Overview
Gunni provides access to 17+ models across Google, Black Forest Labs, OpenAI, Kling, Topaz, ElevenLabs, and more — all through a single API key. You never need to manage provider accounts or swap endpoints. Gunni routes each request to the right provider automatically.
Models are grouped by capability. Each section below lists the available models, which one is the default, and what each is optimized for.
Image Generation
These models generate a new image from a text prompt.
| Model | Provider | Best for | Notes |
|---|---|---|---|
| nano-banana (default) | Google Gemini | All-round generation | Up to 4K. Best starting point for most use cases. |
| recraft-v4 | Recraft | Text in image | Specialist for generating images that include legible text and typography. |
| flux-2-pro | Black Forest Labs | Photorealism | FLUX.1 Pro. Best for photorealistic images with fine detail. |
| gpt-image | OpenAI | Text rendering | GPT Image 1.5. Best text rendering in complex scenes. |
| gpt-image-mini | OpenAI | Fast, cost-optimized | Lighter variant of gpt-image. Faster and cheaper for bulk or draft generation. |
Image Editing
These models edit an existing image using a text prompt or reference images. Triggered when you pass an image parameter alongside a prompt.
| Model | Provider | Best for | Notes |
|---|---|---|---|
| nano-banana-edit (default) | Google Gemini | Multi-image editing | Accepts multiple reference images. Strong instruction following. |
| flux-kontext | Black Forest Labs | Targeted local edits | FLUX Kontext. Precise local edits with multiple reference images. |
| flux-2-pro-edit | Black Forest Labs | Multi-reference editing | Supports up to 9 reference images. Best for complex compositing. |
| gpt-image-edit | OpenAI | Surgical inpainting | Mask-based inpainting with GPT Image. High fidelity on detail work. |
Upscale
Upscale models increase image resolution. Pass upscale: true to trigger this mode. Use the scale parameter to choose 2x or 4x.
| Model | Provider | Best for | Notes |
|---|---|---|---|
| topaz-upscale (default) | Topaz Labs | Industry-standard upscaling | 2x or 4x. Best-in-class detail recovery. Used by professionals for print-ready output. |
Video
Video models generate short clips from an image (image-to-video) or a text prompt (text-to-video). The -t2v suffix denotes a text-to-video variant; the base model is image-to-video.
| Model | Provider | Best for | Notes |
|---|---|---|---|
| kling-v3-pro (default i2v) | Kling AI | Cinematic motion | Image-to-video. Fluid motion, cinematic quality. |
| kling-v3-pro-t2v (default t2v) | Kling AI | Text-to-video | Auto-selected when no input image is provided. |
| veo-3.1 | Video with sound | Image-to-video. Generates synchronized audio alongside video. | |
| veo-3.1-t2v | Text-to-video with sound | Text-to-video variant of Veo 3.1. Includes generated audio. | |
| veo-3.1-fast | Budget video (i2v) | 62% cheaper than veo-3.1. Lower quality. Good for drafts. | |
| veo-3.1-fast-t2v | Budget text-to-video | Fast text-to-video. Cost-optimized for iteration. | |
| minimax-i2v | MiniMax | High-resolution i2v | Hailuo 2.3 Pro. Image-to-video at 1080p. |
| wan-2.6 | Alibaba | Video with audio | Wan 2.6. Up to 1080p with audio generation. |
Audio
Text-to-speech models for natural voice synthesis. Pass a voice parameter to select a specific voice within a model.
| Model | Provider | Best for | Notes |
|---|---|---|---|
| minimax-speech (default) | MiniMax | High-quality TTS | Multiple voices. Excellent prosody and naturalness. |
| elevenlabs-tts | ElevenLabs | Conversational TTS | Natural conversational tone. Good for narration and dialogue. |
Lipsync
Lipsync models synchronize audio to a video or animate a portrait image to speak. Provide a video for lip sync mode, or an image for avatar mode.
| Model | Provider | Best for | Notes |
|---|---|---|---|
| kling-lipsync (default) | Kling AI | Best quality lip sync | Lip sync to existing video. Highest quality. |
| kling-avatar | Kling AI | Talking head avatar | Animate a still portrait image to speak audio. |
| sync-lipsync | Sync Labs | Natural motion lipsync | Advanced lip sync with natural head and body motion. |
Describe & Utility
Utility models analyze or transform images without generating new content from scratch.
| Model | Provider | Operation | Notes |
|---|---|---|---|
| florence-2 | Microsoft | Image captioning / describe | Detailed descriptions of image content, style, and composition. Default for describe mode. |
| bria-bg-remove | Bria AI | Background removal | Clean background removal. Default when remove_bg: true. |
Pricing
Gunni uses a credit system. Credits are consumed per generation based on the model tier and output duration.
| Tier | Credits | Models |
|---|---|---|
| Standard | 1 credit | nano-banana, recraft-v4, gpt-image-mini, minimax-speech, elevenlabs-tts, florence-2, bria-bg-remove |
| Premium | 2 credits | flux-2-pro, gpt-image, nano-banana-edit, flux-kontext, flux-2-pro-edit, gpt-image-edit, topaz-upscale |
| Video (under 10s) | base rate | All video models at base credit rate |
| Video (10s+) | 2× multiplier | Doubles the base credit cost for longer clips |
| Lipsync | 10 credits | kling-lipsync, kling-avatar, sync-lipsync |
New accounts receive 10 free credits. Purchase additional credits at gunni.ai/billing.
Model Selection Tips
Not sure which model to use? Start here.
Image generation
nano-banana — it handles most tasks well and is fast.recraft-v4 when the image needs to contain readable text (labels, logos, posters).flux-2-pro for photorealistic output where detail matters (product shots, portraits).gpt-image when you need the most accurate text rendering in complex scenes.Image editing
nano-banana-edit for most edits — it follows natural language instructions well.flux-kontext for targeted, local edits (change one element without affecting the rest).flux-2-pro-edit when compositing from many reference images.gpt-image-edit for surgical inpainting on specific masked regions.Video
kling-v3-pro for most image-to-video work — cinematic output with fluid motion.veo-3.1 when you want synchronized audio in the output.veo-3.1-fast or veo-3.1-fast-t2v for low-cost draft iterations.minimax-i2v for high-resolution 1080p video from a still image.Audio & lipsync
minimax-speech for narration and voiceover.elevenlabs-tts for conversational or dialogue-style speech.kling-lipsync to sync existing audio onto a talking-head video.kling-avatar to animate a portrait still image to speak.