Documentation

Overview

Gunni provides access to 17+ models across Google, Black Forest Labs, OpenAI, Kling, Topaz, ElevenLabs, and more — all through a single API key. You never need to manage provider accounts or swap endpoints. Gunni routes each request to the right provider automatically.

Models are grouped by capability. Each section below lists the available models, which one is the default, and what each is optimized for.

Image Generation

These models generate a new image from a text prompt.

ModelProviderBest forNotes
nano-banana (default)Google GeminiAll-round generationUp to 4K. Best starting point for most use cases.
recraft-v4RecraftText in imageSpecialist for generating images that include legible text and typography.
flux-2-proBlack Forest LabsPhotorealismFLUX.1 Pro. Best for photorealistic images with fine detail.
gpt-imageOpenAIText renderingGPT Image 1.5. Best text rendering in complex scenes.
gpt-image-miniOpenAIFast, cost-optimizedLighter variant of gpt-image. Faster and cheaper for bulk or draft generation.

Image Editing

These models edit an existing image using a text prompt or reference images. Triggered when you pass an image parameter alongside a prompt.

ModelProviderBest forNotes
nano-banana-edit (default)Google GeminiMulti-image editingAccepts multiple reference images. Strong instruction following.
flux-kontextBlack Forest LabsTargeted local editsFLUX Kontext. Precise local edits with multiple reference images.
flux-2-pro-editBlack Forest LabsMulti-reference editingSupports up to 9 reference images. Best for complex compositing.
gpt-image-editOpenAISurgical inpaintingMask-based inpainting with GPT Image. High fidelity on detail work.

Upscale

Upscale models increase image resolution. Pass upscale: true to trigger this mode. Use the scale parameter to choose 2x or 4x.

ModelProviderBest forNotes
topaz-upscale (default)Topaz LabsIndustry-standard upscaling2x or 4x. Best-in-class detail recovery. Used by professionals for print-ready output.

Video

Video models generate short clips from an image (image-to-video) or a text prompt (text-to-video). The -t2v suffix denotes a text-to-video variant; the base model is image-to-video.

ModelProviderBest forNotes
kling-v3-pro (default i2v)Kling AICinematic motionImage-to-video. Fluid motion, cinematic quality.
kling-v3-pro-t2v (default t2v)Kling AIText-to-videoAuto-selected when no input image is provided.
veo-3.1GoogleVideo with soundImage-to-video. Generates synchronized audio alongside video.
veo-3.1-t2vGoogleText-to-video with soundText-to-video variant of Veo 3.1. Includes generated audio.
veo-3.1-fastGoogleBudget video (i2v)62% cheaper than veo-3.1. Lower quality. Good for drafts.
veo-3.1-fast-t2vGoogleBudget text-to-videoFast text-to-video. Cost-optimized for iteration.
minimax-i2vMiniMaxHigh-resolution i2vHailuo 2.3 Pro. Image-to-video at 1080p.
wan-2.6AlibabaVideo with audioWan 2.6. Up to 1080p with audio generation.

Audio

Text-to-speech models for natural voice synthesis. Pass a voice parameter to select a specific voice within a model.

ModelProviderBest forNotes
minimax-speech (default)MiniMaxHigh-quality TTSMultiple voices. Excellent prosody and naturalness.
elevenlabs-ttsElevenLabsConversational TTSNatural conversational tone. Good for narration and dialogue.

Lipsync

Lipsync models synchronize audio to a video or animate a portrait image to speak. Provide a video for lip sync mode, or an image for avatar mode.

ModelProviderBest forNotes
kling-lipsync (default)Kling AIBest quality lip syncLip sync to existing video. Highest quality.
kling-avatarKling AITalking head avatarAnimate a still portrait image to speak audio.
sync-lipsyncSync LabsNatural motion lipsyncAdvanced lip sync with natural head and body motion.

Describe & Utility

Utility models analyze or transform images without generating new content from scratch.

ModelProviderOperationNotes
florence-2MicrosoftImage captioning / describeDetailed descriptions of image content, style, and composition. Default for describe mode.
bria-bg-removeBria AIBackground removalClean background removal. Default when remove_bg: true.

Pricing

Gunni uses a credit system. Credits are consumed per generation based on the model tier and output duration.

TierCreditsModels
Standard1 creditnano-banana, recraft-v4, gpt-image-mini, minimax-speech, elevenlabs-tts, florence-2, bria-bg-remove
Premium2 creditsflux-2-pro, gpt-image, nano-banana-edit, flux-kontext, flux-2-pro-edit, gpt-image-edit, topaz-upscale
Video (under 10s)base rateAll video models at base credit rate
Video (10s+)2× multiplierDoubles the base credit cost for longer clips
Lipsync10 creditskling-lipsync, kling-avatar, sync-lipsync

New accounts receive 10 free credits. Purchase additional credits at gunni.ai/billing.

Model Selection Tips

Not sure which model to use? Start here.

Image generation

Default to nano-banana — it handles most tasks well and is fast.
Use recraft-v4 when the image needs to contain readable text (labels, logos, posters).
Use flux-2-pro for photorealistic output where detail matters (product shots, portraits).
Use gpt-image when you need the most accurate text rendering in complex scenes.

Image editing

Default to nano-banana-edit for most edits — it follows natural language instructions well.
Use flux-kontext for targeted, local edits (change one element without affecting the rest).
Use flux-2-pro-edit when compositing from many reference images.
Use gpt-image-edit for surgical inpainting on specific masked regions.

Video

Start with kling-v3-pro for most image-to-video work — cinematic output with fluid motion.
Use veo-3.1 when you want synchronized audio in the output.
Use veo-3.1-fast or veo-3.1-fast-t2v for low-cost draft iterations.
Use minimax-i2v for high-resolution 1080p video from a still image.

Audio & lipsync

Default to minimax-speech for narration and voiceover.
Use elevenlabs-tts for conversational or dialogue-style speech.
Use kling-lipsync to sync existing audio onto a talking-head video.
Use kling-avatar to animate a portrait still image to speak.