Models — gunni.ai docs

Overview

Gunni provides access to 17+ models across Google, Black Forest Labs, OpenAI, Kling, Topaz, ElevenLabs, and more — all through a single API key. You never need to manage provider accounts or swap endpoints. Gunni routes each request to the right provider automatically.

Models are grouped by capability. Each section below lists the available models, which one is the default, and what each is optimized for.

Image Generation

These models generate a new image from a text prompt.

Model	Provider	Best for	Notes
nano-banana (default)	Google Gemini	All-round generation	Up to 4K. Best starting point for most use cases.
recraft-v4	Recraft	Text in image	Specialist for generating images that include legible text and typography.
flux-2-pro	Black Forest Labs	Photorealism	FLUX.1 Pro. Best for photorealistic images with fine detail.
gpt-image	OpenAI	Text rendering	GPT Image 1.5. Best text rendering in complex scenes.
gpt-image-mini	OpenAI	Fast, cost-optimized	Lighter variant of gpt-image. Faster and cheaper for bulk or draft generation.

Image Editing

These models edit an existing image using a text prompt or reference images. Triggered when you pass an image parameter alongside a prompt.

Model	Provider	Best for	Notes
nano-banana-edit (default)	Google Gemini	Multi-image editing	Accepts multiple reference images. Strong instruction following.
flux-kontext	Black Forest Labs	Targeted local edits	FLUX Kontext. Precise local edits with multiple reference images.
flux-2-pro-edit	Black Forest Labs	Multi-reference editing	Supports up to 9 reference images. Best for complex compositing.
gpt-image-edit	OpenAI	Surgical inpainting	Mask-based inpainting with GPT Image. High fidelity on detail work.

Upscale

Upscale models increase image resolution. Pass upscale: true to trigger this mode. Use the scale parameter to choose 2x or 4x.

Model	Provider	Best for	Notes
topaz-upscale (default)	Topaz Labs	Industry-standard upscaling	2x or 4x. Best-in-class detail recovery. Used by professionals for print-ready output.

Video

Video models generate short clips from an image (image-to-video) or a text prompt (text-to-video). The -t2v suffix denotes a text-to-video variant; the base model is image-to-video.

Model	Provider	Best for	Notes
kling-v3-pro (default i2v)	Kling AI	Cinematic motion	Image-to-video. Fluid motion, cinematic quality.
kling-v3-pro-t2v (default t2v)	Kling AI	Text-to-video	Auto-selected when no input image is provided.
veo-3.1	Google	Video with sound	Image-to-video. Generates synchronized audio alongside video.
veo-3.1-t2v	Google	Text-to-video with sound	Text-to-video variant of Veo 3.1. Includes generated audio.
veo-3.1-fast	Google	Budget video (i2v)	62% cheaper than veo-3.1. Lower quality. Good for drafts.
veo-3.1-fast-t2v	Google	Budget text-to-video	Fast text-to-video. Cost-optimized for iteration.
minimax-i2v	MiniMax	High-resolution i2v	Hailuo 2.3 Pro. Image-to-video at 1080p.
wan-2.6	Alibaba	Video with audio	Wan 2.6. Up to 1080p with audio generation.

Audio

Text-to-speech models for natural voice synthesis. Pass a voice parameter to select a specific voice within a model.

Model	Provider	Best for	Notes
minimax-speech (default)	MiniMax	High-quality TTS	Multiple voices. Excellent prosody and naturalness.
elevenlabs-tts	ElevenLabs	Conversational TTS	Natural conversational tone. Good for narration and dialogue.

Lipsync

Lipsync models synchronize audio to a video or animate a portrait image to speak. Provide a video for lip sync mode, or an image for avatar mode.

Model	Provider	Best for	Notes
kling-lipsync (default)	Kling AI	Best quality lip sync	Lip sync to existing video. Highest quality.
kling-avatar	Kling AI	Talking head avatar	Animate a still portrait image to speak audio.
sync-lipsync	Sync Labs	Natural motion lipsync	Advanced lip sync with natural head and body motion.

Describe & Utility

Utility models analyze or transform images without generating new content from scratch.

Model	Provider	Operation	Notes
florence-2	Microsoft	Image captioning / describe	Detailed descriptions of image content, style, and composition. Default for describe mode.
bria-bg-remove	Bria AI	Background removal	Clean background removal. Default when `remove_bg: true`.

Pricing

Gunni uses a credit system. Credits are consumed per generation based on the model tier and output duration.

Tier	Credits	Models
Standard	1 credit	nano-banana, recraft-v4, gpt-image-mini, minimax-speech, elevenlabs-tts, florence-2, bria-bg-remove
Premium	2 credits	flux-2-pro, gpt-image, nano-banana-edit, flux-kontext, flux-2-pro-edit, gpt-image-edit, topaz-upscale
Video (under 10s)	base rate	All video models at base credit rate
Video (10s+)	2× multiplier	Doubles the base credit cost for longer clips
Lipsync	10 credits	kling-lipsync, kling-avatar, sync-lipsync

New accounts receive 10 free credits. Purchase additional credits at gunni.ai/billing.

Model Selection Tips

Not sure which model to use? Start here.

Image generation

Default to nano-banana — it handles most tasks well and is fast.

Use recraft-v4 when the image needs to contain readable text (labels, logos, posters).

Use flux-2-pro for photorealistic output where detail matters (product shots, portraits).

Use gpt-image when you need the most accurate text rendering in complex scenes.

Image editing

Default to nano-banana-edit for most edits — it follows natural language instructions well.

Use flux-kontext for targeted, local edits (change one element without affecting the rest).

Use flux-2-pro-edit when compositing from many reference images.

Use gpt-image-edit for surgical inpainting on specific masked regions.

Video

Start with kling-v3-pro for most image-to-video work — cinematic output with fluid motion.

Use veo-3.1 when you want synchronized audio in the output.

Use veo-3.1-fast or veo-3.1-fast-t2v for low-cost draft iterations.

Use minimax-i2v for high-resolution 1080p video from a still image.

Audio & lipsync

Default to minimax-speech for narration and voiceover.

Use elevenlabs-tts for conversational or dialogue-style speech.

Use kling-lipsync to sync existing audio onto a talking-head video.

Use kling-avatar to animate a portrait still image to speak.