Two API keys do the work. OPENROUTER_API_KEY for media,
LLM, and transcription. ELEVENLABS_API_KEY for voice and
music. Nothing else is required, nothing else is supported. The repo's
MODELS.md
is the file we read before each model call; this page mirrors it,
rebuilt on every landing deploy.
How to use this file
- Before any model call, open the matching section. The top pick has the reason it's the top pick.
- For video, also run
ralphy models show <id>. It returns the live supported_durations, supported_resolutions, supported_aspect_ratios, supported_frame_images from OR. Don't hand-pick parameters that aren't in those arrays; the submit will fail validation (ralphy generate video runs the check pre-flight; bypass with --no-validate).
- For image, the
--size flag is a prompt-level hint, not an enforced constraint. Gemini and gpt image models round to their internal natural sizes (1024², 768×1376, …). If you need an exact dimension, post-process with ralphy video extract-segment or ffmpeg after generation.
- For cost preview, every video gen accepts
--dry-run. It prints the resolved request body and cost estimate without spending credits.
- If the task is new (not in this file) — DO NOT invent a provider. Tell the user the task is out of scope or needs a model-list extension.
Image generation
Endpoint: POST /api/v1/chat/completions with modalities: ["image","text"]. Output bytes arrive on choices[0].message.images[0].image_url.url as a data: URI or http URL. cli/lib/providers/media.ts → generateImage() decodes both. Don't make direct fetches to fal.ai or openai.com.
| Use case | Model | Price | Why |
|---|
| Default — multi-ref / character consistency | google/gemini-3-pro-image-preview (= nano-banana-pro lineage) | ~$0.15 / image | Holds face / wardrobe / product identity across multiple references. Tolerates ≥4 concurrent in the OR catalog; in-process semaphore caps at 2 (concurrency.ts) to stay under shared-key OR limits. Pass 2-3 --ref images for "same model + same product across 5 scenes" workflows. Made default 2026-05-20 (re-flip from the 2026-05-12 gpt-5.4-image-2 default — multi-ref wins for almost every UGC workflow). |
| Premium typography / label accuracy | openai/gpt-5.4-image-2 | ~$0.20 / image | Best typography on labels, fewer hallucinations on small details, cleanest photorealism for hero product shots where the wordmark must read crisp. Concurrency: 2 in-process (#007 — validated 2026-05-29, NOT hard-capped to 1 as earlier docs claimed). Self-throttled by cli/lib/providers/concurrency.ts; the old 403 "Key limit exceeded" misread is now surfaced as "concurrent-call limit on …; try --concurrency 1 or switch model". |
| Budget OpenAI | openai/gpt-5-image-mini | ~$0.08 / image | Cheap iteration during prompt exploration. |
| Cheapest viable | google/gemini-2.5-flash-image | ~$0.02 / image | Smoke-test only — quality dip is visible. |
Reference images: --ref accepts URL, local path, or data: URI. Local paths are auto-converted to data: URI in-process; there's no upload step. Both gpt-5.4-image-2 and gemini-3-pro-image-preview accept image inputs; gemini is much better at multi-ref consistency (2+ refs).
Size / aspect ratio: ralphy generate image maps --size WxH to OpenRouter's structured image_config.aspect_ratio (nearest of 1:1 2:3 3:2 3:4 4:3 4:5 5:4 9:16 16:9 21:9) and forwards it alongside the in-prompt hint (openrouter.ts → sizeToAspectRatio). This is the only reliable lever for non-square output: openai/gpt-5.4-image-2 ignores the in-prompt size hint and defaults to 1024² unless image_config is sent — with it, --size 1080x1920 returns a real 9:16 (720×1280 native bucket), validated on loud-kids-poster-001 (2026-05-27). The result still lands on the model's nearest native bucket (1024² for 1:1, ~768×1376 / 720×1280 for 9:16, ~1280×720 for 16:9), not pixel-exact WxH. Downstream HyperFrames / ffmpeg compositions handle the scale-to-cover. image_size (1K/2K/4K) is available in the OR API but not yet wired to a flag.
Prompt cookbook: mode-by-mode masters in docs/prompts/image/ (product-shot, lifestyle-scene, closeup-with-person, macro-detail, flat-lay, virtual-model-tryout, hero-banner, conceptual-product, iteration-edit). The agent fills slots from a user request, then calls ralphy generate image --prompt "<filled>". There's no new CLI flag for this; it's a curated library, not a feature.
Avoid:
- Any model more than a year old (
stable-diffusion-xl, flux/schnell, dall-e-3). Quality is below the current top picks at the same price.
gpt-image-1: legacy line. gpt-5.4-image-2 is the current stable OpenAI image model (not to be confused with the "gpt-image-2" naming some external docs use).
- Hard-coded fal.ai endpoints. We left fal.ai behind in Sprint 2.
Failure modes (per-model quick reference)
Append-only. Add a row when a new quirk costs >5 min of debugging or a re-roll.
| Model | Failure mode | Workaround |
|---|
google/gemini-3-pro-image-preview | "skeleton null" transient — response returns finish_reason: null, content: null, native_finish_reason: null with no error body. Sporadic, not prompt-related. | Retry up to 3× (same prompt, same refs). If all 3 fail, fall back to openai/gpt-5.4-image-2 (accept the 1024² default + concurrency=1). Tracked by cli/lib/providers/openrouter.ts retry loop. |
google/gemini-3-pro-image-preview | Body-horror IMAGE_SAFETY refusal (cap #10 below) — empty content + native_finish_reason: IMAGE_SAFETY on cryptid / skinwalker / Cronenberg prompts. | Route to openai/gpt-5.4-image-2 for the anchor frame; carry scene identity via --ref. |
google/gemini-3-pro-image-preview | Typography smudging on embedded labels (kanji buttons, LED digits, brand wordmarks). | Switch to openai/gpt-5.4-image-2 when copy must be legible — it holds glyphs cleanly. Lesson from flipper-hypermotion-001. |
openai/gpt-5.4-image-2 | Concurrency cap = 2 (per #007, validated 2026-05-29 — earlier "cap of 1" was conservative). 3+ parallel may still trip OR's per-key limit and return misleading 403 "Key limit exceeded (total limit)". | Self-throttled to 2 in-flight by cli/lib/providers/concurrency.ts; the 403 is now rewritten as "concurrent-call limit on …; try --concurrency 1 or switch model — NOT a $ balance issue". For >2 parallel, swap to gemini-3-pro-image-preview. |
openai/gpt-5.4-image-2 | Ignores in-prompt size hints, defaults to 1024². | Pass image_config.aspect_ratio via --size WxH (mapped automatically in openrouter.ts → sizeToAspectRatio). |
Video generation (text-to-video + image-to-video)
Endpoint: async-job pattern at POST /api/v1/videos. Submit returns { id, status, polling_url }; poll until completed; download via GET /api/v1/videos/{id}/content?index=0 (auth required). The legacy /api/v1/videos/generations returns 404. cli/lib/providers/media.ts → generateVideo() handles the full job lifecycle (15s × 80-poll = 20 min budget; tunable).
Per-model matrix (live from /api/v1/videos/models, snapshot 2026-05-08)
Always recheck via ralphy models list. These arrays change.
| Model | Durations (s) | Resolutions | Aspects | Frame anchors | $/sec billed |
|---|
kwaivgi/kling-v3.0-pro | 3-15 | 720p | 9:16, 16:9, 1:1 | first + last | $0.14 ✓ |
kwaivgi/kling-v3.0-std | 3-15 | 720p | 9:16, 16:9, 1:1 | first + last | $0.14 ✓ (not ½ pro — same rate) |
kwaivgi/kling-video-o1 | 5, 10 | 720p | 9:16, 16:9, 1:1 | first + last | ~$0.14 |
google/veo-3.1 | 4, 6, 8 | 720p, 1080p | 9:16, 16:9 | first + last | ~$0.50 |
google/veo-3.1-fast | 4, 6, 8 | 720p, 1080p, 4K | 9:16, 16:9 | first + last | $0.14 ✓ (was ~$0.25 — wrong) |
google/veo-3.1-lite | 4, 6, 8 | 720p, 1080p | 9:16, 16:9 | first + last | ~$0.09 |
openai/sora-2-pro | model-dep | model-dep | model-dep | model-dep | ~$0.50 |
minimax/hailuo-2.3 | 6, 10 | 1080p | 16:9 only | first only | ~$0.10 |
alibaba/wan-2.6 | 5, 10 | 720p, 1080p | 9:16, 16:9 | first only | ~$0.10 |
alibaba/wan-2.7 | 2-10 | 720p, 1080p | 9:16, 16:9, 1:1, 4:3, 3:4 | first + last | ~$0.10 |
bytedance/seedance-2.0 | 4-15 | 480p, 720p, 1080p | 7 aspects incl 21:9 cinema | first + last | $0.14 ✓ |
bytedance/seedance-2.0-fast | 4-15 | 480p, 720p | 7 aspects | first + last | $0.14 ✓ (was ~$0.05 — wrong) |
bytedance/seedance-1-5-pro | 4-12 | 480p, 720p, 1080p | 7 aspects | first + last | ~$0.10 |
Pricing reality check (2026-05-11): OpenRouter bills video generation per-clip flat. The duration parameter sets the clip length and the billed cost ≈ rate × duration. A ✓ in the rate column means we verified the rate against actual OR billing on 2026-05-11 (see docs/render-test-2026-05-11.md §1.1). Earlier docs claimed half-price std and per-second steps that didn't match observation; those have been corrected here and in cli/lib/providers/media.ts:VIDEO_PRICE_PER_SEC. Models without ✓ are ballparks from the OR catalog. Verify on first use and add ✓ once confirmed.
When to pick which
| You need | Pick |
|---|
| Default narrative i2v, character consistency, hold a keyframe | kwaivgi/kling-v3.0-pro |
| Cheap batch (≥10 clips) where keyframe drift is acceptable | kwaivgi/kling-v3.0-std (NOTE: same per-second price as pro on OR — go to veo-3.1-lite for the real cheap-batch tier) |
| Talking-head / face / lip-sync style | google/veo-3.1 (model-native audio with --audio works in EN; off for RU/UA — only Chinese + English are clean) |
| 4K mastered hero piece | google/veo-3.1-fast (only model in catalog with 4K) |
| Sharp physics motion (parkour, sports, falling) | bytedance/seedance-2.0 (also the only path to 21:9 cinema aspect) |
| 3:4 / 4:3 portrait magazine | alibaba/wan-2.7 (only model with these in stock) |
| Cheapest viable | bytedance/seedance-2.0-fast |
Lessons from this session (2026-05-08)
-
kwaivgi/kling-v3.0-pro rotates "wide" prompts inside the 9:16 container. Phrases like "wide overhead cityscape", "massive crowd in town square", "dancers under starlit sky" bias the model toward landscape composition; OR returns a 1080×1920 file but the content is laid out for 16:9. Fix: anchor with --first-frame <portrait-image> and rewrite the prompt with explicit vertical wording ("tall vertical portrait shot, low camera angle looking up, narrow alley framing, subjects centered vertically, half-timbered houses tower vertically on both sides"). The first-frame image overrides the model's compositional bias.
-
--resolution 720p is silently upgraded to 1080p by kwaivgi/kling-v3.0-pro even though the catalog only lists 720p. The output dimension is whatever the model decides; treat resolution as a soft hint and let downstream HyperFrames / ffmpeg crop+scale to the composition's exact frame.
-
OR's per-clip billing is fixed (e.g., a 5s kling-pro clip is ~$0.70 regardless of "duration in body" precision). The per-second figures above are therefore ballparks; pre-flight --dry-run to see the estimate before submitting.
-
generate_audio: true is unsafe outside English. Confirmed for kwaivgi/kling-v3.0-pro, bytedance/seedance-2.0, and google/veo-3.1 on Russian — accent slips, voice age drifts, text gets cut. Default is false; only enable for EN with --audio.
-
kwaivgi/kling-v3.0-pro --last-frame historically returned 400 "File is not in a valid base64 format" even with a clean PNG anchor. Root cause was C2PA / EXIF metadata that the provider parsed too eagerly; the 2026-05-19 resolveImageRef() strip helped on single-frame but the multi-frame (first+last together) path kept failing across flipper / glitter-cream / playdate / venom regardless of source. As of 2026-05-30 (#008), generateVideo() preflight-rejects the kling-v3.0-pro multi-frame combination as TerminalProviderError with a pointer to bytedance/seedance-2.0 — which honors --last-frame natively across all aspects. No more wasted round-trips. Postmortems flagging this: playdate / flipper / venom / glitter-cream.
-
bytedance/seedance-2.0 privacy filter rejects photoreal-human anchors with InputImageSensitiveContentDetected.PrivacyInformation, even when the human was itself AI-generated. Reserve seedance for cartoons / non-human anchors / landscapes / hands / abstract motion. For photoreal humans, default to kwaivgi/kling-v3.0-pro (single-frame or multi-frame after #5 lands). Postmortems: tokyo / noski / venom.
-
kwaivgi/kling-v3.0-pro 2500-char prompt cap. OR returns 400 after a round-trip if you exceed it. As of 2026-05-30 (#008), generateVideo() preflight-rejects prompts >2500 chars as TerminalProviderError before any submit — no more 4× wasted round-trips per session. The cap lives in a per-model MAX_PROMPT_CHARS map in cli/lib/providers/openrouter.ts; add a row when another model documents a hard cap. Trim atmosphere/setting paragraphs first — voice-tag, no-music, on-camera-EN clauses are load-bearing and should never be cut. Postmortem: glitter-cream.
-
ElevenLabs Music 2-concurrent cap per subscription. Three+ parallel calls return 429 concurrent_limit_exceeded and pollute generations.jsonl with error rows. Serialize music gen or stay ≤2 in-flight. Postmortem: tokyo.
-
Per-endpoint concurrent-call caps are now self-throttled in-process (cli/lib/providers/concurrency.ts, #007). The semaphore wraps every network round-trip so the CLI never round-trips an over-cap call; the retry helper sits OUTSIDE the semaphore so backoff sleeps don't pin a slot. Current caps:
elevenlabs/tts — 3 concurrent (choose-your-guide-001: 9 parallel → 6 hard-failed 429).
elevenlabs/music_v1 — 2 concurrent (tokyo-y2k-001: 3 parallel → 1 hard-failed 429).
openrouter:openai/gpt-5.4-image-2 — 2 concurrent (validated 2026-05-29; appstore-takeaminute-001 hit 73/73 403 on uncapped fan-out).
openrouter:google/gemini-3-pro-image-preview — 2 concurrent (catalog tolerates ≥4, kept at 2 for shared-key safety).
openrouter:bytedance/seedance-2.0 — 1 concurrent (queue depth + multi-block extend is inherently sequential).
openrouter:kwaivgi/kling-v3.0-pro — 2 concurrent.
- LLM chat-completions default — 4 concurrent.
- Default fallback for unknown endpoints — 2 concurrent.
When the per-key OR cap still trips despite the semaphore (e.g. two
ralphy processes sharing one key), shared.ts → rewriteUpstreamError() now surfaces the 403 as "concurrent-call limit on <model>; try --concurrency 1 or switch model. (This is NOT a $ balance issue — check 'ralphy doctor' for credits.)" — no more credits-misread.
Follow-up: the in-process semaphore does NOT span processes. The queue daemon's own worker count gates cross-process for now; a file-locked queue is a future enhancement.
Postmortems: appstore / analog-horror / tokyo / choose-your-guide.
-
google/gemini-3-pro-image-preview image-safety filter is materially stricter than openai/gpt-5.4-image-2 on body-horror / cryptid / skinwalker / werewolf register. Gemini returns native_finish_reason: IMAGE_SAFETY with an empty content body even when the prompt uses softened surreal-anatomy language (concentric maws lined with tongue-like protrusions, biological apertures with internal teeth on a recognizable creature in a real-world setting). The reasoning trace literally describes the requested transformation before the filter refuses. openai/gpt-5.4-image-2 accepts the same prompts and delivers — validated on voidstomper-test-001's skinwalker BBQ frame after three gemini refusals. Route rule: for voidstomper-adjacent / Cronenberg / The Thing / mid-warp body-horror anchor images, start at --model openai/gpt-5.4-image-2. Accept the 1024×1024 default (gpt-image ignores arbitrary --size); use --ref <gemini-9:16-scene> to carry scene + character identity, then post-process to 9:16 with ffmpeg pad if your downstream i2v requires matching dimensions. Postmortem: voidstomper-test-001 (2026-05-25).
-
google/veo-3.1 body-horror filter rejects both the prompt AND the input frame independently. Sanitizing the prompt (stripping skinwalker / werewolf / vertebrae / Cronenberg words) does NOT unblock the path — Google's filter ALSO scans the first/last-frame anchor and refuses when the anchor itself is clearly body-horror. Combined with seedance's photoreal-human rejection (cap #6), this leaves kwaivgi/kling-v3.0-pro as the only viable i2v provider for the photoreal-human + body-horror combination that voidstomper-style content requires. Skip seedance and veo round-trips on those jobs; go straight to Kling. Postmortem: voidstomper-test-001 (2026-05-25).
-
kwaivgi/kling-v3.0-pro and bytedance/seedance-2.0 overshoot --duration by ~1s. Both models return clips ~1 second longer than the requested --duration — silent, billed against the requested duration, not the delivered one. tokyo-y2k-001 measured 5s/4s/9s storyboard → 6.04/5.04/10.04 actual; planned 75s of clips landed as 90.7s of raw mp4. Editor recipe (pre-shorten at art-director stage, or budget a vision-trim pass) lives in docs/playbooks/editor/render-pipeline.md → Source-clip duration overshoot. Filed long-form: notes/issues/done/042-clip-duration-overshoot-undocumented.md. Postmortem: tokyo-y2k-001 (workflow-fixes #3).
Failure modes (per-model quick reference)
Append-only. Each row is a quirk that has already cost ≥1 re-roll across sessions.
| Model | Failure mode | Workaround |
|---|
kwaivgi/kling-v3.0-pro | Multi-frame (first+last) base64 reliably 400s even after the 2026-05-19 C2PA strip — flipper / glitter-cream / playdate / venom all reproduced. | Preflight-rejected client-side as of 2026-05-30 (#008). Use bytedance/seedance-2.0 for multi-frame anchoring (honors --last-frame natively for non-photoreal-human anchors). |
kwaivgi/kling-v3.0-pro | --audio slips on Russian / UA — accent slip, voice-age drift, occasional cut text. EN is clean. | Use --audio for EN VO only. For RU / non-EN, generate silent clip + post-mix ElevenLabs VO in the editor stage. (Memory: feedback_kling_no_ru_audio.) |
kwaivgi/kling-v3.0-pro | 2500-char prompt cap — OR returns 400 round-trip if exceeded. | Preflight-rejected client-side as of 2026-05-30 (#008). Trim atmosphere/setting paragraphs first; never cut voice-tag, no-music, on-camera-EN clauses (load-bearing). |
kwaivgi/kling-v3.0-pro | Wide-prompt rotation inside 9:16 — landscape wording biases composition to 16:9 inside the portrait container. | Anchor with --first-frame <portrait> and rewrite prompt with explicit vertical wording (cap #1 above). |
bytedance/seedance-2.0 | Privacy filter blocks photoreal-human i2v anchors — InputImageSensitiveContentDetected.PrivacyInformation, even when the human is AI-generated. | Reserve seedance for cartoons / non-human / landscapes / hands / abstract motion. Photoreal humans → kwaivgi/kling-v3.0-pro. (Memory: feedback_seedance_rejects_realistic_people.) |
bytedance/seedance-2.0 | Concurrency cap = 1 (effective, per current OR keys). | Serialize multi-clip jobs; parallelize across providers, not within seedance. |
kling + seedance (both) | --duration overshoots by ~1s (cap #12 above). | Pre-shorten at art-director stage OR budget a vision-trim pass in the editor. See docs/playbooks/editor/render-pipeline.md + notes/issues/042. |
google/veo-3.1 | Body-horror filter scans the anchor frame independently of the prompt — sanitizing the prompt does not unblock if the input frame reads as body-horror. | For photoreal-human + body-horror, use kwaivgi/kling-v3.0-pro; skip veo + seedance round-trips on those jobs (cap #11 above). |
google/veo-3.1 | 8s clip cap — needs Kling 15s for longer narrative beats. | Use kwaivgi/kling-v3.0-pro for >8s; chain veo clips for talking-head only when the cut is acceptable. |
google/gemini-3.1-pro-preview (video-analysis via callLLM) | Occasional HTTP 502 with empty body on full-length source mp4s. Vision-trim + ref-analysis pipelines hit this most. | Compress source mp4 to 540×960 CRF28 (ffmpeg -vf scale=540:960 -crf 28) + retry. If still 502, fall back to google/gemini-2.5-flash (lower context but reliable). Last resort: ralphy ref frames + frame-level analysis. |
elevenlabs/music_v1 | Artist-name ToS rejection — naming any rapper / producer / band returns 422 with prompt_suggestion. Modern hip-hop especially. | Genre + tempo + instrumentation only; resubmit the API's prompt_suggestion verbatim. (Memory: feedback_elevenlabs_music_no_artist_names.) |
elevenlabs/music_v1 | Concurrency cap = 2 — 3+ parallel returns 429 concurrent_limit_exceeded and pollutes generations.jsonl. | Serialize music gen or stay ≤2 in-flight (--concurrency 2). |
elevenlabs/scribe_v1 + voice endpoints | Default Node UA → Cloudflare 403. Geo-blocked regions return HTTP 200 + HTML body cast as .mp3. | Send User-Agent: Mozilla/5.0 (...) header (already wired). From geo-blocked region, fall back to Kokoro for VO and openai/whisper-1 for transcription. (Memory: feedback_elevenlabs_geoblock_html_in_mp3.) |
Routing rules (which video model for what)
Apply these before drafting a ralphy generate video prompt. They short-circuit re-rolls.
- Hyper-motion (explosions, runway-sprint, coin-arcs, particle bursts, sports collisions, parkour, falling) →
bytedance/seedance-2.0. Kling reads these as narrative beats and softens the physics. Caveat: seedance's privacy filter rejects photoreal-human i2v anchors — only use seedance hyper-motion on cartoon / non-human / landscape / hands / abstract subjects.
- Talking-head, photoreal humans, slow narrative, lip-sync →
kwaivgi/kling-v3.0-pro. Multi-frame works post-2026-05-19. --audio for EN VO only.
- 4K hero piece →
google/veo-3.1-fast (the only catalog entry with 4K).
- 3:4 / 4:3 portrait magazine aspect →
alibaba/wan-2.7 (only model with these in stock).
- Cinema 21:9 →
bytedance/seedance-2.0 (only model with 21:9 in stock).
- Photoreal humans + body-horror anchor →
kwaivgi/kling-v3.0-pro only. Seedance rejects the anchor, veo rejects both prompt and anchor (caps #6 + #11).
- Cheapest acceptable batch →
google/veo-3.1-lite (~$0.09/s) for non-photoreal-human work; otherwise kwaivgi/kling-v3.0-std (same rate as pro but worse keyframe holds).
--audio policy: SPEECH vs AMBIENT/DIEGETIC
Two distinct audio jobs hide behind the single --audio flag — different rules apply.
- SPEECH (on-camera VO / dialogue / lip-sync). Stakes are high — voice age, accent, and word-cut are visible.
- Default:
google/veo-3.1 for English on-camera dialogue.
kwaivgi/kling-v3.0-pro --audio works for EN only. Slips on RU / UA / non-English (memory feedback_kling_no_ru_audio). For RU, render silent + post-mix ElevenLabs eleven_multilingual_v2 in the editor stage.
bytedance/seedance-2.0 --audio is not validated for SPEECH — treat as off.
- AMBIENT / DIEGETIC (background noise, footsteps, crowd murmur, wind, traffic, room tone, prop SFX). Stakes are low — language doesn't apply.
- Any of kling / seedance / veo
--audio: true are fine. Generate alongside the visuals; cheaper than a separate SFX pass.
- If the brief explicitly bans music (most UGC beds do), keep the
no music — only diegetic ambient + sparse SFX clause in the prompt regardless of which model.
When in doubt: SPEECH = veo (EN) / kling-EN-only / silent+ElevenLabs (everything else). AMBIENT = whichever video model is already in the shot list.
Tried-and-dropped (postmortem cross-reference)
| Model | Context where it failed | Why | Postmortem |
|---|
bytedance/seedance-2.0 | photoreal-human i2v anchors | privacy filter InputImageSensitiveContentDetected.PrivacyInformation | tokyo, noski, venom |
kwaivgi/kling-v3.0-pro multi-frame | first+last-frame i2v | 400 not in a valid base64 format — mitigated 2026-05-19 by auto-C2PA-strip in resolveImageRef() | flipper, playdate, venom, glitter-cream |
kwaivgi/kling-v3.0-pro --audio | non-English VO | accent slip + voice-age drift | noski, venom |
google/veo-3.1 | 15s+ clips | 8s cap; needed Kling 15s with --audio instead | kbo |
openai/gpt-5.4-image-2 | concurrent batches >1 | 403 "Key limit exceeded" — cap of 1 per OR key | appstore |
eleven_multilingual_v2 (Ava, Marcus) | analog-horror PSA monotone | too much human inflection — switched to "Alerter" community voice with stability~0.5 / style 0 | analog-horror |
kwaivgi/kling-v3.0-pro for hyper-motion | 8-cut Japanese product ad | too narrative / slow for explosions; needed seedance physics | flipper |
google/gemini-3-pro-image-preview | product-fidelity with embedded text (kanji buttons, LED digits) | smudges typography; gpt-5.4-image-2 holds it | flipper |
| Generic "1-bit / pixel-art" prompt vocabulary | duotone halftone aesthetic | ambiguous 3-way (1-bit vs 8-bit vs hand-illustrated); needed named-corpus refs | playdate |
bytedance/seedance-2.0 t2v + i2v anchor poster | spider-verse skater | image-anchor with baked-in-text confuses; pure t2v with strong subject-block worked | skater |
Avoid:
kling-video/v1.6 or v2.x: outdated.
luma-dream-machine: worse than kling at the same price; out of OR catalog now.
fal-ai/* endpoints: the stack moved to OpenRouter in Sprint 2.
Voiceover (TTS)
| Use case | Model | Price | Why |
|---|
| Default — Russian | ElevenLabs eleven_multilingual_v2 | subscription | The only path to clean deadpan Russian without regional accent slip. User-owned voice clones work best. |
| English premium | ElevenLabs eleven_v3 | subscription | Most emotional for English. Validated 2026-05-08 against the brainrot test (Adam preset, dramatic narrator, 45-55s). Unstable on Russian; don't use in production for RU. |
Voice settings (deadpan young Russian):
{ "model_id": "eleven_multilingual_v2", "voice_settings": { "stability": 0.55, "similarity_boost": 0.8, "style": 0.25, "use_speaker_boost": true }, "output_format": "mp3_44100_128" }
Voice settings (English brainrot dramatic narrator):
{ "model_id": "eleven_v3", "voice_settings": { "stability": 0.30, "similarity_boost": 0.75, "style": 0.40, "use_speaker_boost": true } }
Voice picks: Adam (pNInz6obpgDQGcFmaJgB) for dramatic narrator, Brian (nPczCjzI2devNBz1zQrb) for Reddit-monotone, Daniel (onwK4e9ZLuTAKqWW03F9) for British-sarcastic.
Failure modes:
- Default UA on Node 20+ → Cloudflare 403. Send
User-Agent: Mozilla/5.0 (...).
- Free/starter cap is 3 concurrent → 429. Run sequentially, not in parallel.
- Default library voices (
clyde-warvet, daniel-deep, etc.): too theatrical for RU. A custom clone is mandatory for RU production.
- VO drift: dramatic narration on
eleven_v3 consistently runs ~15-25% longer than scripted word-count would suggest (Strasbourg 45s scenario rendered at 54.6s). Time-budget compositions to actual VO duration via ralphy project transcribe, not scenario text length.
Avoid:
- OpenAI
tts-1-hd on Russian — flat American accent.
- ElevenLabs
eleven_v3 on Russian in production — unstable.
Music generation
| Use case | Model | Price | Why |
|---|
| Default — instrumental beds | ElevenLabs Music (music_v1) | subscription (binary audio response) | Same key as VO. Validated 2026-05-08: 8s instrumental delivered cleanly via ralphy generate music. |
Endpoint contract (for cli/lib/providers/media.ts → generateMusic()):
POST https://api.elevenlabs.io/v1/music
Headers: xi-api-key: $ELEVENLABS_API_KEY, Content-Type: application/json
Body: { "prompt": "...", "music_length_ms": 30000, "force_instrumental": true, "output_format": "mp3_44100_128", "model_id": "music_v1" }
Response: 200 → binary mp3 (application/octet-stream)
400 → JSON ToS rejection (`bad_prompt`) — see "Prompt content policy"
422 → JSON validation error
Prompt content policy (#006). ElevenLabs Music ToS rejects prompts that name specific artists / producers / copyrighted tracks (rappers, named producers, song titles, scored themes) with HTTP 400 bad_prompt. The 400 envelope carries a detail.data.prompt_suggestion field containing a provider-sanitized rewrite the CLI can resubmit verbatim. The connector surfaces this on TerminalProviderError.promptSuggestion; ralphy generate music --auto-retry-on-tos-rejection will log the original failure and resubmit once using the rewrite. Use genre + tempo + instrumentation + mood framing instead of named entities (e.g. "trap beat, 140 BPM, 808 sub-bass, dark minor-key piano stab"). The CLI runs a soft pre-submit linter (cli/lib/music-prompt-lint.ts) against a known artist/track regex set and warns to stderr before submit — non-blocking, false positives are cheaper than missed catches.
Trend-music rule: if a template references a specific trend track (assets/trend-*.mp3), copy the file from the ralphy-assets companion repo. Don't generate a substitute. Track recognition is half of what makes a trend video a trend.
Fallback (if ElevenLabs Music quality regresses): temporarily route to fal-ai/lyria2 via FAL_KEY as a documented exception. As of 2026-05-08 the fallback is not active; ElevenLabs Music is the only path.
Avoid:
- Suno (not on OpenRouter).
lyria2 directly via FAL_KEY while ElevenLabs Music works: don't multiply optional keys.
Audio transcription / Captions
| Use case | Model | Price | Why |
|---|
| Default — word-level captions for compositions | ElevenLabs scribe_v1 (default in cli/lib/transcribe.ts) | ~$0.004 per audio-minute | Returns word-level timestamps natively in the shape HyperFrames caption layers expect; no second normalization pass. Verified end-to-end on the brainrot 54.6s VO → 121 word entries. |
| Fallback — when ElevenLabs is down | OpenRouter openai/whisper-1 (--backend openrouter) | ~$0.006 / min | One key covers it. Sometimes 400s on long audio; re-encode to 64kbps mono mp3 if you hit them. |
CLI: ralphy generate captions --project <id> --audio <vo.mp3> --language en (writes captions.json next to the project).
Avoid:
- Local whisper.cpp: large binary, no real benefit over the cloud at our volumes.
- Direct OpenAI API: we route through OpenRouter so one key covers vision + scoring + transcription.
LLM (for skills and analytics)
Provider routing. All LLM / vision calls go through cli/lib/providers/llm.ts → callLLM(). The only provider is OpenRouter.
| Use case | Model | Why |
|---|
| Vision analysis of images / videos (extract-design, find-viral-moments, face-bbox, scoreImage, scoreVideo) | google/gemini-2.5-flash | Cheap vision (~$0.001/frame), accurate enough for smart-crop and moment detection. Use pro when long context is needed. |
| Deep vision (extract-design on complex landing pages) | google/gemini-2.5-pro | Best quality on long prompts + complex screenshots. ~3× the cost of flash. |
| Scenarist / VO rewrite / feedback parsing | anthropic/claude-sonnet-4.6 or anthropic/claude-opus-4.6 | RU/EN at the same level, nuances revisions well. |
| This chat | Claude Opus 4.7 | The one you're reading right now. |
Avoid:
- Direct
fetch("https://openrouter.ai/...") calls in new scripts. Go through callLLM() so users can switch providers via ralphy setup.
- Hard-coded
anthropic.com or openai.com URLs; everything goes through OpenRouter.
Out-of-scope / dropped
These models / families were removed during OpenRouter consolidation (Sprint 1.3 / 2). Don't bring them back without an explicit plan upgrade:
fal-ai/nano-banana-pro/edit — replaced with google/gemini-3-pro-image-preview via OR.
fal-ai/flux-pro/v1.1-ultra, fal-ai/flux/dev/inpainting, fal-ai/flux-pro/v1.1-ultra/redux — out.
fal-ai/luma-dream-machine/image-to-video — out (worse than kling).
fal-ai/wan-25 — lipsync stage dropped entirely in v2 (no FAL_KEY pipeline).
fal-ai/sync-lipsync, fal-ai/veed/lipsync — out.
fal-ai/lyria2 — out (fallback reserved in the Music section).
fal-ai/seedream — out.
- Replicate
wav2lip — out (no token, no stage).
openai/gpt-image-1, dall-e-3, stable-diffusion-xl, flux/schnell — outdated.
- Apify — replaced with a Playwright scraper in
/researcher (deferred to v2).
- Higgsfield Soul, Fireworks Whisper — require separate keys, not in the stack.
- Vercel AI Gateway, direct OpenAI API — single-provider OpenRouter in v2.
When to update this file
- On the first session in a new chat: check
Last reviewed, refresh if stale.
- After every failure mode on a new model: add it to "Avoid" / "Lessons" with the reason.
- When you change a default in a skill or script: sync it here.
- When you add a verb to
ralphy generate or a flag: sync the price / param notes here.
- At least once a month, even if nothing broke.
Two API keys do the work. OPENROUTER_API_KEY for media, LLM, and transcription. ELEVENLABS_API_KEY for voice and music. Nothing else is required, nothing else is supported. The repo's MODELS.md is the file we read before each model call; this page mirrors it, rebuilt on every landing deploy.
How to use this file
ralphy models show <id>. It returns the livesupported_durations,supported_resolutions,supported_aspect_ratios,supported_frame_imagesfrom OR. Don't hand-pick parameters that aren't in those arrays; the submit will fail validation (ralphy generate videoruns the check pre-flight; bypass with--no-validate).--sizeflag is a prompt-level hint, not an enforced constraint. Gemini and gpt image models round to their internal natural sizes (1024², 768×1376, …). If you need an exact dimension, post-process withralphy video extract-segmentorffmpegafter generation.--dry-run. It prints the resolved request body and cost estimate without spending credits.Image generation
Endpoint:
POST /api/v1/chat/completionswithmodalities: ["image","text"]. Output bytes arrive onchoices[0].message.images[0].image_url.urlas adata:URI or http URL.cli/lib/providers/media.ts → generateImage()decodes both. Don't make direct fetches to fal.ai or openai.com.google/gemini-3-pro-image-preview(= nano-banana-pro lineage)concurrency.ts) to stay under shared-key OR limits. Pass 2-3--refimages for "same model + same product across 5 scenes" workflows. Made default 2026-05-20 (re-flip from the 2026-05-12 gpt-5.4-image-2 default — multi-ref wins for almost every UGC workflow).openai/gpt-5.4-image-2cli/lib/providers/concurrency.ts; the old 403 "Key limit exceeded" misread is now surfaced as "concurrent-call limit on …; try --concurrency 1 or switch model".openai/gpt-5-image-minigoogle/gemini-2.5-flash-imageReference images:
--refaccepts URL, local path, ordata:URI. Local paths are auto-converted todata:URI in-process; there's no upload step. Bothgpt-5.4-image-2andgemini-3-pro-image-previewaccept image inputs; gemini is much better at multi-ref consistency (2+ refs).Size / aspect ratio:
ralphy generate imagemaps--size WxHto OpenRouter's structuredimage_config.aspect_ratio(nearest of1:1 2:3 3:2 3:4 4:3 4:5 5:4 9:16 16:9 21:9) and forwards it alongside the in-prompt hint (openrouter.ts → sizeToAspectRatio). This is the only reliable lever for non-square output:openai/gpt-5.4-image-2ignores the in-prompt size hint and defaults to 1024² unlessimage_configis sent — with it,--size 1080x1920returns a real 9:16 (720×1280 native bucket), validated on loud-kids-poster-001 (2026-05-27). The result still lands on the model's nearest native bucket (1024² for 1:1, ~768×1376 / 720×1280 for 9:16, ~1280×720 for 16:9), not pixel-exactWxH. Downstream HyperFrames / ffmpeg compositions handle the scale-to-cover.image_size(1K/2K/4K) is available in the OR API but not yet wired to a flag.Prompt cookbook: mode-by-mode masters in
docs/prompts/image/(product-shot, lifestyle-scene, closeup-with-person, macro-detail, flat-lay, virtual-model-tryout, hero-banner, conceptual-product, iteration-edit). The agent fills slots from a user request, then callsralphy generate image --prompt "<filled>". There's no new CLI flag for this; it's a curated library, not a feature.Avoid:
stable-diffusion-xl,flux/schnell,dall-e-3). Quality is below the current top picks at the same price.gpt-image-1: legacy line.gpt-5.4-image-2is the current stable OpenAI image model (not to be confused with the "gpt-image-2" naming some external docs use).Failure modes (per-model quick reference)
Append-only. Add a row when a new quirk costs >5 min of debugging or a re-roll.
google/gemini-3-pro-image-previewfinish_reason: null, content: null, native_finish_reason: nullwith no error body. Sporadic, not prompt-related.openai/gpt-5.4-image-2(accept the 1024² default + concurrency=1). Tracked bycli/lib/providers/openrouter.tsretry loop.google/gemini-3-pro-image-previewIMAGE_SAFETYrefusal (cap #10 below) — empty content +native_finish_reason: IMAGE_SAFETYon cryptid / skinwalker / Cronenberg prompts.openai/gpt-5.4-image-2for the anchor frame; carry scene identity via--ref.google/gemini-3-pro-image-previewopenai/gpt-5.4-image-2when copy must be legible — it holds glyphs cleanly. Lesson from flipper-hypermotion-001.openai/gpt-5.4-image-2403 "Key limit exceeded (total limit)".cli/lib/providers/concurrency.ts; the 403 is now rewritten as "concurrent-call limit on …; try --concurrency 1 or switch model — NOT a $ balance issue". For >2 parallel, swap togemini-3-pro-image-preview.openai/gpt-5.4-image-2image_config.aspect_ratiovia--size WxH(mapped automatically inopenrouter.ts → sizeToAspectRatio).Video generation (text-to-video + image-to-video)
Endpoint: async-job pattern at
POST /api/v1/videos. Submit returns{ id, status, polling_url }; poll untilcompleted; download viaGET /api/v1/videos/{id}/content?index=0(auth required). The legacy/api/v1/videos/generationsreturns 404.cli/lib/providers/media.ts → generateVideo()handles the full job lifecycle (15s × 80-poll = 20 min budget; tunable).Per-model matrix (live from
/api/v1/videos/models, snapshot 2026-05-08)Always recheck via
ralphy models list. These arrays change.kwaivgi/kling-v3.0-prokwaivgi/kling-v3.0-stdkwaivgi/kling-video-o1google/veo-3.1google/veo-3.1-fastgoogle/veo-3.1-liteopenai/sora-2-prominimax/hailuo-2.3alibaba/wan-2.6alibaba/wan-2.7bytedance/seedance-2.0bytedance/seedance-2.0-fastbytedance/seedance-1-5-proWhen to pick which
kwaivgi/kling-v3.0-prokwaivgi/kling-v3.0-std(NOTE: same per-second price as pro on OR — go toveo-3.1-litefor the real cheap-batch tier)google/veo-3.1(model-native audio with--audioworks in EN; off for RU/UA — only Chinese + English are clean)google/veo-3.1-fast(only model in catalog with 4K)bytedance/seedance-2.0(also the only path to 21:9 cinema aspect)alibaba/wan-2.7(only model with these in stock)bytedance/seedance-2.0-fastLessons from this session (2026-05-08)
kwaivgi/kling-v3.0-prorotates "wide" prompts inside the 9:16 container. Phrases like "wide overhead cityscape", "massive crowd in town square", "dancers under starlit sky" bias the model toward landscape composition; OR returns a 1080×1920 file but the content is laid out for 16:9. Fix: anchor with--first-frame <portrait-image>and rewrite the prompt with explicit vertical wording ("tall vertical portrait shot, low camera angle looking up, narrow alley framing, subjects centered vertically, half-timbered houses tower vertically on both sides"). The first-frame image overrides the model's compositional bias.--resolution 720pis silently upgraded to 1080p bykwaivgi/kling-v3.0-proeven though the catalog only lists 720p. The output dimension is whatever the model decides; treat resolution as a soft hint and let downstream HyperFrames / ffmpeg crop+scale to the composition's exact frame.OR's per-clip billing is fixed (e.g., a 5s kling-pro clip is ~$0.70 regardless of "duration in body" precision). The per-second figures above are therefore ballparks; pre-flight
--dry-runto see the estimate before submitting.generate_audio: trueis unsafe outside English. Confirmed forkwaivgi/kling-v3.0-pro,bytedance/seedance-2.0, andgoogle/veo-3.1on Russian — accent slips, voice age drifts, text gets cut. Default isfalse; only enable for EN with--audio.kwaivgi/kling-v3.0-pro --last-framehistorically returned400 "File is not in a valid base64 format"even with a clean PNG anchor. Root cause was C2PA / EXIF metadata that the provider parsed too eagerly; the 2026-05-19resolveImageRef()strip helped on single-frame but the multi-frame (first+last together) path kept failing across flipper / glitter-cream / playdate / venom regardless of source. As of 2026-05-30 (#008),generateVideo()preflight-rejects the kling-v3.0-pro multi-frame combination asTerminalProviderErrorwith a pointer tobytedance/seedance-2.0— which honors--last-framenatively across all aspects. No more wasted round-trips. Postmortems flagging this: playdate / flipper / venom / glitter-cream.bytedance/seedance-2.0privacy filter rejects photoreal-human anchors withInputImageSensitiveContentDetected.PrivacyInformation, even when the human was itself AI-generated. Reserve seedance for cartoons / non-human anchors / landscapes / hands / abstract motion. For photoreal humans, default tokwaivgi/kling-v3.0-pro(single-frame or multi-frame after #5 lands). Postmortems: tokyo / noski / venom.kwaivgi/kling-v3.0-pro2500-char prompt cap. OR returns 400 after a round-trip if you exceed it. As of 2026-05-30 (#008),generateVideo()preflight-rejects prompts >2500 chars asTerminalProviderErrorbefore any submit — no more 4× wasted round-trips per session. The cap lives in a per-modelMAX_PROMPT_CHARSmap incli/lib/providers/openrouter.ts; add a row when another model documents a hard cap. Trim atmosphere/setting paragraphs first — voice-tag, no-music, on-camera-EN clauses are load-bearing and should never be cut. Postmortem: glitter-cream.ElevenLabs Music 2-concurrent cap per subscription. Three+ parallel calls return 429
concurrent_limit_exceededand pollutegenerations.jsonlwith error rows. Serialize music gen or stay ≤2 in-flight. Postmortem: tokyo.Per-endpoint concurrent-call caps are now self-throttled in-process (
cli/lib/providers/concurrency.ts, #007). The semaphore wraps every network round-trip so the CLI never round-trips an over-cap call; the retry helper sits OUTSIDE the semaphore so backoff sleeps don't pin a slot. Current caps:elevenlabs/tts— 3 concurrent (choose-your-guide-001: 9 parallel → 6 hard-failed 429).elevenlabs/music_v1— 2 concurrent (tokyo-y2k-001: 3 parallel → 1 hard-failed 429).openrouter:openai/gpt-5.4-image-2— 2 concurrent (validated 2026-05-29; appstore-takeaminute-001 hit 73/73 403 on uncapped fan-out).openrouter:google/gemini-3-pro-image-preview— 2 concurrent (catalog tolerates ≥4, kept at 2 for shared-key safety).openrouter:bytedance/seedance-2.0— 1 concurrent (queue depth + multi-block extend is inherently sequential).openrouter:kwaivgi/kling-v3.0-pro— 2 concurrent.ralphyprocesses sharing one key),shared.ts → rewriteUpstreamError()now surfaces the 403 as"concurrent-call limit on <model>; try --concurrency 1 or switch model. (This is NOT a $ balance issue — check 'ralphy doctor' for credits.)"— no more credits-misread. Follow-up: the in-process semaphore does NOT span processes. The queue daemon's own worker count gates cross-process for now; a file-locked queue is a future enhancement. Postmortems: appstore / analog-horror / tokyo / choose-your-guide.google/gemini-3-pro-image-previewimage-safety filter is materially stricter thanopenai/gpt-5.4-image-2on body-horror / cryptid / skinwalker / werewolf register. Gemini returnsnative_finish_reason: IMAGE_SAFETYwith an empty content body even when the prompt uses softened surreal-anatomy language (concentric maws lined with tongue-like protrusions, biological apertures with internal teeth on a recognizable creature in a real-world setting). The reasoning trace literally describes the requested transformation before the filter refuses.openai/gpt-5.4-image-2accepts the same prompts and delivers — validated on voidstomper-test-001's skinwalker BBQ frame after three gemini refusals. Route rule: for voidstomper-adjacent / Cronenberg / The Thing / mid-warp body-horror anchor images, start at--model openai/gpt-5.4-image-2. Accept the 1024×1024 default (gpt-image ignores arbitrary--size); use--ref <gemini-9:16-scene>to carry scene + character identity, then post-process to 9:16 with ffmpeg pad if your downstream i2v requires matching dimensions. Postmortem: voidstomper-test-001 (2026-05-25).google/veo-3.1body-horror filter rejects both the prompt AND the input frame independently. Sanitizing the prompt (stripping skinwalker / werewolf / vertebrae / Cronenberg words) does NOT unblock the path — Google's filter ALSO scans the first/last-frame anchor and refuses when the anchor itself is clearly body-horror. Combined with seedance's photoreal-human rejection (cap #6), this leaveskwaivgi/kling-v3.0-proas the only viable i2v provider for the photoreal-human + body-horror combination that voidstomper-style content requires. Skip seedance and veo round-trips on those jobs; go straight to Kling. Postmortem: voidstomper-test-001 (2026-05-25).kwaivgi/kling-v3.0-proandbytedance/seedance-2.0overshoot--durationby ~1s. Both models return clips ~1 second longer than the requested--duration— silent, billed against the requested duration, not the delivered one. tokyo-y2k-001 measured5s/4s/9sstoryboard →6.04/5.04/10.04actual; planned 75s of clips landed as 90.7s of raw mp4. Editor recipe (pre-shorten at art-director stage, or budget a vision-trim pass) lives indocs/playbooks/editor/render-pipeline.md → Source-clip duration overshoot. Filed long-form:notes/issues/done/042-clip-duration-overshoot-undocumented.md. Postmortem: tokyo-y2k-001 (workflow-fixes #3).Failure modes (per-model quick reference)
Append-only. Each row is a quirk that has already cost ≥1 re-roll across sessions.
kwaivgi/kling-v3.0-probytedance/seedance-2.0for multi-frame anchoring (honors--last-framenatively for non-photoreal-human anchors).kwaivgi/kling-v3.0-pro--audioslips on Russian / UA — accent slip, voice-age drift, occasional cut text. EN is clean.--audiofor EN VO only. For RU / non-EN, generate silent clip + post-mix ElevenLabs VO in the editor stage. (Memory:feedback_kling_no_ru_audio.)kwaivgi/kling-v3.0-prokwaivgi/kling-v3.0-pro--first-frame <portrait>and rewrite prompt with explicit vertical wording (cap #1 above).bytedance/seedance-2.0InputImageSensitiveContentDetected.PrivacyInformation, even when the human is AI-generated.kwaivgi/kling-v3.0-pro. (Memory:feedback_seedance_rejects_realistic_people.)bytedance/seedance-2.0kling+seedance(both)--durationovershoots by ~1s (cap #12 above).docs/playbooks/editor/render-pipeline.md+notes/issues/042.google/veo-3.1kwaivgi/kling-v3.0-pro; skip veo + seedance round-trips on those jobs (cap #11 above).google/veo-3.1kwaivgi/kling-v3.0-profor >8s; chain veo clips for talking-head only when the cut is acceptable.google/gemini-3.1-pro-preview(video-analysis viacallLLM)ffmpeg -vf scale=540:960 -crf 28) + retry. If still 502, fall back togoogle/gemini-2.5-flash(lower context but reliable). Last resort:ralphy ref frames+ frame-level analysis.elevenlabs/music_v1prompt_suggestion. Modern hip-hop especially.prompt_suggestionverbatim. (Memory:feedback_elevenlabs_music_no_artist_names.)elevenlabs/music_v1429 concurrent_limit_exceededand pollutesgenerations.jsonl.--concurrency 2).elevenlabs/scribe_v1+ voice endpoints.mp3.User-Agent: Mozilla/5.0 (...)header (already wired). From geo-blocked region, fall back to Kokoro for VO andopenai/whisper-1for transcription. (Memory:feedback_elevenlabs_geoblock_html_in_mp3.)Routing rules (which video model for what)
Apply these before drafting a
ralphy generate videoprompt. They short-circuit re-rolls.bytedance/seedance-2.0. Kling reads these as narrative beats and softens the physics. Caveat: seedance's privacy filter rejects photoreal-human i2v anchors — only use seedance hyper-motion on cartoon / non-human / landscape / hands / abstract subjects.kwaivgi/kling-v3.0-pro. Multi-frame works post-2026-05-19.--audiofor EN VO only.google/veo-3.1-fast(the only catalog entry with 4K).alibaba/wan-2.7(only model with these in stock).bytedance/seedance-2.0(only model with 21:9 in stock).kwaivgi/kling-v3.0-proonly. Seedance rejects the anchor, veo rejects both prompt and anchor (caps #6 + #11).google/veo-3.1-lite(~$0.09/s) for non-photoreal-human work; otherwisekwaivgi/kling-v3.0-std(same rate as pro but worse keyframe holds).--audiopolicy: SPEECH vs AMBIENT/DIEGETICTwo distinct audio jobs hide behind the single
--audioflag — different rules apply.google/veo-3.1for English on-camera dialogue.kwaivgi/kling-v3.0-pro --audioworks for EN only. Slips on RU / UA / non-English (memoryfeedback_kling_no_ru_audio). For RU, render silent + post-mix ElevenLabseleven_multilingual_v2in the editor stage.bytedance/seedance-2.0 --audiois not validated for SPEECH — treat as off.--audio: trueare fine. Generate alongside the visuals; cheaper than a separate SFX pass.no music — only diegetic ambient + sparse SFXclause in the prompt regardless of which model.When in doubt: SPEECH = veo (EN) / kling-EN-only / silent+ElevenLabs (everything else). AMBIENT = whichever video model is already in the shot list.
Tried-and-dropped (postmortem cross-reference)
bytedance/seedance-2.0InputImageSensitiveContentDetected.PrivacyInformationkwaivgi/kling-v3.0-promulti-frame400 not in a valid base64 format— mitigated 2026-05-19 by auto-C2PA-strip inresolveImageRef()kwaivgi/kling-v3.0-pro --audiogoogle/veo-3.1--audioinsteadopenai/gpt-5.4-image-2eleven_multilingual_v2(Ava, Marcus)kwaivgi/kling-v3.0-profor hyper-motiongoogle/gemini-3-pro-image-previewbytedance/seedance-2.0t2v + i2v anchor posterAvoid:
kling-video/v1.6orv2.x: outdated.luma-dream-machine: worse than kling at the same price; out of OR catalog now.fal-ai/*endpoints: the stack moved to OpenRouter in Sprint 2.Voiceover (TTS)
eleven_multilingual_v2eleven_v3Voice settings (deadpan young Russian):
{ "model_id": "eleven_multilingual_v2", "voice_settings": { "stability": 0.55, "similarity_boost": 0.8, "style": 0.25, "use_speaker_boost": true }, "output_format": "mp3_44100_128" }Voice settings (English brainrot dramatic narrator):
{ "model_id": "eleven_v3", "voice_settings": { "stability": 0.30, "similarity_boost": 0.75, "style": 0.40, "use_speaker_boost": true } }Voice picks: Adam (
pNInz6obpgDQGcFmaJgB) for dramatic narrator, Brian (nPczCjzI2devNBz1zQrb) for Reddit-monotone, Daniel (onwK4e9ZLuTAKqWW03F9) for British-sarcastic.Failure modes:
User-Agent: Mozilla/5.0 (...).clyde-warvet,daniel-deep, etc.): too theatrical for RU. A custom clone is mandatory for RU production.eleven_v3consistently runs ~15-25% longer than scripted word-count would suggest (Strasbourg 45s scenario rendered at 54.6s). Time-budget compositions to actual VO duration viaralphy project transcribe, not scenario text length.Avoid:
tts-1-hdon Russian — flat American accent.eleven_v3on Russian in production — unstable.Music generation
music_v1)ralphy generate music.Endpoint contract (for
cli/lib/providers/media.ts → generateMusic()):POST https://api.elevenlabs.io/v1/music Headers: xi-api-key: $ELEVENLABS_API_KEY, Content-Type: application/json Body: { "prompt": "...", "music_length_ms": 30000, "force_instrumental": true, "output_format": "mp3_44100_128", "model_id": "music_v1" } Response: 200 → binary mp3 (application/octet-stream) 400 → JSON ToS rejection (`bad_prompt`) — see "Prompt content policy" 422 → JSON validation errorPrompt content policy (#006). ElevenLabs Music ToS rejects prompts that name specific artists / producers / copyrighted tracks (rappers, named producers, song titles, scored themes) with HTTP 400
bad_prompt. The 400 envelope carries adetail.data.prompt_suggestionfield containing a provider-sanitized rewrite the CLI can resubmit verbatim. The connector surfaces this onTerminalProviderError.promptSuggestion;ralphy generate music --auto-retry-on-tos-rejectionwill log the original failure and resubmit once using the rewrite. Use genre + tempo + instrumentation + mood framing instead of named entities (e.g. "trap beat, 140 BPM, 808 sub-bass, dark minor-key piano stab"). The CLI runs a soft pre-submit linter (cli/lib/music-prompt-lint.ts) against a known artist/track regex set and warns to stderr before submit — non-blocking, false positives are cheaper than missed catches.Trend-music rule: if a template references a specific trend track (
assets/trend-*.mp3), copy the file from theralphy-assetscompanion repo. Don't generate a substitute. Track recognition is half of what makes a trend video a trend.Fallback (if ElevenLabs Music quality regresses): temporarily route to
fal-ai/lyria2viaFAL_KEYas a documented exception. As of 2026-05-08 the fallback is not active; ElevenLabs Music is the only path.Avoid:
lyria2directly viaFAL_KEYwhile ElevenLabs Music works: don't multiply optional keys.Audio transcription / Captions
scribe_v1(default incli/lib/transcribe.ts)openai/whisper-1(--backend openrouter)CLI:
ralphy generate captions --project <id> --audio <vo.mp3> --language en(writescaptions.jsonnext to the project).Avoid:
LLM (for skills and analytics)
Provider routing. All LLM / vision calls go through
cli/lib/providers/llm.ts → callLLM(). The only provider is OpenRouter.google/gemini-2.5-flashprowhen long context is needed.google/gemini-2.5-proanthropic/claude-sonnet-4.6oranthropic/claude-opus-4.6Avoid:
fetch("https://openrouter.ai/...")calls in new scripts. Go throughcallLLM()so users can switch providers viaralphy setup.anthropic.comoropenai.comURLs; everything goes through OpenRouter.Out-of-scope / dropped
These models / families were removed during OpenRouter consolidation (Sprint 1.3 / 2). Don't bring them back without an explicit plan upgrade:
fal-ai/nano-banana-pro/edit— replaced withgoogle/gemini-3-pro-image-previewvia OR.fal-ai/flux-pro/v1.1-ultra,fal-ai/flux/dev/inpainting,fal-ai/flux-pro/v1.1-ultra/redux— out.fal-ai/luma-dream-machine/image-to-video— out (worse than kling).fal-ai/wan-25— lipsync stage dropped entirely in v2 (no FAL_KEY pipeline).fal-ai/sync-lipsync,fal-ai/veed/lipsync— out.fal-ai/lyria2— out (fallback reserved in the Music section).fal-ai/seedream— out.wav2lip— out (no token, no stage).openai/gpt-image-1,dall-e-3,stable-diffusion-xl,flux/schnell— outdated./researcher(deferred to v2).When to update this file
Last reviewed, refresh if stale.ralphy generateor a flag: sync the price / param notes here.