Serving a global audience through AI translation.
AI-Powered Localization & Audio Narration
Much of the web is built for a narrow audience: English speakers who can read a screen. This guide demonstrates how to reach a global audience using AI translation and voice synthesis across 106 languages.
The techniques described here enable access for non-English speakers, people with visual impairments, reading difficulties, or those who prefer listening to reading.
What you'll need
- Vercel account (free tier)
- Google AI Studio API key for Gemini Flash (free tier: 1,500 requests/day)
- ElevenLabs API key for voice synthesis ($5/mo starter plan or pay-as-you-go)
- Node.js for running build scripts
Gemini 3 Flash's free tier (1,500 requests/day) handles translation for most sites at no cost. ElevenLabs offers $5-22/mo subscription plans or pay-as-you-go pricing at $0.30 per 1,000 characters. Subsequent updates only regenerate changed content, keeping ongoing costs minimal.
How it works
Content is translated at build time using Gemini Flash, cached on Vercel's edge network, and optionally synthesized into audio using ElevenLabs. Hash-based change detection ensures you only pay to regenerate what's changed.
Gemini Flash
Context-aware translation to 100+ languages
Vercel KV
Global edge caching for instant delivery
ElevenLabs
Voice synthesis in 6 languages
User Experience
A flag icon in the header provides access to a searchable language selector supporting over 100 languages. Languages with audio narration available display a ๐ indicator.
Audio narration is available in six languages: English, Spanish, Chinese, Hindi, Arabic, and French. For these languages, a "Narrate this page" button appears that plays synthesized audio while highlighting paragraphs in sequence. Playback can be paused and resumed via the button or Option+P (Alt+P on Windows).
Developer Experience: Visual Workflow
The development environment provides immediate visual feedback. Modified content is flagged automatically, and the interface displays the exact commands needed to regenerate translations and audio.
The Workflow
Content changes are detected, flagged, and resolved through a three-step process.
Much of the web is built for a narrow audience: English speakers who can read a screen.
This guide demonstrates how to expand that reach using AI translation and voice synthesis.
The techniques described here enable access for non-English speakers, people with visual impairments, or those who prefer listening.
Much of the web is built for a narrow audience: English speakers who can read a screen.
This guide demonstrates how to expand that reach using AI translation and voice synthesis.
The techniques described here enable access for non-English speakers, people with visual impairments, or those who prefer listening.
Data Attributes
Implementation requires adding two data attributes to HTML elements that should be translated or narrated:
<!-- For translation only -->
<p data-l10n-id="page-1">This paragraph will be translated.</p>
<!-- For translation AND narration -->
<p data-narration="0" data-l10n-id="page-2">This will be translated and read aloud.</p>
Translation System: Gemini 3 Flash + Vercel KV
The translation pipeline runs at build time, not runtime. A Node.js script extracts all translatable content, sends it to Google's Gemini 3 Flash model in batches, and stores results in Vercel KV. The system supports 100+ languages.
Running the Script
# From project root, start local dev server (required for API access)
vercel dev
# In another terminal (also from project root), run translation
node pre-translate.js
# Translate a specific page only
node pre-translate.js --page=localization
# Translate to a specific language only
node pre-translate.js --lang=es
# Combine filters for one page, one language
node pre-translate.js --page=localization --lang=fr
# NOTE: If content is unchanged, the script will skip translation.
# To force regeneration (e.g., after API failures or for testing),
# delete the "_translationHash" line for that page in content-hashes.json
Narration System: ElevenLabs Voice Synthesis
Audio narration uses ElevenLabs' text-to-speech API. For English, a cloned voice provides consistency. For other languages, ElevenLabs' multilingual voices handle the synthesis. Audio files are saved to a structured folder hierarchy.
Audio File Structure
/audio/
โโโ en/
โ โโโ page-name/
โ โโโ p0.mp3 # First narrated element
โ โโโ p1.mp3 # Second narrated element
โ โโโ p2.mp3 # ...and so on
โโโ es/
โ โโโ page-name/
โ โโโ ...
โโโ zh/
โโโ hi/
โโโ ar/
โโโ fr/
Running the Script
# All commands run from project root
# Generate English narration
node generate-narration.js page-name/index.html
# Generate for a specific language
node generate-narration.js page-name/index.html --lang es
# Generate for all 6 narration-enabled languages
node generate-narration.js page-name/index.html --all-langs
# Resume interrupted generation (skips existing files)
node generate-narration.js page-name/index.html --all-langs --resume
Code & Implementation
The complete implementation is available as open-source code. The following scripts form the core of the system.
Key Scripts
pre-translate.js
The translation generation script. Extracts content, calls Gemini Flash, and stores results in Vercel KV.
View on GitHub Gist โtranslate-api.js
The Vercel serverless API endpoint. Handles translation requests via Gemini Flash and caches results in Vercel KV.
View on GitHub Gist โgenerate-narration.js
The audio generation script. Synthesizes speech via ElevenLabs and saves MP3 files.
View on GitHub Gist โspeech.js
The client-side narration player. Manages playback, paragraph highlighting, and keyboard shortcuts.
View on GitHub Gist โlocalization-manager.js
The client-side translation handler. Manages language switching, caching, and DOM updates.
View on GitHub Gist โcontent-manager.js
The dev workflow tools. Handles change detection, MODIFIED badges, and generation buttons.
View on GitHub Gist โCaveats & Limitations
This approach involves tradeoffs that warrant consideration:
- Cost: Gemini 3 Flash's free tier covers most translation needs. ElevenLabs offers subscriptions ($5-22/mo) or pay-as-you-go ($0.30/1K characters). Hash-based caching ensures you only regenerate changed content.
- Translation quality: AI translation has limitations. Idioms, cultural references, and domain-specific terminology may be mistranslated. For critical content (legal, medical), human translation remains the standard.
- Voice consistency: Non-English narration uses different voice models, so the "speaker" sounds different across languages.
- API dependencies: The system relies on external APIs that may change pricing or deprecate features over time.
This system is designed for content-focused sites with relatively static text. Real-time chat, user-generated content, and highly dynamic interfaces require different architectures.
โถโถโถโถ
About the Author
Burton Rast is a designer, a photographer, and a public speaker who loves to make things.