AI-Powered Localization & Audio Narration

Much of the web is built for a narrow audience: English speakers who can read a screen. This guide demonstrates how to reach a global audience using AI translation and voice synthesis across 106 languages.

The techniques described here enable access for non-English speakers, people with visual impairments, reading difficulties, or those who prefer listening to reading.

What you'll need

Gemini 3 Flash's free tier (1,500 requests/day) handles translation for most sites at no cost. ElevenLabs offers $5-22/mo subscription plans or pay-as-you-go pricing at $0.30 per 1,000 characters. Subsequent updates only regenerate changed content, keeping ongoing costs minimal.

How it works

Content is translated at build time using Gemini Flash, cached on Vercel's edge network, and optionally synthesized into audio using ElevenLabs. Hash-based change detection ensures you only pay to regenerate what's changed.

Gemini Flash

Context-aware translation to 100+ languages

โ†’

Vercel KV

Global edge caching for instant delivery

โ†’

ElevenLabs

Voice synthesis in 6 languages

User Experience

A flag icon in the header provides access to a searchable language selector supporting over 100 languages. Languages with audio narration available display a ๐Ÿ”Š indicator.

Language Selector

Audio narration is available in six languages: English, Spanish, Chinese, Hindi, Arabic, and French. For these languages, a "Narrate this page" button appears that plays synthesized audio while highlighting paragraphs in sequence. Playback can be paused and resumed via the button or Option+P (Alt+P on Windows).

Developer Experience: Visual Workflow

The development environment provides immediate visual feedback. Modified content is flagged automatically, and the interface displays the exact commands needed to regenerate translations and audio.

The Workflow

Content changes are detected, flagged, and resolved through a three-step process.

1 Edit content, see MODIFIED badges
MODIFIED

Much of the web is built for a narrow audience: English speakers who can read a screen.

MODIFIED

This guide demonstrates how to expand that reach using AI translation and voice synthesis.

The techniques described here enable access for non-English speakers, people with visual impairments, or those who prefer listening.

2 Click buttons to copy commands
PRE-TRANSLATE SITE ๐ŸŒ
node pre-translate.js (requires vercel dev @ 3000)
GENERATE NARRATION โšก
node generate-narration.js localization/index.html --all-langs
3 Run in terminal, badges disappear

Much of the web is built for a narrow audience: English speakers who can read a screen.

This guide demonstrates how to expand that reach using AI translation and voice synthesis.

The techniques described here enable access for non-English speakers, people with visual impairments, or those who prefer listening.

Data Attributes

Implementation requires adding two data attributes to HTML elements that should be translated or narrated:

<!-- For translation only -->
<p data-l10n-id="page-1">This paragraph will be translated.</p>

<!-- For translation AND narration -->
<p data-narration="0" data-l10n-id="page-2">This will be translated and read aloud.</p>

Translation System: Gemini 3 Flash + Vercel KV

The translation pipeline runs at build time, not runtime. A Node.js script extracts all translatable content, sends it to Google's Gemini 3 Flash model in batches, and stores results in Vercel KV. The system supports 100+ languages.

Running the Script

# From project root, start local dev server (required for API access)
vercel dev

# In another terminal (also from project root), run translation
node pre-translate.js

# Translate a specific page only
node pre-translate.js --page=localization

# Translate to a specific language only
node pre-translate.js --lang=es

# Combine filters for one page, one language
node pre-translate.js --page=localization --lang=fr

# NOTE: If content is unchanged, the script will skip translation.
# To force regeneration (e.g., after API failures or for testing),
# delete the "_translationHash" line for that page in content-hashes.json

Narration System: ElevenLabs Voice Synthesis

Audio narration uses ElevenLabs' text-to-speech API. For English, a cloned voice provides consistency. For other languages, ElevenLabs' multilingual voices handle the synthesis. Audio files are saved to a structured folder hierarchy.

Audio File Structure

/audio/
โ”œโ”€โ”€ en/
โ”‚   โ””โ”€โ”€ page-name/
โ”‚       โ”œโ”€โ”€ p0.mp3    # First narrated element
โ”‚       โ”œโ”€โ”€ p1.mp3    # Second narrated element
โ”‚       โ””โ”€โ”€ p2.mp3    # ...and so on
โ”œโ”€โ”€ es/
โ”‚   โ””โ”€โ”€ page-name/
โ”‚       โ””โ”€โ”€ ...
โ”œโ”€โ”€ zh/
โ”œโ”€โ”€ hi/
โ”œโ”€โ”€ ar/
โ””โ”€โ”€ fr/

Running the Script

# All commands run from project root

# Generate English narration
node generate-narration.js page-name/index.html

# Generate for a specific language
node generate-narration.js page-name/index.html --lang es

# Generate for all 6 narration-enabled languages
node generate-narration.js page-name/index.html --all-langs

# Resume interrupted generation (skips existing files)
node generate-narration.js page-name/index.html --all-langs --resume

Code & Implementation

The complete implementation is available as open-source code. The following scripts form the core of the system.

Key Scripts

Caveats & Limitations

This approach involves tradeoffs that warrant consideration:

  • Cost: Gemini 3 Flash's free tier covers most translation needs. ElevenLabs offers subscriptions ($5-22/mo) or pay-as-you-go ($0.30/1K characters). Hash-based caching ensures you only regenerate changed content.
  • Translation quality: AI translation has limitations. Idioms, cultural references, and domain-specific terminology may be mistranslated. For critical content (legal, medical), human translation remains the standard.
  • Voice consistency: Non-English narration uses different voice models, so the "speaker" sounds different across languages.
  • API dependencies: The system relies on external APIs that may change pricing or deprecate features over time.

This system is designed for content-focused sites with relatively static text. Real-time chat, user-generated content, and highly dynamic interfaces require different architectures.

โœถโœถโœถโœถ

About the Author

Burton Rast is a designer, a photographer, and a public speaker who loves to make things.