← All posts
AI Systemshindi-aimultilingual-automationvoice-agentsindian-smewhatsapp

Your AI Chatbot Doesn’t Speak Hindi. Your Customers Do.

4 April 2026Nayikala Team6 min read

90% of India’s new internet users don’t speak English. Most business automation stacks are English-only. The gap between those two facts is where leads go to die.


Your customer in Indore types "mujhe price batao" into your WhatsApp chatbot. The bot responds in English. The customer sends a voice note in Hindi. The bot doesn’t understand. The customer leaves. Your CRM logs it as "session ended" — no lead, no sale, no trace of what went wrong.

This happens thousands of times a day across Indian businesses that have invested in automation but built it in the wrong language. The tools work. The intent is there. The customer showed up ready to buy. But somewhere between the Roman-script Hindi message and the English-only NLP pipeline, the conversation died.

The thing is, the customer was never going to switch to English. 90% of new internet users in India are non-English speakers (KPMG-Google, directionally confirmed by IAMAI 2024). The Hindi belt — Uttar Pradesh, Madhya Pradesh, Rajasthan, Bihar — has the lowest MSME Digital Maturity Index score at 55.8. The largest population of potential customers, the lowest digital adoption. That’s not a coincidence. It’s a language problem dressed up as a technology problem.

The Language Gap in Numbers: 90% non-English new users, 93% Hindi in Roman script, 42.6% Whisper WER baseline, $0.009/min Sarvam stack cost
The Language Gap in Numbers: 90% non-English new users, 93% Hindi in Roman script, 42.6% Whisper WER baseline, $0.009/min Sarvam stack cost

The English-Only Stack Is Quietly Expensive

Every business we’ve onboarded that had "automated" customer communication was running an English-first stack — sometimes without realizing it. The chatbot provider defaults to English. The CRM templates are in English. The IVR menu might offer Hindi, but the actual AI processing happens in English with a translation layer bolted on top.

Messy ops desk in a Bangalore office — CRM dashboards with red notifications, Hindi notes in a notebook, tangled cables
Messy ops desk in a Bangalore office — CRM dashboards with red notifications, Hindi notes in a notebook, tangled cables

Here’s what that costs you in practice.

Speech-to-text accuracy collapses. OpenAI’s Whisper — the most commonly used open-source transcription model — has a baseline Word Error Rate of 42.64% on Hindi (arXiv:2412.19785). That means nearly half the words are wrong. For Bengali, it’s worse: 134.98% WER, meaning the model hallucinates more words than the speaker actually said. For comparison, English WER on the same model sits under 10%.

Whisper Speech-to-Text Word Error Rate: Hindi baseline 42.64% drops to 9.24% after fine-tuning; Bengali 134.98% to 11.78%; Tamil 63.90% to 23.08%
Whisper Speech-to-Text Word Error Rate: Hindi baseline 42.64% drops to 9.24% after fine-tuning; Bengali 134.98% to 11.78%; Tamil 63.90% to 23.08%

Your customers type in Roman script, but your models expect Devanagari. 93.17% of native Hindi speakers write on social media in Roman script — "kya price hai" instead of "क्या प्राइस है" (arXiv:2512.10780). Roman-script Hindi produces significantly higher token-level entropy — 5.6 bits versus 3.6 for Devanagari. The LLM is working harder and getting worse results. Without a transliteration layer that converts Roman Hindi to Devanagari before inference, you’re looking at a 9.9-point F1 score drop in classification tasks.

Script Gap: F1 scores drop 2.5-5.1 points when Hindi input is Roman script vs native Devanagari across Claude 4.5, GPT-4o, Qwen3-80B, LLaMA 4, DeepSeek-V3
Script Gap: F1 scores drop 2.5-5.1 points when Hindi input is Roman script vs native Devanagari across Claude 4.5, GPT-4o, Qwen3-80B, LLaMA 4, DeepSeek-V3

The cost math is upside down. Running a Hindi voice agent on global providers (Google STT + GPT-4o + Azure TTS) costs roughly $0.055 per minute — 3x the cost of an equivalent English stack at $0.017/min. The teams we’ve talked to see this and conclude Hindi automation is too expensive. They’re looking at the wrong price list.

Cost per minute comparison: English global stack $0.017, Hindi global providers $0.055, Hindi Sarvam stack $0.009 — India-native is cheapest
Cost per minute comparison: English global stack $0.017, Hindi global providers $0.055, Hindi Sarvam stack $0.009 — India-native is cheapest

The businesses we’ve seen abandon chatbots within six months almost always built on an English-first architecture and tried to patch in Hindi afterward. Industry-wide, the attrition is staggering — 90% of AI chatbots deployed in 2025 failed to sustain engagement (LoopReply). The architecture was wrong from day one.

What You Can Do Monday Morning

You don’t need a new vendor or a budget approval to start fixing this.

Audit your language gap

For one week, export every WhatsApp conversation your business had. Categorize each message: English, Hindi (Devanagari), Hindi (Roman script), Hinglish, or other regional language. When we’ve run this audit during onboarding, the Roman Hindi percentage lands above 70% every time, and the bot’s comprehension rate on those messages sits below 40%.

Test your chatbot in the customer’s language

Send your own chatbot ten messages in Roman-script Hindi. Common ones: "price batao", "delivery kab hogi", "kal ka order cancel karna hai", "koi discount milega kya". Screenshot the responses. If even three of those ten get a useful reply, you’re ahead of most.

Switch your WhatsApp Business greeting to Hindi-first

WhatsApp supports 11 Indian languages natively. Change your auto-greeting from English to Hindi (or your dominant customer language). Max Life Insurance saw 40% of users complete journeys in their native language after adding vernacular support, with an 80% completion rate among vernacular customers (Haptik case study). Your greeting is the first signal that your business speaks their language.

Max Life Insurance case study: 5x ROI on WhatsApp ads, 4x lead conversion, 40% native language journeys, 80% vernacular service completion
Max Life Insurance case study: 5x ROI on WhatsApp ads, 4x lead conversion, 40% native language journeys, 80% vernacular service completion

Use Bhashini for free translation

The government’s Bhashini platform handles 100 million+ monthly inference calls across 35+ languages. It’s free API access. It won’t replace a production system, but it’s good enough to prototype a translation layer for your most common customer queries. Build a simple lookup: top 50 customer questions in Hindi, mapped to your standard responses.

Track voice notes separately

If your customers send voice notes on WhatsApp — and in Tier 2/3 cities, they overwhelmingly do — start logging how many you receive versus how many get a useful response. Every business we’ve audited has zero automation on voice notes. That’s the baseline you need to measure against.

Hand holding a phone recording a WhatsApp voice note in a textile shop in Surat — the customer who will never type in English
Hand holding a phone recording a WhatsApp voice note in a textile shop in Surat — the customer who will never type in English

Where It Gets Harder

The Monday-morning fixes give you visibility into the problem. The structural fix is a different kind of work.

Whiteboard in a Bangalore startup war room with STT-LLM-TTS pipeline diagrams and a laptop showing terminal code
Whiteboard in a Bangalore startup war room with STT-LLM-TTS pipeline diagrams and a laptop showing terminal code

A Hindi-first AI stack isn’t a chatbot with a translation plugin. It’s a pipeline with at least four distinct components, each with its own failure modes.

Script detection and transliteration. Before any message hits your LLM, you need to detect whether it’s Roman-script Hindi, Devanagari, Hinglish code-mixed, or something else. Then normalize it. The word "मैं" appears as "main", "mein", "mien", "men", and "me" in Roman script — all in common use, no standard spelling. A naive transliteration layer gets this wrong constantly. Pre-normalization from Roman to Devanagari before LLM inference recovers about 5 F1 points (75.3% to 80.1% on GPT-4o), but the normalization model itself needs to handle the entropy.

STT model selection for telephony. Fine-tuned Whisper on Hindi gets WER down to 9.24%, but that’s on clean audio. Telephony audio is 8KHz, compressed, with background noise. Sarvam’s Saaras V3 is optimized for exactly this — sub-150ms time-to-first-token, Rs 0.50/min, and it handles mid-sentence Hindi-English code-switching. The Indian-native stack (Sarvam STT + Sarvam LLM + Bulbul TTS) comes in at roughly $0.009 per minute — cheaper than an English-only global stack. But stitching those components together with proper failover, latency management, and conversation state is where the engineering lives.

STT provider cost comparison: Sarvam 0.6 cents/min, Deepgram 0.43 cents, Azure 1.7 cents, Google 2.4 cents, AWS 1.67 cents
STT provider cost comparison: Sarvam 0.6 cents/min, Deepgram 0.43 cents, Azure 1.7 cents, Google 2.4 cents, AWS 1.67 cents

The RAG layer needs to be script-aware. Your knowledge base is probably in English. Your customer asks in Roman Hindi. Standard embedding models (even multilingual ones like mE5) lose retrieval precision across that script boundary. Hindi-specific retrieval models like DeepRAG show a 23% improvement in retrieval precision over general multilingual embeddings (arXiv:2503.08213). The canonical knowledge base needs to live in Devanagari, with incoming queries normalized before lookup. That’s a data architecture decision, not a plugin toggle.

End-to-end latency budget. A viable Hindi voice agent needs to respond in under 1.5 seconds to feel natural. The current achievable stack — Sarvam Saaras V3 (<150ms) + GPT-4o (800-1,200ms) + Bulbul V3 (<400ms) — lands at roughly 1.3-1.8 seconds on first turn. That’s tight. Every additional processing step — transliteration, RAG lookup, consent check — eats into that budget. The latency architecture determines which features you can actually ship.

Voice AI latency budget: Sarvam STT 150ms + LLM 1000ms + Sarvam TTS 400ms = 1550ms total first turn, near the 1500ms natural threshold
Voice AI latency budget: Sarvam STT 150ms + LLM 1000ms + Sarvam TTS 400ms = 1550ms total first turn, near the 1500ms natural threshold

The transliteration pipeline, the script-aware RAG layer, and the telephony-grade latency budget are where the real design decisions live.


← All posts

More from the blog