Blog
HomeWorld
Voice AI Finds Its Voice
Driving business efficiency with voice agents
Vanessa Li · August 2025
Since the dawn of time, voice has been our most natural and information-dense form of communication. Every utterance carries more than just words. The tone, pace, emotion reveals our true intent. What once required human interpretation — context, meaning, sentiment — can now be processed instantly by AI. We're archiving enormous amounts of spoken content. Voice has become both the richest data source and the most underutilized one.
"Language is the foundation of civilization" – Arrival movie.

Voice is no longer just a medium of human interaction, it is becoming a machine-readable interface. It has seen a wide adoption from automating customer support and enabling hands-free productivity, to training more human-like AI agents. As the foundational models continue to improve, voice is poised to emerge as the dominant interface between humans and machines.
People still rely heavily on phone calls. From a customer's perspective, calling the business provider directly after finding the right business online is the fastest way to get what they need, especially in urgent situations. Yet, research revealed that 62% of phone calls to small businesses are left unanswered. 70% of businesses answered less than half of their calls. (source)
Industry-specific voice solutions
It's no surprise that many of today's voice AI applications are concentrated in customer service and lead conversion, where businesses handle massive volumes of human interactions. The benefits are immediately seen – faster response times, reduced costs, and improved customer experiences.
To deliver this value, companies must be industry-focused. Generic voice agents are unable to answer questions specifically related to the business, therefore the agent must be trained on proprietary information, the way a customer service representative would be trained. This would be done by tailoring commands, workflows, and integrations to industry needs. Voice agents are here to replace company representatives.
Voice agents are powerful tools for converting inquiry into business deals. Unlike a human, voice AI agents are available 24/7, and would never run out of patience.
Industry-focused voice solutions can unlock productivity in any sector where work is time-sensitive, hands-on, or highly transactional. I'm thinking
- SMB plumbing/home renovation services
- Mental health therapy
- Customer service / call centers
- In-car voice assistants
- Accessibility support
- Education (paired with AR/VR)
- Dubbing in entertainment
- Elderly care
- Voice biometrics/ identity detection
- Offline AI (on-device voice agents)
- Negotiation / bill disputes
- Outbound calls Bot calls are illegal. (telephone consumer protection act)
Current Market Landscape
- Car Dealers (lead conversion, service scheduling, financing inquiries): numo, toma
- Real Estate (leasing assistant for property managers): Uniti AI, Colleen.ai, HostAI, EliseAI
- Medical (appointment scheduling, insurance verification): hyro, Infinitus, Outbound AI
- Call Center (customer service): GigaML, PolyAI, Replicant
- Recruiting (candidate screening): ConverzAI, Ribbon, Mercor, Humanly, heyMilo
- Restaurants (you can imagine): Slang
- Logistics (coordination): HappyRobot, FleetWorks
Voice agents create value for end-users and the companies building these tools. It can also be integrated with industry software. Once adopted, switching to another provider is costly, making these agents highly sticky and defensible for the businesses building them.
Challenge
The challenge, however, lies in reliability. Voice models can hallucinate or fail in other ways. Building a truly high-quality product means orchestrating the right mix of models, integrations, conversational flows, and error-handling.
Voice AI infrastructure
Foundation
- Speech-to-Text (STT) / Automatic Speech Recognition (ASR): Converts speech to text (e.g., Whisper, DeepSpeech)
- Text-to-Speech (TTS): Produces realistic voices. (e.g., ElevenLabs, Speechify)
- Language Models (NLP): Understand context, intent, and semantics. (e.g., GPT-based models, Claude)
Middleware: This is the "brain" that decides what to do once speech is captured.
- Intent Recognition: Mapping raw speech into user intent (e.g., "book me a flight" → API call).
- Workflow Engines: Handling multi-step tasks ("Send an email, then schedule a meeting").
- Context Memory: Keeping track of previous conversations and user state. (e.g. Letta, Redis)
Major problems to be solved
- Latency optimization: we want sub-second response
- Integration frameworks
- Security & authentication: handling access to sensitive user systems (emails, Slack…)
- Understanding multiple speakers
- Show emotional intelligence: tone, pauses, filler words..
- Ability to understand emotion
Application Layer
- Horizontal: General-purpose assistants (e.g., Martin).
- Vertical: Industry-specific voice solutions (B2B).
I expect the foundational model to become exponentially better over time, fueling applications across industries. From converting leads to coordinating logistics, these agents are tightly woven into workflows. Companies that embrace voice AI early will capture enormous efficiency. The problem ahead isn't whether voice will matter, but who will execute it best. What began as human dialogue is now the foundation of the next great computing interface.
Check out alexis-ai.com (coming soon).