OpenAI Enhances Its AI Voice & Transcription Models—Here's What's New

By

OpenAI Enhances Its AI Voice & Transcription Models—Here’s What’s New
A photo taken on February 26, 2024 shows the logo of the ChatGPT application developed by US artificial intelligence research organization OpenAI on a smartphone screen (L) and the letters AI on a laptop screen in Frankfurt am Main, western Germany. KIRILL KUDRYAVTSEV/AFP via Getty Images/Getty Images

OpenAI has launched new AI models for voice generation and transcription, promising major improvements in realism, accuracy, and usability.

The updated models, now available via OpenAI's API, are part of the company's vision to enhance AI-powered agents capable of handling complex user interactions independently.

One of the key advancements is the new text-to-speech model, gpt-4o-mini-tts, which OpenAI claims can produce more natural-sounding speech with better emotional range.

Developers can now instruct the AI on how to express itself, such as speaking in a calming tone or even mimicking the enthusiasm of a sports commentator.

Jeff Harris, a product team member at OpenAI, highlighted the significance of this development, saying, "Our big belief here is that developers and users want to really control not just what is spoken, but how things are spoken."

This could revolutionize industries like customer service, where the AI can adjust its tone based on the context of the conversation, TechCrunch said.

In addition to its voice generation improvements, OpenAI has introduced gpt-4o-transcribe and gpt-4o-mini-transcribe, which are designed to replace the older Whisper model.

These transcription models are trained on diverse, high-quality audio datasets, enabling them to accurately capture spoken words—even in noisy environments or with strong accents.

Harris noted that these models significantly reduce errors and hallucinations—where AI generates false words or phrases.

This was a major issue with Whisper, which sometimes inserted incorrect details into transcripts. He explained, "Making sure the models are accurate is completely essential to getting a reliable voice experience."

OpenAI Keeps Latest AI Transcription Models Proprietary, Citing Complexity

Unlike previous OpenAI transcription models, these new versions will not be openly available.

While Whisper was accessible under an MIT license, OpenAI has decided to keep gpt-4o-transcribe and gpt-4o-mini-transcribe proprietary due to their complexity and scale.

"They're not the kind of model that you can just run locally on your laptop," Harris stated. He emphasized that OpenAI is being cautious about releasing models as open source, prioritizing thoughtful deployment.

The improvements in AI voice and transcription models open doors for various industries. Call centers, virtual assistants, and interactive voice response (IVR) systems can now offer more personalized and human-like interactions.

According to INC, OpenAI also introduced an integration with its Agents SDK, allowing developers to convert text-based AI agents into voice-driven assistants with minimal coding.

Companies like EliseAI have already tested the new voice technology, using it to enhance tenant communication in the housing sector.

Co-founder Tony Stoyanov praised the AI's effectiveness, stating that its "emotionally rich voices encourage callers to comfortably engage with AI for leasing inquiries, maintenance requests, and tour scheduling."

© 2025 VCPOST.com All rights reserved. Do not reproduce without permission.

Join the Conversation