First Impressions on GPT-4o
The new model is largely about an interface change. Before now, GPT was fueled by inputs in its original text prompt format, and more recently with images. GPT-4o opens up the possibility of GPT acting more like a smart speaker, listening, understanding, and responding all in one go. It seems to have been tuned, especially for high performance as would be needed in a conversational model. The so-called “time to first token metric” is a measurement of how long it takes from the point of which a model receives its input until it begins generating an answer. It doesn’t matter how long the model takes to respond completely if it can stream part of the answer sooner. This appears to be a great deal of the focus of GPT-4o.
What Differentiates GPT-4o From Other AI Models
Anyone tracking the AI space prior to GenAI realizes that the problem of “speech to text,” also known as “voice recognition,” was the frontier of AI until it was solved a few years ago. Similarly, the problem of generating audio from text, or “text to speech,” was an unsolved problem as well. In recent times, many different providers, including OpenAI’s Whisper and Google’s GTTS, have served up these “speech to text” and “text to speech” models separate from GPT. The new solution simply eliminates latencies in human interfaces by combining them all.
If the underlying GenAI technology were substantially different, they would’ve revved the four-number in the model name. By calling it GPT-4o, they are signaling that it is in the GPT-4 family, like GPT-4 Turbo and GPT-4v. This implies that the transformer tech that is truly the intelligent part is largely unchanged, and what’s new is the engineering of combining all input and output with the underlying AI model.
How GPT-4o Enables OpenAI to Compete with Google and Other LLM Vendors
GPT-4o’s ability to handle multiple languages seamlessly, without requiring the specification of the language in audio files, gives it a significant advantage over competitors like Google. In Google’s stack, the models are tuned to the native language of the speaker, meaning that, for example, Python APIs require the software to indicate what language is being provided with an audio file. In the case of OpenAI’s Whisper model, this requirement is gone. The model is trained to determine what language is being spoken and then transcribe it in that native language seamlessly.
AI-powered smart speakers offer a tantalizing view into a universe where speech becomes the new user experience, and screens disappear altogether. While this is visible in concept through basic interactions like Alexa or Siri, implementations are largely considered tedious and, frankly, dumb. There have been several promising demonstrations of more intelligent interaction, but these suffer from high latencies that disrupt conversation and make the exchanges awkward.
A world of applications opens up if this technology works seamlessly, and OpenAI is the first mover. Drive-through point of sale, any sort of form intake, tech support, coaching/counseling/teaching, companionship—these are all applications where the product is the conversation. If a model can provide the content, and now it is able to also provide the conversation, automation will be complete.
There’s nothing intensely remarkable about the engineering that is being presented here. Google and others will follow immediately with assembler assembly of their stacks. OpenAI’s advantage will be the establishment of the software API that allows them to be thought leaders and trendsetters. They are defining the connectors that will power AI building blocks for the future.
Potential Dangers with GPT-4o
One possible danger to consider is impersonation. With super low latency and a large context window, this model can very inexpensively pretend to be a person and automate large-scale robocalling fraud operations. It would be hard to tell it’s a model over the phone. The same advantage in the case of a legitimate application means a disadvantage for fraudulent use. Traditional problems like hallucinations are more likely to slip through as valid responses because it’s so fast and conversation latency (voice) is low. Think of it as a credible-sounding, fast-talking pitchman.
One of the things we’ve seen with it is that it is generating responses to the user (“time to first token” metric) while it is still thinking about what tools it needs to use to finish the reply–sort of “thinking on its feet” happening live. As a result, the model is answering faster and simultaneously giving itself more time to think. All for half the price of prior models.
What’s Next
By calling it GPT-4o, OpenAI is signaling that it is in the GPT-4 family, like GPT-4 Turbo and GPT-4v. This implies that the transformer tech that is truly the intelligent part is largely unchanged, and what’s new is the engineering of combining all input and output with the underlying AI model. This release allows OpenAI to establish branding and features that will be seen in future models. For example, the Turbo moniker was added to GPT-3.5 and then GPT-4, so we would expect to continue to see releases of GPT models, followed by Turbo versions of them that are cheaper and faster. Similarly, GPT-4 offered V and now O options. We expect to see those same options provided on GPT-4.5 and 5.0, speculated for later this year.