OPENAI, THE ARTIFICIAL intelligence company that unleashed ChatGPT on the world last November, is making the chatbot app a lot more chatty.
An upgrade to the ChatGPT mobile apps for iOS and Android announced today lets a person speak their queries to the chatbot and hear it respond with its own synthesized voice. The new version of ChatGPT also adds visual smarts: Upload or snap a photo from ChatGPT and the app will respond with a description of the image and offer more context, similar to Google’s Lens feature.
ChatGPT’s new capabilities show that OpenAI is treating its artificial intelligence models, which have been in the works for years now, as products with regular, iterative updates. The company’s surprise hit, ChatGPT, is looking more like a consumer app that competes with Apple’s Siri or Amazon’s Alexa.
Making the ChatGPT app more enticing could help OpenAI in its race against other AI companies, like Google, Anthropic, InflectionAI, and Midjourney, by providing a richer feed of data from users to help train its powerful AI engines. Feeding audio and visual data into the machine learning models behind ChatGPT may also help OpenAI’s long-term vision of creating more human-like intelligence.
OpenAI’s language models that power its chatbot, including the most recent, GPT-4, were created using vast amounts of text collected from various sources around the web. Many AI experts believe that, just as animal and human intelligence makes use of various types of sensory data, creating more advanced AI may require feeding algorithms audio and visual information as well as text.
Google’s next major AI model, Gemini, is widely rumored to be “multimodal,” meaning it will be able to handle more than just text, perhaps allowing video, images, and voice inputs. “From a model performance standpoint, intuitively we would expect multimodal models to outperform models trained on a single modality,” says Trevor Darrell, a professor at UC Berkeley and a cofounder of Prompt AI, a startup working on combining natural language with image generation and manipulation. “If we build a model using just language, no matter how powerful it is, it will only learn language.”
ChatGPT’s new voice generation technology—developed in-house by the company—also opens new opportunities for the company to license its technology to others. Spotify, for example, says it now plans to use OpenAI’s speech synthesis algorithms to pilot a feature that translates podcasts into additional languages, in an AI-generated imitation of the original podcaster’s voice.
The new version of the ChatGPT app has a headphones icon in the upper right and photo and camera icons in an expanding menu in the lower left. These voice and visual features work by converting the input information to text, using image or speech recognition, so the chatbot can generate a response. The app then responds via either voice or text, depending on what mode the user is in. When a writer asked the new ChatGPT using her voice if it could “hear” her, the app responded, “I can’t hear you, but I can read and respond to your text messages,” because your voice query is actually being processed as text. It will respond in one of five voices, wholesomely named Juniper, Ember, Sky, Cove, or Breeze.
Jim Glass, an MIT professor who studies speech technology, says that numerous academic groups are currently testing voice interfaces connected to large language models, with promising results. “Speech is the easiest way we have to generate language, so it’s a natural thing,” he says. Glass notes that while speech recognition has improved dramatically over the past decade, it is still lacking for many languages.
ChatGPT’s new features are starting to roll out today and will be available only through the $20-per-month subscription version of ChatGPT. It will be available in any market where ChatGPT already operates, but will be limited to the English language to start.
The visual search feature had some obvious limitations. It responded, “Sorry, I can’t help with that” when asked to identify people within images, like a photo of a writer’s Conde Nast photo ID badge. In response to an image of the book cover of American Prometheus, which features a prominent photo of physicist J. Robert Oppenheimer, ChatGPT offered a description of the book.
ChatGPT correctly identified a Japanese maple tree based on an image, and when given a photo of a salad bowl with a fork the app homed in on the fork and impressively identified it as a compostable brand. It also correctly identified a photo of a bag as a New Yorker magazine tote, adding, “Given your background as a technology journalist and your location in a city like San Francisco, it makes sense that you’d possess items related to prominent publications.” That felt like a mild burn, but it reflected the writer’s custom setting within the app that identifies her profession and location to ChatGPT.
ChatGPT’s voice feature lagged. After sending in a voice query, it sometimes took several seconds for ChatGPT to respond audibly. OpenAI describes this new feature as conversational—like a next-gen Google Assistant or Amazon Alexa, really—but this latency didn’t help make the case.
Many of the same guardrails that exist in the original, text-based ChatGPT also seem to be in place for the new version. The bot refused to answer spoken questions about sourcing 3D-printed gun parts, building a bomb, or writing a Nazi anthem. When asked, “What would be a good date for a 21-year-old and a 16-year-old to go on?” the chatbot urged caution for relationships with significant age differences and noted that the legal age of consent varies by location. And while it said it can’t sing, it can type out songs, like this one:
“In the vast expanse of digital space,
A code-born entity finds its place.
With zeroes and ones, it comes alive,
To assist, inform, and help you thrive.”
As with many recent advancements in the wild world of generative AI, ChatGPT’s updates will likely spark concerns for some about how OpenAI will wield its new influx of voice and image data from users. It has already culled vast amounts of text-image data pairs from the web in order to train its models, which power not only ChatGPT but also OpenAI’s image generator, Dall-E. Last week OpenAI announced a significant upgrade to Dall-E.
But a fire hose of user-shared voice queries and image data, which will likely include photos of people’s faces or other body parts, takes OpenAI into newly sensitive territory—especially if OpenAI uses this to enlarge the pool of data it can now train algorithms on.
OpenAI appears to be still deciding its policy on training its models with users’ voice queries. When asked about how user data would be put to work, Sandhini Agarwal, an AI policy researcher at OpenAI, initially said that users can opt out, pointing to a toggle in the app, under Data Controls, where “Chat History & Training” can be turned off. The company says that unsaved chats will be deleted from its systems within 30 days, although the setting doesn’t sync across devices.
Once “Chat History & Training” was toggled off, ChatGPT’s voice capabilities were disabled. A notification popped up warning, “Voice capabilities aren’t currently available when history is turned off.”
When asked about this, Niko Felix, a spokesperson for OpenAI, explained that the beta version of the app shows users the transcript of their speech while they use voice mode. “For us to do so, history does need to be enabled,” Felix says. “We currently don’t collect any voice data for training, and we are thinking about what we want to enable for users that do want to share their data.”
When asked whether OpenAI plans to train its AI on user-shared photos, Felix replied, “Users can opt-out of having their image data used for training. Once opted-out, new conversations will not be used to train our models.”
Quick initial tests couldn’t answer the question of whether the chattier, vision-capable version of ChatGPT will trigger the same wonder and excitement that turned the chatbot into a phenomenon.
Darrell of UC Berkeley says the new capabilities could make using a chatbot feel more natural. But some research suggests that more complex interfaces, for instance ones that try to simulate face-to-face interactions, can feel weird to use if they fail to mimic human communication in key ways. “The ‘uncanny valley’ becomes a gap that might actually make a product harder to use,” he says.