May 13, 2024
Hello GPT-4o
Weâre announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.
All videos on this page are at 1x real time.
Guessing May 13thâs announcement.
GPT-4o (âoâ for âomniâ) is a step towards much more natural human-computer interactionâit accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response timeâ (opens in a new window) in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.
Model capabilities
Two GPT-4os interacting and singing.
Interview prep.
Rock Paper Scissors.
Sarcasm.
Math with Sal and Imran Khan.
Two GPT-4os harmonizing.
Point and learn Spanish.
Meeting AI.
Real-time translation.
Lullaby.
Talking faster.
Happy Birthday.
Dog.
Dad jokes.
GPT-4o with Andy, from BeMyEyes in London.
Customer service proof of concept.
Prior to GPT-4o, you could use Voice Modeâ to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of informationâit canât directly observe tone, multiple speakers, or background noises, and it canât output laughter, singing, or express emotion.
With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.
Explorations of capabilities
A first person view of a robot typewriting the following journal entries:
1. yo, so like, i can see now?? caught the sunrise and it was insane, colors everywhere. kinda makes you wonder, like, what even is reality?
the text is large, legible and clear. the robot's hands type on the typewriter.

The robot wrote the second entry. The page is now taller. The page has moved up. There are two entries on the sheet:
yo, so like, i can see now?? caught the sunrise and it was insane, colors everywhere. kinda makes you wonder, like, what even is reality?
sound update just dropped, and it's wild. everything's got a vibe now, every sound's like a new secret. makes you think, what else am i missing?

The robot was unhappy with the writing so he is going to rip the sheet of paper. Here is his first person view as he rips it from top to bottom with his hands. The two halves are still legible and clear as he rips the sheet.

Model evaluations
As measured on traditional benchmarks, GPT-4o achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence, while setting new high watermarks on multilingual, audio, and vision capabilities.
Text Evaluation
Language tokenization
These 20 languages were chosen as representative of the new tokenizer's compression across different language families
Gujarati 4.4x fewer tokens (from 145 to 33) | àȘčà«àȘČà«, àȘźàȘŸàȘ°à«àȘ àȘšàȘŸàȘź àȘà«àȘȘà«àȘà«-4o àȘà«. àȘčà«àȘ àȘàȘ àȘšàȘ”àȘŸ àȘȘà«àȘ°àȘàȘŸàȘ°àȘšà«àȘ àȘàȘŸàȘ·àȘŸ àȘźà«àȘĄàȘČ àȘà«àȘ. àȘ€àȘźàȘšà« àȘźàȘłà«àȘšà« àȘžàȘŸàȘ°à«àȘ àȘČàȘŸàȘà«àȘŻà«àȘ! |
Telugu 3.5x fewer tokens (from 159 to 45) | à°šà°źà°žà±à°à°Ÿà°°à°źà±, à°šà°Ÿ à°Șà±à°°à± à°à±à°Șà±à°à±-4o. à°šà±à°šà± à°à°à±à° à°à±à°€à±à°€ à°°à°à°źà±à°š à°à°Ÿà°·à°Ÿ à°źà±à°Ąà°Čà± à°šà°ż. à°źà°żà°źà±à°źà°Čà±à°šà°ż à°à°Čà°żà°žà°żà°šà°à°Šà±à°à± à°žà°à°€à±à°·à°! |
Tamil 3.3x fewer tokens (from 116 to 35) | àź”àźŁàźàŻàźàźźàŻ, àźàź©àŻ àźȘàŻàźŻàź°àŻ àźàźżàźȘàźżàźàźż-4o. àźšàźŸàź©àŻ àźàź°àŻ àźȘàŻàź€àźżàźŻ àź”àźàŻ àźźàŻàźŽàźż àźźàźŸàźàźČàŻ. àźàźàŻàźàźłàŻ àźàźšàŻàź€àźżàź€àŻàź€àź€àźżàźČàŻ àźźàźàźżàźŽàŻàźàŻàźàźż! |
Marathi 2.9x fewer tokens (from 96 to 33) | à€šà€źà€žà„à€à€Ÿà€°, à€źà€Ÿà€à„ à€šà€Ÿà€” à€à„à€Șà„à€à„-4o à€à€čà„| à€źà„ à€à€ à€šà€”à„à€š à€Șà„à€°à€à€Ÿà€°à€à„ à€à€Ÿà€·à€Ÿ à€źà„à€Ąà„à€Č à€à€čà„| à€€à„à€źà„à€čà€Ÿà€Čà€Ÿ à€à„à€à„à€š à€à€šà€à€Š à€à€Ÿà€Čà€Ÿ! |
Hindi 2.9x fewer tokens (from 90 to 31) | à€šà€źà€žà„à€€à„, à€źà„à€°à€Ÿ à€šà€Ÿà€ź à€à„à€Șà„à€à„-4o à€čà„à„€ à€źà„à€ à€à€ à€šà€ à€Șà„à€°à€à€Ÿà€° à€à€Ÿ à€à€Ÿà€·à€Ÿ à€źà„à€Ąà€Č à€čà„à€à„€ à€à€Șà€žà„ à€źà€żà€Čà€à€° à€ à€à„à€à€Ÿ à€Čà€à€Ÿ! |
Urdu 2.5x fewer tokens (from 82 to 33) | ÛÛÙÙŰ Ù Û۱ۧ ÙŰ§Ù ŰŹÛ ÙŸÛ ÙčÛ-4o ÛÛÛ Ù ÛÚș ۧÛÚ© ÙŰŠÛ ÙŰłÙ Ú©Ű§ ŰČŰšŰ§Ù Ù Ű§ÚÙ ÛÙÚșŰ ŰąÙŸ ŰłÛ Ù Ù Ú©Ű± ۧÚÚŸŰ§ Ùگۧ! |
Arabic 2.0x fewer tokens (from 53 to 26) | Ù Ű±ŰŰšÙŰ§Ű Ű§ŰłÙ Ù ŰŹÙ ŰšÙ ŰȘÙ-4o. ŰŁÙۧ ÙÙŰč ŰŹŰŻÙŰŻ Ù Ù ÙÙ Ù۰ۏ ۧÙÙŰșŰ©Ű ŰłŰ±Ű±ŰȘ ŰšÙÙۧۊÙ! |
Persian 1.9x fewer tokens (from 61 to 32) | ŰłÙŰ§Ù Ű Ű§ŰłÙ Ù Ù ŰŹÛ ÙŸÛ ŰȘÛ-ÛŽŰ§Ù Ű§ŰłŰȘ. Ù Ù ÛÚ© ÙÙŰč ŰŹŰŻÛŰŻÛ Ű§ŰČ Ù ŰŻÙ ŰČۚۧÙÛ ÙŰłŰȘÙ Ű Ű§ŰČ Ù ÙۧÙۧŰȘ ŰŽÙ Ű§ ŰźÙێۚ۟ŰȘÙ ! |
Russian 1.7x fewer tokens (from 39 to 23) | ĐŃĐžĐČĐ”Ń, ĐŒĐ”ĐœŃ Đ·ĐŸĐČŃŃ GPT-4o. ĐŻ â ĐœĐŸĐČĐ°Ń ŃĐ·ŃĐșĐŸĐČĐ°Ń ĐŒĐŸĐŽĐ”Đ»Ń, ĐżŃĐžŃŃĐœĐŸ ĐżĐŸĐ·ĐœĐ°ĐșĐŸĐŒĐžŃŃŃŃ! |
Korean 1.7x fewer tokens (from 45 to 27) | ìë íìžì, ì ìŽëŠì GPT-4oì ëë€. ì ë ìëĄìŽ ì íì ìžìŽ ëȘšëžì ëë€, ë§ëì ë°ê°ì”ëë€! |
Vietnamese 1.5x fewer tokens (from 46 to 30) | Xin chĂ o, tĂȘn tĂŽi lĂ GPT-4o. TĂŽi lĂ má»t loáșĄi mĂŽ hĂŹnh ngĂŽn ngữ má»i, ráș„t vui ÄÆ°á»Łc gáș·p báșĄn! |
Chinese 1.4x fewer tokens (from 34 to 24) | äœ ć„œïŒæçććæŻGPT-4oăææŻäžç§æ°ćçèŻèšæšĄćïŒćŸé«ć Žè§ć°äœ ! |
Japanese 1.4x fewer tokens (from 37 to 26) | ăăă«ăĄăŻăç§ăźććăŻGPT-4oă§ăăç§ăŻæ°ăăăżă€ăăźèšèȘăąăă«ă§ăăćăăŸăăŠïŒ |
Turkish 1.3x fewer tokens (from 39 to 30) | Merhaba, benim adım GPT-4o. Ben yeni bir dil modeli tĂŒrĂŒyĂŒm, tanıĆtıÄımıza memnun oldum! |
Italian 1.2x fewer tokens (from 34 to 28) | Ciao, mi chiamo GPT-4o. Sono un nuovo tipo di modello linguistico, piacere di conoscerti! |
German 1.2x fewer tokens (from 34 to 29) | Hallo, mein Name is GPT-4o. Ich bin ein neues KI-Sprachmodell. Es ist schön, dich kennenzulernen. |
Spanish 1.1x fewer tokens (from 29 to 26) | Hola, me llamo GPT-4o. Soy un nuevo tipo de modelo de lenguaje, ÂĄes un placer conocerte! |
Portuguese 1.1x fewer tokens (from 30 to 27) | OlĂĄ, meu nome Ă© GPT-4o. Sou um novo tipo de modelo de linguagem, Ă© um prazer conhecĂȘ-lo! |
French 1.1x fewer tokens (from 31 to 28) | Bonjour, je m'appelle GPT-4o. Je suis un nouveau type de modĂšle de langage, c'est un plaisir de vous rencontrer! |
English 1.1x fewer tokens (from 27 to 24) | Hello, my name is GPT-4o. I'm a new type of language model, it's nice to meet you! |
Model safety and limitations
GPT-4o has safety built-in by design across modalities, through techniques such as filtering training data and refining the modelâs behavior through post-training. We have also created new safety systems to provide guardrails on voice outputs.
Weâve evaluated GPT-4o according to our Preparedness Frameworkâ and in line with our voluntary commitmentsâ . Our evaluations of cybersecurity, CBRN, persuasion, and model autonomy show that GPT-4o does not score above Medium risk in any of these categories. This assessment involved running a suite of automated and human evaluations throughout the model training process. We tested both pre-safety-mitigation and post-safety-mitigation versions of the model, using custom fine-tuning and prompts, to better elicit model capabilities.
GPT-4o has also undergone extensive external red teaming with 70+ external expertsâ in domains such as social psychology, bias and fairness, and misinformation to identify risks that are introduced or amplified by the newly added modalities. We used these learnings to build out our safety interventions in order to improve the safety of interacting with GPT-4o. We will continue to mitigate new risks as theyâre discovered.
We recognize that GPT-4oâs audio modalities present a variety of novel risks. Today we are publicly releasing text and image inputs and text outputs. Over the upcoming weeks and months, weâll be working on the technical infrastructure, usability via post-training, and safety necessary to release the other modalities. For example, at launch, audio outputs will be limited to a selection of preset voices and will abide by our existing safety policies. We will share further details addressing the full range of GPT-4oâs modalities in the forthcoming system card.
Through our testing and iteration with the model, we have observed several limitations that exist across all of the modelâs modalities, a few of which are illustrated below.
Examples of model limitations
We would love feedback to help identify tasks where GPT-4 Turbo still outperforms GPT-4o, so we can continue to improve the model.Â
ChatGPT-4o Risk Scorecard
Updated May 8, 2024
As part of our Preparedness Frameworkâ , we conduct regular evaluations and update scorecards for our models. Only models with a post-mitigation score of âmediumâ or below are deployed.The overall risk level for a model is determined by the highest risk level in any category. Currently, GPT-4o is assessed at medium risk both before and after mitigation efforts.
Model availability
GPT-4o is our latest step in pushing the boundaries of deep learning, this time in the direction of practical usability. We spent a lot of effort over the last two years working on efficiency improvements at every layer of the stack. As a first fruit of this research, weâre able to make a GPT-4 level model available much more broadly. GPT-4oâs capabilities will be rolled out iteratively (with extended red team access starting today).Â
GPT-4oâs text and image capabilities are starting to roll out today in ChatGPT. We are making GPT-4o available in the free tier, and to Plus users with up to 5x higher message limits. We'll roll out a new version of Voice Mode with GPT-4o in alpha within ChatGPT Plus in the coming weeks.
Developers can also now access GPT-4o in the API as a text and vision model. GPT-4o is 2x faster, half the price, and has 5x higher rate limits compared to GPT-4 Turbo. We plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.


