Comparing Latency of GPT-4o vs. GPT-4o Mini

August 29, 2024

TLDR: GPT-4o is significantly slower than GPT-4o Mini and GPT-3.5 Turbo, especially as token sizes increase, highlighting a trade-off between capability and speed.

Latency is a critical factor when selecting a language model, as it directly impacts user experience and system performance. Faster response times enable smoother interactions and quicker processing, which is essential for applications requiring real-time feedback or handling high volumes of requests. To provide a clearer picture of the latency differences among GPT models, Workorb AI conducted a comprehensive benchmarking study of the GPT LLMs hosted on Azure OpenAI.

The graph above shows the API response times for three different models—GPT-3.5 Turbo (blue), GPT-4o Mini (red), and GPT-4o (yellow)—across various token sizes. The results indicate that GPT-3.5 Turbo consistently has the lowest latency, followed closely by GPT-4o Mini. While GPT-4o Mini is only slightly slower than GPT-3.5 Turbo, GPT-4o shows a marked increase in latency, especially as the token count grows. For example, at around 10,000 tokens, GPT-4o Mini begins to show noticeable delays compared to GPT-3.5 Turbo. However, the response time for GPT-4o increases sharply, exceeding 5 seconds at 40,000 tokens and approaching 10 seconds by 80,000 tokens.

This analysis demonstrates a clear trade-off: while GPT-4o offers enhanced capabilities over GPT-4o Mini and GPT-3.5 Turbo, it comes at the cost of significantly higher latency. For applications where speed is crucial, GPT-3.5 Turbo or GPT-4o Mini might be more suitable. However, for tasks that benefit from the advanced capabilities of GPT-4o and where latency is less of a concern, GPT-4o could be the preferred choice.

You can read our previous benchmark which compared LLMs from OpenAI, Anthropic and Cohere.

By understanding these differences, developers can make more informed decisions based on the specific needs of their applications, balancing model capability with acceptable response times.