Introduction: Large language models (LLMs) have become increasingly integral in various applications, from customer service chatbots to research assistants. Among the most notable LLMs are ChatGPT-4o, Gemini, and Perplexity.ai. Each of these models has been designed to handle a wide range of queries, but a common challenge they face is “hallucinations”—instances where the model generates incorrect or unsupported information. The WildHallucinations benchmark was developed to assess how well these LLMs manage this challenge, especially in real-world scenarios involving diverse and complex queries.

Understanding Hallucinations in LLMs: Hallucinations are a significant issue for LLMs. They occur when a model produces text that seems credible but is factually incorrect. For users who rely on these models for accurate information, such errors can be misleading and potentially harmful. The WildHallucinations benchmark evaluates the factual accuracy of LLMs like ChatGPT-4o, Gemini, and Perplexity.ai by challenging them with queries derived from real user interactions, covering a broad spectrum of topics and entities.

Performance of ChatGPT-4o, Gemini, and Perplexity.ai

ChatGPT-4o: ChatGPT-4o is a continuation of OpenAI’s series of GPT models, designed to excel in generating human-like text. According to the WildHallucinations benchmark, ChatGPT-4o demonstrates strong performance in domains that are well-represented in its training data, such as computing, geography, and general knowledge topics. The model is particularly effective when responding to queries about entities that have extensive documentation, such as those with Wikipedia pages.

However, ChatGPT-4o’s performance declines when dealing with lesser-known entities or emerging topics that lack comprehensive documentation. This is a significant limitation, as the model tends to generate hallucinations when it encounters queries that are not directly supported by its training data. The WildHallucinations benchmark highlights that while ChatGPT-4o can provide accurate information in many cases, it struggles with factuality in more obscure or newly-emerged areas.

Gemini: Gemini represents a newer entrant in the LLM space, designed with the promise of enhanced capabilities across multiple domains. The WildHallucinations benchmark reveals that Gemini performs comparably to ChatGPT-4o in well-documented areas, such as geography and computing. However, Gemini has a notable tendency to abstain from generating responses when it is uncertain, which can result in a lower rate of hallucinations but also fewer complete responses.

This cautious approach means that Gemini often avoids the outright errors that can plague other models, but it also means that users might receive less information or none at all for certain queries. When Gemini does respond, its accuracy is generally high, especially in areas where its training data overlaps with the queries. However, like ChatGPT-4o, it faces challenges with less common entities, leading to gaps in coverage and occasional factual inaccuracies.

Perplexity.ai: Perplexity.ai employs a retrieval-augmented generation (RAG) model, setting it apart from the more conventional approaches used by ChatGPT-4o and Gemini. This model’s defining feature is its ability to perform real-time web searches to supplement its responses, theoretically reducing the likelihood of hallucinations by providing more up-to-date and relevant information.

The WildHallucinations benchmark indicates that Perplexity.ai does indeed benefit from this retrieval mechanism, showing a lower overall rate of hallucinations compared to models that do not use retrieval. However, this advantage is not without its flaws. The accuracy of Perplexity.ai’s responses is heavily dependent on the quality and relevance of the retrieved information. In some cases, the model’s performance is hampered by the retrieval of outdated or irrelevant sources, which can lead to new types of hallucinations. Furthermore, while retrieval helps with obscure entities, it does not always guarantee a correct or complete answer, particularly when reliable information is sparse or contradictory.

Comparative Performance Insights

Factual Accuracy Across Models: When comparing the factual accuracy of ChatGPT-4o, Gemini, and Perplexity.ai, each model has its strengths and weaknesses. ChatGPT-4o excels in areas where it has strong prior knowledge from its training data, making it highly effective in well-covered domains. However, its reliance on existing data can lead to higher hallucination rates when dealing with new or obscure entities.

Gemini, while cautious in its approach, provides accurate responses when it chooses to engage. Its strategy of abstaining from uncertain queries means it generates fewer hallucinations but at the cost of being less responsive. This trade-off may be favorable in high-stakes situations where accuracy is more important than completeness.

Perplexity.ai’s use of real-time retrieval gives it an edge in handling up-to-date or niche information. Nevertheless, this advantage is tempered by the model’s occasional retrieval of incorrect or irrelevant data, which can introduce new hallucinations. The model’s performance, therefore, varies significantly depending on the query and the quality of the information it retrieves.

Hallucination Rates: Hallucination rates provide a clear measure of how often each model produces incorrect or unsupported information. ChatGPT-4o shows a relatively lower hallucination rate in familiar domains but struggles with rarer entities. Gemini, due to its cautious response strategy, exhibits the lowest hallucination rate but at the expense of sometimes not providing a response at all.

Perplexity.ai, while designed to minimize hallucinations through retrieval, does not completely avoid them. The model’s reliance on real-time web searches can lead to fluctuating hallucination rates depending on the reliability of the sources it accesses. Although it performs well in some scenarios, particularly with current or less documented topics, its success is not consistent across all domains.

Challenges and Future Directions

Enhancing Model Reliability: The performance of ChatGPT-4o, Gemini, and Perplexity.ai on the WildHallucinations benchmark underscores the need for continued improvement in LLMs. For ChatGPT-4o, the primary challenge is expanding its ability to accurately handle less common or newly emerging topics. This might involve incorporating more diverse data sources into its training or enhancing its ability to recognize when it lacks sufficient information to provide a reliable response.

Gemini’s focus should be on balancing its cautious approach with the need to provide more complete answers. While its strategy of abstaining helps reduce errors, there is a risk of it being too conservative, which could limit its utility in some scenarios.

For Perplexity.ai, refining the retrieval process is critical. Ensuring that the model retrieves relevant and reliable information consistently will be key to improving its overall accuracy. This might involve better filtering mechanisms or more sophisticated methods for integrating retrieved data into the model’s responses.

The Role of WildHallucinations: Benchmarks like WildHallucinations are crucial for identifying the strengths and weaknesses of LLMs. They provide a realistic evaluation of how these models perform in real-world scenarios, highlighting areas where improvement is needed. As LLMs continue to evolve, such benchmarks will be essential in guiding their development and ensuring that they become more reliable and trustworthy tools.

Conclusion

The WildHallucinations benchmark offers valuable insights into the performance of ChatGPT-4o, Gemini, and Perplexity.ai. Each model has demonstrated strong capabilities, but also faces unique challenges in handling real-world queries accurately. ChatGPT-4o excels in familiar domains but struggles with less common entities. Gemini’s cautious approach reduces hallucinations but sometimes limits responsiveness. Perplexity.ai’s retrieval-augmented strategy offers potential, but the quality of its responses depends on the reliability of the information it retrieves.

As the field of AI continues to advance, addressing these challenges will be critical to improving the factual accuracy and overall performance of LLMs. Benchmarks like WildHallucinations will play a key role in this process, helping to ensure that future iterations of these models are not only powerful but also reliable and trustworthy.


Also published on Medium.

By John Mecke

John is a 25 year veteran of the enterprise technology market. He has led six global product management organizations for three public companies and three private equity-backed firms. He played a key role in delivering a $115 million dividend for his private equity backers – a 2.8x return in less than three years. He has led five acquisitions for a total consideration of over $175 million. He has led eight divestitures for a total consideration of $24.5 million in cash. John regularly blogs about product management and mergers/acquisitions.