In 1950, Alan Turing proposed a simple test: if a machine could hold a conversation that was indistinguishable from that of a human, it could be considered intelligent. Over the decades, scientists and engineers have tried to build systems capable of passing this famous Turing Test. But until recently, no AI had convincingly done so.
Now, new research from the University of California, San Diego, shows that a large language model (LLM) called GPT-4.5 can pass a rigorous version of the Turing Test—outperforming real humans in being perceived as human. The study also found that another model, LLaMa-3.1, came close. This result represents a major shift in how we understand artificial intelligence, with big implications for society.
What Is the Turing Test?
The Turing Test was designed as a way to measure machine intelligence through natural language. In its most famous version, three people communicate via text: a human interrogator talks to both a human and a machine and must decide which one is the real person. If the machine can fool the interrogator often enough, it’s said to have passed the test.
Unlike most AI benchmarks today—which are static and narrow—the Turing Test is flexible, interactive, and adversarial. It allows people to ask any question, use any strategy, and test the machine’s ability to mimic the social and emotional aspects of being human. As AI grows more powerful, passing the Turing Test signals more than just technical success—it points to the potential to replace human roles in conversation, social interaction, and even relationships.
How the Study Was Conducted
The researchers conducted two large-scale, randomized, and controlled experiments. One group of participants came from the UC San Diego undergraduate pool, while the other came from Prolific, an online platform for research participants. In each test round, participants acted either as interrogators or human witnesses. They were paired with either another human or one of several AI systems.
Each round lasted five minutes, and interrogators talked with two “witnesses” (one human and one AI) simultaneously via a messaging interface. At the end of the conversation, the interrogator had to decide which witness they believed was human.
The AI models tested included:
- GPT-4.5 (OpenAI’s advanced LLM)
- LLaMa-3.1-405B (Meta’s LLM)
- GPT-4o (a slightly less powerful OpenAI model)
- ELIZA (a rule-based chatbot from the 1960s)
Importantly, the models were given different types of prompts. Some received a NO-PERSONA prompt, with only basic instructions to act human. Others received a PERSONA prompt, instructing them to play the role of a young, introverted person familiar with internet culture and slang.
The Results: AI vs. Humans
The headline finding is this: GPT-4.5, when given a humanlike persona, was mistaken for a real person 73% of the time—more often than real human participants. In other words, the interrogators were more likely to believe GPT-4.5 was human than the actual humans.
LLaMa-3.1 also performed well. With a persona, it was judged to be human 56% of the time. Without the persona prompt, however, all AI models—including GPT-4.5—performed worse. ELIZA and GPT-4o, both using basic prompts, were the least convincing, with win rates of just 23% and 21%, respectively.
What About the Interrogators?
Participants used a wide range of strategies to try to spot the AI. The most common approach (used in 61% of games) was casual small talk. Many asked about feelings, preferences, or personal experiences. Some interrogators tried to trick the AI with “jailbreak” commands (like asking the AI to ignore previous instructions), while others asked about current events or local details.
Interestingly, the strategies that worked best were the rarest. Saying something strange or unpredictable—something a machine might struggle to handle—often led to better detection. But overall, even experienced participants struggled. There was no consistent link between accuracy and age, education, or AI knowledge.
What Does “Passing” Really Mean?
By the strict definition of the Turing Test, GPT-4.5-PERSONA and LLaMa-3.1-PERSONA both passed. That means interrogators were not reliably able to distinguish these AIs from humans. In GPT-4.5’s case, it even outperformed humans at seeming human.
Critics might argue that this just shows interrogators were bad at their job. But the inclusion of ELIZA in the study proves otherwise. Participants were consistently able to spot ELIZA as the fake. So the design of the test was effective enough to reveal real differences.
The study also showed the huge impact of prompt engineering. Without a persona prompt, the same models performed far worse. But the researchers argue that this doesn’t diminish the result. Just like ELIZA’s behavior was shaped by its programmer, the LLMs can be seen as responding to their prompt creators. In fact, it’s this ability to flexibly shift behavior with prompts that makes LLMs so powerful—and so capable of passing the Turing Test.
Social and Ethical Implications
Whether or not passing the Turing Test proves intelligence, it has real-world consequences. LLMs that can convincingly impersonate humans could automate many jobs, from customer service to sales. They could also be used for more questionable purposes: social engineering, misinformation, fake reviews, or even impersonating people online.
The researchers refer to such systems as “counterfeit people.” These are machines that can convincingly imitate human social interaction. If they become common, they could blur the line between real and artificial relationships, creating social and psychological challenges.
Some people might prefer talking to an AI—one that’s always available, always agreeable, and never makes mistakes. But others might begin to question the authenticity of their online conversations. The researchers warn that these interactions could be used to manipulate opinions, shape behavior, or undermine trust.
What the Turing Test Actually Measures
So does passing the Turing Test mean an AI is intelligent? Not necessarily. The test doesn’t measure problem-solving, logic, or deep reasoning. Instead, it measures “substitutability”: can a machine fill in for a person convincingly?
In today’s world, social intelligence may be more important than factual knowledge when it comes to appearing human. People expect computers to know facts and figures—but what makes us human is our messiness, emotion, and the occasional mistake. Ironically, some interrogators picked human witnesses because they seemed less polished or more confused.
The study showed that participants often judged based on linguistic style, humor, emotional expression, or even just a “gut feeling.” Very few interrogators asked technical questions or tried to test deep reasoning. The most accurate verdicts came from those who focused on how questions were handled or noticed when a witness lacked knowledge—a clue that they might actually be human.
Looking Ahead: What Comes Next?
This research marks a major milestone in AI development. But it’s not the end of the story. Future Turing Tests could last longer, include higher stakes, or involve expert interrogators. They might test how well AI systems perform over days or weeks of conversation, or how they handle emotional depth and long-term memory.
Different participant groups might also perform differently. In the UCSD group, for example, students might have relied on shared knowledge of campus culture—like lunch spots or slang. Even so, GPT-4.5 still fooled them more often than not. Future research could explore whether shared cultural context helps people detect AI, and if so, how much.
The study also raises important policy questions. Should people always be informed if they’re talking to an AI? How should companies label AI-generated content or characters? And how can we guard against misuse?
As AI becomes more humanlike, we might also need to reflect on what makes us human. In a world where machines can simulate our social behaviors, the human touch might become more valuable, not less. As author Brian Christian put it: maybe the challenge isn’t just for machines to pass the Turing Test—but for humans to pass it, too.
Key Takeaways
- GPT-4.5, when prompted with a humanlike persona, was more likely to be judged human than real people in Turing Test-style conversations.
- LLaMa-3.1 also performed well, especially with a persona prompt, showing that prompt engineering plays a critical role.
- Participants struggled to distinguish humans from machines, even when they had strategies and experience.
- The Turing Test doesn’t measure intelligence in the traditional sense—it measures how convincingly human an AI can appear.
- The rise of “counterfeit people” raises urgent questions about ethics, trust, and the future of social interaction.
Conclusion
For the first time, we have robust evidence that a machine can pass the original version of the Turing Test. This is a milestone in artificial intelligence research—and a turning point for society. Machines are now capable of humanlike conversation that can fool even attentive, informed people.
But the test isn’t just about machines. It’s also a mirror for ourselves. How do we tell the difference between people and programs? What do we look for when we connect with others? And how will we adapt when the line between human and machine becomes harder to see?
The future of AI will not only depend on what machines can do—but also on how we, as humans, choose to respond.
Also published on Medium.