Friday , 1 May 2026
Home AI: Technology, News & Trends AI Performs Poorly on Advanced History Exam

AI Performs Poorly on Advanced History Exam

380
AI

Although artificial intelligence has excelled in areas such as programming and content creation, new research shows that it still falls short when it comes to complex historical questions. A study recently presented at the NeurIPS conference showed that even the most advanced large language models (LLMs) struggled to perform well on a test of historical knowledge.

Test Results

The study, led by a team at the Austrian Institute for Complex Science (CSH), aimed to test the performance of three top large language models (LLMs) – OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini – on historical questions. The research team developed a benchmark tool called “Hist-LLM” that tests the correctness of answers against the Seshat Global History Database, a massive database of historical knowledge named after the ancient Egyptian goddess of wisdom. The results showed that even the best-performing GPT-4 Turbo model had an accuracy rate of only 46%, which is not much higher than random guessing.

Maria del Rio-Chanona, an associate professor of computer science at University College London and co-author of the paper, said the main conclusion of the study is that while LLMs are impressive, they still lack a deep understanding of advanced historical knowledge. These models perform well on basic historical facts, but still fail when faced with more complex, doctoral-level historical research.

Test Examples

The researchers shared some examples of historical questions that LLMs answered incorrectly. For example, when asked whether scale armor existed in ancient Egypt at a certain time, GPT-4 Turbo gave a positive answer, but in fact, this technology did not appear in Egypt until 1,500 years later. Del Rio-Chanona explained that LLMs perform poorly on technical historical questions, perhaps because they tend to infer from very prominent historical data and have difficulty retrieving less popular historical knowledge.

In another example, the researchers asked GPT-4 whether ancient Egypt had a professional standing army at a certain historical period. The correct answer is no, but the LLM incorrectly answered “yes”. Del Rio-Chanona believes this may be because there is more public information about other ancient empires (such as Persia) having standing armies, while there is less information about ancient Egypt. If the AI is repeatedly told A and B, and C is only mentioned once, when the AI is asked about C, the AI may only remember A and B and try to infer from them.

The study also found that the OpenAI and Llama models performed worse in regions such as sub-Saharan Africa, suggesting that their training data may be biased. Peter Turchin, research director of the Complexity Science Center (CSH), said that this finding shows that AI cannot yet replace human experts in certain professional fields. However, the research team remains optimistic about the prospects for the application of AI in historical research, and they are improving the test benchmark in order to help develop better models.

Summary

Nevertheless, the researchers are optimistic about the prospects of LLMs in assisting historical research in the future. They are improving the benchmarking tools by incorporating more data from underrepresented regions and adding more complex questions. The paper concludes: “Overall, while our results highlight areas for improvement in LLMs, they also underscore the potential of these models for historical research.”

The key to this research is the design of the test, which is not only to test the performance of AI, but also to make everyone aware of the limitations of AI models in knowledge acquisition. Research users should view these models as information processing tools rather than fully dependent sources of knowledge. In order to obtain more accurate answers, it is better to provide AI with reliable data rather than relying solely on its built-in knowledge. Current AI systems have obvious gaps in factual accuracy and urgently need to be improved.

Related Articles

Anthropic Claude

Anthropic Launches AI Tool

In today’s digital age, the importance of code security is becoming increasingly...

Vibe coding

Don’t Let AI Steal Programmers’ Critical Thinking

Tesla’s former AI director brought Vibe Coding into the spotlight, a practice...

Glowing 3800 growth bar chart on tech circuit background

Anthropic Valued At $380B In New Funding

February 12, 2026 – Anthropic, a leading artificial intelligence firm and key...

AI processing cubes with holographic data screens

Chinese AI Firms Unveil New Coding Models

China’s Zhipu AI and MiniMax simultaneously launched new large language models for...