Although artificial intelligence has excelled in areas such as programming and content creation, new research shows that it still falls short when it comes to complex historical questions. A study recently presented at the NeurIPS conference showed that even the most advanced large language models (LLMs) struggled to perform well on a test of historical knowledge.
Test Results
The study, led by a team at the Austrian Institute for Complex Science (CSH), aimed to test the performance of three top large language models (LLMs) – OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini – on historical questions. The research team developed a benchmark tool called “Hist-LLM” that tests the correctness of answers against the Seshat Global History Database, a massive database of historical knowledge named after the ancient Egyptian goddess of wisdom. The results showed that even the best-performing GPT-4 Turbo model had an accuracy rate of only 46%, which is not much higher than random guessing.
Maria del Rio-Chanona, an associate professor of computer science at University College London and co-author of the paper, said the main conclusion of the study is that while LLMs are impressive, they still lack a deep understanding of advanced historical knowledge. These models perform well on basic historical facts, but still fail when faced with more complex, doctoral-level historical research.
Test Examples
The researchers shared some examples of historical questions that LLMs answered incorrectly. For example, when asked whether scale armor existed in ancient Egypt at a certain time, GPT-4 Turbo gave a positive answer, but in fact, this technology did not appear in Egypt until 1,500 years later. Del Rio-Chanona explained that LLMs perform poorly on technical historical questions, perhaps because they tend to infer from very prominent historical data and have difficulty retrieving less popular historical knowledge.
In another example, the researchers asked GPT-4 whether ancient Egypt had a professional standing army at a certain historical period. The correct answer is no, but the LLM incorrectly answered “yes”. Del Rio-Chanona believes this may be because there is more public information about other ancient empires (such as Persia) having standing armies, while there is less information about ancient Egypt. If the AI is repeatedly told A and B, and C is only mentioned once, when the AI is asked about C, the AI may only remember A and B and try to infer from them.
The study also found that the OpenAI and Llama models performed worse in regions such as sub-Saharan Africa, suggesting that their training data may be biased. Peter Turchin, research director of the Complexity Science Center (CSH), said that this finding shows that AI cannot yet replace human experts in certain professional fields. However, the research team remains optimistic about the prospects for the application of AI in historical research, and they are improving the test benchmark in order to help develop better models.
Summary
Nevertheless, the researchers are optimistic about the prospects of LLMs in assisting historical research in the future. They are improving the benchmarking tools by incorporating more data from underrepresented regions and adding more complex questions. The paper concludes: “Overall, while our results highlight areas for improvement in LLMs, they also underscore the potential of these models for historical research.”
The key to this research is the design of the test, which is not only to test the performance of AI, but also to make everyone aware of the limitations of AI models in knowledge acquisition. Research users should view these models as information processing tools rather than fully dependent sources of knowledge. In order to obtain more accurate answers, it is better to provide AI with reliable data rather than relying solely on its built-in knowledge. Current AI systems have obvious gaps in factual accuracy and urgently need to be improved.