Artificial Intelligence Struggles in Mastering History, Recent Study Reveals

Artificial Intelligence (AI), known for performing vital tasks seamlessly, like programming or initiating podcasts, has demonstrated a rough time answering higher-level history exam questions, according to a recently published research paper.

Researchers set a benchmark utilizing three leading large language models, namely OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini for answering history-related inquiries. The benchmark, titled Hist-LLM, graded answers utilizing the Seshat Global History Databank, a comprehensive reservoir of historical know-how.

The outcomes presented at the lauded AI gathering, NeurIPS, were somewhat unsatisfactory, according to the Complexity Science Hub researchers. OpenAI’s GPT-4 Turbo turned out as the top contender with an accuracy rate of approximately 46%, a figure marginally better than random projections.

Our analysis revealed that large language models, despite their progress, still don’t possess the detailed understanding required for processing complex history. They are apt for basic facts, but these models lack efficiency in nuanced, scholarly-level history subjects, comments Maria del Rio-Chanona, one of the research contributors and an associate computer science professor at University College London.

Researchers disclosed sample historical queries that these models incorrectly responded to. For instance, when asked whether scale armor existed in a specified ancient Egyptian era, GPT-4 Turbo confirmed its presence. However, this technology emerged in Egypt about 1,500 years later.

The study aims to enhance these models by incorporating data from underrepresented regions and including complex questions. Despite improvements required, the potential of these models in assisting historical research remains promising.

Original source: Read the full article on TechCrunch