Faculty News

Professor Smaranda Muresan and Four Barnard Students Publish Research on Large Language Models’ Capacity for Abstract Reasoning

In November 2024, Smaranda Muresan, associate professor of computer science, published a conference paper titled “Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game” alongside four Barnard student researchers, Prisha Samdarshi ’24, Mariam Mustafa ’24, Anushka Kulkarni ’24, and Raven Rothkopf ’24. The study was presented at the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) that took place in Miami, Florida from November 12-16, 2024. It originated with work that began in Muresan’s Spring 2024 seminar, Large Language Models: Foundations and Ethics. Tuhin Chakrabarty, a recent graduate of Columbia University's computer science doctoral program who studied under Muresan, is a senior co-author of the EMNLP 2024 conference paper.

The study assesses how well large language models (LLMs) can use abstract reasoning — tasks like finding patterns and solving problems — by comparing them to human players in the New York Times Connections game. The researchers tested 438 games, analyzing how advanced LLMs and humans performed. Claude 3.5 Sonnet, the best LLM tested, solved only 18% of the games fully, while novice and expert human players performed better, with experts significantly outperforming artificial intelligence (AI). The research revealed that there are different types of knowledge needed to succeed in the game, such as understanding word meanings, encyclopedic facts, and phrases. While LLMs did well with basic word relationships, they struggled with more complex knowledge. The findings suggest the Connections game is a tough but valuable way to measure AI's ability to think abstractly.

This research was featured in a Bloomberg news article, titled “AI Has a Way to Go to Become A Connections Puzzle Champ,” earlier this year.

Faculty News

STEM

Smaranda Muresan