Your browser doesn't support javascript.
loading
Evaluating language models for mathematics through interactions.
Collins, Katherine M; Jiang, Albert Q; Frieder, Simon; Wong, Lionel; Zilka, Miri; Bhatt, Umang; Lukasiewicz, Thomas; Wu, Yuhuai; Tenenbaum, Joshua B; Hart, William; Gowers, Timothy; Li, Wenda; Weller, Adrian; Jamnik, Mateja.
Afiliação
  • Collins KM; University of Cambridge, Cambridge CB2 1TN, United Kingdom.
  • Jiang AQ; University of Cambridge, Cambridge CB2 1TN, United Kingdom.
  • Frieder S; University of Oxford, Oxford OX1 4BH, United Kingdom.
  • Wong L; Massachusetts Institute of Technology, Cambridge, MA 02139.
  • Zilka M; University of Cambridge, Cambridge CB2 1TN, United Kingdom.
  • Bhatt U; University of Cambridge, Cambridge CB2 1TN, United Kingdom.
  • Lukasiewicz T; The Alan Turing Institute, London NW1 2DB, United Kingdom.
  • Wu Y; New York University, New York, NY 10011.
  • Tenenbaum JB; University of Oxford, Oxford OX1 4BH, United Kingdom.
  • Hart W; Vienna University of Technology, Vienna 1040, Austria.
  • Gowers T; x.AI, New York, NY 10038.
  • Li W; Massachusetts Institute of Technology, Cambridge, MA 02139.
  • Weller A; University of Cambridge, Cambridge CB2 1TN, United Kingdom.
  • Jamnik M; University of Cambridge, Cambridge CB2 1TN, United Kingdom.
Proc Natl Acad Sci U S A ; 121(24): e2318124121, 2024 Jun 11.
Article em En | MEDLINE | ID: mdl-38830100
ABSTRACT
There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs; this is insufficient for making an informed decision about which LLMs are best to use in an interactive setting, and how that varies by setting. Static assessment therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analyzing MathConverse, we derive a taxonomy of human query behaviors and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, among other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by experienced mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians models that communicate uncertainty, respond well to user corrections, and can provide a concise rationale for their recommendations, may constitute better assistants. Humans should inspect LLM output carefully given their current shortcomings and potential for surprising fallibility.
Assuntos
Palavras-chave

Texto completo: 1 Bases de dados: MEDLINE Assunto principal: Resolução de Problemas / Idioma / Matemática Limite: Humans Idioma: En Revista: Proc Natl Acad Sci U S A Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Reino Unido

Texto completo: 1 Bases de dados: MEDLINE Assunto principal: Resolução de Problemas / Idioma / Matemática Limite: Humans Idioma: En Revista: Proc Natl Acad Sci U S A Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Reino Unido