NEWTON:
Are Language Models Capable of Physical Reasoning

EMNLP 2023 Findings

1University of Washington, 2NVIDIA
ArXiv Code Dataset


NEWTON is a Repository, Pipeline, and Benchmark designed to evaluate the physical reasoning capability of LLMs

Abstract


Interpolate start reference image.

Training language models on extensive unprocessed text data has yielded impressive advancements in natural language processing (NLP), particularly in tasks such as question answering and reading comprehension. These models, through their contextualized representations, have been proven in numerous studies to encapsulate syntactic, semantic, word sense, and common-sense knowledge. However, there has been limited exploration of their physical reasoning abilities, specifically concerning the crucial attributes for comprehending everyday objects. To address this gap, we introduce NEWTON dataset, a comprehensive Repository, Pipeline, and Benchmark designed to facilitate streamlined evaluation of Large Language Models (LLMs) in the context of physical reasoning. The dataset Repository comprises a vast collection of object-attribute pairs, providing the foundation for generating infinite-scale assessment templates when combined with the NEWTON Pipeline. Leveraging this infrastructure, we construct a large-scale QA dataset to investigate the physical reasoning capabilities of several mainstream language models across foundational, explicit, and implicit reasoning tasks.

Through extensive empirical analysis, our results highlight the capabilities of LLMs for physical reasoning. Furthermore, the NEWTON platform demonstrates its potential for evaluating and enhancing language models, paving the way for their integration into physically grounded settings.


NEWTON

A novel Repository

NEWTON Repository, includes the identification and shortlist of objects and attributes, and obtaining a set of consistent object-attribute annotations.

Interpolate start reference image.

Templates for generating the three different tracks of QA.

Interpolate start reference image.

Comparison of NEWTON with other dataset.

Interpolate start reference image.

Results

Results for Track 1

Interpolate start reference image.

Results for Track 2

Interpolate start reference image.

Results for Track 3

Interpolate start reference image.

Ablation Studies

We provide an analysis of the NEWTON dataset, focusing on potential ways of leveraging \dataset to enhance model performance in a physical reasoning context, and examining the consistency of LLMs with regard to model size, question polarity, and answer positioning.

Interpolate start reference image.

BibTeX

@article{wang2023newton,
  title     = {NEWTON: Are Large Language Models Capable of Physical Reasoning?}, 
  author    = {Wang, Yi Ru and Duan, Jiafei and Fox, Dieter and Srinivasa, Siddhartha},
  booktitle = {arXiv preprint arXiv:2310.07018},
  year      = {2023},
}