AstroBench

Our benchmarking dataset is available for download on Hugging Face.

Exploring astronomy LLMs provides an unprecedented opportunity to quantify capabilities that were previously difficult to assess. The progress of astronomy requires a myriad of logical skills. These skills often combine elements of factual retrieval, competitive mathematics derivation, and common sense AI, making it challenging to assess LLM performance in this domain.

Addressing the Gap in Scientific Reasoning Datasets

Until recently, there was a void in question and answering datasets specifically tailored to scientific reasoning in astronomy. To address this gap, we have focused on establishing the standard for astronomical Q&A in this non-trivial and largely unexplored field.

Our Benchmarking Dataset

In collaboration with Argonne National Laboratory, we have developed a comprehensive astronomical benchmarking dataset, designed to evaluate LLM performance in astronomical research contexts. This dataset comprises both Q&A and Multiple Choice Question (MCQ) components. Key features include:

Source: 885 articles from the Annual Review of Astronomy and Astrophysics (ARAA), 1963-2023
Quality Control:
- Questions are specific to article content but general enough for independent use
- Answers are general, not pointing to specific sections
- Answer options are balanced in length
Explanation and Citation: Each question includes an explanation and supporting paragraphs from the review
Human Validation: Experts reviewed a subset of questions to ensure adequacy

This dataset is designed to:

Evaluate LLM performance in astronomical research contexts
Test astronomical facts and community consensus
Assess models’ abilities to link insights across diverse subfields
Gauge understanding of the interdisciplinary nature of astronomical research

Example Questions

Here are some examples from our dataset:

Topic: Quasar Number Density

Question: What is the primary reason for the decline in the number density of luminous quasars at redshifts greater than 5?

A) A decrease in the overall star formation rate, leading to fewer potential host galaxies for quasars.
B) An increase in the neutral hydrogen fraction in the intergalactic medium, which obscures the quasars' light.
C) A decrease in the number of massive black hole seeds that can form and grow into supermassive black holes.
D) An increase in the average metallicity of the Universe, leading to a decrease in the efficiency of black hole accretion.

Correct Answer: C

Topic: Cosmological Simulations

Question: What is the primary goal of calibrating subgrid feedback models in cosmological simulations?

A) To ensure that simulations accurately reproduce the observed properties of the interstellar medium.
B) To create a diverse range of galaxy morphologies in the simulations.
C) To achieve convergence in simulation results across different resolutions and box sizes.
D) To steer simulations towards producing a broadly realistic galaxy population that is consistent with key observational constraints.

Correct Answer: D

Topic: Circumgalactic Medium

Question: The properties of the circumgalactic medium (CGM) primarily depend on the competition between:

A) Star formation rate and supernova feedback.
B) Gas cooling and stellar winds.
C) Gravity-driven infall and gas cooling.
D) Magnetic fields and thermal conduction.

Correct Answer: C

These examples demonstrate the depth and breadth of astronomical knowledge covered in our benchmarking dataset.

AstroMLab

AstroBench

Addressing the Gap in Scientific Reasoning Datasets

Our Benchmarking Dataset

Example Questions