Our benchmarking dataset is available for download on Hugging Face.
Exploring astronomy LLMs provides an unprecedented opportunity to quantify capabilities that were previously difficult to assess. The progress of astronomy requires a myriad of logical skills. These skills often combine elements of factual retrieval, competitive mathematics derivation, and common sense AI, making it challenging to assess LLM performance in this domain.
Addressing the Gap in Scientific Reasoning Datasets
Until recently, there was a void in question and answering datasets specifically tailored to scientific reasoning in astronomy. To address this gap, we have focused on establishing the standard for astronomical Q&A in this non-trivial and largely unexplored field.
Our Benchmarking Dataset
In collaboration with Argonne National Laboratory, we have developed a comprehensive astronomical benchmarking dataset, designed to evaluate LLM performance in astronomical research contexts. This dataset comprises both Q&A and Multiple Choice Question (MCQ) components. Key features include:
- Source: 885 articles from the Annual Review of Astronomy and Astrophysics (ARAA), 1963-2023
- Quality Control:
- Questions are specific to article content but general enough for independent use
- Answers are general, not pointing to specific sections
- Answer options are balanced in length
- Explanation and Citation: Each question includes an explanation and supporting paragraphs from the review
- Human Validation: Experts reviewed a subset of questions to ensure adequacy
This dataset is designed to:
- Evaluate LLM performance in astronomical research contexts
- Test astronomical facts and community consensus
- Assess models’ abilities to link insights across diverse subfields
- Gauge understanding of the interdisciplinary nature of astronomical research
Example Questions
Here are some examples from our dataset:
Topic: Quasar Number Density
Question: What is the primary reason for the decline in the number density of luminous quasars at redshifts greater than 5?
A) A decrease in the overall star formation rate, leading to fewer potential host galaxies for quasars.
B) An increase in the neutral hydrogen fraction in the intergalactic medium, which obscures the quasars' light.
C) A decrease in the number of massive black hole seeds that can form and grow into supermassive black holes.
D) An increase in the average metallicity of the Universe, leading to a decrease in the efficiency of black hole accretion.
Correct Answer: C
Topic: Cosmological Simulations
Question: What is the primary goal of calibrating subgrid feedback models in cosmological simulations?
A) To ensure that simulations accurately reproduce the observed properties of the interstellar medium.
B) To create a diverse range of galaxy morphologies in the simulations.
C) To achieve convergence in simulation results across different resolutions and box sizes.
D) To steer simulations towards producing a broadly realistic galaxy population that is consistent with key observational constraints.
Correct Answer: D
Topic: Circumgalactic Medium
Question: The properties of the circumgalactic medium (CGM) primarily depend on the competition between:
A) Star formation rate and supernova feedback.
B) Gas cooling and stellar winds.
C) Gravity-driven infall and gas cooling.
D) Magnetic fields and thermal conduction.
Correct Answer: C
These examples demonstrate the depth and breadth of astronomical knowledge covered in our benchmarking dataset.