AstroBench

Our benchmarking dataset is available for download on Hugging Face.

Exploring astronomy LLMs provides an unprecedented opportunity to quantify capabilities that were previously difficult to assess. The progress of astronomy requires a myriad of logical skills. These skills often combine elements of factual retrieval, competitive mathematics derivation, and common sense AI, making it challenging to assess LLM performance in this domain.

Addressing the Gap in Scientific Reasoning Datasets

Until recently, there was a void in question and answering datasets specifically tailored to scientific reasoning in astronomy. To address this gap, we have focused on establishing the standard for astronomical Q&A in this non-trivial and largely unexplored field.

Our Benchmarking Dataset

In collaboration with Argonne National Laboratory, we have developed a comprehensive astronomical benchmarking dataset, designed to evaluate LLM performance in astronomical research contexts. This dataset comprises both Q&A and Multiple Choice Question (MCQ) components. Key features include:

  • Source: 885 articles from the Annual Review of Astronomy and Astrophysics (ARAA), 1963-2023
  • Quality Control:
    • Questions are specific to article content but general enough for independent use
    • Answers are general, not pointing to specific sections
    • Answer options are balanced in length
  • Explanation and Citation: Each question includes an explanation and supporting paragraphs from the review
  • Human Validation: Experts reviewed a subset of questions to ensure adequacy

This dataset is designed to:

  • Evaluate LLM performance in astronomical research contexts
  • Test astronomical facts and community consensus
  • Assess models’ abilities to link insights across diverse subfields
  • Gauge understanding of the interdisciplinary nature of astronomical research

Example Questions

Here are some examples from our dataset:

Topic: Quasar Number Density

Question: What is the primary reason for the decline in the number density of luminous quasars at redshifts greater than 5?

A) A decrease in the overall star formation rate, leading to fewer potential host galaxies for quasars.
B) An increase in the neutral hydrogen fraction in the intergalactic medium, which obscures the quasars' light.
C) A decrease in the number of massive black hole seeds that can form and grow into supermassive black holes.
D) An increase in the average metallicity of the Universe, leading to a decrease in the efficiency of black hole accretion.

Correct Answer: C
Topic: Cosmological Simulations

Question: What is the primary goal of calibrating subgrid feedback models in cosmological simulations?

A) To ensure that simulations accurately reproduce the observed properties of the interstellar medium.
B) To create a diverse range of galaxy morphologies in the simulations.
C) To achieve convergence in simulation results across different resolutions and box sizes.
D) To steer simulations towards producing a broadly realistic galaxy population that is consistent with key observational constraints.

Correct Answer: D
Topic: Circumgalactic Medium

Question: The properties of the circumgalactic medium (CGM) primarily depend on the competition between:

A) Star formation rate and supernova feedback.
B) Gas cooling and stellar winds.
C) Gravity-driven infall and gas cooling.
D) Magnetic fields and thermal conduction.

Correct Answer: C

These examples demonstrate the depth and breadth of astronomical knowledge covered in our benchmarking dataset.

Benchmarking Results

Our benchmarking results reveal significant variations in performance across different proprietary large language models when tested on astronomy-specific questions. For a comprehensive analysis and detailed discussion of these results, please refer to Ting et al. (2024).

Open-Weights Models

Model Score (%)
AstroMLab/AstroSage Series  
AstroSage-8B (AstroMLab, de Haan et al., 2024) 80.9 ▲ ▲ ▲ 🥈
Meta/LLaMA Series  
LLaMA-2-7B 50.3
AstroLLaMA-2-7B (UniverseTBD, Perkowski et al., 2024) 44.3 🔻
LLaMA-2-70B 70.7
AstroLLaMA-2-70B (AstroMLab, Pan et al., 2024) 72.3 ▲
LLaMA-3-8B 72.9
LLaMA-3-70B 80.4
LLaMA-3.1-8B 73.7
LLaMA-3.1-70B 80.1
LLaMA-3.1-405B 83.8 🥇
LLaMA-3.2-11B 74.9
LLaMA-3.2-90B 80.6 🥉
Mistral AI Series  
Mistral-7B-v0.1 48.1
Mistral-8x7B-v0.1 73.7
Mixtral-8x22B-v0.1 77.7
Mistral-7B-v0.2 62.1
Mistral-7B-v0.3 63.9
Mistral-12B-Nemo 71.6
Microsoft/Phi Series  
Phi-2-3B 65.6
Phi-3-4B 71.7
Phi-3-14B 75.6
Phi-3.5-4B 72.8
Google/Gemma Series  
Gemma-2-2B 58.5
Gemma-2-9B 71.5
Gemma-2-27B 75.3
Alibaba/Qwen(通义千问) Series  
Qwen-2.5-7B 70.4
Qwen-2.5-72B 78.6
Nvidia/Nemotron Series  
Nemotron-340B 80.6 🥉
01/Yi(零一万物) Series  
Yi-1.5-6B 61.0
Yi-1.5-9B 68.4
Yi-1.5-34B 73.1
Deepseek(深度求索) Series  
Deepseek-67B 63.1
Zhipu(智谱)/ChatGLM Series  
ChatGLM3-6B 50.4
GLM-4-9B 67.0
PJ Lab(浦语)/InternLM(书生) Series  
InternLM-2.5-7B 64.5
InternLM-2.5-20B 66.7


Proprietary Models

Model Score (%)
OpenAI/GPT Series  
O1-Mini 80.1
O1-Preview 81.6 🥉
GPT-3.5 70.4
GPT-4 74.5
GPT-4o-Mini 76.1
GPT-4o 80.4
Anthropic/Claude Series  
Claude-2.0 75.3
Claude-3.0-Haiku 77.9
Claude-3.0-Sonnet 76.7
Claude-3.0-Opus 82.7 🥈
Claude-3.5-Sonnet 85.0 🥇
Google/Gemini Series  
Gemini-1.0-Pro-001 71.0
Gemini-1.5-Flash-001 73.6
Gemini-1.5-Pro-001 77.6
Gemini-1.5-Flash-002 76.5
Gemini-1.5-Pro-002 78.2
Mistral AI Series  
Mistral-Large-1 76.4
Mistral-Large-2 80.8
xAI Series  
Grok-Beta 79.5
Zhipu(智谱)/GLM Series  
GLM-3-Turbo 64.3
GLM-4-Flash 67.1
GLM-4-Air 72.9
GLM-4-AirX 72.5
GLM-4-0520 75.1
GLM-4-Plus 77.9
Baidu/ERNIE(文心一言) Series  
ERNIE-3.5 72.1
ERNIE-4.0 75.1
Deepseek(深度求索) Series  
Deepseek-v2 73.6
Deepseek-v2.5 73.9
Step(阶跃星辰) Series  
Step-1 75.2
Step-2 80.5
ByteDance/Doubao(豆包) Series  
Doubao-Lite 60.5
Doubao-Pro 70.1
MiniMax AI Series  
ABAB-5.5 69.5
ABAB-6.5 72.7
01/Yi(零一万物) Series  
Yi-Medium 70.3
Yi-Large 77.3
Yi-Lightning 78.0
Moonshot(月之暗面)/Kimi Series  
Moonshot-v1 72.3
Perplexity Series  
Perplexity-Llama-3.1-Sonar-Small 72.0
Perplexity-Llama-3.1-Sonar-Large 76.7


Cost and Performance Trade-off in Astronomical Q&A

Cost efficiency is crucial when deploying LLMs as research agents in astronomy. The figure below illustrates the cost-performance relationship across various models.

Cost and performance trade-off in astronomical Q&A

Figure: Our flagship model, AstroSage-LLaMA-3.1-8B, demonstrates exceptional performance in astronomical knowledge recall, achieving 80.9% accuracy on the AstroMLab-1 benchmark. This performance is comparable to OpenAI’s latest GPT-4o model and represents a substantial 7-point improvement over its base LLaMA-3.1-8B model.