AstroMLab is a dynamic group of astrophysicists and computer scientists developing Large Language Models (LLMs) for astronomy. Our team includes:
We've achieved:
Our flagship models, AstroSage-LLaMA-3.1-70B and AstroSage-LLaMA-3.1-8B, achieve 86.2% and 80.9% accuracy respectively on the AstroMLab-1 benchmark. The 70B model ties with Claude-4-Opus for the highest performance, while the 8B model performs comparably to Mistral-Large-v2 at a fraction of the cost (see AstroBench).
Model | Score (%) |
---|---|
AstroSage-LLaMA-3.1-70B (AstroMLab) | 86.2 |
Claude-4-Opus | 86.3 |
o3 | 85.4 |
Claude-4-Sonnet | 85.0 |
GPT-4.1 | 84.7 |
o4-Mini | 84.7 |
Gemini-2.5-Pro | 84.8 |
Deepseek-R1 | 84.4 |
Qwen-3-235B | 84.0 |
LLaMA-4-Maverick | 83.4 |
Deepseek-v3-2503 | 82.9 |
Gemini-2.5-Flash-0520 | 82.3 |
LLaMA-4-Scout | 82.2 |
Grok-3 | 81.7 |
Mistral-Medium-v3 | 81.8 |
AstroSage-LLaMA-3.1-8B (AstroMLab) | 80.9 |
Mistral-Large-v2 | 80.8 |
Qwen-3-32B | 79.7 |
Mistral-Small-v3.1 | 78.6 |
GPT-4.1-Nano | 78.0 |
Gemini-2-Flash-Lite | 78.4 |
Gemma-3-27B | 76.9 |
Qwen-3-14B | 76.4 |
AstroLLaMA-2-7B | 44.3 |
All our models are available on Hugging Face
Contact us: astromachinelearninglab@gmail.com
![]() |
![]() |
![]() |
![]() |
|
Yuan-Sen Ting The Ohio State University |
Tirthankar Ghosal Oak Ridge National Laboratory |
Tijmen de Haan KEK |
Josh Nguyen University of Pennsylvania |
|
![]() |
![]() |
![]() |
![]() |
|
Rui Pan University of Illinois Urbana-Champaign |
Hardik Arora Indian Institutes of Technology |
Emily Herron Oak Ridge National Laboratory |
Yuwei Yang Australian National University |
|
![]() |
![]() |
![]() |
![]() |
![]() |
Zechang Sun Tsinghua University |
Alberto Accomazzi NASA Astrophysics Data System |
Azton Wells Argonne National Laboratory |
Nesar Ramachandra Argonne National Laboratory |
|
![]() |
||||
Sandeep Madireddy Argonne National Laboratory |
We present AstroSage-LLaMA-3.1-70B, a 70-billion parameter domain-specialized language model that achieves state-of-the-art performance on astronomical knowledge tasks. Built from Meta-Llama-3.1-70B through extensive continued pre-training on astronomical literature, supervised fine-tuning, and model merging, it demonstrates that domain specialization can enable specialized models to outperform even the most advanced commercial alternatives.
Key points:
We present AstroSage-LLaMA-3.1-8B, a domain-specialized natural-language AI assistant tailored for research in astronomy, astrophysics, and cosmology. Through extensive data curation, massive continued pre-training, and supervised fine-tuning, we demonstrate that proper specialization of a relatively small model can achieve performance comparable to much larger flagship models.
Key points:
Rui Pan, Josh Nguyen, et al., 2024
We introduce new models: AstroLLaMA-3-8B and AstroLLaMA-2-70B, building upon the previous AstroLLaMA series and quantitatively assess specialized LLMs in astronomy, leveraging recently curated high-quality astronomical MCQs.
Key points:
Yuan-Sen Ting, et al., 2024, arXiv:2407.11194
We present a comprehensive evaluation of proprietary and open-weights large language models using the first astronomy-specific benchmarking dataset. This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics, covering a broad range of astrophysical topics.
Key findings:
The first open-source conversational AI tool tailored for the astronomy community – AstroLLaMA-2-7B and AstroLLaMA-2-7B-Chat.