Welcome to AstroMLab, a dynamic group of astronomers and computer scientists who are passionate about pushing the boundaries of Large Language Models (LLMs) in astronomy. Our team is comprised of leading astronomers, top natural language processing experts from Oak Ridge National Laboratory and Argonne National Laboratory, and frontier arXivists from the NASA Astrophysics Data System who possess deep insights into the vast astronomy literature corpus. We are also fortunate to have a group of enthusiastic young researchers who are fearlessly bridging the gap between astronomy and LLMs.
We are grateful for the support we receive, including access to the massive computing power of the Frontier nodes at Oak Ridge and the backing of Microsoft Research for our project. While the field of LLMs is advancing at a breakneck pace, we firmly believe that real progress in AI-driven scientific research, particularly in astronomy, requires deep domain knowledge. However, such efforts are still largely lacking in astronomy, which has compelled us to unite and tackle this challenge head-on.
Despite being a relatively young group, we have already made significant strides. We have published the first LLM in astronomy, known as AstroLLaMA-7b and AstroLLaMA-chat-7b (previously part of UniverseTBD, but the core team of AstroLLaMA has since moved on). Most recently, we have released AstroLLaMA-70b, all of which were based on LLaMa-2 models. We have also built upon that to create a complete “arena models” by performing full-parameter fine-tuning on Phi-2, Phi-3, Mistral-7b, LLaMa-3b-8b, and QWen-7b.
Our team is rapidly expanding, and we would love to hear from you! Feel free to reach out to us at here.
We are also fully committed to open source – all our models are immediately released on Hugging Face, and you can find all our models here.
Yuan-Sen Ting Australian National University |
Rui Pan University of Illinois Urbana-Champaign |
Josh Nguyen University of Pennsylvania |
Hardik Arora Indian Institutes of Technology |
Zechang Sun Tsinghua Unversity |
Tirthankar Ghosal Oak Ridge National Laboratory |
Alberto Accomazzi NASA Astrophysics Data System |
Yuwei Yang Australian National University |
Azton Wells Argonne National Laboratory |
Nesar Ramachandra Argonne National Laboratory |
Sandeep Madireddy Argonne National Laboratory |
Josh Nguyen, Yuan-Sen Ting et al., 2023, arXiv:2309.06126
Ernest Perkowski, Rui Pan, Josh Nguyen, et al., 2024, arXiv:2401.01916
Our study introduces the pioneering AstroLLaMA series, featuring the cutting-edge models AstroLLaMA-7b, AstroLLaMA-7b-Chat, and the expansive AstroLLaMA-70b. These state-of-the-art language models are the result of full parameter fine-tuning from LLaMA-2 using the entire corpus of astronomy arXiv. The training of the AstroLLaMA series incorporates conversational capabilities and diverse, curated question-answering datasets pertinent to astronomy, enhancing its performance in astronomy-focused question-answering tasks.
AstroLLaMA-Chat stands as the first open-source conversational AI tool tailored specifically for the astronomy community. To the best of our knowledge, this also marks the first-ever publicly available specialized LLM in astronomy, setting a new standard in the field.
Even with the 7b models, we demonstrate that AstroLLaMA achieves a remarkable 30% lower perplexity compared to LLaMA-2. This substantial reduction in perplexity highlights the model’s successful domain adaptation and its potential to generate more insightful and scientifically relevant text completions.
AstroLLaMA’s embedding space exhibits a higher quality in reflecting semantic similarities among astronomy texts. Compared to other embedding models, our model demonstrates a more granular representation of astronomical content, facilitating enhanced retrieval augmented generation.
Coming soon.