Benchmarking

One of the key exciting aspects of exploring astronomy LLMs is the unprecedented opportunity they provide for benchmarking capabilities that were previously difficult to quantify.

The progress of astronomy requires a myriad of logical skills, from mathematical derivation and physical insights to statistical modeling. The combination of these skills often lies somewhere between factual retrieval, competitive mathematics derivation, and common sense AI, making it challenging to assess the performance of LLMs in this domain.

However, until now, there has been a significant void in terms of golden question and answering datasets specifically tailored to scientific reasoning, which are essential for testing these models. To address this gap, a key focus of our research is to establish the standard for Q&A in the non-trivial field of astronomy that has remained largely unexplored.

We have recently released our benchmarking datasets here, and the details of how we curated these datasets are currently being documented. Using this data set, we have compared our myriad of specialized models among themselves and against other open-source models, demonstrating a notable improvement in their abilities. The detailed benchmarking results can be found below:

Model Name Base Model BLEU Score Factual Responses Perplexity Embedding Quality
LLaMa-2-7b -        
AstroLLaMA-7b LLaMa-2-7b        
AstroLLaMA-chat-7b AstroLLaMA-7b        
LLaMa-2-70b -        
AstroLLaMA-70b LLaMa-2-70b        
Phi-2 -        
AstroLLaMA-Phi-2 Phi-2        
Phi-3 -        
AstroLLaMA-Phi-3 Phi-3        
Mistral-7b -        
AstroLLaMA-Mistral-7b Mistral-7b        
LLaMa-3b-8b -        
AstroLLaMA-LLaMa-3b-8b LLaMa-3b-8b        
QWen-7b -        
AstroLLaMA-QWen-7b QWen-7b