ALUE (Aerospace Language Understanding Evaluation) is a comprehensive framework designed to facilitate the evaluation and inference of Language Learning Models (LLMs) on aerospace-specific datasets. The framework is user-friendly and versatile, supporting custom datasets, preferred models, user-defined prompts, and quantitative metrics of performance. Contact Eugene Mangortey (emangortey@mitre.org) with inquiries.
- Multiple Task Types: MCQA, Summarization, RAG, and Extractive QA
- Flexible Backends: Works with OpenAI, vLLM, TGI, Ollama, Transformers, and more
- Advanced Evaluation: Combines traditional metrics with LLM-as-judge evaluation
# Recommended: using uv
uv sync
# Alternative: using pip
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Create a .env
file:
cp .env-example .env
Minimal configuration for OpenAI:
ALUE_ENDPOINT_TYPE=openai
ALUE_OPENAI_API_KEY=sk-...
Note: Additional configuration required for some tasks:
- RAG: Requires
EMBEDDING_*
settings- RAG & Summarization: Require
ALUE_LLM_JUDGE_*
settings for evaluationSee Configuration Reference for complete setup.
# Multiple Choice QA example
python -m scripts.mcqa inference \
-i data/aviation_knowledge_exam/3_1_aviation_test.json \
-o runs/mcqa \
-m gpt-4o-mini \
--task_type aviation_exam \
--num_examples 3 \
--num_questions 2 \
--schema_class schemas.aviation_exam.schema.MCQAResponse \
--field_to_extract answer
Results saved to runs/mcqa_<timestamp>/predictions.json
Evaluate model's ability to answer multiple choice questions with a single correct option.
python -m scripts.mcqa both \
-i data/mcqa/aviation_exam.json \
-o runs/mcqa \
-m gpt-4o-mini \
--task_type aviation_exam \
--num_examples 3 \
--num_questions 2 \
--schema_class schemas.aviation_exam.schema.MCQAResponse \
--field_to_extract answer
Metrics: Accuracy
python -m alue.rag_utils \
--document-directory ./tests/resources/ \
--database-path ./chroma_db \
--collection-name documents \
--output-path ./artifacts \
--partition-strategy hi_res \
--chunk-hard-max 1200 \
--chunk-soft-max 700 \
--overlap-size 50
python -m scripts.rag both \
-i data/dummy_rag/rag_qa.json \
-o runs/rag \
-m gpt-4o-mini \
--database-path ./chroma_db \
--collection-name documents \
--llm_judge_model_name gpt-4o-mini
Metrics: Recall@k, Context Relevancy, Composite Correctness
Generate concise summaries of aviation narratives.
python -m scripts.summarization both \
-i data/dummy_summarization/data.jsonl \
-o runs/sum \
-m gpt-4o-mini \
--llm_judge_model_name gpt-4o-mini
Metrics: Precision, Recall, F1 (via claim decomposition)
Extract exact text spans from documents that answer questions.
python -m scripts.extractive_qa both \
-i data/ntsb_tail_extraction/ntsb_tail_extraction_dummy.json \
-o runs/extractive_qa \
-m gpt-4o-mini \
--task_type ntsb_extract_tail_number
Metrics: Exact Match, F1 Score
Backend | Type | Description |
---|---|---|
OpenAI | API | Direct OpenAI API access |
vLLM | API/Local | Fast inference with OpenAI-compatible API |
TGI | API | HuggingFace Text Generation Inference |
Ollama | API | Local model serving |
Transformers | Local | Direct HuggingFace transformers |
- OpenAI embeddings
- Ollama embeddings
- HuggingFace embeddings
- Local embeddings (default)
- OpenAI-compatible endpoints
ALUE_ENDPOINT_TYPE=openai
ALUE_OPENAI_API_KEY=sk-...
ALUE_LLM_JUDGE_ENDPOINT_TYPE=openai
ALUE_LLM_JUDGE_OPENAI_API_KEY=sk-...
EMBEDDING_ENDPOINT_TYPE=openai
EMBEDDING_API_KEY=sk-...
ALUE_ENDPOINT_TYPE=ollama
ALUE_ENDPOINT_URL=http://localhost:11434
ALUE_LLM_JUDGE_ENDPOINT_TYPE=ollama
ALUE_LLM_JUDGE_ENDPOINT_URL=http://localhost:11434
EMBEDDING_ENDPOINT_TYPE=local
# Fast inference with vLLM
ALUE_ENDPOINT_TYPE=vllm
ALUE_ENDPOINT_URL=http://localhost:8000/v1
# Separate judge to reduce bias
ALUE_LLM_JUDGE_ENDPOINT_TYPE=openai
ALUE_LLM_JUDGE_OPENAI_API_KEY=sk-...
# Local embeddings (no API costs)
EMBEDDING_ENDPOINT_TYPE=local
All tasks support three modes:
- Inference only - Generate predictions
- Evaluation only - Evaluate existing predictions
- Both - Run inference then evaluation
Example:
# Inference only
python -m scripts.mcqa inference -i data.json -o runs -m gpt-4o-mini
# Evaluation only
python -m scripts.mcqa evaluation -i data.json -o runs --predictions_file runs/predictions.json
# Both
python -m scripts.mcqa both -i data.json -o runs -m gpt-4o-mini
For RAG and Summarization tasks, ALUE uses LLM judges for nuanced evaluation:
- Context Relevancy: Assesses whether retrieved chunks are relevant
- Composite Correctness: Evaluates answer factuality via claim decomposition
- Claim-based Scoring: Precision/Recall/F1 for summaries
Use Pydantic schemas to enforce structured outputs:
python -m scripts.mcqa inference \
-i data.json \
-o runs \
-m gpt-4o-mini \
--schema_class MCQAResponse \
--field_to_extract answer
Customize prompts via Jinja2 templates in templates/<task>/
:
system.jinja2
- System message with few-shot examplesuser.jinja2
- User query template
Build ChromaDB from PDFs:
python -m alue.rag_utils \
--document-directory ./docs_pdfs \
--database-path ./chroma_db \
--collection-name documents \
--partition-strategy hi_res
- Getting Started - Installation and setup
- Configuration Reference - Complete config guide
- Models & Backends - Backend comparison
- Creating Datasets - Dataset format specs
- Tasks:
- Python 3.10+ (3.11 partially tested)
- Dependencies managed via
uv
orpip
- API keys for chosen backend(s)
- Optional: GPU for local model inference
ALUE is an open-science initiative and community contributions will help us further ALUE and LLM usage on aerospace data. We welcome:
π§ New Tools: Specialized analysis functions and algorithms
π Datasets: Curated aerospace data and knowledge bases
π» Software: Integration of existing LLM evaluation software packages
π Benchmarks: Evaluation datasets and performance metrics
π Misc: Tutorials, examples, and use cases
π§ Update existing tools: many current tools, benchmarks, metrics are not optimized - fixes and replacements are welcome!
Check out the Contributing Guide on how to contribute to the ALUE project.
If you have particular tool/database/software in mind that you want to add, you can also submit to this form and the ALUE team may implement them depending on our resource constraints.
- Documentation: Full docs
Note: ALUE has been tested primarily on Python 3.10. Other versions may work but are not officially supported.
You can cite this work using either:
Eugene Mangortey, Satyen Singh, Shuo Chen and Kunal Sarkhel. "Aviation Language Understanding Evaluation (ALUE) β Large Language Model Benchmark with Aviation Datasets," AIAA 2025-3247. AIAA AVIATION FORUM AND ASCEND 2025. July 2025.
It can also be cited using BibTeX.
@inbook{doi:10.2514/6.2025-3247,
author = {Eugene Mangortey and Satyen Singh and Shuo Chen and Kunal Sarkhel},
title = {Aviation Language Understanding Evaluation (ALUE) β Large Language Model Benchmark with Aviation Datasets},
booktitle = {AIAA AVIATION FORUM AND ASCEND 2025},
chapter = {},
pages = {},
doi = {10.2514/6.2025-3247},
URL = {https://arc.aiaa.org/doi/abs/10.2514/6.2025-3247},
eprint = {https://arc.aiaa.org/doi/pdf/10.2514/6.2025-3247},
abstract = { Large Language Models (LLMs) present revolutionary potential for the aviation industry, enabling stakeholders to derive critical intelligence and improve operational efficiencies through automation. However, given the safety-critical nature of aviation, a rigorous domain-specific evaluation of LLMs is paramount before their integration into workflows. General-purpose LLM benchmarks often do not capture the nuanced understanding of aerospace-specific knowledge and the phraseology required for reliable application. This paper introduces the Aerospace Language Understanding Evaluation (ALUE) benchmark, an aviation-specific framework designed for scalable evaluation, assessment, and benchmarking of LLMs against specialized aviation datasets and language tasks. ALUE incorporates diverse datasets and tasks, including binary and multiclass classification for hazard identification, extractive question answering for precise information retrieval (e.g., tail numbers, runways), sentiment analysis, and multiclass token classification for fine-grained analysis of air traffic control communications. ALUE also introduces several metrics for evaluating the correctness of generated responses utilizing LLMs to identify and judge claims made in generated responses. Our findings demonstrate that structured prompts and in-context examples significantly improve model performance, highlighting that general models struggle with aviation tasks without such guidance and often produce verbose or unstructured outputs. ALUE provides a crucial tool for guiding the development and safe deployment of LLMs tailored to the unique demands of the aviation and aerospace domains. }
}
If you'd like to cite this work using a citation manager such as Zotero, please see the CITATION.cff file for more information citing this work. For more informaton on the CITATION.cff file format, please see What is a CITATION.cff
file?
This is the copyright work of The MITRE Corporation, and was produced for the U. S. Government under Contract Number 693KA8-22-C-00001, and is subject to Federal Aviation Administration Acquisition Management System Clause 3.5-13, Rights In Data-General (Oct. 2014), Alt. III and Alt. IV (Jan. 2009). No other use other than that granted to the U. S. Government, or to those acting on behalf of the U. S. Government, under that Clause is authorized without the express written permission of The MITRE Corporation. For further information, please contact The MITRE Corporation, Contracts Management Office, 7515 Colshire Drive, McLean, VA 22102-7539, (703) 983-6000.
Β© 2025 The MITRE Corporation. All Rights Reserved.
Approved for Public Release, Distribution Unlimited. PRS Case 25-2078.