How To Generate High-Quality Ground Truth at Enterprise Scale for QA Evaluation Using LLMs and AWS FMEval

1. Introduction

In the rapidly evolving landscape of artificial intelligence, the efficacy of any AI system, particularly Question Answering (QA) models, hinges critically on robust and accurate evaluation. At the heart of this evaluation lies high-fidelity ground truth data. Without reliable benchmarks, assessing model performance, identifying biases, and driving meaningful improvements become formidable challenges. Enterprises deploying AI at scale often grapple with the immense effort and cost associated with generating accurate QA datasets. This article will explore how the strategic integration of Large Language Models (LLMs) with AWS’s open-source FMEval library can significantly streamline, standardize, and scale the generation of high-quality ground truth for comprehensive QA evaluation.

2. The Role of Ground Truth in QA System Evaluation

Ground truth refers to the factual, verified data used as a benchmark to train and evaluate AI models. In the context of QA systems, ground truth typically comprises a set of questions, their corresponding correct answers, and often the context from which those answers can be derived. QA tasks can vary widely, from simple factual recall to complex multi-hop reasoning or summarization. Context relevance is paramount: an answer is only “correct” if it is directly supported by the provided context.

The impact of poor-quality ground truth on evaluation is profound. Inaccurate or inconsistent ground truth can lead to misleading performance metrics, causing developers to misinterpret model capabilities, make suboptimal fine-tuning decisions, and potentially deploy models that underperform in real-world scenarios. While human-annotated ground truth is often considered the gold standard due to its inherent accuracy and nuanced understanding, it is notoriously slow, expensive, and difficult to scale, especially for diverse enterprise datasets. Automated ground truth generation, particularly with the advent of powerful LLMs, offers a compelling alternative to overcome these scalability limitations while maintaining high quality.

3. Overview of FMEval

AWS FMEval is an open-source Python library designed to simplify and standardize the evaluation of Large Language Models (LLMs) and foundation models. It provides a flexible framework for evaluating models against various metrics and datasets, making it an invaluable tool for MLOps engineers and data scientists. FMEval supports a range of metrics relevant to QA evaluation, allowing for a comprehensive assessment of model performance:

Accuracy: A simple measure of correct predictions.
F1 Score: A harmonic mean of precision and recall, particularly useful for tasks where partial matches or multiple correct answers are possible.
Exact Match (EM): A strict metric that requires the model’s answer to be identical to the ground truth answer.
ROUGE: (Recall-Oriented Understudy for Gisting Evaluation) A set of metrics for evaluating summarization and machine translation, useful for generative QA.
Semantic Similarity: Measures how semantically similar the model’s answer is to the ground truth, often using embeddings.

Minimal Example to Run FMEval for QA

from fmeval.eval import get_eval_results

# Configuration for the QA evaluation
config = {
    "eval_type": "qa",  # Specifies the evaluation type as Question Answering
    "ground_truth_file": "ground_truth.jsonl",  # Path to the ground truth dataset in JSONL format
    "predictions_file": "model_predictions.jsonl",  # Path to the model's predictions in JSONL format
    "metrics": ["f1_score", "exact_match", "semantic_similarity"] # List of metrics to compute
}

# Run the evaluation and get results
results = get_eval_results(config)

# Print the evaluation results
print(results)

4. High-Fidelity Ground Truth Generation: Best Practices

Generating high-fidelity ground truth, whether manually or automatically, requires adherence to best practices to ensure its quality and utility.

Annotation Guidelines: Clear, unambiguous guidelines are crucial for consistency, especially when multiple annotators (human or LLM-based) are involved. These guidelines should define what constitutes a correct answer, how to handle ambiguity, and the required format.
Contextual Diversity and Coverage: The dataset should cover a wide range of topics, question types, and contexts relevant to the enterprise’s domain. This ensures the QA model is evaluated across its intended operational scope.
Negative QA Examples for Robustness: Including questions for which the answer is not present in the provided context (unanswerable questions) is vital for training and evaluating a model’s ability to identify when it should abstain from answering, thereby improving its robustness and reducing hallucination.
Iterative Dataset Refinement: Ground truth generation is rarely a one-time process. It should be an iterative cycle involving creation, review, evaluation, and refinement based on model performance and evolving requirements.

Table: Ground Truth Components and Quality Criteria

Component	Requirement	Notes
Question	Clear, concise, relevant	Avoid ambiguity, ensure natural language.
Answer	Supported by context, factual	Format consistency (e.g., exact span).
Context	Sufficient to derive answer	Not overly broad; focused and relevant.
Metadata	Question type, difficulty, source ID	For stratified analysis and debugging.

5. Architecture: LLM-Powered Ground Truth Generation

The following architecture outlines a robust pipeline for generating high-quality ground truth at enterprise scale using LLMs and integrating with FMEval for evaluation.

Architecture Diagram

Explanation of Components:

Enterprise Data Sources: Diverse internal data repositories (databases, document management systems, knowledge bases).
Data Lake / S3: Centralized storage for raw and processed enterprise data.
Data Preprocessing / Chunking: Processes raw documents into manageable chunks suitable for LLM context windows, handling text extraction, cleaning, and formatting.
SageMaker JumpStart / Endpoint: Provides managed access to powerful foundation models (e.g., Anthropic Claude, Meta Llama, Amazon Titan) for inference, allowing for scalable and secure LLM integration.
Prompt Engineering: The art and science of crafting effective prompts to guide the LLM in generating desired QA pairs.
Powerful LLM: The chosen Large Language Model responsible for generating questions and answers from the provided context.
Raw Ground Truth JSONL: The initial output from the LLM, formatted as a JSON Lines file.
Human Review UI (e.g., Ground Truth): An interface for human annotators to review, validate, and refine the LLM-generated ground truth, ensuring accuracy and addressing any LLM hallucinations or formatting errors.
FMEval Evaluation Engine: The core FMEval library, configured to run evaluations using the curated ground truth and model predictions.
QA Model Endpoint: The deployed QA model (e.g., fine-tuned BERT, RAG-based LLM) whose performance is being evaluated.
Evaluation Report: The output from FMEval, detailing metrics, errors, and insights into the QA model’s performance.

6. Automating Ground Truth with LLMs

Automating ground truth generation with LLMs involves several key steps:

Document Ingestion: Pulling relevant documents from enterprise data sources like S3 buckets, internal databases, or content management systems.
Preprocessing & Chunking: Large documents need to be broken down into smaller, coherent chunks that fit within the LLM’s context window. This might involve splitting by paragraphs, sections, or using more advanced chunking strategies.
Prompt Engineering for Diverse QA Types: Crafting effective prompts is crucial. The prompt should instruct the LLM to generate various types of questions (factual, summary, comparison, etc.) and their corresponding answers, ensuring they are directly derivable from the provided context. It should also specify the desired output format.
LLM Inference using SageMaker Endpoints: Deploying the chosen LLM (e.g., Claude, Llama, Titan) as a SageMaker endpoint allows for scalable and secure inference. Batches of chunked documents are sent to the endpoint, and the LLM generates QA pairs.
Metadata Tagging for Dataset Enrichment: As the LLM generates QA pairs, additional metadata can be extracted or inferred (e.g., question type, difficulty, source document ID) to enrich the dataset for more granular analysis.
JSONL Output Formatting for FMEval: The LLM’s output needs to be consistently formatted into JSON Lines (JSONL) files, where each line is a JSON object representing a single QA example, adhering to the structure expected by FMEval.

Example Prompt for LLM

PROMPT = """
You are a QA dataset generator.
Input text: {context}
Generate:
1. 3-5 factual questions that can be directly answered from the text.
2. Their answers, which must be directly supported by the text.
3. The exact context span (start and end indices) within the input text where the answer is found.
4. The type of question (e.g., factual, summary, comparison, definition).

Format your output as a JSON array, where each object has the following keys:
"id": A unique identifier for the QA pair.
"question": The generated question.
"answer": The answer to the question.
"context": The original input text provided.
"context_id": An identifier for the original document chunk.
"span_start": The starting character index of the answer in the context.
"span_end": The ending character index of the answer in the context.
"type": The type of question.
"""

7. Post-Processing: Validation & Review

While LLMs are powerful, their outputs still require validation and review to ensure high quality.

Human-in-the-Loop Curation Strategies: A critical step involves human review of the LLM-generated ground truth. This can be done using services like Amazon SageMaker Ground Truth or custom UIs, where human annotators verify the accuracy, relevance, and formatting of the generated QA pairs. This human feedback loop is essential for catching hallucinations, correcting factual errors, and refining the LLM’s prompt engineering.
Quality Assurance Steps: Automated checks can be implemented to enforce format adherence and basic logical consistency:
- Answer presence in context: Verify that the generated answer string is indeed present within the provided context.
- Question clarity: While subjective, some linguistic checks can flag overly complex or ambiguous questions.
- Format adherence: Ensure the JSONL output strictly follows the defined schema.

Python Snippet: Format Validator for FMEval

import json

def validate_jsonl_format(filepath):
    """
    Validates if a JSONL file adheres to the expected format for FMEval ground truth.
    Checks for the presence of 'question', 'answer', and 'context' keys in each line.
    """
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            line_num = 0
            for line in f:
                line_num += 1
                try:
                    item = json.loads(line.strip())
                    # Assert that essential keys are present
                    assert "question" in item, f"Missing 'question' key in line {line_num}"
                    assert "answer" in item, f"Missing 'answer' key in line {line_num}"
                    assert "context" in item, f"Missing 'context' key in line {line_num}"
                    # Optionally, add more specific checks, e.g., type checking
                    assert isinstance(item["question"], str), f"'question' is not a string in line {line_num}"
                    assert isinstance(item["answer"], str), f"'answer' is not a string in line {line_num}"
                    assert isinstance(item["context"], str), f"'context' is not a string in line {line_num}"
                except json.JSONDecodeError:
                    print(f"Error: Invalid JSON on line {line_num}: {line.strip()}")
                    return False
                except AssertionError as e:
                    print(f"Validation Error: {e} in file {filepath}")
                    return False
        print(f"Successfully validated format for {filepath}")
        return True
    except FileNotFoundError:
        print(f"Error: File not found at {filepath}")
        return False
    except Exception as e:
        print(f"An unexpected error occurred during validation: {e}")
        return False

# Example usage:
# if validate_jsonl_format("ground_truth.jsonl"):
#     print("Ground truth file is in valid format.")

8. Running FMEval at Scale

Once the high-quality ground truth is generated and curated, FMEval can be leveraged to evaluate QA models efficiently and at scale.

How to configure FMEval against SageMaker-hosted QA models: FMEval can be configured to connect directly to SageMaker endpoints hosting your QA models, allowing for seamless integration into your MLOps pipeline. You provide the endpoint name, and FMEval handles the inference calls.
Auto-scaling evaluation jobs: For large datasets, FMEval can be integrated with AWS Batch or SageMaker Processing Jobs to auto-scale the evaluation workload, ensuring timely results without manual resource management.
Parsing evaluation reports: FMEval outputs detailed evaluation reports, typically in JSON format, which can be parsed and visualized to track model performance over time, compare different model versions, and identify areas for improvement.

CLI Example

# Example CLI command to run FMEval evaluation
# This command specifies the ground truth file, the predictions file,
# the metrics to be calculated, and the output file for the report.
fmeval \
--ground_truth_file data/gt.jsonl \
--predictions_file data/preds.jsonl \
--metrics exact_match f1_score semantic_similarity \
--output results/eval_report.json

9. Best Practices for Iterative Feedback and Model Improvement

The process of ground truth generation and model evaluation should be iterative, forming a continuous feedback loop for model improvement.

Fine-tuning prompt templates based on human review: Insights from human review of LLM-generated ground truth are invaluable. If humans consistently correct certain types of errors, the LLM’s prompt can be refined to mitigate these issues in future generations.
Use FMEval metrics to guide QA model refinement: The detailed metrics from FMEval (e.g., low F1 score on specific question types, high EM errors for certain contexts) provide concrete signals for where the QA model needs improvement. This can guide targeted fine-tuning, retrieval augmentation strategies, or architectural changes.
Tracking versioned datasets and model comparison: Maintain strict version control for both your ground truth datasets and your QA models. This allows for reproducible evaluations and enables clear comparisons of model performance across different iterations and fine-tuning experiments.

10. Conclusion

In the enterprise AI landscape, the pursuit of high-quality ground truth for QA evaluation is not merely a best practice; it is a non-negotiable requirement for building robust, reliable, and performant AI systems. The combination of powerful Large Language Models for scalable ground truth generation and AWS FMEval for standardized, comprehensive evaluation offers a transformative approach. This pipeline enables enterprises to overcome the traditional bottlenecks of manual annotation, fostering a more agile and data-driven approach to QA model development and deployment. By integrating this LLM-powered ground truth generation and FMEval-driven evaluation pipeline, organizations can establish robust internal benchmarks, accelerate model iteration, and confidently deploy high-performing QA solutions at scale.

Sidra Saleem

A Software Engineer by profession and a Writer by passion

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.