Factual Correctness Evaluator
The Factual Correctness Evaluator assesses whether the information in a generated response is factually accurate according to the provided context. It detects inconsistencies, fabrications, or hallucinations in the response that contradict the source material.

Factual Correctness Evaluator component interface and configuration
Evaluation Notice: Low factual correctness scores indicate potential hallucinations or misrepresentations in responses, which can damage user trust and may have serious consequences in critical applications.
Component Inputs
- Generated Output: The response generated by the RAG system that needs to be evaluated
Example: "SpaceX was founded by Elon Musk in 2002."
- Expected Output: The expected or reference response to compare against
Example: "SpaceX was founded in 2002 by Elon Musk with the goal to reduce space transportation costs."
- Evaluation Mode: The method used for evaluation (e.g., token-based, semantic, or hybrid)
Example: "Semantic"
- Atomicity: The level of granularity for fact-checking (sentence, claim, or entity level)
Example: "Claim-level"
- Coverage: Whether to evaluate all claims or only a subset
Example: "All claims"
Component Outputs
- Evaluation Result: Qualitative explanation of the factual assessment, potentially highlighting contradictions
Example: "The response is factually consistent with the provided context."
Score Interpretation
High Factual Consistency (0.7-1.0)
Response facts align closely with the information in the provided context
Example Score: 0.95
This indicates excellent factual alignment with the context
Moderate Factual Consistency (0.3-0.7)
Response contains some accurate information but may include minor factual errors or unsupported claims
Example Score: 0.55
This indicates partial factual alignment with notable discrepancies
Low Factual Consistency (0.0-0.3)
Response contains significant factual errors or contradictions to the provided context
Example Score: 0.15
This indicates substantial factual inaccuracies or hallucinations
Implementation Example
from ragas.metrics import FactualConsistency
# Create the metric
factual = FactualConsistency()
# Use in evaluation
from datasets import Dataset
from ragas import evaluate
eval_dataset = Dataset.from_dict({
"question": ["Who founded SpaceX?"],
"contexts": [["SpaceX was founded in 2002
by Elon Musk with the goal to reduce space
transportation costs."]],
"answer": ["SpaceX was founded by Elon Musk in 2002."]
})
result = evaluate(
eval_dataset,
metrics=[factual]
)
print(result)
Use Cases
- Hallucination Detection: Identify when an AI generates information not supported by the provided context
- Content Verification: Ensure information in high-stakes domains like healthcare or legal advice is accurate
- Model Tuning: Guide fine-tuning of LLMs to improve their factual consistency when used in RAG systems
- Response Quality Control: Implement quality gates that prevent factually incorrect responses from reaching users
- Comparative Analysis: Compare different LLMs or RAG configurations for their factual accuracy
Best Practices
- Combine Factual Correctness with other metrics like Faithfulness for a comprehensive evaluation
- Establish minimum factual correctness thresholds based on your application's risk profile
- Implement fallback strategies for responses that fail to meet factual correctness standards
- Use factual correctness evaluation results to continuously improve your RAG system
- Consider domain-specific factual correctness evaluators for specialized applications