Factual Correctness Evaluator

The Factual Correctness Evaluator assesses whether the information in a generated response is factually accurate according to the provided context. It detects inconsistencies, fabrications, or hallucinations in the response that contradict the source material.

Factual Correctness Evaluator component interface and configuration

Evaluation Notice: Low factual correctness scores indicate potential hallucinations or misrepresentations in responses, which can damage user trust and may have serious consequences in critical applications.

Component Inputs

Generated Output: The response generated by the RAG system that needs to be evaluated
Example: "SpaceX was founded by Elon Musk in 2002."
Expected Output: The expected or reference response to compare against
Example: "SpaceX was founded in 2002 by Elon Musk with the goal to reduce space transportation costs."
Evaluation Mode: The method used for evaluation (e.g., token-based, semantic, or hybrid)
Example: "Semantic"
Atomicity: The level of granularity for fact-checking (sentence, claim, or entity level)
Example: "Claim-level"
Coverage: Whether to evaluate all claims or only a subset
Example: "All claims"

Component Outputs

Evaluation Result: Qualitative explanation of the factual assessment, potentially highlighting contradictions
Example: "The response is factually consistent with the provided context."

Score Interpretation

High Factual Consistency (0.7-1.0)

Response facts align closely with the information in the provided context

Example Score: 0.95
This indicates excellent factual alignment with the context

Moderate Factual Consistency (0.3-0.7)

Response contains some accurate information but may include minor factual errors or unsupported claims

Example Score: 0.55
This indicates partial factual alignment with notable discrepancies

Low Factual Consistency (0.0-0.3)

Response contains significant factual errors or contradictions to the provided context

Example Score: 0.15
This indicates substantial factual inaccuracies or hallucinations

Implementation Example

from ragas.metrics import FactualConsistency

# Create the metric
factual = FactualConsistency()

# Use in evaluation
from datasets import Dataset
from ragas import evaluate

eval_dataset = Dataset.from_dict({
    "question": ["Who founded SpaceX?"],
    "contexts": [["SpaceX was founded in 2002
     by Elon Musk with the goal to reduce space 
     transportation costs."]],
    "answer": ["SpaceX was founded by Elon Musk in 2002."]
})

result = evaluate(
    eval_dataset,
    metrics=[factual]
)
print(result)

Use Cases

Hallucination Detection: Identify when an AI generates information not supported by the provided context
Content Verification: Ensure information in high-stakes domains like healthcare or legal advice is accurate
Model Tuning: Guide fine-tuning of LLMs to improve their factual consistency when used in RAG systems
Response Quality Control: Implement quality gates that prevent factually incorrect responses from reaching users
Comparative Analysis: Compare different LLMs or RAG configurations for their factual accuracy

Useful Resources

Best Practices

Combine Factual Correctness with other metrics like Faithfulness for a comprehensive evaluation
Establish minimum factual correctness thresholds based on your application's risk profile
Implement fallback strategies for responses that fail to meet factual correctness standards
Use factual correctness evaluation results to continuously improve your RAG system
Consider domain-specific factual correctness evaluators for specialized applications

Documentation