Response Context Precision Evaluator

The Response Context Precision Evaluator measures how well a generated response avoids incorporating information that wasn't present in the retrieved contexts. It helps detect hallucinations and made-up information in RAG system outputs.

Response Context Precision Evaluator component interface and configuration

Evaluation Notice: Low Response Context Precision scores indicate potential hallucinations or fabrications in the response. This is particularly concerning in applications where factual accuracy is critical, such as healthcare, legal, or financial domains.

Component Inputs

User Input: The original question or query from the user
Example: "What is the Great Barrier Reef?"
Generated Output: The response generated by the RAG system
Example: "The Great Barrier Reef is the world's largest coral reef system. It extends for 2,300 kilometers along Australia's Queensland coast and is visible from space."
Expected Output: The reference or ground truth response (if available)
Example: "The Great Barrier Reef is the world's largest coral reef system, stretching 2,300 kilometers along the coast of Queensland, Australia."
Retrieved Contexts: The collection of retrieved passages or documents used to generate the response
Example: ["The Great Barrier Reef is the world's largest coral reef system, stretching 2,300 kilometers along the coast of Queensland, Australia."]

Component Outputs

Evaluation Result: Qualitative assessment of the response's adherence to context information
Example: "The response is mostly grounded in the provided contexts, but the claim about being visible from space is not supported by the retrieved information."

Score Interpretation

High Precision (0.7-1.0)

Response is well-grounded in the provided contexts with minimal to no hallucinations

Example Score: 0.95
This indicates an excellent response that mostly sticks to information provided in the contexts

Moderate Precision (0.3-0.7)

Response contains some information from contexts but also includes notable unsupported claims

Example Score: 0.50
This indicates a response with significant hallucination issues that need to be addressed

Low Precision (0.0-0.3)

Response is largely disconnected from the provided contexts

Example Score: 0.15
This indicates a response with severe hallucination issues - most information is fabricated

Implementation Example

from ragas.metrics import ResponseContextPrecision

# Create the metric
response_precision = ResponseContextPrecision()

# Use in evaluation
from datasets import Dataset
from ragas import evaluate

eval_dataset = Dataset.from_dict({
    "question": ["What is the Great Barrier Reef?"],
    "contexts": [["The Great Barrier Reef is the world's +
    largest coral reef system, stretching 2,300 +
    kilometers along the coast of Queensland, +
    Australia."]],
    "answer": ["The Great Barrier Reef is the world's +
    largest coral reef system. It extends for 2,300 +
    kilometers along Australia's Queensland coast +
    and is visible from space."]
})

result = evaluate(
    eval_dataset,
    metrics=[response_precision]
)
print(result)

Use Cases

Hallucination Detection: Identify when the LLM is generating information not found in the retrieved contexts
Model Tuning: Compare different prompt strategies or models to minimize hallucinations
Critical Application Safety: Ensure factual accuracy in domains where misinformation could have serious consequences
LLM Performance Analysis: Evaluate different large language models for their tendency to hallucinate
Response Filtering: Create automated filters to flag potentially fabricated responses for human review

Useful Resources

Best Practices

Combine with Factual Correctness Evaluator to get a more comprehensive assessment of response accuracy
Implement tiered thresholds for different types of content (e.g., stricter for medical advice)
Use the specific hallucinated statements to refine your prompting strategy
Consider the trade-off between strict precision and natural-sounding responses
For critical applications, implement a human review process for responses flagged with low precision

Documentation