Documentation

Response Context Precision Evaluator

The Response Context Precision Evaluator measures how well a generated response avoids incorporating information that wasn't present in the retrieved contexts. It helps detect hallucinations and made-up information in RAG system outputs.

Response Context Precision Evaluator Component

Response Context Precision Evaluator component interface and configuration

Evaluation Notice: Low Response Context Precision scores indicate potential hallucinations or fabrications in the response. This is particularly concerning in applications where factual accuracy is critical, such as healthcare, legal, or financial domains.

Component Inputs

  • User Input: The original question or query from the user

    Example: "What is the Great Barrier Reef?"

  • Generated Output: The response generated by the RAG system

    Example: "The Great Barrier Reef is the world's largest coral reef system. It extends for 2,300 kilometers along Australia's Queensland coast and is visible from space."

  • Expected Output: The reference or ground truth response (if available)

    Example: "The Great Barrier Reef is the world's largest coral reef system, stretching 2,300 kilometers along the coast of Queensland, Australia."

  • Retrieved Contexts: The collection of retrieved passages or documents used to generate the response

    Example: ["The Great Barrier Reef is the world's largest coral reef system, stretching 2,300 kilometers along the coast of Queensland, Australia."]

Component Outputs

  • Evaluation Result: Qualitative assessment of the response's adherence to context information

    Example: "The response is mostly grounded in the provided contexts, but the claim about being visible from space is not supported by the retrieved information."

Score Interpretation

High Precision (0.7-1.0)

Response is well-grounded in the provided contexts with minimal to no hallucinations

Example Score: 0.95 This indicates an excellent response that mostly sticks to information provided in the contexts

Moderate Precision (0.3-0.7)

Response contains some information from contexts but also includes notable unsupported claims

Example Score: 0.50 This indicates a response with significant hallucination issues that need to be addressed

Low Precision (0.0-0.3)

Response is largely disconnected from the provided contexts

Example Score: 0.15 This indicates a response with severe hallucination issues - most information is fabricated

Implementation Example

from ragas.metrics import ResponseContextPrecision # Create the metric response_precision = ResponseContextPrecision() # Use in evaluation from datasets import Dataset from ragas import evaluate eval_dataset = Dataset.from_dict({ "question": ["What is the Great Barrier Reef?"], "contexts": [["The Great Barrier Reef is the world's + largest coral reef system, stretching 2,300 + kilometers along the coast of Queensland, + Australia."]], "answer": ["The Great Barrier Reef is the world's + largest coral reef system. It extends for 2,300 + kilometers along Australia's Queensland coast + and is visible from space."] }) result = evaluate( eval_dataset, metrics=[response_precision] ) print(result)

Use Cases

  • Hallucination Detection: Identify when the LLM is generating information not found in the retrieved contexts
  • Model Tuning: Compare different prompt strategies or models to minimize hallucinations
  • Critical Application Safety: Ensure factual accuracy in domains where misinformation could have serious consequences
  • LLM Performance Analysis: Evaluate different large language models for their tendency to hallucinate
  • Response Filtering: Create automated filters to flag potentially fabricated responses for human review

Best Practices

  • Combine with Factual Correctness Evaluator to get a more comprehensive assessment of response accuracy
  • Implement tiered thresholds for different types of content (e.g., stricter for medical advice)
  • Use the specific hallucinated statements to refine your prompting strategy
  • Consider the trade-off between strict precision and natural-sounding responses
  • For critical applications, implement a human review process for responses flagged with low precision