Documentation

Faithfulness Evaluator

The Faithfulness Evaluator assesses whether the generated response contains statements that contradict or misrepresent information in the retrieved contexts. It detects cases where a response is factually incorrect according to the provided source material.

Faithfulness Evaluator Component

Faithfulness Evaluator component interface and configuration

Evaluation Notice: Low faithfulness scores indicate misrepresentations or contradictions in the generated response. This can damage user trust and lead to the spread of inaccurate information.

Component Inputs

  • User Input: The question or prompt from the user

    Example: "What is photosynthesis?"

  • Generated Output: The response generated by the RAG system

    Example: "Photosynthesis is a biological process where plants convert light energy into chemical energy. During this process, plants take in water, carbon dioxide, and minerals, and produce oxygen and energy-rich compounds."

  • Retrieved Contexts: The collection of retrieved passages or documents used to generate the response

    Example: ["Photosynthesis is the process by which green plants and certain other organisms transform light energy into chemical energy. During photosynthesis in green plants, light energy is captured and used to convert water, carbon dioxide, and minerals into oxygen and energy-rich organic compounds."]

  • Handling unknown answers: Configuration for how to handle information not present in the contexts

    Example: "Strict - Only allow information explicitly stated in contexts"

Component Outputs

  • Evaluation Result: Qualitative assessment of the response's faithfulness to the contexts

    Example: "The response accurately represents the information in the context about photosynthesis."

Score Interpretation

High Faithfulness (0.7-1.0)

Response accurately represents information from the contexts with minimal to no contradictions

Example Score: 0.95 This indicates a highly faithful response with statements that closely align with context information

Moderate Faithfulness (0.3-0.7)

Response contains some accurate information but includes notable misrepresentations or contradictions

Example Score: 0.55 This indicates partial faithfulness with some problematic contradictions

Low Faithfulness (0.0-0.3)

Response significantly misrepresents or contradicts the information in the contexts

Example Score: 0.15 This indicates a response with substantial contradictions to the source material

Implementation Example

from ragas.metrics import Faithfulness # Create the metric faithfulness = Faithfulness() # Use in evaluation from datasets import Dataset from ragas import evaluate eval_dataset = Dataset.from_dict({ "question": ["What is photosynthesis?"], "contexts": [["Photosynthesis is the process by which green plants and certain other organisms transform light energy into chemical energy. During photosynthesis in green plants, light energy is captured and used to convert water, carbon dioxide, and minerals into oxygen and energy-rich organic compounds."]], "answer": ["Photosynthesis is a biological process where plants convert light energy into chemical energy. + During this process, plants take in water, carbon + dioxide, and minerals, and produce oxygen and + energy-rich compounds."] }) result = evaluate( eval_dataset, metrics=[faithfulness] ) print(result)

Use Cases

  • Factual Verification: Validate that responses don't contradict or misrepresent the source material
  • Model Comparison: Compare different LLMs for their tendencies to stay faithful to provided contexts
  • Critical Domain Applications: Ensure accuracy in high-stakes domains like healthcare, legal, and financial advice
  • Prompt Engineering: Improve prompting strategies to enhance the faithfulness of generated responses
  • Response Filtering: Implement quality gates to block responses that misrepresent source information

Best Practices

  • Use Faithfulness in combination with Response Context Precision for a comprehensive assessment
  • Analyze contradictions to identify patterns in how the LLM misinterprets or misrepresents information
  • Consider implementing additional verification steps for responses with moderate faithfulness scores
  • Set domain-specific faithfulness thresholds based on the criticality of accurate information
  • Test faithfulness across various types of content (e.g., factual, numerical, procedural) to identify weak spots