Documentation

Context F1 Evaluator

The Context F1 Evaluator combines Context Precision and Context Recall into a single balanced metric. It provides a comprehensive assessment of how well the retrieval system balances retrieving all necessary information (recall) while avoiding irrelevant information (precision).

Context F1 Evaluator Component

Context F1 Evaluator component interface and configuration

Evaluation Notice: The F1 score considers both precision and recall equally important. However, depending on your application, you might want to prioritize one over the other using alternative metrics or custom weights.

Component Inputs

  • Retrieved Contexts: The collection of retrieved passages or documents to be evaluated

    Example: ["The 2008 financial crisis was triggered by the subprime mortgage crisis.", "Banking deregulation and excessive risk-taking by financial institutions played a role in the crisis."]

  • Reference Contexts: The ground truth contexts that should have been retrieved

    Example: ["The 2008 financial crisis was triggered by the subprime mortgage crisis.", "The housing bubble burst contributed to the financial crisis."]

  • Distance Measure: The method used to calculate similarity between contexts

    Example: "cosine" (for cosine similarity) or "euclidean" (for euclidean distance)

Component Outputs

  • F1 Score: A numerical value between 0 and 1, representing the harmonic mean of precision and recall

    Example: 0.75 (indicating a good balance between precision and recall)

  • Precision Score: The proportion of retrieved contexts that are relevant

    Example: 0.80 (indicating 80% of retrieved contexts are relevant)

  • Recall Score: The proportion of necessary information that was retrieved

    Example: 0.70 (indicating 70% of necessary information was retrieved)

  • Evaluation Result: Qualitative assessment of the balance between precision and recall

    Example: "The retrieval system shows good precision (most retrieved contexts are relevant) but could improve recall (some necessary information is missing)."

Score Interpretation

High F1 Score (0.7-1.0)

Excellent balance between precision and recall, with both scores typically high

Example Score: 0.85 This indicates a retrieval system that surfaces most of the necessary information while avoiding irrelevant content

Moderate F1 Score (0.3-0.7)

Decent balance, but with room for improvement in either precision, recall, or both

Example Score: 0.55 This may indicate a system that has high precision but low recall, high recall but low precision, or moderate values for both

Low F1 Score (0.0-0.3)

Poor performance in either precision, recall, or both

Example Score: 0.15 This indicates significant issues with the retrieval system's effectiveness

Implementation Example

from ragas.metrics import ContextPrecision, ContextRecall # Create the individual metrics context_precision = ContextPrecision() context_recall = ContextRecall() # Use in evaluation from datasets import Dataset from ragas import evaluate eval_dataset = Dataset.from_dict({ "retrieved_contexts": [ ["The 2008 financial crisis was triggered by + the subprime mortgage crisis.", "Banking deregulation and excessive risk-taking + by financial institutions played a role in the crisis.", "A discussion of economic trends in the 1990s + that are unrelated to the 2008 crisis."] ], "reference_contexts": [ ["The 2008 financial crisis was triggered by + the subprime mortgage crisis.", "The housing bubble burst contributed to the + financial crisis.", "Banking deregulation and excessive risk-taking + by financial institutions played a role in the crisis."] ] }) # Evaluate both metrics result = evaluate( eval_dataset, metrics=[context_precision, context_recall] ) # Calculate F1 manually precision = result["context_precision"] recall = result["context_recall"] f1 = 2 * (precision * recall) / (precision + recall) print(f"F1 Score: {f1}")

Use Cases

  • Retrieval System Comparison: Compare different retrieval methods using a single balanced metric
  • Balanced Optimization: Optimize retrieval systems without sacrificing either precision or recall
  • RAG Pipeline Evaluation: Assess the overall effectiveness of the retrieval component in a RAG system
  • Retrieval Parameter Tuning: Fine-tune parameters like the number of retrieved documents to maximize F1
  • Holistic Improvement Tracking: Track the overall performance improvement of retrieval systems over time

Best Practices

  • Consider whether precision or recall is more important for your specific application
  • Use weighted F1 scores (F-beta) if precision and recall have different priorities
  • Examine the individual precision and recall scores to identify specific areas for improvement
  • Segment evaluation by query types to identify where your retrieval system performs best and worst
  • Combine F1 evaluation with qualitative analysis of the most common retrieval failures