Context F1 Evaluator

The Context F1 Evaluator combines Context Precision and Context Recall into a single balanced metric. It provides a comprehensive assessment of how well the retrieval system balances retrieving all necessary information (recall) while avoiding irrelevant information (precision).

Context F1 Evaluator component interface and configuration

Evaluation Notice: The F1 score considers both precision and recall equally important. However, depending on your application, you might want to prioritize one over the other using alternative metrics or custom weights.

Component Inputs

Retrieved Contexts: The collection of retrieved passages or documents to be evaluated
Example: ["The 2008 financial crisis was triggered by the subprime mortgage crisis.", "Banking deregulation and excessive risk-taking by financial institutions played a role in the crisis."]
Reference Contexts: The ground truth contexts that should have been retrieved
Example: ["The 2008 financial crisis was triggered by the subprime mortgage crisis.", "The housing bubble burst contributed to the financial crisis."]
Distance Measure: The method used to calculate similarity between contexts
Example: "cosine" (for cosine similarity) or "euclidean" (for euclidean distance)

Component Outputs

F1 Score: A numerical value between 0 and 1, representing the harmonic mean of precision and recall
Example: 0.75 (indicating a good balance between precision and recall)
Precision Score: The proportion of retrieved contexts that are relevant
Example: 0.80 (indicating 80% of retrieved contexts are relevant)
Recall Score: The proportion of necessary information that was retrieved
Example: 0.70 (indicating 70% of necessary information was retrieved)
Evaluation Result: Qualitative assessment of the balance between precision and recall
Example: "The retrieval system shows good precision (most retrieved contexts are relevant) but could improve recall (some necessary information is missing)."

Score Interpretation

High F1 Score (0.7-1.0)

Excellent balance between precision and recall, with both scores typically high

Example Score: 0.85
This indicates a retrieval system that surfaces most of the necessary information while avoiding irrelevant content

Moderate F1 Score (0.3-0.7)

Decent balance, but with room for improvement in either precision, recall, or both

Example Score: 0.55
This may indicate a system that has high precision but low recall, high recall but low precision, or moderate values for both

Low F1 Score (0.0-0.3)

Poor performance in either precision, recall, or both

Example Score: 0.15
This indicates significant issues with the retrieval system's effectiveness

Implementation Example

from ragas.metrics import ContextPrecision, ContextRecall

# Create the individual metrics
context_precision = ContextPrecision()
context_recall = ContextRecall()

# Use in evaluation
from datasets import Dataset
from ragas import evaluate

eval_dataset = Dataset.from_dict({
    "retrieved_contexts": [
        ["The 2008 financial crisis was triggered by +
        the subprime mortgage crisis.",
         "Banking deregulation and excessive risk-taking +
         by financial institutions played a role in the crisis.",
         "A discussion of economic trends in the 1990s +
         that are unrelated to the 2008 crisis."]
    ],
    "reference_contexts": [
        ["The 2008 financial crisis was triggered by +
        the subprime mortgage crisis.",
         "The housing bubble burst contributed to the +
         financial crisis.",
         "Banking deregulation and excessive risk-taking +
         by financial institutions played a role in the crisis."]
    ]
})

# Evaluate both metrics
result = evaluate(
    eval_dataset,
    metrics=[context_precision, context_recall]
)

# Calculate F1 manually
precision = result["context_precision"]
recall = result["context_recall"]
f1 = 2 * (precision * recall) / (precision + recall)
print(f"F1 Score: {f1}")

Use Cases

Retrieval System Comparison: Compare different retrieval methods using a single balanced metric
Balanced Optimization: Optimize retrieval systems without sacrificing either precision or recall
RAG Pipeline Evaluation: Assess the overall effectiveness of the retrieval component in a RAG system
Retrieval Parameter Tuning: Fine-tune parameters like the number of retrieved documents to maximize F1
Holistic Improvement Tracking: Track the overall performance improvement of retrieval systems over time

Useful Resources

Best Practices

Consider whether precision or recall is more important for your specific application
Use weighted F1 scores (F-beta) if precision and recall have different priorities
Examine the individual precision and recall scores to identify specific areas for improvement
Segment evaluation by query types to identify where your retrieval system performs best and worst
Combine F1 evaluation with qualitative analysis of the most common retrieval failures

Documentation