Context F1 Evaluator
The Context F1 Evaluator combines Context Precision and Context Recall into a single balanced metric. It provides a comprehensive assessment of how well the retrieval system balances retrieving all necessary information (recall) while avoiding irrelevant information (precision).

Context F1 Evaluator component interface and configuration
Evaluation Notice: The F1 score considers both precision and recall equally important. However, depending on your application, you might want to prioritize one over the other using alternative metrics or custom weights.
Component Inputs
- Retrieved Contexts: The collection of retrieved passages or documents to be evaluated
Example: ["The 2008 financial crisis was triggered by the subprime mortgage crisis.", "Banking deregulation and excessive risk-taking by financial institutions played a role in the crisis."]
- Reference Contexts: The ground truth contexts that should have been retrieved
Example: ["The 2008 financial crisis was triggered by the subprime mortgage crisis.", "The housing bubble burst contributed to the financial crisis."]
- Distance Measure: The method used to calculate similarity between contexts
Example: "cosine" (for cosine similarity) or "euclidean" (for euclidean distance)
Component Outputs
- F1 Score: A numerical value between 0 and 1, representing the harmonic mean of precision and recall
Example: 0.75 (indicating a good balance between precision and recall)
- Precision Score: The proportion of retrieved contexts that are relevant
Example: 0.80 (indicating 80% of retrieved contexts are relevant)
- Recall Score: The proportion of necessary information that was retrieved
Example: 0.70 (indicating 70% of necessary information was retrieved)
- Evaluation Result: Qualitative assessment of the balance between precision and recall
Example: "The retrieval system shows good precision (most retrieved contexts are relevant) but could improve recall (some necessary information is missing)."
Score Interpretation
High F1 Score (0.7-1.0)
Excellent balance between precision and recall, with both scores typically high
Example Score: 0.85
This indicates a retrieval system that surfaces most of the necessary information while avoiding irrelevant content
Moderate F1 Score (0.3-0.7)
Decent balance, but with room for improvement in either precision, recall, or both
Example Score: 0.55
This may indicate a system that has high precision but low recall, high recall but low precision, or moderate values for both
Low F1 Score (0.0-0.3)
Poor performance in either precision, recall, or both
Example Score: 0.15
This indicates significant issues with the retrieval system's effectiveness
Implementation Example
from ragas.metrics import ContextPrecision, ContextRecall
# Create the individual metrics
context_precision = ContextPrecision()
context_recall = ContextRecall()
# Use in evaluation
from datasets import Dataset
from ragas import evaluate
eval_dataset = Dataset.from_dict({
"retrieved_contexts": [
["The 2008 financial crisis was triggered by +
the subprime mortgage crisis.",
"Banking deregulation and excessive risk-taking +
by financial institutions played a role in the crisis.",
"A discussion of economic trends in the 1990s +
that are unrelated to the 2008 crisis."]
],
"reference_contexts": [
["The 2008 financial crisis was triggered by +
the subprime mortgage crisis.",
"The housing bubble burst contributed to the +
financial crisis.",
"Banking deregulation and excessive risk-taking +
by financial institutions played a role in the crisis."]
]
})
# Evaluate both metrics
result = evaluate(
eval_dataset,
metrics=[context_precision, context_recall]
)
# Calculate F1 manually
precision = result["context_precision"]
recall = result["context_recall"]
f1 = 2 * (precision * recall) / (precision + recall)
print(f"F1 Score: {f1}")
Use Cases
- Retrieval System Comparison: Compare different retrieval methods using a single balanced metric
- Balanced Optimization: Optimize retrieval systems without sacrificing either precision or recall
- RAG Pipeline Evaluation: Assess the overall effectiveness of the retrieval component in a RAG system
- Retrieval Parameter Tuning: Fine-tune parameters like the number of retrieved documents to maximize F1
- Holistic Improvement Tracking: Track the overall performance improvement of retrieval systems over time
Best Practices
- Consider whether precision or recall is more important for your specific application
- Use weighted F1 scores (F-beta) if precision and recall have different priorities
- Examine the individual precision and recall scores to identify specific areas for improvement
- Segment evaluation by query types to identify where your retrieval system performs best and worst
- Combine F1 evaluation with qualitative analysis of the most common retrieval failures