Response Relevancy Evaluator
The Response Relevancy Evaluator assesses how well a generated response addresses the user's original query. It helps identify responses that may be factually correct but fail to answer what the user actually asked.

Response Relevancy Evaluator component interface and configuration
Evaluation Notice: Low relevancy scores indicate that responses are not addressing what users are asking for, which can lead to poor user experience and frustration regardless of factual correctness.
Component Inputs
- Prompt / User Input: The original query or question posed by the user
Example: "What are the health benefits of regular exercise?"
- Generated Output: The response generated by the RAG system
Example: "Regular exercise provides numerous health benefits, including improved cardiovascular health, better weight management, enhanced mental wellbeing, stronger muscles and bones, reduced risk of chronic diseases, and improved sleep quality."
Component Outputs
- Evaluation Result: Qualitative assessment of the response's relevance to the original query
Example: "The response directly addresses the health benefits of regular exercise as requested in the query."
Score Interpretation
High Relevance (0.7-1.0)
Response directly addresses the query and provides the information the user was seeking
Example Score: 0.95
This indicates an excellent response that precisely answers what was asked
Moderate Relevance (0.3-0.7)
Response partially addresses the query but may include tangential information or miss some aspects
Example Score: 0.50
This indicates a response that addresses the query topic but may not fully answer what was asked
Low Relevance (0.0-0.3)
Response fails to address the query or provides information on a different topic
Example Score: 0.15
This indicates a response that does not answer the question asked
Implementation Example
from ragas.metrics import ResponseRelevancy
# Create the metric
response_relevancy = ResponseRelevancy()
# Use in evaluation
from datasets import Dataset
from ragas import evaluate
eval_dataset = Dataset.from_dict({
"question": ["What are the health
benefits of regular exercise?"],
"contexts": [["Regular exercise improves
cardiovascular health, helps with weight
management, boosts mental health, strengthens
muscles and bones, reduces risk of chronic diseases, +
and improves sleep quality."]],
"answer": ["Regular exercise provides numerous health benefits, +
including improved cardiovascular health, better weight +
management, enhanced mental wellbeing, stronger muscles +
and bones, reduced risk of chronic diseases, and +
improved sleep quality."]
})
result = evaluate(
eval_dataset,
metrics=[response_relevancy]
)
print(result)
Use Cases
- Query Understanding: Evaluate how well your system interprets and responds to different query types
- Response Quality Assurance: Ensure responses actually answer the questions users are asking
- LLM Comparison: Compare different models' ability to generate relevant responses
- Prompt Engineering: Refine prompts to improve response relevancy
- User Satisfaction Prediction: Use relevancy scores as a predictor for potential user satisfaction
Best Practices
- Use ResponseRelevancy in conjunction with other metrics like Faithfulness and AnswerRelevancy for comprehensive evaluation
- Set appropriate thresholds for different types of queries and use cases
- Regularly audit responses with low relevancy scores to identify patterns and improve system performance
- Consider the complexity of the original query when interpreting scores
- Incorporate user feedback to verify and calibrate relevancy scores