Response Relevancy Evaluator

The Response Relevancy Evaluator assesses how well a generated response addresses the user's original query. It helps identify responses that may be factually correct but fail to answer what the user actually asked.

Response Relevancy Evaluator component interface and configuration

Evaluation Notice: Low relevancy scores indicate that responses are not addressing what users are asking for, which can lead to poor user experience and frustration regardless of factual correctness.

Component Inputs

Prompt / User Input: The original query or question posed by the user
Example: "What are the health benefits of regular exercise?"
Generated Output: The response generated by the RAG system
Example: "Regular exercise provides numerous health benefits, including improved cardiovascular health, better weight management, enhanced mental wellbeing, stronger muscles and bones, reduced risk of chronic diseases, and improved sleep quality."

Component Outputs

Evaluation Result: Qualitative assessment of the response's relevance to the original query
Example: "The response directly addresses the health benefits of regular exercise as requested in the query."

Score Interpretation

High Relevance (0.7-1.0)

Response directly addresses the query and provides the information the user was seeking

Example Score: 0.95
This indicates an excellent response that precisely answers what was asked

Moderate Relevance (0.3-0.7)

Response partially addresses the query but may include tangential information or miss some aspects

Example Score: 0.50
This indicates a response that addresses the query topic but may not fully answer what was asked

Low Relevance (0.0-0.3)

Response fails to address the query or provides information on a different topic

Example Score: 0.15
This indicates a response that does not answer the question asked

Implementation Example

from ragas.metrics import ResponseRelevancy

# Create the metric
response_relevancy = ResponseRelevancy()

# Use in evaluation
from datasets import Dataset
from ragas import evaluate

eval_dataset = Dataset.from_dict({
    "question": ["What are the health 
    benefits of regular exercise?"],
    "contexts": [["Regular exercise improves 
    cardiovascular health, helps with weight 
    management, boosts mental health, strengthens 
    muscles and bones, reduces risk of chronic diseases, +
    and improves sleep quality."]],
    "answer": ["Regular exercise provides numerous health benefits, +
    including improved cardiovascular health, better weight +
    management, enhanced mental wellbeing, stronger muscles +
    and bones, reduced risk of chronic diseases, and +
    improved sleep quality."]
})

result = evaluate(
    eval_dataset,
    metrics=[response_relevancy]
)
print(result)

Use Cases

Query Understanding: Evaluate how well your system interprets and responds to different query types
Response Quality Assurance: Ensure responses actually answer the questions users are asking
LLM Comparison: Compare different models' ability to generate relevant responses
Prompt Engineering: Refine prompts to improve response relevancy
User Satisfaction Prediction: Use relevancy scores as a predictor for potential user satisfaction

Useful Resources

Best Practices

Use ResponseRelevancy in conjunction with other metrics like Faithfulness and AnswerRelevancy for comprehensive evaluation
Set appropriate thresholds for different types of queries and use cases
Regularly audit responses with low relevancy scores to identify patterns and improve system performance
Consider the complexity of the original query when interpreting scores
Incorporate user feedback to verify and calibrate relevancy scores

Documentation