Rubrics Based Scoring Evaluator

The Rubrics Based Scoring evaluator assesses responses against customized qualitative criteria defined as rubrics. Unlike fixed metrics, rubric-based evaluation allows users to define application-specific dimensions that matter for their use case.

Rubrics Based Scoring component interface and configuration

Evaluation Notice: The effectiveness of rubric-based evaluation depends on carefully crafted criteria that reflect your application's specific quality requirements. Consider involving domain experts in defining your evaluation rubrics.

Component Inputs

User Input: The original question or query
Example: "What are the benefits and drawbacks of remote work?"
Generated Output: The response generated by the RAG system
Example: "Remote work has several benefits including flexibility in schedule, no commute time, and improved work-life balance. On the other hand, it can cause feelings of isolation, make it difficult to separate work from personal life, and potentially reduce team cohesion due to limited face-to-face interaction."
Expected Output: The reference or ground truth answer for comparison
Example: "Remote work offers advantages such as schedule flexibility, elimination of commuting, and better work-life balance. However, it presents challenges including social isolation, blurred boundaries between work and home, and reduced team collaboration."

Component Outputs

Score: An aggregated numerical value between 0 and 1, representing the overall assessment across all rubrics
Example: 0.88 (indicating strong performance across defined rubrics)
Evaluation Result: A detailed breakdown showing scores for each individual rubric criterion
Example: { "overall_score": 0.88, "rubric_scores": { "Is the answer comprehensive...": 0.92, "Is the answer concise...": 0.85, ... } }

Score Interpretation

Excellent Performance (0.7-1.0)

Response strongly meets most or all criteria defined in the rubrics

Example Score: 0.92
This indicates excellent performance against your custom evaluation criteria

Satisfactory Performance (0.3-0.7)

Response meets some but not all criteria, with room for improvement in certain dimensions

Example Score: 0.55
This indicates adequate performance with some weaknesses in specific rubric criteria

Poor Performance (0.0-0.3)

Response fails to meet most of the criteria defined in the rubrics

Example Score: 0.15
This indicates poor performance against your custom evaluation criteria

Implementation Example

from ragas.metrics import RubricScore

# Define custom rubrics
rubrics = [
    "Is the answer comprehensive, covering all aspects of the question?",
    "Is the answer concise, without unnecessary information?",
    "Is the answer well-structured and easy to understand?",
    "Does the answer address the core intent of the query?"
]

# Create the metric
rubric_score = RubricScore(rubrics=rubrics)

# Use in evaluation
from datasets import Dataset
from ragas import evaluate

eval_dataset = Dataset.from_dict({
    "user_input": ["What are the benefits and drawbacks of remote work?"],
    "generated_output": ["Remote work has several benefits including 
    flexibility in schedule, no commute time, and improved
     work-life balance. On the other hand, it can cause +
     feelings of isolation, make it difficult to separate +
     work from personal life, and potentially reduce team +
     cohesion due to limited face-to-face interaction."],
    "expected_output": ["Remote work offers advantages such 
    as schedule flexibility, elimination of commuting, and 
    better work-life balance. However, it presents 
    challenges including social isolation, blurred 
    boundaries between work and home, and reduced team
    collaboration."]
})

result = evaluate(
    eval_dataset,
    metrics=[rubric_score]
)
print(result)

Use Cases

Custom Evaluation Criteria: Create domain-specific evaluation frameworks tailored to particular use cases
Multi-dimensional Assessment: Evaluate responses across multiple quality dimensions simultaneously
Educational Feedback: Provide structured feedback on responses for training or educational purposes
Industry-Specific Evaluation: Assess responses against industry standards or regulatory requirements
Brand Voice Alignment: Evaluate how well responses adhere to brand communication guidelines

Useful Resources

Best Practices

Keep rubric questions clear, specific, and objectively answerable
Balance the number of rubrics to cover important dimensions without overwhelming the evaluation
Periodically review and refine your rubrics based on changing requirements
Consider weighting certain rubrics higher than others if they're more important for your use case
Use the per-rubric scores to identify specific areas for improvement in your RAG system

Documentation