Documentation

Rubrics Based Scoring Evaluator

The Rubrics Based Scoring evaluator assesses responses against customized qualitative criteria defined as rubrics. Unlike fixed metrics, rubric-based evaluation allows users to define application-specific dimensions that matter for their use case.

Rubrics Based Scoring Component

Rubrics Based Scoring component interface and configuration

Evaluation Notice: The effectiveness of rubric-based evaluation depends on carefully crafted criteria that reflect your application's specific quality requirements. Consider involving domain experts in defining your evaluation rubrics.

Component Inputs

  • User Input: The original question or query

    Example: "What are the benefits and drawbacks of remote work?"

  • Generated Output: The response generated by the RAG system

    Example: "Remote work has several benefits including flexibility in schedule, no commute time, and improved work-life balance. On the other hand, it can cause feelings of isolation, make it difficult to separate work from personal life, and potentially reduce team cohesion due to limited face-to-face interaction."

  • Expected Output: The reference or ground truth answer for comparison

    Example: "Remote work offers advantages such as schedule flexibility, elimination of commuting, and better work-life balance. However, it presents challenges including social isolation, blurred boundaries between work and home, and reduced team collaboration."

Component Outputs

  • Score: An aggregated numerical value between 0 and 1, representing the overall assessment across all rubrics

    Example: 0.88 (indicating strong performance across defined rubrics)

  • Evaluation Result: A detailed breakdown showing scores for each individual rubric criterion

    Example: { "overall_score": 0.88, "rubric_scores": { "Is the answer comprehensive...": 0.92, "Is the answer concise...": 0.85, ... } }

Score Interpretation

Excellent Performance (0.7-1.0)

Response strongly meets most or all criteria defined in the rubrics

Example Score: 0.92 This indicates excellent performance against your custom evaluation criteria

Satisfactory Performance (0.3-0.7)

Response meets some but not all criteria, with room for improvement in certain dimensions

Example Score: 0.55 This indicates adequate performance with some weaknesses in specific rubric criteria

Poor Performance (0.0-0.3)

Response fails to meet most of the criteria defined in the rubrics

Example Score: 0.15 This indicates poor performance against your custom evaluation criteria

Implementation Example

from ragas.metrics import RubricScore # Define custom rubrics rubrics = [ "Is the answer comprehensive, covering all aspects of the question?", "Is the answer concise, without unnecessary information?", "Is the answer well-structured and easy to understand?", "Does the answer address the core intent of the query?" ] # Create the metric rubric_score = RubricScore(rubrics=rubrics) # Use in evaluation from datasets import Dataset from ragas import evaluate eval_dataset = Dataset.from_dict({ "user_input": ["What are the benefits and drawbacks of remote work?"], "generated_output": ["Remote work has several benefits including flexibility in schedule, no commute time, and improved work-life balance. On the other hand, it can cause + feelings of isolation, make it difficult to separate + work from personal life, and potentially reduce team + cohesion due to limited face-to-face interaction."], "expected_output": ["Remote work offers advantages such as schedule flexibility, elimination of commuting, and better work-life balance. However, it presents challenges including social isolation, blurred boundaries between work and home, and reduced team collaboration."] }) result = evaluate( eval_dataset, metrics=[rubric_score] ) print(result)

Use Cases

  • Custom Evaluation Criteria: Create domain-specific evaluation frameworks tailored to particular use cases
  • Multi-dimensional Assessment: Evaluate responses across multiple quality dimensions simultaneously
  • Educational Feedback: Provide structured feedback on responses for training or educational purposes
  • Industry-Specific Evaluation: Assess responses against industry standards or regulatory requirements
  • Brand Voice Alignment: Evaluate how well responses adhere to brand communication guidelines

Best Practices

  • Keep rubric questions clear, specific, and objectively answerable
  • Balance the number of rubrics to cover important dimensions without overwhelming the evaluation
  • Periodically review and refine your rubrics based on changing requirements
  • Consider weighting certain rubrics higher than others if they're more important for your use case
  • Use the per-rubric scores to identify specific areas for improvement in your RAG system