Documentation

ROUGE Score Evaluator

The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score Evaluator measures the overlap between system-generated responses and reference texts. Unlike BLEU, which focuses on precision, ROUGE emphasizes recall—how much of the reference text is captured in the generated output.

ROUGE Score Evaluator Component

ROUGE Score Evaluator component interface and configuration

Evaluation Notice: ROUGE evaluates from a recall perspective, meaning it measures how well the generated text captures content from the reference. This is particularly important for summarization tasks where coverage of key information is critical.

Component Inputs

  • Generated Output: The response generated by the RAG system

    Example: "Diabetes symptoms include increased thirst, frequent urination, extreme hunger, weight loss, fatigue, irritability, blurred vision, and slow-healing sores."

  • Expected Output: The ground truth or expected answer for comparison

    Example: "The symptoms of diabetes typically include excessive thirst, frequent urination, increased hunger, unexplained weight loss, fatigue, irritability, blurry vision, and wounds that heal slowly."

  • ROUGE Type: The specific ROUGE metric variant to use for evaluation

    Example: "rouge1" for unigram matching, "rouge2" for bigram matching, "rougeL" for longest common subsequence

  • Measure Type: The specific measurement approach (precision, recall, or F1-score)

    Example: "precision", "recall", or "f1"

Component Outputs

  • Score: An aggregated numerical value between 0 and 1, typically an average of the different ROUGE metrics

    Example: 0.82 (indicating high recall of reference content)

  • Evaluation Result: A detailed breakdown of various ROUGE metrics:
    • rouge1: Unigram (single word) overlap
    • rouge2: Bigram (two-word sequence) overlap
    • rougeL: Longest Common Subsequence overlap
    • rougeLsum: Longest Common Subsequence across sentence boundaries

    Example: { "rouge1": 0.85, "rouge2": 0.76, "rougeL": 0.78, "rougeLsum": 0.80 }

Score Interpretation

High Recall (0.7-1.0)

Generated text successfully captures most or all of the important content from the reference text

Example Score: 0.85 This indicates excellent recall where the generated text contains most of the information from the reference

Moderate Recall (0.3-0.7)

Generated text captures some but not all of the key content from the reference text

Example Score: 0.50 This indicates partial recall where the generated text misses some key information

Low Recall (0.0-0.3)

Generated text captures little of the reference content or key information

Example Score: 0.15 This indicates poor recall where the generated text fails to capture most of the reference content

Implementation Example

from ragas.metrics import RougeScore # Create the metric rouge = RougeScore() # Use in evaluation from datasets import Dataset from ragas import evaluate eval_dataset = Dataset.from_dict({ "generated_output": ["Diabetes symptoms include increased + thirst, frequent urination, extreme hunger, weight + loss, fatigue, irritability, blurred vision, and + slow-healing sores."], "expected_output": ["The symptoms of diabetes typically include excessive thirst, frequent urination, increased hunger, unexplained weight loss, fatigue, + irritability, blurry vision, and wounds that heal + slowly."] }) result = evaluate( eval_dataset, metrics=[rouge] ) print(result)

Use Cases

  • Summarization Evaluation: Assess how well generated summaries capture key information from source texts
  • Reference-based Assessment: Benchmark RAG outputs against gold-standard answers or expert responses
  • Model Comparison: Compare different RAG implementations to determine which produces content closest to reference material
  • Content Coverage Evaluation: Measure how comprehensively generated content covers key points from a reference
  • Response Quality Monitoring: Track ROUGE scores over time to monitor system performance and detect degradation

Best Practices

  • Use ROUGE in combination with BLEU for a more comprehensive evaluation of text similarity
  • Consider different ROUGE variants (ROUGE-1, ROUGE-2, ROUGE-L) for different aspects of text quality
  • Create multiple reference answers when possible to account for linguistic variation
  • Interpret ROUGE scores in context of the specific task and domain requirements
  • Combine ROUGE with semantic metrics like Faithfulness for evaluating both lexical and semantic accuracy