ROUGE Score Evaluator

The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score Evaluator measures the overlap between system-generated responses and reference texts. Unlike BLEU, which focuses on precision, ROUGE emphasizes recall—how much of the reference text is captured in the generated output.

ROUGE Score Evaluator component interface and configuration

Evaluation Notice: ROUGE evaluates from a recall perspective, meaning it measures how well the generated text captures content from the reference. This is particularly important for summarization tasks where coverage of key information is critical.

Component Inputs

Generated Output: The response generated by the RAG system
Example: "Diabetes symptoms include increased thirst, frequent urination, extreme hunger, weight loss, fatigue, irritability, blurred vision, and slow-healing sores."
Expected Output: The ground truth or expected answer for comparison
Example: "The symptoms of diabetes typically include excessive thirst, frequent urination, increased hunger, unexplained weight loss, fatigue, irritability, blurry vision, and wounds that heal slowly."
ROUGE Type: The specific ROUGE metric variant to use for evaluation
Example: "rouge1" for unigram matching, "rouge2" for bigram matching, "rougeL" for longest common subsequence
Measure Type: The specific measurement approach (precision, recall, or F1-score)
Example: "precision", "recall", or "f1"

Component Outputs

Score: An aggregated numerical value between 0 and 1, typically an average of the different ROUGE metrics
Example: 0.82 (indicating high recall of reference content)
Evaluation Result: A detailed breakdown of various ROUGE metrics:
- rouge1: Unigram (single word) overlap
- rouge2: Bigram (two-word sequence) overlap
- rougeL: Longest Common Subsequence overlap
- rougeLsum: Longest Common Subsequence across sentence boundaries
Example: { "rouge1": 0.85, "rouge2": 0.76, "rougeL": 0.78, "rougeLsum": 0.80 }

Score Interpretation

High Recall (0.7-1.0)

Generated text successfully captures most or all of the important content from the reference text

Example Score: 0.85
This indicates excellent recall where the generated text contains most of the information from the reference

Moderate Recall (0.3-0.7)

Generated text captures some but not all of the key content from the reference text

Example Score: 0.50
This indicates partial recall where the generated text misses some key information

Low Recall (0.0-0.3)

Generated text captures little of the reference content or key information

Example Score: 0.15
This indicates poor recall where the generated text fails to capture most of the reference content

Implementation Example

from ragas.metrics import RougeScore

# Create the metric
rouge = RougeScore()

# Use in evaluation
from datasets import Dataset
from ragas import evaluate

eval_dataset = Dataset.from_dict({
    "generated_output": ["Diabetes symptoms include increased +
    thirst, frequent urination, extreme hunger, weight +
    loss, fatigue, irritability, blurred vision, and +
    slow-healing sores."],
    "expected_output": ["The symptoms of diabetes 
    typically include excessive thirst, frequent urination, 
    increased hunger, unexplained weight loss, fatigue, +
    irritability, blurry vision, and wounds that heal +
    slowly."]
})

result = evaluate(
    eval_dataset,
    metrics=[rouge]
)
print(result)

Use Cases

Summarization Evaluation: Assess how well generated summaries capture key information from source texts
Reference-based Assessment: Benchmark RAG outputs against gold-standard answers or expert responses
Model Comparison: Compare different RAG implementations to determine which produces content closest to reference material
Content Coverage Evaluation: Measure how comprehensively generated content covers key points from a reference
Response Quality Monitoring: Track ROUGE scores over time to monitor system performance and detect degradation

Useful Resources

Best Practices

Use ROUGE in combination with BLEU for a more comprehensive evaluation of text similarity
Consider different ROUGE variants (ROUGE-1, ROUGE-2, ROUGE-L) for different aspects of text quality
Create multiple reference answers when possible to account for linguistic variation
Interpret ROUGE scores in context of the specific task and domain requirements
Combine ROUGE with semantic metrics like Faithfulness for evaluating both lexical and semantic accuracy

Documentation