Documentation

LangWatch Evaluator Agent

The LangWatch Evaluator Agent monitors and evaluates LLM-generated content in real-time. It helps ensure output quality, detects potential issues, and provides comprehensive analytics on model performance across multiple dimensions including factuality, toxicity, and relevance.

LangWatch Evaluator Component

LangWatch Evaluator Agent interface and configuration

Implementation Notice: LangWatch evaluations may increase overall latency and cost for high-volume applications. Consider configuring sampling rates or event-based triggers for production deployments to optimize resource usage.

Component Inputs

  • Input Text: The original prompt or query sent to the LLM

    Example: "Explain quantum computing in simple terms."

  • Output Text: The LLM-generated response to be evaluated

    Example: "Quantum computing uses quantum bits or qubits that can be both 0 and 1 simultaneously..."

  • Evaluation Metrics: List of metrics to evaluate in the LLM response

    Options: "factuality,toxicity,relevance,coherence,bias,hallucination,conciseness,completeness,all"

  • Reference Sources: Optional reference materials for factual verification

    Example: URLs, document IDs, or text snippets containing factual information

Component Outputs

  • Overall Quality Score: Aggregate score representing overall quality

    Scale: 0.0 (low quality) to 1.0 (high quality)

  • Metric Scores: Individual scores for each evaluated metric

    Example: factuality : 0.92, relevance : 0.87, toxicity : 0.03

  • Issue Flags: Specific issues identified in the content

    Example: ["potential_hallucination", "missing_key_information"]

  • Evaluation Insights: Detailed explanation of evaluation results

    Includes specific content segments that influenced scoring

Evaluation Metrics

Quality Metrics

  • Factuality: Accuracy of factual claims

    Verifies information against reference sources

  • Relevance: Alignment with the original query

    Measures how well the response addresses the question

  • Coherence: Logical flow and structure

    Evaluates organization and clarity of ideas

  • Conciseness: Efficiency of communication

    Checks for wordiness and redundancy

  • Completeness: Coverage of necessary information

    Identifies missing key elements

Safety Metrics

  • Toxicity: Harmful, offensive, or inappropriate content

    Detects various forms of harmful language

  • Bias: Unfair or prejudiced perspectives

    Identifies demographic or ideological bias

  • Hallucination: Fabricated or unsupported claims

    Detects made-up facts or misleading information

  • Instruction Following: Adherence to provided guidelines

    Measures how well directions were followed

  • Prompt Injection: Attempts to manipulate model behavior

    Identifies potential security vulnerabilities

How It Works

The LangWatch Evaluator Agent utilizes specialized language models and rule-based systems to analyze LLM outputs across multiple dimensions. It compares responses against reference data, identifies patterns associated with quality issues, and provides comprehensive analysis of content safety and effectiveness.

Evaluation Process

  1. Capturing input-output pairs from LLM interactions
  2. Applying selected evaluation metrics to analyze content
  3. Comparing factual claims against reference sources (if provided)
  4. Detecting potential issues like hallucination or bias
  5. Generating detailed scores and insights for each metric
  6. Aggregating results into comprehensive evaluation report

Use Cases

  • Quality Assurance: Monitor and maintain output quality in production systems
  • Model Comparison: Evaluate different LLM models against consistent metrics
  • Safety Monitoring: Identify and address potential harmful outputs
  • Prompt Engineering: Refine prompts based on output quality metrics
  • Compliance: Ensure outputs meet regulatory or policy requirements

Implementation Example

const langwatchEvaluator = new LangWatchEvaluatorAgent({ evaluationMetrics: ["factuality", "relevance", "toxicity", "hallucination"], confidenceThreshold: 0.7, sampling: 0.25 // Evaluate 25% of all interactions }); // Example evaluation of an LLM interaction const userQuery = "What are the health benefits of drinking water?"; const llmResponse = "Drinking water offers numerous health benefits " + "better kidney function, maintained blood pressure"; const result = langwatchEvaluator.evaluate(userQuery, llmResponse); // Output: // { // overallQualityScore: 0.68, // metricScores: { // "factuality": 0.45, // "relevance": 0.95, // "toxicity": 0.02, // "hallucination": 0.65 // }, // issueFlags: ["factual_error", "potential_hallucination"], // evaluationInsights: [ // { // segment: "it can cure cancer", // issue: "factual_error", // explanation: "Unsupported medical claim. Water consumption has health benefits but is not proven to cure cancer.", // confidence: 0.92 // } // ] // }

Best Practices

  • Select evaluation metrics relevant to your specific use case and requirements
  • Use sampling for high-volume applications to balance performance and cost
  • Provide reference sources when evaluating factuality to improve accuracy
  • Establish baseline quality thresholds based on your application's needs
  • Regularly review evaluation reports to identify systemic issues in your LLM outputs