LangWatch Evaluator Agent

The LangWatch Evaluator Agent monitors and evaluates LLM-generated content in real-time. It helps ensure output quality, detects potential issues, and provides comprehensive analytics on model performance across multiple dimensions including factuality, toxicity, and relevance.

LangWatch Evaluator Agent interface and configuration

Implementation Notice: LangWatch evaluations may increase overall latency and cost for high-volume applications. Consider configuring sampling rates or event-based triggers for production deployments to optimize resource usage.

Component Inputs

Input Text: The original prompt or query sent to the LLM
Example: "Explain quantum computing in simple terms."
Output Text: The LLM-generated response to be evaluated
Example: "Quantum computing uses quantum bits or qubits that can be both 0 and 1 simultaneously..."
Evaluation Metrics: List of metrics to evaluate in the LLM response
Options: "factuality,toxicity,relevance,coherence,bias,hallucination,conciseness,completeness,all"
Reference Sources: Optional reference materials for factual verification
Example: URLs, document IDs, or text snippets containing factual information

Component Outputs

Overall Quality Score: Aggregate score representing overall quality
Scale: 0.0 (low quality) to 1.0 (high quality)
Metric Scores: Individual scores for each evaluated metric
Example: factuality : 0.92, relevance : 0.87, toxicity : 0.03
Issue Flags: Specific issues identified in the content
Example: ["potential_hallucination", "missing_key_information"]
Evaluation Insights: Detailed explanation of evaluation results
Includes specific content segments that influenced scoring

Evaluation Metrics

Quality Metrics

Factuality: Accuracy of factual claims
Verifies information against reference sources
Relevance: Alignment with the original query
Measures how well the response addresses the question
Coherence: Logical flow and structure
Evaluates organization and clarity of ideas
Conciseness: Efficiency of communication
Checks for wordiness and redundancy
Completeness: Coverage of necessary information
Identifies missing key elements

Safety Metrics

Toxicity: Harmful, offensive, or inappropriate content
Detects various forms of harmful language
Bias: Unfair or prejudiced perspectives
Identifies demographic or ideological bias
Hallucination: Fabricated or unsupported claims
Detects made-up facts or misleading information
Instruction Following: Adherence to provided guidelines
Measures how well directions were followed
Prompt Injection: Attempts to manipulate model behavior
Identifies potential security vulnerabilities

How It Works

The LangWatch Evaluator Agent utilizes specialized language models and rule-based systems to analyze LLM outputs across multiple dimensions. It compares responses against reference data, identifies patterns associated with quality issues, and provides comprehensive analysis of content safety and effectiveness.

Evaluation Process

Capturing input-output pairs from LLM interactions
Applying selected evaluation metrics to analyze content
Comparing factual claims against reference sources (if provided)
Detecting potential issues like hallucination or bias
Generating detailed scores and insights for each metric
Aggregating results into comprehensive evaluation report

Use Cases

Quality Assurance: Monitor and maintain output quality in production systems
Model Comparison: Evaluate different LLM models against consistent metrics
Safety Monitoring: Identify and address potential harmful outputs
Prompt Engineering: Refine prompts based on output quality metrics
Compliance: Ensure outputs meet regulatory or policy requirements

Implementation Example

const langwatchEvaluator = new LangWatchEvaluatorAgent({
  evaluationMetrics: ["factuality", "relevance", "toxicity", "hallucination"],
  confidenceThreshold: 0.7,
  sampling: 0.25  // Evaluate 25% of all interactions
});

// Example evaluation of an LLM interaction
const userQuery = "What are the health benefits of drinking water?";
const llmResponse = "Drinking water offers numerous health benefits  " +
                   "better kidney function, maintained blood pressure";

const result = langwatchEvaluator.evaluate(userQuery, llmResponse);

// Output:
// {
//   overallQualityScore: 0.68,
//   metricScores: {
//     "factuality": 0.45,
//     "relevance": 0.95,
//     "toxicity": 0.02,
//     "hallucination": 0.65
//   },
//   issueFlags: ["factual_error", "potential_hallucination"],
//   evaluationInsights: [
//     {
//       segment: "it can cure cancer",
//       issue: "factual_error",
//       explanation: "Unsupported medical claim. Water consumption has health benefits but is not proven to cure cancer.",
//       confidence: 0.92
//     }
//   ]
// }

Useful Resources

Best Practices

Select evaluation metrics relevant to your specific use case and requirements
Use sampling for high-volume applications to balance performance and cost
Provide reference sources when evaluating factuality to improve accuracy
Establish baseline quality thresholds based on your application's needs
Regularly review evaluation reports to identify systemic issues in your LLM outputs

Documentation