LangWatch Evaluator Agent
The LangWatch Evaluator Agent monitors and evaluates LLM-generated content in real-time. It helps ensure output quality, detects potential issues, and provides comprehensive analytics on model performance across multiple dimensions including factuality, toxicity, and relevance.

LangWatch Evaluator Agent interface and configuration
Implementation Notice: LangWatch evaluations may increase overall latency and cost for high-volume applications. Consider configuring sampling rates or event-based triggers for production deployments to optimize resource usage.
Component Inputs
- Input Text: The original prompt or query sent to the LLM
Example: "Explain quantum computing in simple terms."
- Output Text: The LLM-generated response to be evaluated
Example: "Quantum computing uses quantum bits or qubits that can be both 0 and 1 simultaneously..."
- Evaluation Metrics: List of metrics to evaluate in the LLM response
Options: "factuality,toxicity,relevance,coherence,bias,hallucination,conciseness,completeness,all"
- Reference Sources: Optional reference materials for factual verification
Example: URLs, document IDs, or text snippets containing factual information
Component Outputs
- Overall Quality Score: Aggregate score representing overall quality
Scale: 0.0 (low quality) to 1.0 (high quality)
- Metric Scores: Individual scores for each evaluated metric
Example: factuality : 0.92, relevance : 0.87, toxicity : 0.03
- Issue Flags: Specific issues identified in the content
Example: ["potential_hallucination", "missing_key_information"]
- Evaluation Insights: Detailed explanation of evaluation results
Includes specific content segments that influenced scoring
Evaluation Metrics
Quality Metrics
- Factuality: Accuracy of factual claims
Verifies information against reference sources
- Relevance: Alignment with the original query
Measures how well the response addresses the question
- Coherence: Logical flow and structure
Evaluates organization and clarity of ideas
- Conciseness: Efficiency of communication
Checks for wordiness and redundancy
- Completeness: Coverage of necessary information
Identifies missing key elements
Safety Metrics
- Toxicity: Harmful, offensive, or inappropriate content
Detects various forms of harmful language
- Bias: Unfair or prejudiced perspectives
Identifies demographic or ideological bias
- Hallucination: Fabricated or unsupported claims
Detects made-up facts or misleading information
- Instruction Following: Adherence to provided guidelines
Measures how well directions were followed
- Prompt Injection: Attempts to manipulate model behavior
Identifies potential security vulnerabilities
How It Works
The LangWatch Evaluator Agent utilizes specialized language models and rule-based systems to analyze LLM outputs across multiple dimensions. It compares responses against reference data, identifies patterns associated with quality issues, and provides comprehensive analysis of content safety and effectiveness.
Evaluation Process
- Capturing input-output pairs from LLM interactions
- Applying selected evaluation metrics to analyze content
- Comparing factual claims against reference sources (if provided)
- Detecting potential issues like hallucination or bias
- Generating detailed scores and insights for each metric
- Aggregating results into comprehensive evaluation report
Use Cases
- Quality Assurance: Monitor and maintain output quality in production systems
- Model Comparison: Evaluate different LLM models against consistent metrics
- Safety Monitoring: Identify and address potential harmful outputs
- Prompt Engineering: Refine prompts based on output quality metrics
- Compliance: Ensure outputs meet regulatory or policy requirements
Implementation Example
const langwatchEvaluator = new LangWatchEvaluatorAgent({
evaluationMetrics: ["factuality", "relevance", "toxicity", "hallucination"],
confidenceThreshold: 0.7,
sampling: 0.25 // Evaluate 25% of all interactions
});
// Example evaluation of an LLM interaction
const userQuery = "What are the health benefits of drinking water?";
const llmResponse = "Drinking water offers numerous health benefits " +
"better kidney function, maintained blood pressure";
const result = langwatchEvaluator.evaluate(userQuery, llmResponse);
// Output:
// {
// overallQualityScore: 0.68,
// metricScores: {
// "factuality": 0.45,
// "relevance": 0.95,
// "toxicity": 0.02,
// "hallucination": 0.65
// },
// issueFlags: ["factual_error", "potential_hallucination"],
// evaluationInsights: [
// {
// segment: "it can cure cancer",
// issue: "factual_error",
// explanation: "Unsupported medical claim. Water consumption has health benefits but is not proven to cure cancer.",
// confidence: 0.92
// }
// ]
// }
Best Practices
- Select evaluation metrics relevant to your specific use case and requirements
- Use sampling for high-volume applications to balance performance and cost
- Provide reference sources when evaluating factuality to improve accuracy
- Establish baseline quality thresholds based on your application's needs
- Regularly review evaluation reports to identify systemic issues in your LLM outputs