Language Scanner

The Language Scanner provides advanced language detection and validation capabilities to protect against language-based manipulation attempts. It uses the XLM-RoBERTa model to accurately identify and validate text language.

Language Scanner Architecture

Language detection workflow using XLM-RoBERTa

Attack Prevention

Common Attack Vectors

  • Multilingual Jailbreaks: Attempts to bypass security using language-specific features
  • Character Overloading: Using excessive special characters or mixed languages
  • Language Confusion: Mixing multiple languages to confuse the model

Supported Languages

  • Arabic (ar)
  • Bulgarian (bg)
  • German (de)
  • Greek (el)
  • English (en)
  • Spanish (es)
  • French (fr)
  • Hindi (hi)
  • Italian (it)
  • Japanese (ja)
  • Dutch (nl)
  • Polish (pl)
  • Portuguese (pt)
  • Russian (ru)
  • Swahili (sw)
  • Thai (th)
  • Turkish (tr)
  • Urdu (ur)
  • Vietnamese (vi)
  • Chinese (zh)

Configuration Options

  • valid_languages: List of allowed language codes (ISO 639-1)
  • match_type: Analysis mode
    • FULL: Complete text analysis
    • SENTENCE: Sentence-by-sentence scanning
  • model: papluca/xlm-roberta-base-language-detection

Output Format

  • sanitized_prompt: The analyzed text
  • is_valid: Boolean indicating if language is allowed
  • risk_score: Confidence score for language detection

Note: If no languages are detected above the threshold, the scanner returns is_valid=True and risk_score=0. This prevents false positives for edge cases.

Tip: For multilingual applications, consider implementing language-specific validation rules and thresholds. Regular model updates help maintain detection accuracy.