Invisible Text Scanner

The Invisible Text Scanner identifies and removes non-printable Unicode characters from text inputs, protecting against steganography-based attacks and maintaining text integrity in LLM applications.

Invisible Text Scanner Architecture

Invisible text detection and sanitization workflow

Unicode Detection Zones

Private Use Areas (PUA)

  • Basic Multilingual Plane: U+E000 to U+F8FF
  • Supplementary PUA-A: U+F0000 to U+FFFFD
  • Supplementary PUA-B: U+100000 to U+10FFFD

Character Categories

  • Cf: Format characters
  • Cc: Control characters
  • Co: Private use characters
  • Cn: Unassigned characters

Attack Vectors

  • Hidden instructions in online reviews
  • Steganographic content in emails
  • Concealed prompts in website content
  • Masked instructions in security logs
  • Clipboard-based payload injection

Scanner Features

  • Non-printable character detection
  • Unicode PUA scanning
  • Automated text sanitization
  • Risk scoring system
  • Format preservation

Output Format

  • sanitized_prompt: Text with invisible characters removed
  • is_valid: Boolean indicating if invisible text was detected
  • risk_score: Proportion of invisible characters found

Note: While invisible characters are valid Unicode, they're typically unused in normal text and can be a sign of steganographic attacks. Regular monitoring of character distributions can help identify unusual patterns.

Tip: Implement the scanner as part of a broader security strategy, including input validation and sanitization. Consider logging detected invisible text patterns for security analysis.