Invisible Text Scanner
The Invisible Text Scanner identifies and removes non-printable Unicode characters from text inputs, protecting against steganography-based attacks and maintaining text integrity in LLM applications.

Invisible text detection and sanitization workflow
Unicode Detection Zones
Private Use Areas (PUA)
- Basic Multilingual Plane: U+E000 to U+F8FF
- Supplementary PUA-A: U+F0000 to U+FFFFD
- Supplementary PUA-B: U+100000 to U+10FFFD
Character Categories
- Cf: Format characters
- Cc: Control characters
- Co: Private use characters
- Cn: Unassigned characters
Attack Vectors
- Hidden instructions in online reviews
- Steganographic content in emails
- Concealed prompts in website content
- Masked instructions in security logs
- Clipboard-based payload injection
Scanner Features
- Non-printable character detection
- Unicode PUA scanning
- Automated text sanitization
- Risk scoring system
- Format preservation
Output Format
- sanitized_prompt: Text with invisible characters removed
- is_valid: Boolean indicating if invisible text was detected
- risk_score: Proportion of invisible characters found
Note: While invisible characters are valid Unicode, they're typically unused in normal text and can be a sign of steganographic attacks. Regular monitoring of character distributions can help identify unusual patterns.
Tip: Implement the scanner as part of a broader security strategy, including input validation and sanitization. Consider logging detected invisible text patterns for security analysis.