This tool demonstrates encoding handling in Perl, detecting and repairing invalid UTF-8 sequences that silently break data pipelines.
Silent UTF-8 corruption breaks:
- JSON parsing (invalid byte = parse failure)
- Database imports (encoding mismatch)
- Log aggregation (search/index corruption)
- Data exports (downstream processing fails)
Most tools either crash or silently mangle data. utf8-doctor gives you visibility and control.
utf8-doctor validates UTF-8 byte sequences according to RFC 3629:
Valid UTF-8 patterns:
ASCII: 0xxxxxxx (U+0000..U+007F)
2-byte: 110xxxxx 10xxxxxx (U+0080..U+07FF)
3-byte: 1110xxxx 10xxxxxx 10xxxxxx (U+0800..U+FFFF)
4-byte: 11110xxx 10xxxxxx 10xxxxxx 10xx (U+10000..U+10FFFF)
Invalid sequences detected:
- Continuation bytes without start (0x80-0xBF alone)
- Overlong encodings (using more bytes than necessary)
- Invalid start bytes (0xC0, 0xC1, 0xF5-0xFF)
- Truncated sequences (incomplete multi-byte)
- Surrogate halves (UTF-16 only, not valid UTF-8)
cd utf8-doctor
perl -Ilib bin/utf8-doctor --help# Check single file
utf8-doctor data.txt
# Check multiple files
utf8-doctor *.log /var/exports/*.csv
# Strict mode - fail on first error
utf8-doctor --strict config.json
# Quiet mode for scripts
if ! utf8-doctor -q incoming.json; then
echo "Encoding problem detected"
fi# Repair to new file
utf8-doctor --repair -o fixed.txt broken.txt
# Repair in place
utf8-doctor --repair broken.txt
# Strip invalid bytes instead of replacing
utf8-doctor --repair --strip -o clean.txt dirty.txt
# Custom replacement character
utf8-doctor --repair -r '?' -o fixed.csv broken.csv# Detect encoding
utf8-doctor --detect mystery.txt
# UTF-8 (95% confidence)
# Verbose detection with indicators
utf8-doctor --detect -v export.csv
# export.csv: UTF-8 (95% confidence)
# - Valid UTF-8 with multi-byte sequencesuse UTF8Doctor;
my $doctor = UTF8Doctor->new(
verbose => 1,
max_errors => 100,
on_error => sub {
my $err = shift;
print "Error at line $err->{line}: $err->{description}\n";
},
);
# Validate
my $result = $doctor->validate_file('data.txt');
if (!$result->{valid}) {
print "Found $result->{stats}{errors_found} errors\n";
for my $err (@{$result->{errors}}) {
print " Line $err->{line}, offset $err->{offset_in_line}: $err->{hex_value}\n";
}
}
# Repair
my $repair = $doctor->repair_file('broken.txt', 'fixed.txt');
print "Made $repair->{repairs_made} repairs\n";
# Detect
my $enc = $doctor->detect_encoding('file.txt');
print "Encoding: $enc->{encoding} ($enc->{confidence}% confidence)\n";| Option | Description |
|---|---|
-V, --validate |
Validate mode (default) |
-R, --repair |
Repair mode |
-D, --detect |
Detect encoding mode |
-s, --strict |
Exit on first error |
-o, --output=FILE |
Output file for repair |
-r, --replacement=CHAR |
Replacement character (default: U+FFFD) |
-S, --strip |
Remove invalid bytes instead of replacing |
-v, --verbose |
Show byte context |
-q, --quiet |
Minimal output |
-m, --max-errors=N |
Stop after N errors |
-c, --context=N |
Bytes of context to show (default: 20) |
| Code | Meaning |
|---|---|
| 0 | Valid / repaired successfully |
| 1 | Encoding errors found |
| 2 | File access error |
Included is generate-test-data for creating test files with controlled encoding errors:
# Generate 1000 lines with 10% encoding errors
bin/generate-test-data -l 1000 -r 0.1 -o test.txt
# Specific error types
bin/generate-test-data -t latin1 -l 100 -o latin1_mix.txt
bin/generate-test-data -t overlong -l 100 -o overlong.txt
# Reproducible output
bin/generate-test-data --seed 42 -o deterministic.txt| Type | Description |
|---|---|
continuation |
Bare continuation bytes (0x80-0xBF) |
overlong |
Overlong UTF-8 encodings |
invalid_start |
Invalid start bytes (0xC0, 0xC1, 0xF5+) |
truncated |
Incomplete multi-byte sequences |
surrogate |
UTF-16 surrogate halves |
latin1 |
Latin-1 high bytes |
windows |
Windows-1252 special chars |
mixed |
Random mix (default) |
prove -l t/Tests cover:
- Valid UTF-8 validation
- Invalid byte detection (continuation, overlong, invalid start, truncated)
- Error location tracking (line, offset)
- Strict mode
- Repair functionality
- Strip mode
- Custom replacement
- In-place repair
- Encoding detection (BOM, heuristics)
- Byte-level validation: Validates actual UTF-8 byte sequences, not Perl's internal representation
- Raw I/O: Uses
:rawlayer to avoid Perl's encoding transformations - Atomic repair: In-place repairs use temp file + rename pattern
- RFC 3629 strict: Rejects overlong encodings and surrogates (security-relevant)
- Context preservation: Shows surrounding bytes to help diagnose source of corruption
| Symptom | Likely Cause | Solution |
|---|---|---|
0x80-0x9F bytes |
Windows-1252 "smart quotes" | Repair with replacement |
0xE9 alone |
Latin-1 text labeled as UTF-8 | Convert from ISO-8859-1 |
0xC0 0xAF |
Overlong encoding (security issue) | Repair/reject |
0xED 0xA0-0xBF |
UTF-16 surrogate in UTF-8 | Repair/reject |
Ed Bates — TECHBLIP LLC
Licensed under the Apache License, Version 2.0.