utf8-doctor

This tool demonstrates encoding handling in Perl, detecting and repairing invalid UTF-8 sequences that silently break data pipelines.

Problem It Solves

Silent UTF-8 corruption breaks:

JSON parsing (invalid byte = parse failure)
Database imports (encoding mismatch)
Log aggregation (search/index corruption)
Data exports (downstream processing fails)

Most tools either crash or silently mangle data. utf8-doctor gives you visibility and control.

How It Works

utf8-doctor validates UTF-8 byte sequences according to RFC 3629:

Valid UTF-8 patterns:
  ASCII:      0xxxxxxx                         (U+0000..U+007F)
  2-byte:     110xxxxx 10xxxxxx                (U+0080..U+07FF)
  3-byte:     1110xxxx 10xxxxxx 10xxxxxx       (U+0800..U+FFFF)
  4-byte:     11110xxx 10xxxxxx 10xxxxxx 10xx  (U+10000..U+10FFFF)

Invalid sequences detected:
  - Continuation bytes without start (0x80-0xBF alone)
  - Overlong encodings (using more bytes than necessary)
  - Invalid start bytes (0xC0, 0xC1, 0xF5-0xFF)
  - Truncated sequences (incomplete multi-byte)
  - Surrogate halves (UTF-16 only, not valid UTF-8)

Installation

cd utf8-doctor
perl -Ilib bin/utf8-doctor --help

Usage

Validate Files

# Check single file
utf8-doctor data.txt

# Check multiple files
utf8-doctor *.log /var/exports/*.csv

# Strict mode - fail on first error
utf8-doctor --strict config.json

# Quiet mode for scripts
if ! utf8-doctor -q incoming.json; then
    echo "Encoding problem detected"
fi

Repair Files

# Repair to new file
utf8-doctor --repair -o fixed.txt broken.txt

# Repair in place
utf8-doctor --repair broken.txt

# Strip invalid bytes instead of replacing
utf8-doctor --repair --strip -o clean.txt dirty.txt

# Custom replacement character
utf8-doctor --repair -r '?' -o fixed.csv broken.csv

Detect Encoding

# Detect encoding
utf8-doctor --detect mystery.txt
# UTF-8 (95% confidence)

# Verbose detection with indicators
utf8-doctor --detect -v export.csv
# export.csv: UTF-8 (95% confidence)
#   - Valid UTF-8 with multi-byte sequences

Perl API

use UTF8Doctor;

my $doctor = UTF8Doctor->new(
    verbose    => 1,
    max_errors => 100,
    on_error   => sub {
        my $err = shift;
        print "Error at line $err->{line}: $err->{description}\n";
    },
);

# Validate
my $result = $doctor->validate_file('data.txt');
if (!$result->{valid}) {
    print "Found $result->{stats}{errors_found} errors\n";
    for my $err (@{$result->{errors}}) {
        print "  Line $err->{line}, offset $err->{offset_in_line}: $err->{hex_value}\n";
    }
}

# Repair
my $repair = $doctor->repair_file('broken.txt', 'fixed.txt');
print "Made $repair->{repairs_made} repairs\n";

# Detect
my $enc = $doctor->detect_encoding('file.txt');
print "Encoding: $enc->{encoding} ($enc->{confidence}% confidence)\n";

Options

Option	Description
`-V, --validate`	Validate mode (default)
`-R, --repair`	Repair mode
`-D, --detect`	Detect encoding mode
`-s, --strict`	Exit on first error
`-o, --output=FILE`	Output file for repair
`-r, --replacement=CHAR`	Replacement character (default: U+FFFD)
`-S, --strip`	Remove invalid bytes instead of replacing
`-v, --verbose`	Show byte context
`-q, --quiet`	Minimal output
`-m, --max-errors=N`	Stop after N errors
`-c, --context=N`	Bytes of context to show (default: 20)

Exit Codes

Code	Meaning
0	Valid / repaired successfully
1	Encoding errors found
2	File access error

Synthetic Data Generator

Included is generate-test-data for creating test files with controlled encoding errors:

# Generate 1000 lines with 10% encoding errors
bin/generate-test-data -l 1000 -r 0.1 -o test.txt

# Specific error types
bin/generate-test-data -t latin1 -l 100 -o latin1_mix.txt
bin/generate-test-data -t overlong -l 100 -o overlong.txt

# Reproducible output
bin/generate-test-data --seed 42 -o deterministic.txt

Error Types

Type	Description
`continuation`	Bare continuation bytes (0x80-0xBF)
`overlong`	Overlong UTF-8 encodings
`invalid_start`	Invalid start bytes (0xC0, 0xC1, 0xF5+)
`truncated`	Incomplete multi-byte sequences
`surrogate`	UTF-16 surrogate halves
`latin1`	Latin-1 high bytes
`windows`	Windows-1252 special chars
`mixed`	Random mix (default)

Running Tests

prove -l t/

Tests cover:

Valid UTF-8 validation
Invalid byte detection (continuation, overlong, invalid start, truncated)
Error location tracking (line, offset)
Strict mode
Repair functionality
Strip mode
Custom replacement
In-place repair
Encoding detection (BOM, heuristics)

Design Decisions

Byte-level validation: Validates actual UTF-8 byte sequences, not Perl's internal representation
Raw I/O: Uses :raw layer to avoid Perl's encoding transformations
Atomic repair: In-place repairs use temp file + rename pattern
RFC 3629 strict: Rejects overlong encodings and surrogates (security-relevant)
Context preservation: Shows surrounding bytes to help diagnose source of corruption

Common Encoding Problems

Symptom	Likely Cause	Solution
`0x80-0x9F` bytes	Windows-1252 "smart quotes"	Repair with replacement
`0xE9` alone	Latin-1 text labeled as UTF-8	Convert from ISO-8859-1
`0xC0 0xAF`	Overlong encoding (security issue)	Repair/reject
`0xED 0xA0-0xBF`	UTF-16 surrogate in UTF-8	Repair/reject

Author

Ed Bates — TECHBLIP LLC

License

Licensed under the Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bin		bin
lib		lib
t		t
.gitignore		.gitignore
AUTHORS		AUTHORS
LICENSE		LICENSE
README.md		README.md
curl_headers.txt		curl_headers.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

utf8-doctor

Problem It Solves

How It Works

Installation

Usage

Validate Files

Repair Files

Detect Encoding

Perl API

Options

Exit Codes

Synthetic Data Generator

Error Types

Running Tests

Design Decisions

Common Encoding Problems

See Also

Author

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

utf8-doctor

Problem It Solves

How It Works

Installation

Usage

Validate Files

Repair Files

Detect Encoding

Perl API

Options

Exit Codes

Synthetic Data Generator

Error Types

Running Tests

Design Decisions

Common Encoding Problems

See Also

Author

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages