Skip to content

edbzed/utf8-doctor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

utf8-doctor

License Perl CLI Platform

This tool demonstrates encoding handling in Perl, detecting and repairing invalid UTF-8 sequences that silently break data pipelines.

Problem It Solves

Silent UTF-8 corruption breaks:

  • JSON parsing (invalid byte = parse failure)
  • Database imports (encoding mismatch)
  • Log aggregation (search/index corruption)
  • Data exports (downstream processing fails)

Most tools either crash or silently mangle data. utf8-doctor gives you visibility and control.

How It Works

utf8-doctor validates UTF-8 byte sequences according to RFC 3629:

Valid UTF-8 patterns:
  ASCII:      0xxxxxxx                         (U+0000..U+007F)
  2-byte:     110xxxxx 10xxxxxx                (U+0080..U+07FF)
  3-byte:     1110xxxx 10xxxxxx 10xxxxxx       (U+0800..U+FFFF)
  4-byte:     11110xxx 10xxxxxx 10xxxxxx 10xx  (U+10000..U+10FFFF)

Invalid sequences detected:
  - Continuation bytes without start (0x80-0xBF alone)
  - Overlong encodings (using more bytes than necessary)
  - Invalid start bytes (0xC0, 0xC1, 0xF5-0xFF)
  - Truncated sequences (incomplete multi-byte)
  - Surrogate halves (UTF-16 only, not valid UTF-8)

Installation

cd utf8-doctor
perl -Ilib bin/utf8-doctor --help

Usage

Validate Files

# Check single file
utf8-doctor data.txt

# Check multiple files
utf8-doctor *.log /var/exports/*.csv

# Strict mode - fail on first error
utf8-doctor --strict config.json

# Quiet mode for scripts
if ! utf8-doctor -q incoming.json; then
    echo "Encoding problem detected"
fi

Repair Files

# Repair to new file
utf8-doctor --repair -o fixed.txt broken.txt

# Repair in place
utf8-doctor --repair broken.txt

# Strip invalid bytes instead of replacing
utf8-doctor --repair --strip -o clean.txt dirty.txt

# Custom replacement character
utf8-doctor --repair -r '?' -o fixed.csv broken.csv

Detect Encoding

# Detect encoding
utf8-doctor --detect mystery.txt
# UTF-8 (95% confidence)

# Verbose detection with indicators
utf8-doctor --detect -v export.csv
# export.csv: UTF-8 (95% confidence)
#   - Valid UTF-8 with multi-byte sequences

Perl API

use UTF8Doctor;

my $doctor = UTF8Doctor->new(
    verbose    => 1,
    max_errors => 100,
    on_error   => sub {
        my $err = shift;
        print "Error at line $err->{line}: $err->{description}\n";
    },
);

# Validate
my $result = $doctor->validate_file('data.txt');
if (!$result->{valid}) {
    print "Found $result->{stats}{errors_found} errors\n";
    for my $err (@{$result->{errors}}) {
        print "  Line $err->{line}, offset $err->{offset_in_line}: $err->{hex_value}\n";
    }
}

# Repair
my $repair = $doctor->repair_file('broken.txt', 'fixed.txt');
print "Made $repair->{repairs_made} repairs\n";

# Detect
my $enc = $doctor->detect_encoding('file.txt');
print "Encoding: $enc->{encoding} ($enc->{confidence}% confidence)\n";

Options

Option Description
-V, --validate Validate mode (default)
-R, --repair Repair mode
-D, --detect Detect encoding mode
-s, --strict Exit on first error
-o, --output=FILE Output file for repair
-r, --replacement=CHAR Replacement character (default: U+FFFD)
-S, --strip Remove invalid bytes instead of replacing
-v, --verbose Show byte context
-q, --quiet Minimal output
-m, --max-errors=N Stop after N errors
-c, --context=N Bytes of context to show (default: 20)

Exit Codes

Code Meaning
0 Valid / repaired successfully
1 Encoding errors found
2 File access error

Synthetic Data Generator

Included is generate-test-data for creating test files with controlled encoding errors:

# Generate 1000 lines with 10% encoding errors
bin/generate-test-data -l 1000 -r 0.1 -o test.txt

# Specific error types
bin/generate-test-data -t latin1 -l 100 -o latin1_mix.txt
bin/generate-test-data -t overlong -l 100 -o overlong.txt

# Reproducible output
bin/generate-test-data --seed 42 -o deterministic.txt

Error Types

Type Description
continuation Bare continuation bytes (0x80-0xBF)
overlong Overlong UTF-8 encodings
invalid_start Invalid start bytes (0xC0, 0xC1, 0xF5+)
truncated Incomplete multi-byte sequences
surrogate UTF-16 surrogate halves
latin1 Latin-1 high bytes
windows Windows-1252 special chars
mixed Random mix (default)

Running Tests

prove -l t/

Tests cover:

  • Valid UTF-8 validation
  • Invalid byte detection (continuation, overlong, invalid start, truncated)
  • Error location tracking (line, offset)
  • Strict mode
  • Repair functionality
  • Strip mode
  • Custom replacement
  • In-place repair
  • Encoding detection (BOM, heuristics)

Design Decisions

  1. Byte-level validation: Validates actual UTF-8 byte sequences, not Perl's internal representation
  2. Raw I/O: Uses :raw layer to avoid Perl's encoding transformations
  3. Atomic repair: In-place repairs use temp file + rename pattern
  4. RFC 3629 strict: Rejects overlong encodings and surrogates (security-relevant)
  5. Context preservation: Shows surrounding bytes to help diagnose source of corruption

Common Encoding Problems

Symptom Likely Cause Solution
0x80-0x9F bytes Windows-1252 "smart quotes" Repair with replacement
0xE9 alone Latin-1 text labeled as UTF-8 Convert from ISO-8859-1
0xC0 0xAF Overlong encoding (security issue) Repair/reject
0xED 0xA0-0xBF UTF-16 surrogate in UTF-8 Repair/reject

See Also

Author

Ed Bates — TECHBLIP LLC

License

Licensed under the Apache License, Version 2.0.

About

UTF-8 validation and repair tool that detects and fixes encoding corruption

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages