Skip to content

Geek0x0/pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

81 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GoPDF - High-Performance PDF Processing Library

Go Version License Test Coverage

GoPDF is a powerful PDF processing library written in Go, focused on efficient text extraction, content analysis, and multilingual support. Built with a modular architecture, it provides high-performance concurrent processing capabilities.

✨ Key Features

πŸ“– Text Extraction & Analysis

  • Intelligent Text Extraction: Supports plain text and styled text extraction
  • Semantic Classification: Automatic identification of titles, paragraphs, lists, tables, and other content types
  • Multilingual Support: Built-in English, French, German, and Spanish language detection and processing
  • Layout Analysis: Smart handling of multi-column layouts and complex page structures

πŸš€ Performance Optimization

  • Memory Optimization (NEW): Targeted allocation reduction for high-volume processing
    • Pre-allocated slices with capacity estimation (30-40% allocation reduction)
    • Eliminated unnecessary copies in hot paths (50% memory reduction in sorting)
    • Precise capacity calculation in merge operations (100+ allocations β†’ 3)
    • Optimized string builder growth (40-50% reduction in string operations)
  • Sharded Caching: 256-shard cache with lock-free statistics (70-80% lock contention reduction)
  • Font Prefetching: Intelligent pattern-based font preloading with priority queuing
  • Zero-Copy Strings: Unsafe pointer optimization reducing memory allocation by 30-50%
  • Pool Warmup: Startup memory pool pre-warming reducing first-access latency by 60-80%
  • Enhanced Parallel Processing: Adaptive worker pools with batch processing (50% scheduling overhead reduction)
  • Memory Management: Advanced object pooling and resource management
  • Spatial Indexing: R-tree spatial indexing for optimized layout analysis
  • Asynchronous I/O: Streaming support for large files

πŸ”§ Technical Features

  • Encoding Support: UTF-16, PDFDocEncoding, WinAnsi, MacRoman, and more
  • Compression Formats: Flate, LZW, ASCII85, RunLength
  • Encryption Support: RC4, AES encrypted PDFs
  • PDF Compatibility: Comprehensive PDF version and feature compatibility checking
  • PDF Recovery: Automatic recovery from malformed or corrupted PDF files
  • Thread Safety: Fully concurrent-safe operations
  • Robust Error Handling: Graceful degradation for malformed PDFs
    • Library never panics on invalid input (errors returned instead)
    • Tolerates missing PDF structure elements (endobj, endstream, etc.)
    • Handles malformed hex strings, names, and escape sequences
    • Graceful handling of truncated or corrupted files

πŸ“¦ Installation

go get -u github.com/Geek0x0/pdf

πŸš€ Quick Start

Basic Text Extraction

package main

import (
    "fmt"
    "log"
    "github.com/Geek0x0/pdf"
)

func main() {
    // Open PDF file
    file, reader, err := gopdf.Open("example.pdf")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    // Extract plain text
    textReader, err := reader.GetPlainText()
    if err != nil {
        log.Fatal(err)
    }

    // Read text content
    // ... use textReader
}

⚑ Performance Quick Start

For high-performance PDF processing, follow these optimization steps:

1. Optimized Application Startup

import "github.com/Geek0x0/pdf"

func init() {
    // Pre-warm memory pools and optimize GC settings
    config := pdf.DefaultStartupConfig()
    config.WarmupPools = true
    config.GCPercent = 200  // Reduce GC frequency
    
    if err := pdf.OptimizedStartup(config); err != nil {
        log.Fatalf("Startup optimization failed: %v", err)
    }
}

2. Use Parallel Extraction for Large Documents

func extractLargeDocument(filename string) ([]string, error) {
    f, r, err := pdf.Open(filename)
    if err != nil {
        return nil, err
    }
    defer f.Close()
    
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
    defer cancel()
    
    // Automatically uses all CPU cores
    return r.ExtractAllPagesParallel(ctx, 0)
}

3. Enable Caching for Repeated Operations

// Create global cache
var globalCache = pdf.NewShardedCache(100000, 1*time.Hour)

func getPageText(reader *pdf.Reader, pageNum int) (string, error) {
    cacheKey := fmt.Sprintf("page_%d", pageNum)
    
    // Check cache first
    if cached, ok := globalCache.Get(cacheKey); ok {
        return cached.(string), nil
    }
    
    // Extract and cache
    page := reader.Page(pageNum)
    text, err := page.GetPlainText(nil)
    if err == nil {
        globalCache.Set(cacheKey, text, int64(len(text)))
    }
    
    return text, err
}

4. Use Zero-Copy for String Operations

func processTexts(texts []string) string {
    // Fast zero-copy string operations
    builder := pdf.NewStringBuffer(10240)
    
    for _, text := range texts {
        trimmed := pdf.TrimSpaceZeroCopy(text)
        builder.WriteString(trimmed)
        builder.WriteByte('\n')
    }
    
    return builder.StringCopy()
}

PDF Compatibility Checking

// Check PDF compatibility and features
data, err := os.ReadFile("document.pdf")
if err != nil {
    log.Fatal(err)
}

compat, err := pdf.CheckPDFCompatibility(data)
if err != nil {
    log.Fatal(err)
}

fmt.Printf("PDF Version: %s\n", compat.Version)
fmt.Printf("Is Linearized: %v\n", compat.IsLinearized)
fmt.Printf("Has Transparency: %v\n", compat.HasTransparency)
fmt.Printf("Has Forms: %v\n", compat.HasForms)

if len(compat.Warnings) > 0 {
    fmt.Println("Warnings:")
    for _, warning := range compat.Warnings {
        fmt.Printf("  - %s\n", warning)
    }
}

// Validate PDF/A compliance
issues, err := pdf.ValidatePDFA(data)
if err != nil {
    log.Fatal(err)
}

if len(issues) == 0 {
    fmt.Println("PDF/A validation passed")
} else {
    fmt.Println("PDF/A validation issues:")
    for _, issue := range issues {
        fmt.Printf("  - %s\n", issue)
    }
}

PDF Integrity Checking and Recovery

// Check PDF integrity before processing
f, err := os.Open("potentially_corrupted.pdf")
if err != nil {
    log.Fatal(err)
}
defer f.Close()

stat, err := f.Stat()
if err != nil {
    log.Fatal(err)
}

integrity := pdf.CheckIntegrity(f, stat.Size())
fmt.Printf("PDF Valid: %v\n", integrity.IsValid)
fmt.Printf("Is Truncated: %v\n", integrity.IsTruncated)
fmt.Printf("Estimated Objects: %d\n", integrity.EstimatedObjects)

if len(integrity.Issues) > 0 {
    fmt.Println("Issues found:")
    for _, issue := range integrity.Issues {
        fmt.Printf("  - %s\n", issue)
    }
}

// Attempt to recover corrupted PDF
data, err := os.ReadFile("corrupted.pdf")
if err != nil {
    log.Fatal(err)
}

recovered, err := pdf.RecoverPDF(data)
if err != nil {
    log.Printf("Recovery failed: %v", err)
} else {
    fmt.Printf("Recovered PDF size: %d bytes\n", len(recovered))
    // Save recovered PDF
    err = os.WriteFile("recovered.pdf", recovered, 0644)
    if err != nil {
        log.Fatal(err)
    }
}

High-Performance Parallel Extraction

import "context"

// Extract all pages in parallel with all optimizations
ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
defer cancel()

// Automatically uses runtime.NumCPU() workers when workers=0
pages, err := reader.ExtractAllPagesParallel(ctx, 0)
if err != nil {
    log.Fatal(err)
}

for i, text := range pages {
    fmt.Printf("Page %d: %d characters\n", i+1, len(text))
}

Using ParallelExtractor Directly

// Create parallel extractor with custom worker count
extractor := pdf.NewParallelExtractor(4)
defer extractor.Close()

// Collect pages
numPages := reader.NumPage()
pages := make([]pdf.Page, numPages)
for i := 0; i < numPages; i++ {
    pages[i] = reader.Page(i + 1)
    pages[i].SetFontCacheInterface(extractor.GetCache())
}

// Extract with context
results, err := extractor.ExtractAllPages(ctx, pages)
if err != nil {
    log.Fatal(err)
}

// Get performance stats
cacheStats := extractor.GetCacheStats()
fmt.Printf("Cache hits: %d, misses: %d\n", cacheStats.Hits, cacheStats.Misses)

Multilingual Text Processing

// Create multilingual processor
processor := gopdf.NewMultiLangProcessor()

// Detect text language
result := processor.DetectLanguage("Hello world! Bonjour le monde!")
fmt.Printf("Detected language: %s (confidence: %.2f)\n", result.Language, result.Confidence)

// Extract text by language
extractor := gopdf.NewLanguageTextExtractor()
textsByLang, err := extractor.ExtractTextByLanguage(reader)

Performance Optimization Features

// 1. Optimized Startup with Pool Warmup
err := pdf.OptimizedStartup(pdf.DefaultStartupConfig())
if err != nil {
    log.Fatal(err)
}

// 2. Sharded Cache (256 shards, lock-free)
cache := pdf.NewShardedCache(10000, 30*time.Minute)
cache.Set("key", value, 100)
if val, ok := cache.Get("key"); ok {
    // Use cached value
}
stats := cache.GetStats()
fmt.Printf("Hits: %d, Misses: %d, Evictions: %d\n", 
    stats.Hits, stats.Misses, stats.Evictions)

// 3. Font Prefetching (intelligent pattern-based)
fontCache := pdf.NewOptimizedFontCache(1000)
prefetcher := pdf.NewFontPrefetcher(fontCache)
defer prefetcher.Close()
prefetcher.RecordAccess("Arial", []string{"Helvetica", "Times"})

// 4. Zero-Copy String Operations
builder := pdf.NewStringBuffer(1024)
builder.WriteString("Hello")
builder.WriteByte(' ')
builder.WriteString("World")
result := builder.StringCopy()  // Safe copy

// Fast string operations
trimmed := pdf.TrimSpaceZeroCopy("  text  ")
parts := pdf.SplitZeroCopy("a,b,c", ',')
joined := pdf.JoinZeroCopy([]string{"a", "b", "c"}, ",")

πŸ—οΈ Architecture

GoPDF uses a modular architecture with clear component responsibilities:

gopdf/
β”œβ”€β”€ lex.go                       # PDF lexical analysis and tokenization
β”œβ”€β”€ read.go                      # PDF file reading and parsing
β”œβ”€β”€ text.go                      # Core text extraction logic
β”œβ”€β”€ page.go                      # Page structure analysis
β”œβ”€β”€ metadata.go                  # Metadata processing
β”œβ”€β”€ compatibility.go             # PDF format compatibility checking
β”œβ”€β”€ recovery.go                  # PDF recovery for malformed files
β”œβ”€β”€ errors.go                    # Error handling and wrapping
β”œβ”€β”€ caching.go                   # Caching strategy implementation
β”œβ”€β”€ spatial_index.go             # Spatial indexing (R-tree)
β”œβ”€β”€ text_classifier.go           # Text classifier
β”œβ”€β”€ multilang.go                 # Multilingual support
β”œβ”€β”€ parallel_processing.go       # Parallel processing
β”œβ”€β”€ performance.go               # Performance optimization
β”œβ”€β”€ async_io.go                  # Asynchronous I/O
β”‚
β”œβ”€β”€ Performance Optimizations (2024)
β”œβ”€β”€ sharded_cache.go             # 256-shard high-performance cache
β”œβ”€β”€ font_prefetch.go             # Intelligent font prefetching
β”œβ”€β”€ zero_copy_strings.go         # Zero-copy string operations
β”œβ”€β”€ pool_warmup.go               # Memory pool pre-warming
β”œβ”€β”€ enhanced_parallel.go         # Enhanced parallel processing
β”œβ”€β”€ optimizations_advanced.go    # Advanced optimizations
└── memory_pools.go              # Advanced memory pool management

Core Components

  • Reader: Main PDF reading interface with encryption support
  • Text Extractor: Intelligent text extraction engine with smart ordering
  • Classifier: ML-based text classification for semantic analysis
  • Compatibility Checker: PDF version and feature compatibility validation
  • Recovery Engine: Automatic repair and recovery for damaged PDFs
  • Sharded Cache: 256-shard lock-free cache system
  • Font Prefetcher: Pattern-based predictive font loading
  • Parallel Extractor: Adaptive worker pool with batch processing
  • Spatial Index: R-tree spatial query optimization
  • Language Processor: Multilingual detection and processing
  • Zero-Copy Optimizer: Memory allocation reduction utilities

πŸ“Š Performance Benchmarks

Performance metrics based on standard test datasets (Intel i7-14700K):

Overall Performance

  • Text Extraction Speed: Average 50-100 pages/second
  • Memory Usage: Smart object pooling, 40% reduction in memory footprint
  • Concurrent Processing: Multi-core support, 3-5x performance improvement with parallel extractor

Optimization Benchmarks

Sharded Cache Performance

  • Set Operations: ~118 ns/op (256 shards)
  • Get Operations: ~112 ns/op
  • Concurrent Access: ~31 ns/op (70-80% lock contention reduction)
  • Cache Hit Rate: Up to 85% with LRU policies

Zero-Copy String Operations

  • BytesToString: 0.14 ns/op (97x faster than standard)
  • String Concat: 10.12 ns/op (3.1x faster)
  • TrimSpace: 2.67 ns/op (1.2x faster)
  • Split: 59.62 ns/op (1.3x faster)

Memory Pool Warmup

  • Light Warmup: ~37 Β΅s (development)
  • Default Warmup: ~96 Β΅s (production)
  • Aggressive Warmup: ~358 Β΅s (high-performance)
  • Concurrent vs Sequential: 35% faster with concurrent warmup

Parallel Extraction

  • 2 Workers: 1.8x speedup
  • 4 Workers: 3.1x speedup
  • 8 Workers: 5.0x speedup
  • Auto (CPU cores): 4.2x average speedup

πŸ§ͺ Testing

The project maintains testing standards with 67.6% coverage (main package):

  • Unit tests covering all core functionality
  • Integration tests for end-to-end PDF processing
  • Performance tests with benchmarks and memory profiling
  • Concurrency tests for thread safety validation
  • Optimization-specific tests for new features
# Run all tests
go test ./...

# Run coverage tests
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

# Run performance benchmarks
go test -bench=. -benchmem -benchtime=500ms

# Run specific optimization benchmarks
go test -bench=BenchmarkShardedCache -run=^$
go test -bench=BenchmarkStringOperations -run=^$
go test -bench=BenchmarkParallelExtractor -run=^$
go test -bench=BenchmarkWarmup -run=^$

Benchmark Examples

# Compare parallel vs sequential extraction
go test -bench=BenchmarkParallelExtractorVsSequential -run=^$ -benchtime=500ms

# Zero-copy string operations performance
go test -bench=BenchmarkStringOperations -run=^$ -benchtime=500ms

# Cache performance under different loads
go test -bench=BenchmarkShardedCache -run=^$ -benchtime=1s

πŸ“‹ Command Line Tool

GoPDF includes a command-line tool for quick PDF text extraction:

# Build the CLI tool
go build -o pdfcli ./cmd/pdfcli

# Extract plain text from all pages
./pdfcli document.pdf

# Extract text from specific page
./pdfcli -page 1 document.pdf

# Extract styled text with formatting
./pdfcli -mode styled document.pdf

# Extract text organized by rows
./pdfcli -mode rows -page 1 document.pdf

# Extract text organized by columns
./pdfcli -mode columns -page 1 document.pdf

Core Interfaces

// PDF file operations
Open(filename string) (*os.File, *Reader, error)
NewReader(r io.ReaderAt, size int64) (*Reader, error)

// PDF compatibility checking
CheckPDFCompatibility(data []byte) (*PDFCompatibilityInfo, error)
ValidatePDFA(data []byte) ([]string, error)
ValidatePDFX(data []byte) ([]string, error)

// PDF integrity and recovery
CheckIntegrity(r io.ReaderAt, size int64) *IntegrityStatus
RecoverPDF(data []byte) ([]byte, error)

// Text extraction
(reader *Reader) GetPlainText() (io.Reader, error)
(reader *Reader) ExtractWithContext(ctx context.Context, opts ExtractOptions) (io.Reader, error)
(reader *Reader) ExtractAllPagesParallel(ctx context.Context, workers int) ([]string, error)

// Page operations
(reader *Reader) Page(num int) *Page
(page *Page) Content() *Content
(page *Page) ClassifyTextBlocks() ([]ClassifiedBlock, error)

// High-Performance Parallel Extraction
NewParallelExtractor(workers int) *ParallelExtractor
(pe *ParallelExtractor) ExtractAllPages(ctx context.Context, pages []Page) ([][]Text, error)
(pe *ParallelExtractor) GetCacheStats() ShardedCacheStats
(pe *ParallelExtractor) GetPrefetchStats() PrefetchStats
(pe *ParallelExtractor) Close()

// Sharded Cache
NewShardedCache(maxSize int, ttl time.Duration) *ShardedCache
(sc *ShardedCache) Get(key string) (interface{}, bool)
(sc *ShardedCache) Set(key string, value interface{}, size int64)
(sc *ShardedCache) GetStats() ShardedCacheStats
(sc *ShardedCache) Clear()

// Font Prefetching
NewFontPrefetcher(cache *OptimizedFontCache) *FontPrefetcher
(fp *FontPrefetcher) RecordAccess(fontKey string, relatedKeys []string)
(fp *FontPrefetcher) GetStats() PrefetchStats
(fp *FontPrefetcher) Close()

// Zero-Copy String Operations
BytesToString(b []byte) string
StringToBytes(s string) []byte
NewStringBuffer(capacity int) *StringBuffer
FastStringConcatZC(parts ...string) string
TrimSpaceZeroCopy(s string) string
SplitZeroCopy(s string, sep byte) []string
JoinZeroCopy(parts []string, sep string) string

// Pool Warmup
OptimizedStartup(config *StartupConfig) error
WarmupGlobal(config *WarmupConfig) error
DefaultWarmupConfig() *WarmupConfig

Performance Optimization APIs

// Optimized startup (recommended at application start)
config := pdf.DefaultStartupConfig()
config.WarmupPools = true
config.PreallocateCaches = true
err := pdf.OptimizedStartup(config)

// Create optimized font cache
fontCache := pdf.NewOptimizedFontCache(1000)

// Use string pool for repeated strings
pool := pdf.NewStringPool()
fontName := pool.Intern("Arial")

🀝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Create a Pull Request

Development Setup

# Clone repository
git clone https://github.com/Geek0x0/pdf.git
cd gopdf

# Install dependencies
go mod download

# Run tests
go test ./...

# Run with coverage
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

# Run performance benchmarks
go test -bench=. -benchmem -benchtime=500ms

# Build examples
go build ./examples/...

# Run specific example
go run ./examples/extract/main.go sample.pdf

Code Quality

  • Linting: Use golangci-lint for code quality checks
  • Formatting: Follow standard Go formatting with gofmt
  • Testing: Maintain or improve test coverage with new features
  • Documentation: Update README and code comments for API changes

Performance Contributions

When contributing performance optimizations:

  1. Include benchmark tests for the optimization
  2. Measure memory usage impact with go test -benchmem
  3. Test under concurrent load scenarios
  4. Document the performance improvement metrics

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Based on unidoc/unipdf PDF parsing technology
  • Valuable feedback and suggestions from community contributors
  • Excellent language and toolchain provided by the Go team

πŸ“ž Contact


⭐ If this project helps you, please give us a star!

About

High-Performance PDF Processing Library

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Go 99.7%
  • Other 0.3%