GoPDF is a powerful PDF processing library written in Go, focused on efficient text extraction, content analysis, and multilingual support. Built with a modular architecture, it provides high-performance concurrent processing capabilities.
- Intelligent Text Extraction: Supports plain text and styled text extraction
- Semantic Classification: Automatic identification of titles, paragraphs, lists, tables, and other content types
- Multilingual Support: Built-in English, French, German, and Spanish language detection and processing
- Layout Analysis: Smart handling of multi-column layouts and complex page structures
- Memory Optimization (NEW): Targeted allocation reduction for high-volume processing
- Pre-allocated slices with capacity estimation (30-40% allocation reduction)
- Eliminated unnecessary copies in hot paths (50% memory reduction in sorting)
- Precise capacity calculation in merge operations (100+ allocations β 3)
- Optimized string builder growth (40-50% reduction in string operations)
- Sharded Caching: 256-shard cache with lock-free statistics (70-80% lock contention reduction)
- Font Prefetching: Intelligent pattern-based font preloading with priority queuing
- Zero-Copy Strings: Unsafe pointer optimization reducing memory allocation by 30-50%
- Pool Warmup: Startup memory pool pre-warming reducing first-access latency by 60-80%
- Enhanced Parallel Processing: Adaptive worker pools with batch processing (50% scheduling overhead reduction)
- Memory Management: Advanced object pooling and resource management
- Spatial Indexing: R-tree spatial indexing for optimized layout analysis
- Asynchronous I/O: Streaming support for large files
- Encoding Support: UTF-16, PDFDocEncoding, WinAnsi, MacRoman, and more
- Compression Formats: Flate, LZW, ASCII85, RunLength
- Encryption Support: RC4, AES encrypted PDFs
- PDF Compatibility: Comprehensive PDF version and feature compatibility checking
- PDF Recovery: Automatic recovery from malformed or corrupted PDF files
- Thread Safety: Fully concurrent-safe operations
- Robust Error Handling: Graceful degradation for malformed PDFs
- Library never panics on invalid input (errors returned instead)
- Tolerates missing PDF structure elements (endobj, endstream, etc.)
- Handles malformed hex strings, names, and escape sequences
- Graceful handling of truncated or corrupted files
go get -u github.com/Geek0x0/pdfpackage main
import (
"fmt"
"log"
"github.com/Geek0x0/pdf"
)
func main() {
// Open PDF file
file, reader, err := gopdf.Open("example.pdf")
if err != nil {
log.Fatal(err)
}
defer file.Close()
// Extract plain text
textReader, err := reader.GetPlainText()
if err != nil {
log.Fatal(err)
}
// Read text content
// ... use textReader
}For high-performance PDF processing, follow these optimization steps:
import "github.com/Geek0x0/pdf"
func init() {
// Pre-warm memory pools and optimize GC settings
config := pdf.DefaultStartupConfig()
config.WarmupPools = true
config.GCPercent = 200 // Reduce GC frequency
if err := pdf.OptimizedStartup(config); err != nil {
log.Fatalf("Startup optimization failed: %v", err)
}
}func extractLargeDocument(filename string) ([]string, error) {
f, r, err := pdf.Open(filename)
if err != nil {
return nil, err
}
defer f.Close()
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
// Automatically uses all CPU cores
return r.ExtractAllPagesParallel(ctx, 0)
}// Create global cache
var globalCache = pdf.NewShardedCache(100000, 1*time.Hour)
func getPageText(reader *pdf.Reader, pageNum int) (string, error) {
cacheKey := fmt.Sprintf("page_%d", pageNum)
// Check cache first
if cached, ok := globalCache.Get(cacheKey); ok {
return cached.(string), nil
}
// Extract and cache
page := reader.Page(pageNum)
text, err := page.GetPlainText(nil)
if err == nil {
globalCache.Set(cacheKey, text, int64(len(text)))
}
return text, err
}func processTexts(texts []string) string {
// Fast zero-copy string operations
builder := pdf.NewStringBuffer(10240)
for _, text := range texts {
trimmed := pdf.TrimSpaceZeroCopy(text)
builder.WriteString(trimmed)
builder.WriteByte('\n')
}
return builder.StringCopy()
}// Check PDF compatibility and features
data, err := os.ReadFile("document.pdf")
if err != nil {
log.Fatal(err)
}
compat, err := pdf.CheckPDFCompatibility(data)
if err != nil {
log.Fatal(err)
}
fmt.Printf("PDF Version: %s\n", compat.Version)
fmt.Printf("Is Linearized: %v\n", compat.IsLinearized)
fmt.Printf("Has Transparency: %v\n", compat.HasTransparency)
fmt.Printf("Has Forms: %v\n", compat.HasForms)
if len(compat.Warnings) > 0 {
fmt.Println("Warnings:")
for _, warning := range compat.Warnings {
fmt.Printf(" - %s\n", warning)
}
}
// Validate PDF/A compliance
issues, err := pdf.ValidatePDFA(data)
if err != nil {
log.Fatal(err)
}
if len(issues) == 0 {
fmt.Println("PDF/A validation passed")
} else {
fmt.Println("PDF/A validation issues:")
for _, issue := range issues {
fmt.Printf(" - %s\n", issue)
}
}// Check PDF integrity before processing
f, err := os.Open("potentially_corrupted.pdf")
if err != nil {
log.Fatal(err)
}
defer f.Close()
stat, err := f.Stat()
if err != nil {
log.Fatal(err)
}
integrity := pdf.CheckIntegrity(f, stat.Size())
fmt.Printf("PDF Valid: %v\n", integrity.IsValid)
fmt.Printf("Is Truncated: %v\n", integrity.IsTruncated)
fmt.Printf("Estimated Objects: %d\n", integrity.EstimatedObjects)
if len(integrity.Issues) > 0 {
fmt.Println("Issues found:")
for _, issue := range integrity.Issues {
fmt.Printf(" - %s\n", issue)
}
}
// Attempt to recover corrupted PDF
data, err := os.ReadFile("corrupted.pdf")
if err != nil {
log.Fatal(err)
}
recovered, err := pdf.RecoverPDF(data)
if err != nil {
log.Printf("Recovery failed: %v", err)
} else {
fmt.Printf("Recovered PDF size: %d bytes\n", len(recovered))
// Save recovered PDF
err = os.WriteFile("recovered.pdf", recovered, 0644)
if err != nil {
log.Fatal(err)
}
}import "context"
// Extract all pages in parallel with all optimizations
ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
defer cancel()
// Automatically uses runtime.NumCPU() workers when workers=0
pages, err := reader.ExtractAllPagesParallel(ctx, 0)
if err != nil {
log.Fatal(err)
}
for i, text := range pages {
fmt.Printf("Page %d: %d characters\n", i+1, len(text))
}// Create parallel extractor with custom worker count
extractor := pdf.NewParallelExtractor(4)
defer extractor.Close()
// Collect pages
numPages := reader.NumPage()
pages := make([]pdf.Page, numPages)
for i := 0; i < numPages; i++ {
pages[i] = reader.Page(i + 1)
pages[i].SetFontCacheInterface(extractor.GetCache())
}
// Extract with context
results, err := extractor.ExtractAllPages(ctx, pages)
if err != nil {
log.Fatal(err)
}
// Get performance stats
cacheStats := extractor.GetCacheStats()
fmt.Printf("Cache hits: %d, misses: %d\n", cacheStats.Hits, cacheStats.Misses)// Create multilingual processor
processor := gopdf.NewMultiLangProcessor()
// Detect text language
result := processor.DetectLanguage("Hello world! Bonjour le monde!")
fmt.Printf("Detected language: %s (confidence: %.2f)\n", result.Language, result.Confidence)
// Extract text by language
extractor := gopdf.NewLanguageTextExtractor()
textsByLang, err := extractor.ExtractTextByLanguage(reader)// 1. Optimized Startup with Pool Warmup
err := pdf.OptimizedStartup(pdf.DefaultStartupConfig())
if err != nil {
log.Fatal(err)
}
// 2. Sharded Cache (256 shards, lock-free)
cache := pdf.NewShardedCache(10000, 30*time.Minute)
cache.Set("key", value, 100)
if val, ok := cache.Get("key"); ok {
// Use cached value
}
stats := cache.GetStats()
fmt.Printf("Hits: %d, Misses: %d, Evictions: %d\n",
stats.Hits, stats.Misses, stats.Evictions)
// 3. Font Prefetching (intelligent pattern-based)
fontCache := pdf.NewOptimizedFontCache(1000)
prefetcher := pdf.NewFontPrefetcher(fontCache)
defer prefetcher.Close()
prefetcher.RecordAccess("Arial", []string{"Helvetica", "Times"})
// 4. Zero-Copy String Operations
builder := pdf.NewStringBuffer(1024)
builder.WriteString("Hello")
builder.WriteByte(' ')
builder.WriteString("World")
result := builder.StringCopy() // Safe copy
// Fast string operations
trimmed := pdf.TrimSpaceZeroCopy(" text ")
parts := pdf.SplitZeroCopy("a,b,c", ',')
joined := pdf.JoinZeroCopy([]string{"a", "b", "c"}, ",")GoPDF uses a modular architecture with clear component responsibilities:
gopdf/
βββ lex.go # PDF lexical analysis and tokenization
βββ read.go # PDF file reading and parsing
βββ text.go # Core text extraction logic
βββ page.go # Page structure analysis
βββ metadata.go # Metadata processing
βββ compatibility.go # PDF format compatibility checking
βββ recovery.go # PDF recovery for malformed files
βββ errors.go # Error handling and wrapping
βββ caching.go # Caching strategy implementation
βββ spatial_index.go # Spatial indexing (R-tree)
βββ text_classifier.go # Text classifier
βββ multilang.go # Multilingual support
βββ parallel_processing.go # Parallel processing
βββ performance.go # Performance optimization
βββ async_io.go # Asynchronous I/O
β
βββ Performance Optimizations (2024)
βββ sharded_cache.go # 256-shard high-performance cache
βββ font_prefetch.go # Intelligent font prefetching
βββ zero_copy_strings.go # Zero-copy string operations
βββ pool_warmup.go # Memory pool pre-warming
βββ enhanced_parallel.go # Enhanced parallel processing
βββ optimizations_advanced.go # Advanced optimizations
βββ memory_pools.go # Advanced memory pool management
- Reader: Main PDF reading interface with encryption support
- Text Extractor: Intelligent text extraction engine with smart ordering
- Classifier: ML-based text classification for semantic analysis
- Compatibility Checker: PDF version and feature compatibility validation
- Recovery Engine: Automatic repair and recovery for damaged PDFs
- Sharded Cache: 256-shard lock-free cache system
- Font Prefetcher: Pattern-based predictive font loading
- Parallel Extractor: Adaptive worker pool with batch processing
- Spatial Index: R-tree spatial query optimization
- Language Processor: Multilingual detection and processing
- Zero-Copy Optimizer: Memory allocation reduction utilities
Performance metrics based on standard test datasets (Intel i7-14700K):
- Text Extraction Speed: Average 50-100 pages/second
- Memory Usage: Smart object pooling, 40% reduction in memory footprint
- Concurrent Processing: Multi-core support, 3-5x performance improvement with parallel extractor
- Set Operations: ~118 ns/op (256 shards)
- Get Operations: ~112 ns/op
- Concurrent Access: ~31 ns/op (70-80% lock contention reduction)
- Cache Hit Rate: Up to 85% with LRU policies
- BytesToString: 0.14 ns/op (97x faster than standard)
- String Concat: 10.12 ns/op (3.1x faster)
- TrimSpace: 2.67 ns/op (1.2x faster)
- Split: 59.62 ns/op (1.3x faster)
- Light Warmup: ~37 Β΅s (development)
- Default Warmup: ~96 Β΅s (production)
- Aggressive Warmup: ~358 Β΅s (high-performance)
- Concurrent vs Sequential: 35% faster with concurrent warmup
- 2 Workers: 1.8x speedup
- 4 Workers: 3.1x speedup
- 8 Workers: 5.0x speedup
- Auto (CPU cores): 4.2x average speedup
The project maintains testing standards with 67.6% coverage (main package):
- Unit tests covering all core functionality
- Integration tests for end-to-end PDF processing
- Performance tests with benchmarks and memory profiling
- Concurrency tests for thread safety validation
- Optimization-specific tests for new features
# Run all tests
go test ./...
# Run coverage tests
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
# Run performance benchmarks
go test -bench=. -benchmem -benchtime=500ms
# Run specific optimization benchmarks
go test -bench=BenchmarkShardedCache -run=^$
go test -bench=BenchmarkStringOperations -run=^$
go test -bench=BenchmarkParallelExtractor -run=^$
go test -bench=BenchmarkWarmup -run=^$# Compare parallel vs sequential extraction
go test -bench=BenchmarkParallelExtractorVsSequential -run=^$ -benchtime=500ms
# Zero-copy string operations performance
go test -bench=BenchmarkStringOperations -run=^$ -benchtime=500ms
# Cache performance under different loads
go test -bench=BenchmarkShardedCache -run=^$ -benchtime=1sGoPDF includes a command-line tool for quick PDF text extraction:
# Build the CLI tool
go build -o pdfcli ./cmd/pdfcli
# Extract plain text from all pages
./pdfcli document.pdf
# Extract text from specific page
./pdfcli -page 1 document.pdf
# Extract styled text with formatting
./pdfcli -mode styled document.pdf
# Extract text organized by rows
./pdfcli -mode rows -page 1 document.pdf
# Extract text organized by columns
./pdfcli -mode columns -page 1 document.pdf// PDF file operations
Open(filename string) (*os.File, *Reader, error)
NewReader(r io.ReaderAt, size int64) (*Reader, error)
// PDF compatibility checking
CheckPDFCompatibility(data []byte) (*PDFCompatibilityInfo, error)
ValidatePDFA(data []byte) ([]string, error)
ValidatePDFX(data []byte) ([]string, error)
// PDF integrity and recovery
CheckIntegrity(r io.ReaderAt, size int64) *IntegrityStatus
RecoverPDF(data []byte) ([]byte, error)
// Text extraction
(reader *Reader) GetPlainText() (io.Reader, error)
(reader *Reader) ExtractWithContext(ctx context.Context, opts ExtractOptions) (io.Reader, error)
(reader *Reader) ExtractAllPagesParallel(ctx context.Context, workers int) ([]string, error)
// Page operations
(reader *Reader) Page(num int) *Page
(page *Page) Content() *Content
(page *Page) ClassifyTextBlocks() ([]ClassifiedBlock, error)
// High-Performance Parallel Extraction
NewParallelExtractor(workers int) *ParallelExtractor
(pe *ParallelExtractor) ExtractAllPages(ctx context.Context, pages []Page) ([][]Text, error)
(pe *ParallelExtractor) GetCacheStats() ShardedCacheStats
(pe *ParallelExtractor) GetPrefetchStats() PrefetchStats
(pe *ParallelExtractor) Close()
// Sharded Cache
NewShardedCache(maxSize int, ttl time.Duration) *ShardedCache
(sc *ShardedCache) Get(key string) (interface{}, bool)
(sc *ShardedCache) Set(key string, value interface{}, size int64)
(sc *ShardedCache) GetStats() ShardedCacheStats
(sc *ShardedCache) Clear()
// Font Prefetching
NewFontPrefetcher(cache *OptimizedFontCache) *FontPrefetcher
(fp *FontPrefetcher) RecordAccess(fontKey string, relatedKeys []string)
(fp *FontPrefetcher) GetStats() PrefetchStats
(fp *FontPrefetcher) Close()
// Zero-Copy String Operations
BytesToString(b []byte) string
StringToBytes(s string) []byte
NewStringBuffer(capacity int) *StringBuffer
FastStringConcatZC(parts ...string) string
TrimSpaceZeroCopy(s string) string
SplitZeroCopy(s string, sep byte) []string
JoinZeroCopy(parts []string, sep string) string
// Pool Warmup
OptimizedStartup(config *StartupConfig) error
WarmupGlobal(config *WarmupConfig) error
DefaultWarmupConfig() *WarmupConfig// Optimized startup (recommended at application start)
config := pdf.DefaultStartupConfig()
config.WarmupPools = true
config.PreallocateCaches = true
err := pdf.OptimizedStartup(config)
// Create optimized font cache
fontCache := pdf.NewOptimizedFontCache(1000)
// Use string pool for repeated strings
pool := pdf.NewStringPool()
fontName := pool.Intern("Arial")Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Create a Pull Request
# Clone repository
git clone https://github.com/Geek0x0/pdf.git
cd gopdf
# Install dependencies
go mod download
# Run tests
go test ./...
# Run with coverage
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
# Run performance benchmarks
go test -bench=. -benchmem -benchtime=500ms
# Build examples
go build ./examples/...
# Run specific example
go run ./examples/extract/main.go sample.pdf- Linting: Use
golangci-lintfor code quality checks - Formatting: Follow standard Go formatting with
gofmt - Testing: Maintain or improve test coverage with new features
- Documentation: Update README and code comments for API changes
When contributing performance optimizations:
- Include benchmark tests for the optimization
- Measure memory usage impact with
go test -benchmem - Test under concurrent load scenarios
- Document the performance improvement metrics
This project is licensed under the MIT License - see the LICENSE file for details.
- Based on unidoc/unipdf PDF parsing technology
- Valuable feedback and suggestions from community contributors
- Excellent language and toolchain provided by the Go team
- Project Home: https://github.com/Geek0x0/pdf
- Issue Tracker: https://github.com/Geek0x0/pdf/issues
β If this project helps you, please give us a star!