Go DOCX Text Processor

A powerful Go library for extracting and manipulating text in Microsoft Word (DOCX) files, with zero external dependencies. Transform your documents using Go templates or custom replacers while maintaining document structure.

✨ Features

Text extraction from DOCX files with two modes:
- ExtractText() - Extract text with changes accepted (insertions included, deletions ignored)
- ExtractOriginalText() - Extract original text with changes rejected (insertions ignored, deletions restored)
Word-level diff analysis with readable output:
- Diff() - Compare original vs accepted text with semantic word-level differences
- PrettyPrint() - Generate LLM-friendly diff output with <delete> and <insert> tags
- Built on custom LCS (Longest Common Subsequence) algorithm for optimal performance
Text modification using Go templates or custom replacers
Full document support:
- Main document body
- Headers and footers
- Tables and cells
- Bullet points and numbered lists
Track changes handling (insertions/deletions) for both extraction and modification
Memory support with byte array functions (ExtractTextBytes, ExtractOriginalTextBytes)
Zero external dependencies - completely self-contained library
Unicode support - handles all Unicode characters, emojis, and multibyte text correctly
Efficient (single pass processing)
OOXML standard compliant
MIT License

🚀 Installation

go get github.com/xavier268/mydocx

📖 Quick Start

Text Extraction

import "github.com/xavier268/mydocx"

func main() {
    // Extract text from all document parts (main body, headers, footers)
    // This extracts text as if all track changes were ACCEPTED
    content, err := mydocx.ExtractText("document.docx")
    if err != nil {
        log.Fatal(err)
    }

    // Extract original text as if all track changes were REJECTED
    original, err := mydocx.ExtractOriginalText("document.docx")
    if err != nil {
        log.Fatal(err)
    }

    // Both return a map[string][]string where:
    // - key is the container name (e.g., "word/document.xml", "word/footer1.xml")
    // - value is a slice of strings, one for each paragraph
    for container, paragraphs := range content {
        fmt.Printf("Content from %s (changes accepted):\n", container)
        for _, para := range paragraphs {
            fmt.Println(para)
        }
    }
    
    for container, paragraphs := range original {
        fmt.Printf("Original content from %s (changes rejected):\n", container)
        for _, para := range paragraphs {
            fmt.Println(para)
        }
    }
}

Document Diff Analysis

Simple One-Line Analysis

import "github.com/xavier268/mydocx"

func main() {
    // Get LLM-friendly diff analysis in one line
    analysis, err := mydocx.DiffAnalyse("document.docx")
    if err != nil {
        log.Fatal(err)
    }
    
    fmt.Print(analysis) // Ready for LLM processing
}

Detailed Diff Processing

import "github.com/xavier268/mydocx"

func main() {
    // Extract both original and accepted versions
    original, err := mydocx.ExtractOriginalText("document.docx")
    if err != nil {
        log.Fatal(err)
    }
    
    accepted, err := mydocx.ExtractText("document.docx")
    if err != nil {
        log.Fatal(err)
    }
    
    // Generate word-level diff analysis
    diffResult := mydocx.Diff(original, accepted)
    
    // Get readable diff output for LLM analysis
    prettyDiff := diffResult.PrettyPrint()
    fmt.Println(prettyDiff)
    
    // Access structured diff data
    fmt.Printf("Total containers: %d\n", diffResult.Summary.TotalContainers)
    fmt.Printf("Changed containers: %d\n", diffResult.Summary.ChangedContainers)
    fmt.Printf("Insertions: %d, Deletions: %d\n", 
        diffResult.Summary.TotalInsertions, 
        diffResult.Summary.TotalDeletions)
        
    // Process individual container diffs
    for containerName, containerDiff := range diffResult.ContainerDiffs {
        fmt.Printf("Changes in %s:\n", containerName)
        for _, op := range containerDiff.Operations {
            fmt.Printf("  %s: %q\n", op.Type, op.Text)
        }
    }
}

Example diff output:

=== DIFF SUMMARY ===
Total containers: 3
Changed containers: 1
Insertions: 2, Deletions: 1, Equal: 5

=== CONTAINER: word/document.xml ===
The document contains <delete>old content</delete><insert>new updated content</insert> here.

Using Go Templates

import "github.com/xavier268/mydocx"

func main() {
    // Define template data
    data := struct {
        Name    string
        Company string
        Date    string
    }{
        Name:    "John Doe",
        Company: "ACME Corp",
        Date:    time.Now().Format("2006-01-02"),
    }

    // Create a template-based replacer
    replacer := mydocx.NewTplReplacer(data)

    // Modify the document
    err := mydocx.ModifyText("template.docx", replacer, "output.docx")
    if err != nil {
        log.Fatal(err)
    }
}

Custom Replacer

// Define your custom replacer
func myReplacer(container, text string) []string {
    // container: "word/document.xml", "word/footer1.xml", etc.
    // text: original paragraph text
    // Return:
    // - empty slice to remove the paragraph
    // - slice with multiple strings to create multiple paragraphs
    // - slice with one string to replace paragraph content
    
    switch {
    case strings.Contains(text, "DELETE"):
        return []string{} // Remove paragraph
    case strings.Contains(text, "DUPLICATE"):
        return []string{text, text} // Duplicate paragraph
    default:
        return []string{strings.ToUpper(text)} // Convert to uppercase
    }
}

// Use your replacer
err := mydocx.ModifyText("input.docx", myReplacer, "output.docx")

📝 Track Changes Support

This library provides comprehensive support for Microsoft Word track changes (revisions) with different extraction modes:

Text Extraction Options

1. Extract with Changes Accepted (`ExtractText`)

Deletions: Text marked for deletion is excluded from the extracted content
Insertions: Text marked as inserted is included in the extracted content
Result: The extracted text represents the "accepted changes" version

2. Extract Original Text (`ExtractOriginalText`)

Deletions: Text marked for deletion is included to restore original content
Insertions: Text marked as inserted is excluded from the extracted content
Result: The extracted text represents the original document before any changes

3. Byte Array Support

Both extraction modes also support byte array input:

ExtractTextBytes([]byte) - Extract with changes accepted
ExtractOriginalTextBytes([]byte) - Extract original text with changes rejected

Example:

Document with track changes: "Hello [deleted: old] [inserted: new] world"

ExtractText result:         "Hello new world"      (changes accepted)
ExtractOriginalText result: "Hello old world"      (changes rejected)

Text Modification

During text modification (ModifyText), track changes are handled differently:

Deletions: Deletion markup is preserved unchanged in the output document
Insertions: Insertion markup is preserved unchanged in the output document
Templates/Replacers: Only operate on the "clean" text (like extraction), but track changes markup is maintained in the final document

This means:

Your templates and replacers work with clean text (as if changes were accepted)
The original track changes markup is preserved in the template document
The output document maintains the same revision history as the input template

Important Notes

Track changes from the template document are preserved during modification
If you need to work with documents without track changes, accept all changes in Word before using them as templates
The extraction function gives you a preview of what text your templates will process

🔧 Diff Algorithm

Internal LCS Implementation

Starting with v0.5.0, mydocx includes a custom implementation of the Longest Common Subsequence (LCS) algorithm that powers the word-level diff functionality. This removes all external dependencies while providing excellent performance and Unicode support.

Features

Zero external dependencies - completely self-contained
Full Unicode support - handles emojis, CJK characters, mathematical symbols, and mixed scripts
Word-level precision - optimized for document text comparison
O(m×n) time complexity - efficient for typical document sizes
Compatible API - drop-in replacement for previous difflib-based implementation

Unicode Support Examples

The diff algorithm correctly handles:

// Accented characters
original := []string{"café", "naïve", "résumé"}
modified := []string{"coffee", "simple", "resume"}

// Emojis and complex sequences  
original := []string{"Hello", "🌍", "world", "👨‍👩‍👧‍👦"}
modified := []string{"Hello", "🌎", "world", "👨‍👩‍👧‍👧"}

// Mixed scripts
original := []string{"English", "中文", "العربية", "Привет"}
modified := []string{"English", "日本語", "العربية", "Добро"}

// All generate accurate word-level diffs with proper Unicode handling

Performance

Benchmarks on modern hardware:

Small sequences (100 words): ~47μs
Large sequences (1000 words): ~5.7ms
Worst case (completely different): ~62μs per 100 words

The algorithm maintains consistent performance regardless of character encoding complexity.

🔧 Advanced Features

Template Functions

Built-in Functions

All go template functions are available. In addition, the following built-in functions are always available :

{{nl}} - Inserts a new paragraph
{{version}} - Returns version information
{{copyright}} - Returns copyright text
{{date}}- Returns the current date, as 2006-02-10
{{join}} - Expects an array of strings and a delimiter string, returns a single concatenated string with the delimiter (see go function strings.Join)
{{keepEmpty}} - From this point, will never remove a paragraph that becomes empty after modification.
{{removeEmpty}}- From this point, non empty paragraphs that become empty after Replacer is applied are removed. This is the default.

Register Custom Functions

// Register a custom function
mydocx.RegisterTplFunction("upper", strings.ToUpper)

// Use in template
// {{upper .Name}}

Template Guidelines

Each paragraph is an independent template
Templates cannot span across paragraphs

Valid example:

Hello {{.Name}}!
Your order #{{.OrderID}} has been processed.

Invalid example:

Hello {{if .Premium}}
Premium customer {{.Name}}!
{{else}}
Valued customer {{.Name}}!
{{end}}

🔄 Paragraph Management

With Custom Replacer

The Replacer function controls paragraph creation and removal through its return value:

type Replacer func(container string, text string) []string

Remove Paragraph

// Return empty slice to remove the paragraph (unless {{keepEmpty}} was called earlier)
func myReplacer(container, text string) []string {
    if strings.Contains(text, "DELETE") {
        return []string{} // Paragraph will be removed
    }
    return []string{text}
}

Create Multiple Paragraphs

// Return multiple strings to create multiple paragraphs
func myReplacer(container, text string) []string {
    if strings.Contains(text, "DUPLICATE") {
        // Creates three identical paragraphs with the same formatting
        return []string{text, text, text}
    }
    return []string{text}
}

Each new paragraph inherits the formatting of the original paragraph.

With Go Templates

When using the template-based replacer (NewTplReplacer), paragraph management is controlled by newlines in the template output:

Remove Paragraph
```
{{if .ShouldDelete}}{{else}}Original content{{end}}
```
If .ShouldDelete is true, the empty output will remove the paragraph (unless {{keepEmpty}} was called before).

Create Multiple Paragraphs

{{.Title}}
Items:
{{range .Items}} - {{.}}{{nl}}{{end}}
Contact: {{.Contact}}

The {{nl}} function inserts a newline, and the template output is split on newlines to create new paragraphs. Each resulting paragraph inherits the formatting of the original paragraph. Notice how {{range}} ... {{end}} fits within a single source paragraph but will create multiple paragraphs !

Paragraph Creation Rules

Initially empty paragraphs
- Initially empty paragraphs are always left unchanged
Empty Result
- If the Replacer returns an empty slice → paragraph is removed
- If a template produces empty output → paragraph is removed
  Note : this is the default, it can be changed with {{keepEmpty}}
Multiple Paragraphs
- Custom Replacer: Each string in the returned slice becomes a new paragraph
- Template: Output is split on newlines (\n), each line becomes a new paragraph
- All new paragraphs inherit formatting from the original paragraph

Examples with Templates

// Template in document
Dear {{.Name}},
{{if .Premium}}Thank you for being a premium member!{{end}}
Your balance is ${{.Balance}}.

This template could produce:

Dear John Doe,
Thank you for being a premium member!
Your balance is $100.

Or (if not premium):

Dear John Doe,
Your balance is $100.

⚙️ Technical Details

Word Run Management

Microsoft Word splits text into "runs" - segments sharing the same formatting. This creates challenges for text replacement:

Example: "Hello {{.Name}}!" might be split into:
Run 1: "Hello "
Run 2: "{{.Name"
Run 3: "}}!"

Our solution:

Consolidates all runs in a paragraph into the first run
Processes the complete text with a Replacer
Creates new paragraphs for each line in the result

⚠️ Important: Due to this approach, the entire paragraph will inherit the formatting from its beginning.

Tables and Lists

Tables and lists are fully supported
Each cell must contain at least one paragraph (even if empty)
Word will show an error when opening files with empty cells but can recover

🚨 Limitations

Formatting
- Paragraph formatting is unified based on the first run
- In-paragraph formatting variations are lost
Template Boundaries
- Templates must be contained within a single paragraph
- Cross-paragraph templates are not supported
Table Cells
- Avoid creating completely empty cells
- Always include at least one (empty) paragraph in cells

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
diff		diff
testFiles		testFiles
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
diff.go		diff.go
diff_test.go		diff_test.go
docx_test.go		docx_test.go
extract.go		extract.go
go.mod		go.mod
modify.go		modify.go
tpl.go		tpl.go
util.go		util.go
version.go		version.go

Folders and files

Latest commit

History

Repository files navigation

Go DOCX Text Processor

✨ Features

🚀 Installation

📖 Quick Start

Text Extraction

Document Diff Analysis

Simple One-Line Analysis

Detailed Diff Processing

Using Go Templates

Custom Replacer

📝 Track Changes Support

Text Extraction Options

1. Extract with Changes Accepted (ExtractText)

2. Extract Original Text (ExtractOriginalText)

3. Byte Array Support

Text Modification

Important Notes

🔧 Diff Algorithm

Internal LCS Implementation

Features

Unicode Support Examples

Performance

🔧 Advanced Features

Template Functions

Built-in Functions

Register Custom Functions

Template Guidelines

🔄 Paragraph Management

With Custom Replacer

With Go Templates

Paragraph Creation Rules

⚙️ Technical Details

Word Run Management

Tables and Lists

🚨 Limitations

📚 Resources

🤝 Contributing

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. Extract with Changes Accepted (`ExtractText`)

2. Extract Original Text (`ExtractOriginalText`)

Packages