Skip to content

wlynxg/chardet

Repository files navigation

chardet: Go character encoding detector

Go Reference License Go Report Card

Introduction

This is a Go port of the python's chardet library. Much respect and appreciation to the original authors for their excellent work.

chardet is a character encoding detector library written in Go. It helps you automatically detect the character encoding of text content.

Installation

To install chardet, use go get:

go get github.com/wlynxg/chardet

Supported Encodings & Languages

Support Encodings:

Expand the list of supported encodings
  • US-ASCII
  • UTF-8
  • UTF-8-SIG
  • UTF-16
  • UTF-16LE
  • UTF-16BE
  • UTF-32
  • UTF-32BE
  • UTF-32LE
  • GB2312
  • HZ-GB-2312
  • Shift_JIS
  • Big5
  • KS_C_5601-1987 (Johab)
  • KOI8-R
  • TIS-620
  • x-mac-cyrillic (MacCyrillic)
  • macintosh (MacRoman)
  • EUC-TW
  • EUC-KR
  • EUC-JP
  • CP932
  • CP949
  • Windows-1250
  • Windows-1251
  • Windows-1252
  • Windows-1253
  • Windows-1254
  • Windows-1255
  • Windows-1256
  • Windows-1257
  • ISO-8859-1
  • ISO-8859-2
  • ISO-8859-5
  • ISO-8859-6
  • ISO-8859-7
  • ISO-8859-8
  • ISO-8859-9
  • ISO-8859-13
  • ISO-2022-CN
  • ISO-2022-JP
  • ISO-2022-KR
  • X-ISO-10646-UCS-4-3412
  • X-ISO-10646-UCS-4-2143
  • IBM855
  • IBM866

Support Languages:

Expand the list of supported languages - Chinese - Japanese - Korean - Hebrew - Russian - Greek - Bulgarian - Thai - Turkish

Usage

Basic Usage

Detecting Text Encoding

The simplest way to use chardet is with the Detect function:

package main

import (
	"fmt"
	"github.com/wlynxg/chardet"
)

func main() {
	data := []byte("Your text data here...")
	result := chardet.Detect(data)
	fmt.Printf("Detected result: %+v\n", result) 
    //Output: Detected result: {Charset:US-ASCII Encoding:Ascii Confidence:1 Language:}
}

Result.Encoding continues to expose the legacy value (e.g. Ascii, SHIFT_JIS). For new applications use Result.Charset, which follows IANA naming.

Decoding text

Use the optional github.com/wlynxg/chardet/lookup helper to map Result.Charset to golang.org/x/text/encoding:

package main

import (
	"fmt"

	"github.com/wlynxg/chardet"
	"github.com/wlynxg/chardet/lookup"
)

func main() {
	data := []byte("Your text data here...")
	result := chardet.Detect(data)

	enc, err := lookup.LookupEncoding(result.Charset)
	if err != nil {
		panic(err)
	}
	if enc == nil {
		fmt.Printf("no decoder for %s\n", result.Charset)
		return
	}

	decoded, err := enc.NewDecoder().String(string(data))
	if err != nil {
		panic(err)
	}

	fmt.Println(decoded)
}

Advanced Usage

For handling large amounts of text, you can use the detector incrementally. This allows the detector to stop as soon as it reaches sufficient confidence in its result.

package main

import (
	"fmt"
	"github.com/wlynxg/chardet"
)

func main() {
	// Create a detector instance
	detector := chardet.NewUniversalDetector(0)
	// Process text in chunks
	chunk1 := []byte("First chunk of text...")
	chunk2 := []byte("Second chunk of text...")
	detector.Feed(chunk1)
	detector.Feed(chunk2)
	// Get the result
	result := detector.GetResult()
	fmt.Printf("Detected result: %+v\n", result)
	// Output: Detected result: {Charset:US-ASCII Encoding:Ascii Confidence:1 Language:}
}

Processing Multiple Files

You can reuse the same detector instance for multiple files by using the Reset() method:

package main

import (
	"fmt"
	"os"
	"github.com/wlynxg/chardet"
)

func main() {
	detector := chardet.NewUniversalDetector(0)
	files := []string{"file1.txt", "file2.txt"}
	for _, file := range files {
		detector.Reset()
		data, err := os.ReadFile(file)
		if err != nil {
			continue
		}
		detector.Feed(data)
		result := detector.GetResult()
		fmt.Printf("File %s encoding: %+v\n", file, result)
	}
}

License

chardet is licensed under the MIT License, 100% free and open-source, forever.

About

Go character encoding detector

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors