Introduction

chardet: Go character encoding detector

Introduction

This is a Go port of the python's chardet library. Much respect and appreciation to the original authors for their excellent work.

chardet is a character encoding detector library written in Go. It helps you automatically detect the character encoding of text content.

Installation

To install chardet, use go get:

go get github.com/wlynxg/chardet

Supported Encodings & Languages

Support Encodings:

Expand the list of supported encodings

US-ASCII
UTF-8
UTF-8-SIG
UTF-16
UTF-16LE
UTF-16BE
UTF-32
UTF-32BE
UTF-32LE
GB2312
HZ-GB-2312
Shift_JIS
Big5
KS_C_5601-1987 (Johab)
KOI8-R
TIS-620
x-mac-cyrillic (MacCyrillic)
macintosh (MacRoman)
EUC-TW
EUC-KR
EUC-JP
CP932
CP949
Windows-1250
Windows-1251
Windows-1252
Windows-1253
Windows-1254
Windows-1255
Windows-1256
Windows-1257
ISO-8859-1
ISO-8859-2
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
ISO-8859-13
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
X-ISO-10646-UCS-4-3412
X-ISO-10646-UCS-4-2143
IBM855
IBM866

Support Languages:

Expand the list of supported languages

- Chinese - Japanese - Korean - Hebrew - Russian - Greek - Bulgarian - Thai - Turkish

Usage

Basic Usage

Detecting Text Encoding

The simplest way to use chardet is with the Detect function:

package main

import (
	"fmt"
	"github.com/wlynxg/chardet"
)

func main() {
	data := []byte("Your text data here...")
	result := chardet.Detect(data)
	fmt.Printf("Detected result: %+v\n", result) 
    //Output: Detected result: {Charset:US-ASCII Encoding:Ascii Confidence:1 Language:}
}

Result.Encoding continues to expose the legacy value (e.g. Ascii, SHIFT_JIS). For new applications use Result.Charset, which follows IANA naming.

Decoding text

Use the optional github.com/wlynxg/chardet/lookup helper to map Result.Charset to golang.org/x/text/encoding:

package main

import (
	"fmt"

	"github.com/wlynxg/chardet"
	"github.com/wlynxg/chardet/lookup"
)

func main() {
	data := []byte("Your text data here...")
	result := chardet.Detect(data)

	enc, err := lookup.LookupEncoding(result.Charset)
	if err != nil {
		panic(err)
	}
	if enc == nil {
		fmt.Printf("no decoder for %s\n", result.Charset)
		return
	}

	decoded, err := enc.NewDecoder().String(string(data))
	if err != nil {
		panic(err)
	}

	fmt.Println(decoded)
}

Advanced Usage

For handling large amounts of text, you can use the detector incrementally. This allows the detector to stop as soon as it reaches sufficient confidence in its result.

package main

import (
	"fmt"
	"github.com/wlynxg/chardet"
)

func main() {
	// Create a detector instance
	detector := chardet.NewUniversalDetector(0)
	// Process text in chunks
	chunk1 := []byte("First chunk of text...")
	chunk2 := []byte("Second chunk of text...")
	detector.Feed(chunk1)
	detector.Feed(chunk2)
	// Get the result
	result := detector.GetResult()
	fmt.Printf("Detected result: %+v\n", result)
	// Output: Detected result: {Charset:US-ASCII Encoding:Ascii Confidence:1 Language:}
}

Processing Multiple Files

You can reuse the same detector instance for multiple files by using the Reset() method:

package main

import (
	"fmt"
	"os"
	"github.com/wlynxg/chardet"
)

func main() {
	detector := chardet.NewUniversalDetector(0)
	files := []string{"file1.txt", "file2.txt"}
	for _, file := range files {
		detector.Reset()
		data, err := os.ReadFile(file)
		if err != nil {
			continue
		}
		detector.Feed(data)
		result := detector.GetResult()
		fmt.Printf("File %s encoding: %+v\n", file, result)
	}
}

License

chardet is licensed under the MIT License, 100% free and open-source, forever.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
cda		cda
consts		consts
lookup		lookup
probe		probe
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chardet.go		chardet.go
detector.go		detector.go
go.mod		go.mod
go.sum		go.sum
result.go		result.go
result_test.go		result_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chardet: Go character encoding detector

Introduction

Installation

Supported Encodings & Languages

Usage

Basic Usage

Detecting Text Encoding

Decoding text

Advanced Usage

Processing Multiple Files

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

chardet: Go character encoding detector

Introduction

Installation

Supported Encodings & Languages

Usage

Basic Usage

Detecting Text Encoding

Decoding text

Advanced Usage

Processing Multiple Files

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages