Skip to content

HTML encoding is not autodetected properly #777

@Dinver

Description

@Dinver

Hi! When I try to recognize the encoding on sites with windows-1251, I get:
2023/08/23 21:45:10 ÑÄÎ «Ïðîìåòåé» | ÎÎÎ «Âèðòóàëüíûå òåõíîëîãèè â îáðàçîâàíèè»
2023/08/23 21:45:10 Ýëåêòðîííûå êóðñû
2023/08/23 21:45:10 Ïðîäóêòû

Example:

package main

import (
	"log"

	"github.com/gocolly/colly"
)

func main() {
	c := colly.NewCollector(
		colly.DetectCharset(),
		colly.Async(true),
	)
	c.OnHTML("title", func(e *colly.HTMLElement) {
		title := e.Text
		log.Println(title)
	})
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		title := e.Text
		log.Println(title)
	})

	c.OnHTML("img", func(e *colly.HTMLElement) {
		title := e.Attr("alt")
		log.Println(title)
	})

	c.Visit("https://prometeus.ru/")
	c.Wait()
}

colly.DetectCharset() / c.DetectCharset = true - does not working.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions