Skip to content

Conversation

@WGH-
Copy link
Collaborator

@WGH- WGH- commented Jan 28, 2022

There are several commits, which are largely independent, but they touch the same code, so I thought it would be better to batch them together into single pull request instead of doing 2 extra rebases. Of course, if some of them are more questionable than others, I can split them to separate PRs. Please see the commit messages themselves for the description of what and why I attempted to do.

WGH- added 3 commits February 1, 2022 04:39
http.NewRequest already does exactly what setRequestBody did.
In fact, scrape and setRequestBody used to contain parts of
http.NewRequest implementation.

119df7a implies that NewRequest is slower than assembling
http.Request object manually, likely because it avoids URL
reparsing. Indeed, this change brings some slowdown:

    name           old time/op  new time/op  delta
    OnHTML-24       129µs ± 0%   131µs ± 1%  +1.69%  (p=0.029 n=4+4)
    OnXML-24        130µs ± 0%   132µs ± 1%  +1.69%  (p=0.029 n=4+4)
    OnResponse-24  93.4µs ± 0%  95.5µs ± 1%  +2.30%  (p=0.029 n=4+4)

However, the benchmarks are somewhat unrealistic as the HTML page is
extremely simple, the web server runs on localhost, and the data is
discarded (rather than e.g. stored somewhere). In practice,
impact of this is going to be neglible, and code complexity and
code duplication is not really worth it.
streamToByte has an ugly type switch that tries to rewind the body so it
could be read again.

However, http.NewRequest already does something similar: when Body type
allows, GetBody is populated so request can be retried. So just use it
instead of trying to reimplement the body rewinding, and remove
streamToByte function altogether.
The algorithm - specific hash function and what data is hashed - was
duplicated in several locations. Move it to separate function.
// replace this with http.NewRequestWithContext
req = req.WithContext(c.Context)
setRequestBody(req, requestData)
if err := c.requestCheck(u, parsedURL, method, req.GetBody, depth, checkRevisit); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I'm missing something, but I can't find where the req.GetBody function is implemented. Pls help. =]

Copy link
Collaborator Author

@WGH- WGH- Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetBody is set by NewRequest: https://cs.opensource.google/go/go/+/refs/tags/go1.17.8:src/net/http/request.go;l=889-930;drc=refs%2Ftags%2Fgo1.17.8

This code fragment used to be copy-pasted in Colly (setRequestBody) and was removed in my previous commit.

Copy link
Member

@asciimoo asciimoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

@asciimoo asciimoo merged commit b151a08 into gocolly:master Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants