enh: crawler#269
Conversation
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
…ADME with workflow details
because loguru takes care of the tracing Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
|
Update: pytest working ( |
|
I would recommend a default |
|
But the delay is for all URLs no? I mean it makes sense to not burst on a single domain but across all domains it's fine right? Also I initially set it to 0.0 to make it "opt in" because that was my idea and I didn't want to trouble the owner too much. |
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
|
The delay should not be necessary since we only use a single URL from each domain in each batch of URLs. |
|
If we're upgrading Python, is there a reason to use 3.11 instead of 3.13? |
Keeping it at 0 by default then. But th idea was to put a cap on the strain on my end, not on the remote server.
Iirc it was because I had trouble installing a dep on later versions but let me revisit that |
|
I tried with 3.13.3 but have issues with pybloomfilter (!), numpy, pandas etc. I'm sure it's possible without too much trouble but I'll leave it up to you as I'm not familiar with the tradeoffs involved. |
|
Ok that's fair enough! Good to know, thanks for checking. I will need to check that the main site works on before we merge this but it's looks like great work. |
|
There are a few things that could be improved here - I'll make some fixes on a branch off your PR and we can merge from there. |
|
Alright! |
|
Added a commit in #274 - will test out over the next few days, including deploying to beta.mwmbl.org and running against that, then hopefully we can merge 🎉 |
CRAWLER_WORKERSdocumentation to specify OS process spawningprocess_batchandrun_indexing, update README with workflow detailsDOMAIN_GROUPSpurpose in README for domain authority scoringOkay so I sort of branched from the previous PR (#242) then did a first visual pass to fix what I botched during the merges (forgot a few lines here and there) then ran poetry.lock, fixed a few things here and there, added a configurable contact info and made the PR.
I don't have the time yet to give it a go, there might be fatal errors somewhere but I prefered to do the draft asap to give you more time to review. If there are issues they are small.
The main difference with #242 in the end is the lack of LMDB, not using black, not using loguru.
Feedback welcome