Skip to content

enh: crawler#269

Closed
thiswillbeyourgithub wants to merge 46 commits into
mwmbl:mainfrom
thiswillbeyourgithub:enh_crawler_chunk_1
Closed

enh: crawler#269
thiswillbeyourgithub wants to merge 46 commits into
mwmbl:mainfrom
thiswillbeyourgithub:enh_crawler_chunk_1

Conversation

@thiswillbeyourgithub

Copy link
Copy Markdown
Contributor
  • fix: import HTTPException from ninja to resolve undefined name errors
  • refactor: move crawler configuration to env_vars with environment variable support
  • docs: add initial README for mwmbl crawler
  • feat: add configurable rate limiting for crawler to reduce request bursting
  • feat: import CRAWLER_WORKERS from env_vars and simplify worker configuration
  • docs: add CRAWLER_WORKERS env var description to crawler README
  • docs: Add comprehensive crawler architecture details to README
  • remove burst limiting
  • enh: print error reason in the logger
  • feat: add configurable threads for crawl_batch with env var
  • docs: clarify CRAWLER_WORKERS documentation to specify OS process spawning
  • enh: add check for crawler worker value
  • docs: clarify crawler process and thread architecture in README
  • minor: use logger instead of print for the results
  • feat: add rate limiting with configurable delay and random fuzz
  • docs: Add docstrings to process_batch and run_indexing, update README with workflow details
  • docs: clarify DOMAIN_GROUPS purpose in README for domain authority scoring
  • feat: add crash tracking for index process with 5-crash-per-hour threshold
  • chore: replace pip with uv for faster Python package management
  • feat: add crawler version tracking to batch uploads
  • refactor: move hardcoded constants from env vars to respective files
  • minor: more compact log line for crawling
  • feat: import REDIS_URL from env_vars and use in Redis connection
  • minor: var rename
  • feat: add retry mechanism with exponential backoff for API requests
  • bump python version to 3.11.12
  • upgrade version of pybloomfiltermmap3
  • fix wrong requirements.txt
  • bump python version to 3.11.12
  • enh: add check for redis connection very early
  • fix: botched merge forgot some lines
  • ignore aider files
  • ignore redis_data folder
  • minor: better presentation of the env var in the docker compose
  • ran poetry.lock
  • fix: duplicate crawler version + better declaration
  • feat: Add configurable contact info to user-agent for responsible crawling

Okay so I sort of branched from the previous PR (#242) then did a first visual pass to fix what I botched during the merges (forgot a few lines here and there) then ran poetry.lock, fixed a few things here and there, added a configurable contact info and made the PR.

I don't have the time yet to give it a go, there might be fatal errors somewhere but I prefered to do the draft asap to give you more time to review. If there are issues they are small.

The main difference with #242 in the end is the lack of LMDB, not using black, not using loguru.

Feedback welcome

Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub
<26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub
<26625900+thiswillbeyourgithub@users.noreply.github.com>
because loguru takes care of the tracing

Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub
<26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub
<26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub
<26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
@thiswillbeyourgithub

Copy link
Copy Markdown
Contributor Author

I included the what is in #264 and addressed #265

@thiswillbeyourgithub

Copy link
Copy Markdown
Contributor Author

Update: pytest working ( DJANGO_SETTINGS_MODULE="mwmbl.settings_dev" pytest), the container builds fine, up is fine so far. I'm making as "ready for review" but will let it run overnight

@thiswillbeyourgithub thiswillbeyourgithub marked this pull request as ready for review June 18, 2025 17:30
@TechnologyClassroom

Copy link
Copy Markdown

I would recommend a default CRAWL_DELAY_SECONDS value of of 2.0 instead of 0.0 .

@thiswillbeyourgithub

Copy link
Copy Markdown
Contributor Author

But the delay is for all URLs no? I mean it makes sense to not burst on a single domain but across all domains it's fine right?

Also I initially set it to 0.0 to make it "opt in" because that was my idea and I didn't want to trouble the owner too much.

Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
@daoudclarke

Copy link
Copy Markdown
Contributor

The delay should not be necessary since we only use a single URL from each domain in each batch of URLs.

@daoudclarke

Copy link
Copy Markdown
Contributor

If we're upgrading Python, is there a reason to use 3.11 instead of 3.13?

@thiswillbeyourgithub

Copy link
Copy Markdown
Contributor Author

The delay should not be necessary since we only use a single URL from each domain in each batch of URLs.

Keeping it at 0 by default then. But th idea was to put a cap on the strain on my end, not on the remote server.

If we're upgrading Python, is there a reason to use 3.11 instead of 3.13?

Iirc it was because I had trouble installing a dep on later versions but let me revisit that

@thiswillbeyourgithub

Copy link
Copy Markdown
Contributor Author

I tried with 3.13.3 but have issues with pybloomfilter (!), numpy, pandas etc. I'm sure it's possible without too much trouble but I'll leave it up to you as I'm not familiar with the tradeoffs involved.

@daoudclarke

Copy link
Copy Markdown
Contributor

Ok that's fair enough! Good to know, thanks for checking.

I will need to check that the main site works on before we merge this but it's looks like great work.

@daoudclarke

Copy link
Copy Markdown
Contributor

There are a few things that could be improved here - I'll make some fixes on a branch off your PR and we can merge from there.

@thiswillbeyourgithub

Copy link
Copy Markdown
Contributor Author

Alright!

@daoudclarke daoudclarke mentioned this pull request Jun 23, 2025
@daoudclarke

Copy link
Copy Markdown
Contributor

Added a commit in #274 - will test out over the next few days, including deploying to beta.mwmbl.org and running against that, then hopefully we can merge 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants