chore: revert README to previous version#4
Conversation
There was a problem hiding this comment.
Pull Request Overview
This pull request adds a new Python script that fetches latest posts from MySQL Taobao monthly archives and updates the README.md file with new database-related articles. The script includes functionality to parse HTML content, categorize articles, and append new entries to the README in markdown table format.
- Adds a complete web scraping script for MySQL monthly archives
- Implements HTML parsing to extract articles by database categories
- Provides README.md update functionality with new article entries
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| handler = urllib.request.ProxyHandler({}) # disable env proxies | ||
| opener = urllib.request.build_opener(handler) | ||
| try: | ||
| with opener.open(req, timeout=10) as resp: # pragma: no cover - network I/O |
There was a problem hiding this comment.
The hardcoded timeout value of 10 seconds should be configurable. Consider adding a TIMEOUT environment variable or making it a module-level constant to allow customization for different network conditions.
| with opener.open(req, timeout=10) as resp: # pragma: no cover - network I/O | |
| with opener.open(req, timeout=TIMEOUT) as resp: # pragma: no cover - network I/O |
| # skip categories not present | ||
| continue | ||
| lines = [f"| {typ} | [{title}]({link}) |" for typ, title, link in items] | ||
| content = content.replace(table_header, table_header + "\n" + "\n".join(lines)) |
There was a problem hiding this comment.
Using string replace for content insertion is fragile and could cause incorrect replacements if the table header appears multiple times in the README. Consider using regex with word boundaries or a more precise matching approach to ensure only the intended table is updated.
| content = content.replace(table_header, table_header + "\n" + "\n".join(lines)) | |
| # Use regex to match the table header and following table rows | |
| pattern = re.compile( | |
| re.escape(table_header) + r"(?:\n(?:\|.*\|))*", | |
| re.MULTILINE | |
| ) | |
| replacement = table_header + "\n" + "\n".join(lines) | |
| content, count = pattern.subn(replacement, content, count=1) | |
| if count == 0: | |
| # fallback: skip if not matched (should not happen due to earlier check) | |
| continue |
| updates = defaultdict(list) | ||
| for y, m in months: | ||
| url = f"{BASE_URL}/{y:04d}/{m:02d}/" | ||
| if url in existing: |
There was a problem hiding this comment.
Using string containment check on the entire README content for each URL is inefficient. Consider extracting existing URLs once at the beginning using regex and storing them in a set for O(1) lookup instead of O(n) string searches.
| if url in existing: | |
| # Extract all URLs from the README content and store in a set for O(1) lookup | |
| existing_urls = set(re.findall(r"https?://[^\s\)]+", existing)) | |
| updates = defaultdict(list) | |
| for y, m in months: | |
| url = f"{BASE_URL}/{y:04d}/{m:02d}/" | |
| if url in existing_urls: |
Summary
Testing
python scripts/update_readme.py(fails: RuntimeError: failed to fetch https://mysql.taobao.org/monthly/: <urlopen error [Errno 101] Network is unreachable>)https://chatgpt.com/codex/tasks/task_e_689ed80fd5808330b5cd142dc02a4729