Tags: ArkNill/markgrab
Tags
feat: add extract_batch() for thread-safe concurrent URL extraction Playwright browser fallback deadlocks when called from ThreadPoolExecutor. extract_batch() replaces threading with asyncio.gather + Semaphore, eliminating the deadlock entirely. - extract_batch(): async batch extraction with per-domain rate limiting, per-URL timeout, and configurable concurrency (no threads) - MCP extract_multiple(): sequential → parallel via extract_batch() - publish.yml: add GitHub Release creation on publish - 124 tests (10 new batch tests), ruff clean
fix: preserve text after mixed <br> and <br /> tags Workaround for markdownify #244 / #58: Python's html.parser treats mixed <br> and <br /> as an opening tag, swallowing subsequent text as children. The upstream convert_br() discards child text entirely. Override MarkdownConverter.convert_br() to append the text parameter after the newline, preventing silent content loss in HTML-to-markdown conversion. - Add _BrFixedConverter with convert_br override - Replace markdownify() call with custom converter - Bump version to 0.1.1 - Add 2 regression tests (English + Korean) - 114 tests passing