From dc4ae73221780b6654db32c3c041502bf9002710 Mon Sep 17 00:00:00 2001 From: unclecode Date: Sun, 1 Feb 2026 05:41:33 +0000 Subject: [PATCH] Merge PRs #1714, #1721, #1719, #1717 and fix base tag pipeline - PR #1714: Replace tf-playwright-stealth with playwright-stealth - PR #1721: Respect tag in html2text for relative links - PR #1719: Include GoogleSearchCrawler script.js in package data - PR #1717: Allow local embeddings by removing OpenAI fallback - Fix: Extract from raw HTML before head gets stripped - Close duplicates: #1703, #1698, #1697, #1710, #1720 - Update CONTRIBUTORS.md and PR-TODOLIST.md --- .context/PR-TODOLIST.md | 149 +++++++++++++++++++++++++++++++++++ CONTRIBUTORS.md | 1 + crawl4ai/async_webcrawler.py | 11 ++- 3 files changed, 159 insertions(+), 2 deletions(-) create mode 100644 .context/PR-TODOLIST.md diff --git a/.context/PR-TODOLIST.md b/.context/PR-TODOLIST.md new file mode 100644 index 00000000..690f062e --- /dev/null +++ b/.context/PR-TODOLIST.md @@ -0,0 +1,149 @@ +# PR Review Todolist + +> Last updated: 2026-02-01 | Total open PRs: 85 + +--- + +## Solid Bug Fixes + +| PR | Author | Description | Status | +|----|--------|-------------|--------| +| ~~#1746~~ | ~~ChiragBellara~~ | ~~Fix: sitemap-only seeding was initializing Common Crawl unnecessarily~~ | **merged** | +| ~~#1721~~ | ~~YuriNachos~~ | ~~Fix `` tag ignored in html2text — relative links resolve wrong. (#1680)~~ | **merged** | +| ~~#1720~~ | ~~YuriNachos~~ | ~~Fix LLM schema generation fails when LLM wraps JSON in markdown code blocks. (#1663)~~ | **closed (already fixed)** | +| ~~#1719~~ | ~~YuriNachos~~ | ~~Fix GoogleSearchCrawler `script.js` missing from package distribution. (#1711)~~ | **merged** | +| ~~#1717~~ | ~~YuriNachos~~ | ~~Fix local sentence-transformers embeddings blocked by OpenAI fallback. (#1658)~~ | **merged** | +| ~~#1714~~ | ~~YuriNachos~~ | ~~Fix: Replace `tf-playwright-stealth` with `playwright-stealth` dependency. (#1553)~~ | **merged** | +| #1667 | christian-oudard | Fix `crwl --deep-crawl` only outputting first page. Real CLI bug with tests. | pending | +| #1640 | Martichou | Fix memory leak — unused browser contexts never cleaned up under continuous load. (#943) | pending | +| #1622 | zhaoyun006 | Fix redirect target verification in AsyncUrlSeeder and enhance tests. | pending | +| #1592 | jzmiller1 | Fix CDP page leaks and race conditions in concurrent crawling. (#1563) | pending | +| #1572 | yuexuan-chen | Fix CDP setting with managed browser. | pending | +| #1450 | prlz77 | Fix LLM extraction fails when content is in alternative response fields. | pending | +| #1364 | nnxiong | Fix `