- Add tests for device_scale_factor (config + integration) - Add tests for redirected_status_code (model + redirect + raw HTML) - Document device_scale_factor in browser config docs and API reference - Document redirected_status_code in crawler result docs and API reference - Add TristanDonze and charlaie to CONTRIBUTORS.md - Update PR-TODOLIST with session results
17 KiB
17 KiB
PR Review Todolist
Last updated: 2026-02-06 | Total open PRs: ~63
Solid Bug Fixes
| PR | Author | Description | Status |
|---|---|---|---|
| merged | |||
<base> tag ignored in html2text — relative links resolve wrong. (#1680) |
merged | ||
| closed (already fixed) | |||
script.js missing from package distribution. (#1711) |
merged | ||
| merged | |||
tf-playwright-stealth with playwright-stealth dependency. (#1553) |
merged | ||
crwl --deep-crawl only outputting first page. Real CLI bug with tests. |
merged | ||
| closed | |||
| #1622 | Ahmed-Tawfik94 | Fix redirect target verification in AsyncUrlSeeder and enhance tests. | pending |
| #1592 | Ahmed-Tawfik94 | Fix CDP page leaks and race conditions in concurrent crawling. (#1563) | pending |
| #1572 | Ahmed-Tawfik94 | Fix CDP setting with managed browser. | pending |
| #1450 | rbushri | Fix LLM extraction fails when content is in alternative response fields. | pending |
<script> tag removal losing adjacent text in cleaned_html. |
merged | ||
| #1308 | dominicx | Fix css_selector variable type error (assigned to list). | pending |
VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY env var. 1-line fix. |
merged | ||
ERR_INVALID_AUTH_CREDENTIALS. Fixes #993, #974, #1109. |
merged | ||
| #1234 | AdarsHH30 | Fix TypeError when keep_data_attributes=False by ensuring list concat. |
pending |
| #1211 | Praneeth1-O-1 | Fix: safely create new page if no page exists in persistent context. | pending |
| #1207 | moncapitaine | Fix streaming error handling. | pending |
| #1200 | fischerdr | Bugfix browser manager session handling. | pending |
| #1179 | phamngocquy | Fix leak token when input url as raw html. | pending |
response variable not overridden causing 'str' has no attribute 'choices'. |
closed (already fixed) | ||
| closed (already fixed) | |||
| #1106 | devxpain | Fix: Adapt to CrawlerMonitor constructor change. | pending |
| #1081 | Joorrit | Fix deep crawl scorer logic was inverted — high-distance paths scored higher. | needs work (commented) |
text -> string). 1 line. |
merged | ||
capture_console_messages=False. |
closed (already fixed) | ||
| #1065 | mccullya | Fix: Update deprecated Groq models to recommended replacements. | pending |
| #1059 | Aaron2516 | Fix wrong proxy config type in proxy demo example. | pending |
| #1058 | Aaron2516 | Fix dict-type proxy_config not handled properly. (#1057) |
pending |
| #983 | umerkhan95 | Fix memory leak and empty responses in streaming mode. (#980) | pending |
temperature in async_configs.py. 1 line. |
closed (already fixed) | ||
| #948 | GeorgeVince | Fix summarize_page.py example. |
pending |
| closed (already fixed) | |||
| #462 | jtanningbed | Fix: Add newline before pre codeblock start in html2text. 1-line fix. | pending |
Good Features
| PR | Author | Description | Status |
|---|---|---|---|
| #1730 | hoi | Add configurable TTL for Redis task data. Prevents unbounded memory growth. | pending |
| #1729 | hoi | Add support for external Redis with embedded Redis disable option. | pending |
| #1707 | dillonledoux | Add Crawl-delay directive support from robots.txt. Good compliance feature. |
pending |
| #1706 | vikas-gits-good | Fix arun_many not working with DeepCrawlStrategy. (#1277) |
pending |
| #1702 | YxmMyth | Add CSS background image extraction. (#1691) | pending |
| #1689 | mzyfree | Docker: optimize concurrency performance and memory management. | pending |
| #1683 | Vaccarini-Lorenzo | Implement double config for AdaptiveCrawler. | pending |
| #1674 | blentz | Add output pagination/control for MCP endpoints. Useful for LLM context windows. | pending |
| #1668 | microHoffman | Add --json-ensure-ascii CLI flag for Unicode handling. Clean, small. |
pending |
| #1650 | KennyStryker | Add support for Vertex AI in LLM Extraction Strategy. | pending |
| #1580 | arpagon | Add Azure OpenAI configuration support to crwl config. | pending |
device_scale_factor for screenshot quality. 3 files, clean. |
merged | ||
redirected_status_code to CrawlResult. 3 files, clean. |
merged | ||
| #1425 | denrusio | Add OpenRouter API support. | pending |
| #1417 | NickMandylas | Add CDP headers support for remote browser auth (AWS Bedrock etc). | pending |
| #1290 | 130347665 | Support type-list pipeline in JsonElementExtraction (multi-step extract). | pending |
| #1255 | itsskofficial | Fix JsonCssSelector to handle adjacent sibling CSS selectors (+ tr). |
pending |
| #1245 | mukul-atomicwork | Feature: GitHub releases integration. | pending |
| #1238 | yerik515 | Fix ManagedBrowser constructor and Windows encoding issues. | pending |
| #1220 | dcieslak19973 | Allow OPENAI_BASE_URL to be used to control the base_url for the LLM. |
pending |
| #1180 | kunalmanelkar | Add CallbackURLFilter for custom URL filtering in deep crawling. | pending |
| #999 | loliw | Add filters that filter based on regular expressions in deep crawling. | pending |
| #901 | gbe3hunna | CrawlResult model: add pydantic fields and descriptions. | pending |
| #800 | atomlong | ensure_ascii=False for json.dumps to support non-ASCII characters. |
pending |
| #799 | atomlong | Allow setting base_url for LLM extraction strategy in CLI. |
pending |
| #741 | atomlong | Add config option to control Content-Security-Policy header. | pending |
| #723 | alexandreolives | Optional close page after screenshot. | pending |
| #681 | ksallee | JS execution should happen after waiting (reorder in strategy). | pending |
| #416 | dar0xt | Add keep-aria-label-attribute option. 6 files. | pending |
| #332 | nelzomal | Add remove_invisible_texts method to crawler strategy. | pending |
| #312 | AndreaFrancis | Add save to HuggingFace support for async webcrawler. 367 additions, 9 files. | pending |
Quick Doc/Maintenance Merges
| PR | Author | Description | Status |
|---|---|---|---|
| #1734 | pgoslatara | Update outdated GitHub Actions versions (v4->v6). 2 files. | pending |
| #1722 | YuriNachos | Add missing docstring to MCP md endpoint. |
pending |
| #1716 | YuriNachos | Fix wrong return types in arun/arun_many docs. | pending |
| #1715 | YuriNachos | Add missing CacheMode import in quickstart docs. |
pending |
| closed (keeping intentionally) | |||
| #1494 | AkosLukacs | Fix wrong param name in arun() docstring. |
pending |
| #1488 | AkosLukacs | Fix syntax error in README JSON example. | pending |
| #1483 | NiclasLindqvist | Update README.md with latest docker image. | pending |
| #1416 | adityaagre | Fix missing bracket in README code block. | pending |
| #1272 | zhenjunMa | Fix get title bug in amazon example. | pending |
| #1263 | vvanglro | Fix: consistent with sdk behavior. | pending |
| #1225 | albertkim | Fix docker deployment guide URL. | pending |
| #1223 | dowithless | Docs: add links to other language versions of README. | pending |
| #1159 | lbeziaud | Fix cleanup warning when no process on debug port. 1 line. | pending |
| #1098 | B-X-Y | Docs: fix outdated links to Docker guide and release notes. | pending |
| #1093 | Aaron2516 | Docs: Fixed incorrect elapsed calculation and output format. | pending |
| #948 | GeorgeVince | Fix summarize_page.py example. |
pending |
| closed (fixed ourselves) | |||
| #967 | prajjwalnag | Update README.md. | pending |
| #671 | SteveAlphaVantage | Update README.md. | pending |
| #605 | mochamadsatria | Fix typo in docker-deployment.md filename. | pending |
| #335 | amanagarwal042 | Add Documentation for Monitoring with OpenTelemetry. | pending |
Duplicates (Close These)
| PR | Duplicate Of | Description |
|---|---|---|
<base> tag fix |
||
<base> tag fix |
||
| #800 | #1668 | Overlaps with --json-ensure-ascii feature |
| #475 | #1296 | Same CRAWL4_AI_BASE_DIRECTORY fix for VersionManager, DocsManager, migrations. #1296 already merged. |
Skip / Close
| PR | Author | Why | Status |
|---|---|---|---|
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| #1533 | unclecode | Add Claude Code GitHub Workflow — CI workflow, not core. | skipped (owner's PR) |
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| closed | |||
| #1124 | unclecode | VNC streaming support — 98 additions, niche. | skipped (owner's PR) |
| closed | |||
| closed | |||
| closed | |||
| closed |
Resolved This Session
| PR | Author | Description | Merged Date |
|---|---|---|---|
| #1694 | theredrad | feat: add force viewport screenshot | 2026-02-01 |
| #1746 | ChiragBellara | fix: avoid Common Crawl calls for sitemap-only URL seeding | 2026-02-01 |
| #1714 | YuriNachos | fix: replace tf-playwright-stealth with playwright-stealth | 2026-02-01 |
| #1721 | YuriNachos | fix: respect <base> tag for relative link resolution in html2text |
2026-02-01 |
| #1719 | YuriNachos | fix: include GoogleSearchCrawler script.js in package distribution | 2026-02-01 |
| #1717 | YuriNachos | fix: allow local embeddings by removing OpenAI fallback | 2026-02-01 |
| #1720 | YuriNachos | closed: LLM schema markdown fences (already fixed on develop) | 2026-02-01 |
| #1703 | — | closed: duplicate of #1721 | 2026-02-01 |
| #1698 | — | closed: duplicate of #1721 | 2026-02-01 |
| #1697 | — | closed: duplicate of #1717 | 2026-02-01 |
| #1710 | — | closed: duplicate of #1719 | 2026-02-01 |
| #1667 | christian-oudard | fix: deep-crawl CLI outputting only the first page | 2026-02-01 |
| #1296 | vladmandic | fix: VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY env var | 2026-02-01 |
| #1364 | nnxiong | fix: script tag removal losing adjacent text in cleaned_html | 2026-02-01 |
| #1150 | scris | closed: LLM extraction response variable (already fixed on develop) | 2026-02-01 |
| #1077 | RoyLeviLangware | fix: bs4 deprecation warning (text -> string) | 2026-02-01 |
| #1281 | garyluky | fix: proxy auth ERR_INVALID_AUTH_CREDENTIALS | 2026-02-01 |
| #973 | danyQe | closed: temperature typo (already fixed on develop) | 2026-02-02 |
| #1073 | saipavanmeruga7797 | closed: local HTML capture_console bug (already fixed on develop) | 2026-02-02 |
| #931 | stevenaldinger | closed: duplicate PROMPT_EXTRACT_BLOCKS removed (fixed ourselves) | 2026-02-02 |
| #1655 | daviddl9 | closed: Chinese docstring kept intentionally | 2026-02-02 |
| #1133 | chrizzly2309 | closed: JWT auth bypass (already fixed on develop) | 2026-02-02 |
| #729 | complete-dope | closed: console logging error (already fixed on develop) | 2026-02-02 |
| #1600 | cbwinslow | closed: accidental dump (ASDF) | 2026-02-02 |
| #1100 | xerexesx | closed: empty PR | 2026-02-02 |
| #1110 | lwsinclair | closed: marketing badge spam | 2026-02-02 |
| #1724 | git-pranavbabu | closed: template PR title, trivial | 2026-02-02 |
| #1569 | Ahmed-Tawfik94 | closed: too large (17k+ additions) | 2026-02-02 |
| #1630 | Daniel21b | closed: too large, unsolicited JWT auth | 2026-02-02 |
| #1700 | chansearrington | closed: too large, niche LLM provider | 2026-02-02 |
| #1525 | leoric-crown | closed: too large, MCP rewrite | 2026-02-02 |
| #1420 | ntohidi | closed: too large, telemetry system | 2026-02-02 |
| #1497 | Akeemkabiru | closed: niche Firecrawl integration | 2026-02-02 |
| #1518 | YorelN | closed: Docker PDF strategy | 2026-02-02 |
| #1274 | Fiser12 | closed: devcontainer support | 2026-02-02 |
| #1413 | GarfieldTheOldCat | closed: unclear scope | 2026-02-02 |
| #1373 | ywatanabe1989 | closed: too large, MCP fixes | 2026-02-02 |
| #1212 | ACakshay | closed: MCP transport | 2026-02-02 |
| #1157 | yesidc | closed: overlaps existing cache freshness | 2026-02-02 |
| #1140 | tmocky1134 | closed: not core | 2026-02-02 |
| #1068 | jeremygiberson | closed: playground feature | 2026-02-02 |
| #865 | janbuchar | closed: external Apify integration | 2026-02-02 |
| #680 | lassedrud | closed: 80k additions, Jupyter notebook | 2026-02-02 |
| #1547 | mziv | closed: 100-file lockfile update | 2026-02-02 |
| #1496 | Ahmed-Tawfik94 | closed: too large normalize_url refactor | 2026-02-02 |
| #1565 | TrungLee2020 | closed: not core (Vietnamese crawler scripts) | 2026-02-02 |
| #1083 | Sacristaan | closed: overlaps with #1220 | 2026-02-02 |
| #1395 | granolacowboy | closed: no description | 2026-02-02 |
| #1408 | PATAKAMURIVENKATAGANESH | closed: no description | 2026-02-02 |
| #1696 | majiayu000 | closed: duplicate of #1722 | 2026-02-02 |
| #1478 | e1codes | closed: duplicate of #1715 | 2026-02-02 |
| #1465 | fardhanrasya | closed: duplicate of #1715 | 2026-02-02 |
| #1450 | rbushri | closed: litellm handles response field normalization | 2026-02-06 |
| #1463 | TristanDonze | feat: add configurable device_scale_factor for screenshot quality | 2026-02-06 |
| #1435 | charlaie | feat: add redirected_status_code to CrawlResult | 2026-02-06 |