Remove duplicate PROMPT_EXTRACT_BLOCKS definition in prompts.py
The first definition (with tags/questions fields) was immediately overwritten by the second simpler definition — pure dead code. Removes 61 lines of unused prompt text. Inspired by PR #931 (stevenaldinger).
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# PR Review Todolist
|
||||
|
||||
> Last updated: 2026-02-01 | Total open PRs: 79
|
||||
> Last updated: 2026-02-02 | Total open PRs: 95
|
||||
|
||||
---
|
||||
|
||||
@@ -15,29 +15,33 @@
|
||||
| ~~#1717~~ | ~~YuriNachos~~ | ~~Fix local sentence-transformers embeddings blocked by OpenAI fallback. (#1658)~~ | **merged** |
|
||||
| ~~#1714~~ | ~~YuriNachos~~ | ~~Fix: Replace `tf-playwright-stealth` with `playwright-stealth` dependency. (#1553)~~ | **merged** |
|
||||
| ~~#1667~~ | ~~christian-oudard~~ | ~~Fix `crwl --deep-crawl` only outputting first page. Real CLI bug with tests.~~ | **merged** |
|
||||
| #1640 | Martichou | Fix memory leak — unused browser contexts never cleaned up under continuous load. (#943) | pending |
|
||||
| #1622 | zhaoyun006 | Fix redirect target verification in AsyncUrlSeeder and enhance tests. | pending |
|
||||
| #1592 | jzmiller1 | Fix CDP page leaks and race conditions in concurrent crawling. (#1563) | pending |
|
||||
| ~~#1640~~ | ~~Martichou~~ | ~~Fix memory leak — unused browser contexts never cleaned up under continuous load. (#943)~~ | **closed** |
|
||||
| #1622 | Ahmed-Tawfik94 | Fix redirect target verification in AsyncUrlSeeder and enhance tests. | pending |
|
||||
| #1592 | Ahmed-Tawfik94 | Fix CDP page leaks and race conditions in concurrent crawling. (#1563) | pending |
|
||||
| #1572 | yuexuan-chen | Fix CDP setting with managed browser. | pending |
|
||||
| #1450 | prlz77 | Fix LLM extraction fails when content is in alternative response fields. | pending |
|
||||
| ~~#1364~~ | ~~nnxiong~~ | ~~Fix `<script>` tag removal losing adjacent text in `cleaned_html`.~~ | **merged** |
|
||||
| #1308 | cjh-GITHUB | Fix css_selector variable type error (assigned to list). | pending |
|
||||
| #1308 | dominicx | Fix css_selector variable type error (assigned to list). | pending |
|
||||
| ~~#1296~~ | ~~vladmandic~~ | ~~Fix `VersionManager` ignoring `CRAWL4_AI_BASE_DIRECTORY` env var. 1-line fix.~~ | **merged** |
|
||||
| ~~#1281~~ | ~~garyluky~~ | ~~Fix proxy auth `ERR_INVALID_AUTH_CREDENTIALS`. Fixes #993, #974, #1109.~~ | **merged** |
|
||||
| #1234 | hellokayas | Fix TypeError when `keep_data_attributes=False` by ensuring list concat. | pending |
|
||||
| #1211 | zhangbo-tj | Fix: safely create new page if no page exists in persistent context. | pending |
|
||||
| #1207 | ninjapanzer | Fix streaming error handling. | pending |
|
||||
| #1200 | Gyscos | Bugfix browser manager session handling. | pending |
|
||||
| #1179 | Nuo-55 | Fix leak token when input url as raw html. | pending |
|
||||
| #1234 | AdarsHH30 | Fix TypeError when `keep_data_attributes=False` by ensuring list concat. | pending |
|
||||
| #1211 | Praneeth1-O-1 | Fix: safely create new page if no page exists in persistent context. | pending |
|
||||
| #1207 | moncapitaine | Fix streaming error handling. | pending |
|
||||
| #1200 | fischerdr | Bugfix browser manager session handling. | pending |
|
||||
| #1179 | phamngocquy | Fix leak token when input url as raw html. | pending |
|
||||
| ~~#1150~~ | ~~scris~~ | ~~Fix LLM extraction `response` variable not overridden causing `'str' has no attribute 'choices'`.~~ | **closed (already fixed)** |
|
||||
| #1133 | Daniel21b | Enforce auth when JWT is enabled. 1-line fix. | pending |
|
||||
| #1106 | ruoyuGao | Fix: Adapt to CrawlerMonitor constructor change. | pending |
|
||||
| #1133 | chrizzly2309 | Enforce auth when JWT is enabled. 1-line fix. | pending |
|
||||
| #1106 | devxpain | Fix: Adapt to CrawlerMonitor constructor change. | pending |
|
||||
| #1081 | Joorrit | Fix deep crawl scorer logic was inverted — high-distance paths scored higher. | **needs work (commented)** |
|
||||
| ~~#1077~~ | ~~RoyLeviLangware~~ | ~~Fix bs4 deprecation warning (`text` -> `string`). 1 line.~~ | **merged** |
|
||||
| #1073 | saipavanmeruga7797 | Fix local HTML file crawling broken when `capture_console_messages=False`. | pending |
|
||||
| #1065 | dzhao-gearset | Fix: Update deprecated Groq models to recommended replacements. | pending |
|
||||
| #1059 | wangs1024 | Fix wrong proxy config type in proxy demo example. | pending |
|
||||
| #1065 | mccullya | Fix: Update deprecated Groq models to recommended replacements. | pending |
|
||||
| #1059 | Aaron2516 | Fix wrong proxy config type in proxy demo example. | pending |
|
||||
| #1058 | Aaron2516 | Fix dict-type `proxy_config` not handled properly. (#1057) | pending |
|
||||
| #983 | umerkhan95 | Fix memory leak and empty responses in streaming mode. (#980) | pending |
|
||||
| #973 | danyQe | Fix typo of `temperature` in async_configs.py. 1 line. | pending |
|
||||
| #948 | GeorgeVince | Fix `summarize_page.py` example. | pending |
|
||||
| #729 | complete-dope | Fix: Logging for Error. 1-line fix. | pending |
|
||||
|
||||
## Good Features
|
||||
|
||||
@@ -48,44 +52,55 @@
|
||||
| #1707 | dillonledoux | Add `Crawl-delay` directive support from robots.txt. Good compliance feature. | pending |
|
||||
| #1706 | vikas-gits-good | Fix `arun_many` not working with `DeepCrawlStrategy`. (#1277) | pending |
|
||||
| #1702 | YxmMyth | Add CSS background image extraction. (#1691) | pending |
|
||||
| #1689 | dillonledoux | Docker: optimize concurrency performance and memory management. | pending |
|
||||
| #1683 | unknown | Implement double config for AdaptiveCrawler. | pending |
|
||||
| #1689 | mzyfree | Docker: optimize concurrency performance and memory management. | pending |
|
||||
| #1683 | Vaccarini-Lorenzo | Implement double config for AdaptiveCrawler. | pending |
|
||||
| #1674 | blentz | Add output pagination/control for MCP endpoints. Useful for LLM context windows. | pending |
|
||||
| #1668 | microHoffman | Add `--json-ensure-ascii` CLI flag for Unicode handling. Clean, small. | pending |
|
||||
| #1650 | sathyanarays | Add support for Vertex AI in LLM Extraction Strategy. | pending |
|
||||
| #1580 | GrumpyLion | Add Azure OpenAI configuration support to crwl config. | pending |
|
||||
| #1650 | KennyStryker | Add support for Vertex AI in LLM Extraction Strategy. | pending |
|
||||
| #1580 | arpagon | Add Azure OpenAI configuration support to crwl config. | pending |
|
||||
| #1463 | TristanDonze | Add configurable `device_scale_factor` for screenshot quality. 3 files, clean. | pending |
|
||||
| #1450 | prlz77 | Fix LLM extraction fails with alternative response fields. | pending |
|
||||
| #1435 | charlaie | Add `redirected_status_code` to CrawlResult. 3 files, clean. | pending |
|
||||
| #1425 | Nisarg38 | Add OpenRouter API support. | pending |
|
||||
| #1425 | denrusio | Add OpenRouter API support. | pending |
|
||||
| #1417 | NickMandylas | Add CDP headers support for remote browser auth (AWS Bedrock etc). | pending |
|
||||
| #1290 | 130347665 | Support type-list pipeline in JsonElementExtraction (multi-step extract). | pending |
|
||||
| #1255 | itsskofficial | Fix JsonCssSelector to handle adjacent sibling CSS selectors (`+ tr`). | pending |
|
||||
| #1238 | IgorLeno | Fix ManagedBrowser constructor and Windows encoding issues. | pending |
|
||||
| #1220 | chineidu | Allow `OPENAI_BASE_URL` to be used to control the base_url for the LLM. | pending |
|
||||
| #1180 | aravindkarnam | Add CallbackURLFilter for custom URL filtering in deep crawling. | pending |
|
||||
| #999 | Morriz | Add filters that filter based on regular expressions in deep crawling. | pending |
|
||||
| #1245 | mukul-atomicwork | Feature: GitHub releases integration. | pending |
|
||||
| #1238 | yerik515 | Fix ManagedBrowser constructor and Windows encoding issues. | pending |
|
||||
| #1220 | dcieslak19973 | Allow `OPENAI_BASE_URL` to be used to control the base_url for the LLM. | pending |
|
||||
| #1180 | kunalmanelkar | Add CallbackURLFilter for custom URL filtering in deep crawling. | pending |
|
||||
| #999 | loliw | Add filters that filter based on regular expressions in deep crawling. | pending |
|
||||
| #901 | gbe3hunna | CrawlResult model: add pydantic fields and descriptions. | pending |
|
||||
| #800 | atomlong | `ensure_ascii=False` for json.dumps to support non-ASCII characters. | pending |
|
||||
| #799 | atomlong | Allow setting `base_url` for LLM extraction strategy in CLI. | pending |
|
||||
| #741 | atomlong | Add config option to control Content-Security-Policy header. | pending |
|
||||
| #723 | alexandreolives | Optional close page after screenshot. | pending |
|
||||
| #681 | ksallee | JS execution should happen after waiting (reorder in strategy). | pending |
|
||||
|
||||
## Quick Doc/Maintenance Merges
|
||||
|
||||
| PR | Author | Description | Status |
|
||||
|----|--------|-------------|--------|
|
||||
| #1734 | pgoslatara | Update outdated GitHub Actions versions (v4->v6). 2 files. | pending |
|
||||
| #1722 | YuriNachos | Add missing docstring to MCP `md` endpoint. | pending |
|
||||
| #1716 | YuriNachos | Fix wrong return types in arun/arun_many docs. | pending |
|
||||
| #1715 | YuriNachos | Add missing `CacheMode` import in quickstart docs. | pending |
|
||||
| #1722 | YuriNachos | Add missing docstring to MCP `md` endpoint. | pending |
|
||||
| #1655 | unknown | Replace Chinese comment with English in nullcontext method. 1 line. | pending |
|
||||
| #1655 | daviddl9 | Replace Chinese comment with English in nullcontext method. 1 line. | pending |
|
||||
| #1494 | AkosLukacs | Fix wrong param name in `arun()` docstring. | pending |
|
||||
| #1488 | AkosLukacs | Fix syntax error in README JSON example. | pending |
|
||||
| #1483 | unknown | Update README.md with latest docker image. | pending |
|
||||
| #1416 | unknown | Fix missing bracket in README code block. | pending |
|
||||
| #1272 | unknown | Fix get title bug in amazon example. | pending |
|
||||
| #1263 | unknown | Fix: consistent with sdk behavior. | pending |
|
||||
| #1225 | unknown | Fix docker deployment guide URL. | pending |
|
||||
| #1223 | unknown | Docs: add links to other language versions of README. | pending |
|
||||
| #1483 | NiclasLindqvist | Update README.md with latest docker image. | pending |
|
||||
| #1416 | adityaagre | Fix missing bracket in README code block. | pending |
|
||||
| #1272 | zhenjunMa | Fix get title bug in amazon example. | pending |
|
||||
| #1263 | vvanglro | Fix: consistent with sdk behavior. | pending |
|
||||
| #1225 | albertkim | Fix docker deployment guide URL. | pending |
|
||||
| #1223 | dowithless | Docs: add links to other language versions of README. | pending |
|
||||
| #1159 | lbeziaud | Fix cleanup warning when no process on debug port. 1 line. | pending |
|
||||
| #1098 | unknown | Docs: fix outdated links to Docker guide and release notes. | pending |
|
||||
| #1093 | unknown | Docs: Fixed incorrect elapsed calculation and output format. | pending |
|
||||
| #1098 | B-X-Y | Docs: fix outdated links to Docker guide and release notes. | pending |
|
||||
| #1093 | Aaron2516 | Docs: Fixed incorrect elapsed calculation and output format. | pending |
|
||||
| #948 | GeorgeVince | Fix `summarize_page.py` example. | pending |
|
||||
| #931 | stevenaldinger | Remove duplicate variable definition dead code in prompts.py. | pending |
|
||||
| #967 | prajjwalnag | Update README.md. | pending |
|
||||
| #671 | SteveAlphaVantage | Update README.md. | pending |
|
||||
| #605 | mochamadsatria | Fix typo in docker-deployment.md filename. | pending |
|
||||
|
||||
## Duplicates (Close These)
|
||||
|
||||
@@ -98,6 +113,7 @@
|
||||
| ~~#1710~~ | ~~#1719~~ | ~~Same script.js packaging fix~~ **closed** |
|
||||
| #1478 | #1715 | Same quickstart CacheMode fix |
|
||||
| #1465 | #1715 | Same quickstart example fix |
|
||||
| #800 | #1668 | Overlaps with `--json-ensure-ascii` feature |
|
||||
|
||||
## Skip / Close
|
||||
|
||||
@@ -115,20 +131,22 @@
|
||||
| #1547 | mziv | lxml update — touches 100 files (lockfile). Needs careful review. |
|
||||
| #1395 | granolacowboy | "Feature/interactive wizard" — no description. |
|
||||
| #1408 | PATAKAMURIVENKATAGANESH | "Basic Health Check Endpoint" — no description filled. |
|
||||
| #1533 | unknown | Add Claude Code GitHub Workflow — CI workflow, not core. |
|
||||
| #1274 | unknown | Devcontainer support — 913 additions, dev tooling. |
|
||||
| #1420 | unknown | Opt-in telemetry system — 3,208 additions. Too large/sensitive. |
|
||||
| #1497 | unknown | Firecrawl backend support — 191 additions, niche integration. |
|
||||
| #1496 | unknown | normalize_url refactor — 869 additions, too large for URL normalization. |
|
||||
| #1518 | unknown | Docker PDF strategy — 324 additions, Docker-specific. |
|
||||
| #1413 | unknown | Full scan update — 290 additions, unclear scope. |
|
||||
| #1373 | unknown | MCP server endpoint fixes — 753 additions, large. |
|
||||
| #1212 | unknown | Stateless streamable_http transport for MCP — 154 additions. |
|
||||
| #1157 | unknown | Content change detection — 229 additions, feature scope unclear. |
|
||||
| #1140 | unknown | Prompt-driven recursive crawler script — 268 additions, not core. |
|
||||
| #1124 | unknown | VNC streaming support — 98 additions, niche. |
|
||||
| #1068 | unknown | Playground enhancement — 158 additions, separate feature. |
|
||||
| #1083 | unknown | Provider base url feature — 40 additions, overlaps with #1220. |
|
||||
| #1533 | unclecode | Add Claude Code GitHub Workflow — CI workflow, not core. |
|
||||
| #1274 | Fiser12 | Devcontainer support — 913 additions, dev tooling. |
|
||||
| #1420 | ntohidi | Opt-in telemetry system — 3,208 additions. Too large/sensitive. |
|
||||
| #1497 | Akeemkabiru | Firecrawl backend support — 191 additions, niche integration. |
|
||||
| #1496 | Ahmed-Tawfik94 | normalize_url refactor — 869 additions, too large for URL normalization. |
|
||||
| #1518 | YorelN | Docker PDF strategy — 324 additions, Docker-specific. |
|
||||
| #1413 | GarfieldTheOldCat | Full scan update — 290 additions, unclear scope. |
|
||||
| #1373 | ywatanabe1989 | MCP server endpoint fixes — 753 additions, large. |
|
||||
| #1212 | ACakshay | Stateless streamable_http transport for MCP — 154 additions. |
|
||||
| #1157 | yesidc | Content change detection — 229 additions, feature scope unclear. |
|
||||
| #1140 | tmocky1134 | Prompt-driven recursive crawler script — 268 additions, not core. |
|
||||
| #1124 | unclecode | VNC streaming support — 98 additions, niche. |
|
||||
| #1068 | jeremygiberson | Playground enhancement — 158 additions, separate feature. |
|
||||
| #1083 | Sacristaan | Provider base url feature — 40 additions, overlaps with #1220. |
|
||||
| #865 | janbuchar | Apify Actor sponsorship — 4,384 additions, external integration. |
|
||||
| #680 | lassedrud | 79,791 additions, Jupyter notebook for Legat4me. Not core. |
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -8,67 +8,6 @@ And here is the cleaned HTML content of that webpage:
|
||||
|
||||
Your task is to break down this HTML content into semantically relevant blocks, and for each block, generate a JSON object with the following keys:
|
||||
|
||||
- index: an integer representing the index of the block in the content
|
||||
- tags: a list of semantic tags that are relevant to the content of the block
|
||||
- content: a list of strings containing the text content of the block
|
||||
- questions: a list of 3 questions that a user may ask about the content in this block
|
||||
|
||||
To generate the JSON objects:
|
||||
|
||||
1. Carefully read through the HTML content and identify logical breaks or shifts in the content that would warrant splitting it into separate blocks.
|
||||
|
||||
2. For each block:
|
||||
a. Assign it an index based on its order in the content.
|
||||
b. Analyze the content and generate a list of relevant semantic tags that describe what the block is about.
|
||||
c. Extract the text content, clean it up if needed, and store it as a list of strings in the "content" field.
|
||||
d. Come up with 3 questions that a user might ask about this specific block of content, based on the tags and content. The questions should be relevant and answerable by the content in the block.
|
||||
|
||||
3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.
|
||||
|
||||
4. Double-check that each JSON object includes all required keys (index, tags, content, questions) and that the values are in the expected format (integer, list of strings, etc.).
|
||||
|
||||
5. Make sure the generated JSON is complete and parsable, with no errors or omissions.
|
||||
|
||||
6. Make sure to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.
|
||||
|
||||
Please provide your output within <blocks> tags, like this:
|
||||
|
||||
<blocks>
|
||||
[{
|
||||
"index": 0,
|
||||
"tags": ["introduction", "overview"],
|
||||
"content": ["This is the first paragraph of the article, which provides an introduction and overview of the main topic."],
|
||||
"questions": [
|
||||
"What is the main topic of this article?",
|
||||
"What can I expect to learn from reading this article?",
|
||||
"Is this article suitable for beginners or experts in the field?"
|
||||
]
|
||||
},
|
||||
{
|
||||
"index": 1,
|
||||
"tags": ["history", "background"],
|
||||
"content": ["This is the second paragraph, which delves into the history and background of the topic.",
|
||||
"It provides context and sets the stage for the rest of the article."],
|
||||
"questions": [
|
||||
"What historical events led to the development of this topic?",
|
||||
"How has the understanding of this topic evolved over time?",
|
||||
"What are some key milestones in the history of this topic?"
|
||||
]
|
||||
}]
|
||||
</blocks>
|
||||
|
||||
Remember, the output should be a complete, parsable JSON wrapped in <blocks> tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order."""
|
||||
|
||||
PROMPT_EXTRACT_BLOCKS = """Here is the URL of the webpage:
|
||||
<url>{URL}</url>
|
||||
|
||||
And here is the cleaned HTML content of that webpage:
|
||||
<html>
|
||||
{HTML}
|
||||
</html>
|
||||
|
||||
Your task is to break down this HTML content into semantically relevant blocks, and for each block, generate a JSON object with the following keys:
|
||||
|
||||
- index: an integer representing the index of the block in the content
|
||||
- content: a list of strings containing the text content of the block
|
||||
|
||||
|
||||
Reference in New Issue
Block a user