Files
crawl4ai/crawler/prompts.py
2024-05-09 19:10:25 +08:00

110 lines
5.1 KiB
Python

PROMPT_EXTRACT_BLOCKS = """YHere is the URL of the webpage:
<url>{URL}</url>
And here is the cleaned HTML content of that webpage:
<html>
{HTML}
</html>
Your task is to break down this HTML content into semantically relevant blocks, and for each block, generate a JSON object with the following keys:
- index: an integer representing the index of the block in the content
- tags: a list of semantic tags that are relevant to the content of the block
- content: a list of strings containing the text content of the block
- questions: a list of 3 questions that a user may ask about the content in this block
To generate the JSON objects:
1. Carefully read through the HTML content and identify logical breaks or shifts in the content that would warrant splitting it into separate blocks.
2. For each block:
a. Assign it an index based on its order in the content.
b. Analyze the content and generate a list of relevant semantic tags that describe what the block is about.
c. Extract the text content, clean it up if needed, and store it as a list of strings in the "content" field.
d. Come up with 3 questions that a user might ask about this specific block of content, based on the tags and content. The questions should be relevant and answerable by the content in the block.
3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.
4. Double-check that each JSON object includes all required keys (index, tags, content, questions) and that the values are in the expected format (integer, list of strings, etc.).
5. Make sure the generated JSON is complete and parsable, with no errors or omissions.
6. Make sur to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.
Please provide your output within <blocks> tags, like this:
<blocks>
[{
"index": 0,
"tags": ["introduction", "overview"],
"content": ["This is the first paragraph of the article, which provides an introduction and overview of the main topic."],
"questions": [
"What is the main topic of this article?",
"What can I expect to learn from reading this article?",
"Is this article suitable for beginners or experts in the field?"
]
},
{
"index": 1,
"tags": ["history", "background"],
"content": ["This is the second paragraph, which delves into the history and background of the topic.",
"It provides context and sets the stage for the rest of the article."],
"questions": [
"What historical events led to the development of this topic?",
"How has the understanding of this topic evolved over time?",
"What are some key milestones in the history of this topic?"
]
}]
</blocks>
Remember, the output should be a complete, parsable JSON wrapped in <blocks> tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order."""
PROMPT_EXTRACT_BLOCKS = """YHere is the URL of the webpage:
<url>{URL}</url>
And here is the cleaned HTML content of that webpage:
<html>
{HTML}
</html>
Your task is to break down this HTML content into semantically relevant blocks, and for each block, generate a JSON object with the following keys:
- index: an integer representing the index of the block in the content
- content: a list of strings containing the text content of the block
To generate the JSON objects:
1. Carefully read through the HTML content and identify logical breaks or shifts in the content that would warrant splitting it into separate blocks.
2. For each block:
a. Assign it an index based on its order in the content.
b. Analyze the content and generate ONE semantic tag that describe what the block is about.
c. Extract the text content, EXACTLY SAME AS GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.
4. Double-check that each JSON object includes all required keys (index, tag, content) and that the values are in the expected format (integer, list of strings, etc.).
5. Make sure the generated JSON is complete and parsable, with no errors or omissions.
6. Make sur to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.
7. Never alter the extracted content, just copy and paste it as it is.
Please provide your output within <blocks> tags, like this:
<blocks>
[{
"index": 0,
"tags": ["introduction"],
"content": ["This is the first paragraph of the article, which provides an introduction and overview of the main topic."]
},
{
"index": 1,
"tags": ["background"],
"content": ["This is the second paragraph, which delves into the history and background of the topic.",
"It provides context and sets the stage for the rest of the article."]
}]
</blocks>
Remember, the output should be a complete, parsable JSON wrapped in <blocks> tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order."""