feat: add comprehensive type definitions and improve test coverage

Add new type definitions file with extensive Union type aliases for all core components including AsyncUrlSeeder, SeedingConfig, and various crawler strategies. Enhance test coverage with improved bot detection tests, Docker-based testing, and extended features validation. The changes provide better type safety and more robust testing infrastructure for the crawling framework.
This commit is contained in:
AHMET YILMAZ
2025-10-13 18:49:01 +08:00
parent 201843a204
commit 8cca9704eb
21 changed files with 2626 additions and 704 deletions

View File

@@ -779,6 +779,144 @@ async def test_stream_crawl(token: str = None): # Made token optional
# asyncio.run(test_stream_crawl())
```
#### LLM Job with Chunking Strategy
```python
import requests
import time
# Example: LLM extraction with RegexChunking strategy
# This breaks large documents into smaller chunks before LLM processing
llm_job_payload = {
"url": "https://example.com/long-article",
"q": "Extract all key points and main ideas from this article",
"chunking_strategy": {
"type": "RegexChunking",
"params": {
"patterns": ["\\n\\n"], # Split on double newlines (paragraphs)
"overlap": 50
}
}
}
# Submit LLM job
response = requests.post(
"http://localhost:11235/llm/job",
json=llm_job_payload
)
if response.ok:
job_data = response.json()
job_id = job_data["task_id"]
print(f"Job submitted successfully. Job ID: {job_id}")
# Poll for completion
while True:
status_response = requests.get(f"http://localhost:11235/llm/job/{job_id}")
if status_response.ok:
status_data = status_response.json()
if status_data["status"] == "completed":
print("Job completed!")
print("Extracted content:", status_data["result"])
break
elif status_data["status"] == "failed":
print("Job failed:", status_data.get("error"))
break
else:
print(f"Job status: {status_data['status']}")
time.sleep(2) # Wait 2 seconds before checking again
else:
print(f"Error checking job status: {status_response.text}")
break
else:
print(f"Error submitting job: {response.text}")
```
**Available Chunking Strategies:**
- **IdentityChunking**: Returns the entire content as a single chunk (no splitting)
```json
{
"type": "IdentityChunking",
"params": {}
}
```
- **RegexChunking**: Split content using regular expression patterns
```json
{
"type": "RegexChunking",
"params": {
"patterns": ["\\n\\n"]
}
}
```
- **NlpSentenceChunking**: Split content into sentences using NLP (requires NLTK)
```json
{
"type": "NlpSentenceChunking",
"params": {}
}
```
- **TopicSegmentationChunking**: Segment content into topics using TextTiling (requires NLTK)
```json
{
"type": "TopicSegmentationChunking",
"params": {
"num_keywords": 3
}
}
```
- **FixedLengthWordChunking**: Split into fixed-length word chunks
```json
{
"type": "FixedLengthWordChunking",
"params": {
"chunk_size": 100
}
}
```
- **SlidingWindowChunking**: Overlapping word chunks with configurable step size
```json
{
"type": "SlidingWindowChunking",
"params": {
"window_size": 100,
"step": 50
}
}
```
- **OverlappingWindowChunking**: Fixed-size chunks with word overlap
```json
{
"type": "OverlappingWindowChunking",
"params": {
"window_size": 1000,
"overlap": 100
}
}
```
{
"type": "OverlappingWindowChunking",
"params": {
"chunk_size": 1500,
"overlap": 100
}
}
```
**Notes:**
- `chunking_strategy` is optional - if omitted, default token-based chunking is used
- Chunking is applied at the API level without modifying the core SDK
- Results from all chunks are merged into a single response
- Each chunk is processed independently with the same LLM instruction
---
## Metrics & Monitoring