feat(scraping): add LXML-based scraping mode for improved performance
Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing. LXML mode offers 10-20x better performance for large HTML documents. Key changes: - Added ScrapingMode enum with BEAUTIFULSOUP and LXML options - Implemented LXMLWebScrapingStrategy class - Added LXML-based metadata extraction - Updated documentation with scraping mode usage and performance considerations - Added cssselect dependency BREAKING CHANGE: None
This commit is contained in:
52
scraper_evaluation.json
Normal file
52
scraper_evaluation.json
Normal file
@@ -0,0 +1,52 @@
|
||||
{
|
||||
"original": {
|
||||
"performance": [],
|
||||
"differences": []
|
||||
},
|
||||
"batch": {
|
||||
"performance": [
|
||||
{
|
||||
"case": "basic",
|
||||
"metrics": {
|
||||
"time": 0.8874530792236328,
|
||||
"memory": 98.328125
|
||||
}
|
||||
}
|
||||
],
|
||||
"differences": [
|
||||
{
|
||||
"case": "basic",
|
||||
"differences": {
|
||||
"images_count": {
|
||||
"old": 50,
|
||||
"new": 0,
|
||||
"diff": -50
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
"lxml": {
|
||||
"performance": [
|
||||
{
|
||||
"case": "basic",
|
||||
"metrics": {
|
||||
"time": 1.210719108581543,
|
||||
"memory": 99.921875
|
||||
}
|
||||
}
|
||||
],
|
||||
"differences": [
|
||||
{
|
||||
"case": "basic",
|
||||
"differences": {
|
||||
"images_count": {
|
||||
"old": 50,
|
||||
"new": 0,
|
||||
"diff": -50
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user