feat(scraping): add LXML-based scraping mode for improved performance

Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing. LXML mode offers 10-20x better performance for large HTML documents. Key changes: - Added ScrapingMode enum with BEAUTIFULSOUP and LXML options - Implemented LXMLWebScrapingStrategy class - Added LXML-based metadata extraction - Updated documentation with scraping mode usage and performance considerations - Added cssselect dependency BREAKING CHANGE: None
2025-01-12 20:46:23 +08:00
parent 825c78a048
commit f3ae5a657c
12 changed files with 1366 additions and 509 deletions
--- a/scraper_equivalence_results.json
+++ b/scraper_equivalence_results.json
@@ -0,0 +1,16 @@
+{
+  "tests": [
+    {
+      "case": "complicated_exclude_all_links",
+      "lxml_mode": {
+        "differences": {},
+        "execution_time": 0.0019578933715820312
+      },
+      "original_time": 0.0059909820556640625
+    }
+  ],
+  "summary": {
+    "passed": 1,
+    "failed": 0
+  }
+}