feat(schema): improve HTML preprocessing for schema generation

Add new preprocess_html_for_schema utility function to better handle HTML cleaning
for schema generation. This replaces the previous optimize_html function in the
GoogleSearchCrawler and includes smarter attribute handling and pattern detection.

Other changes:
- Update default provider to gpt-4o
- Add DEFAULT_PROVIDER_API_KEY constant
- Make LLMConfig creation more flexible with create_llm_config helper
- Add new dependencies: zstandard and msgpack

This change improves schema generation reliability while reducing noise in the
processed HTML.
This commit is contained in:
UncleCode
2025-03-12 22:40:46 +08:00
parent 1630fbdafe
commit dc36997a08
8 changed files with 134 additions and 12 deletions

View File

@@ -42,7 +42,9 @@ dependencies = [
"pyperclip>=1.8.2",
"faust-cchardet>=2.1.19",
"aiohttp>=3.11.11",
"humanize>=4.10.0"
"humanize>=4.10.0",
"zstandard>=0.23.0",
"msgpack>=1.1.0"
]
classifiers = [
"Development Status :: 4 - Beta",