This update introduces a new feature in the URL seeding process that allows for the automatic filtering of utility URLs, such as robots.txt and sitemap.xml, which are not useful for content crawling. The class has been enhanced with a new parameter, , which is enabled by default. This change aims to improve the efficiency of the crawling process by reducing the number of irrelevant URLs processed.
Significant modifications include:
- Added parameter to in .
- Implemented logic in to check and filter out nonsense URLs during the seeding process in .
- Updated documentation to reflect the new filtering feature and provide examples of its usage in .
This change enhances the overall functionality of the URL seeder, making it smarter and more efficient in identifying and excluding non-content URLs.
BREAKING CHANGE: The now requires the parameter to be explicitly set if the default behavior is to be altered.
Related issues: #123