Enhanced Scraping Reliability
This release improves scraping accuracy and SDK functionality with fixes for key handling, content extraction, and media processing. Users will experience more reliable results when working with complex websites and lazy-loaded content.
Hero image classification now correctly identifies only the primary carousel container instead of tagging every image in all carousels. This prevents excessive hero tags on pages with multiple carousels and improves image categorization accuracy.
The SDK now validates section filter parameters before API requests, catching typos and type errors in min_content_blocks, exclude_content_types, and content_only options. This prevents failed requests due to configuration mistakes and provides immediate feedback during development.
Carousel detection on Shopify sites now uses a more reliable two-pass strategy. The system first looks for data-section-type attributes, which Shopify themes use to annotate major sections like slideshows, providing higher confidence in identifying the primary carousel.
Improvements
3SDK configuration validation
The SDK now validates section filter parameters before API requests, catching typos and type errors in min_content_blocks, exclude_content_types, and content_only options. This prevents failed requests due to configuration mistakes and provides immediate feedback during development.
General bug fixes and improvements
Plus 3 internal improvements for better reliability and performance.
Better Shopify carousel detection
Carousel detection on Shopify sites now uses a more reliable two-pass strategy. The system first looks for data-section-type attributes, which Shopify themes use to annotate major sections like slideshows, providing higher confidence in identifying the primary carousel.
Bug Fixes
14Carousel hero classification
Hero image classification now correctly identifies only the primary carousel container instead of tagging every image in all carousels. This prevents excessive hero tags on pages with multiple carousels and improves image categorization accuracy.
Text extraction cleanup
Text extraction now automatically removes navigation, header, footer, and aside elements after Readability processing. This eliminates residual boilerplate content that previously appeared in extracted text, delivering cleaner article content and more accurate text analysis.
Playground button responsiveness
The playground interface now properly resets after API failures, preventing the Run button from getting stuck in a disabled state. The system tracks consecutive poll failures and automatically bails after 3 attempts, ensuring the interface remains responsive for subsequent operations.
Fixed key fetch fallback
The playground now handles cases where keys created in prior sessions or seeded keys don't have cached plaintext, preventing silent failures during crawl operations.
Fixed crawl button stuck state
The Run button in the playground no longer remains disabled after consecutive API failures, as the system now properly tracks and resets polling states after multiple errors.
Improved content extraction accuracy
Login and signup patterns no longer strip legitimate standalone content from paywalled sites, preserving important text that was previously removed during extraction.
Optimized scroll parameter handling
The scroll_count parameter is now only included when scroll_to_load is enabled, reducing unnecessary data in API responses and improving request efficiency.
Enhanced section filter validation
SDKs now validate section_filter parameters against allowed keys and value types, preventing typos and invalid configurations from causing unexpected behavior during scraping operations.
Refined hero image classification
Hero image detection now only applies to the primary carousel container, preventing every image in nested carousels from being incorrectly tagged as hero content.
Removed boilerplate from text extraction
Navigation, header, footer, and aside elements are now stripped from extracted text after Readability cleaning, resulting in cleaner content without residual boilerplate elements.
Improved srcset URL parsing
Commas inside URLs in srcset attributes are now preserved correctly, handling Cloudinary transforms and Shopify query parameters that were previously split incorrectly.
Enabled scroll with auto rendering
Scroll_to_load functionality now works seamlessly with render_js set to auto, providing more flexibility for scraping sites with dynamic content loading.
Enhanced lazy-loaded image handling
Image source resolution now handles common lazy-loading patterns including data attributes and srcset, improving image extraction from sites using various lazy-loading techniques.
Added section filter to SDKs
Python and Node SDKs now support the section_filter parameter, keeping SDK functionality in sync with the API's SectionFilter model for more precise content extraction.
Plus 5 internal changes for stability and performance.