Enhanced Scraping Reliability

This release improves scraping accuracy and SDK functionality with fixes for key handling, content extraction, and media processing. Users will experience more reliable results when working with complex websites and lazy-loaded content.

Highlights

Carousel hero classification

Hero image classification now correctly identifies only the primary carousel container instead of tagging every image in all carousels. This prevents excessive hero tags on pages with multiple carousels and improves image categorization accuracy.

SDK configuration validation

The SDK now validates section filter parameters before API requests, catching typos and type errors in min_content_blocks, exclude_content_types, and content_only options. This prevents failed requests due to configuration mistakes and provides immediate feedback during development.

Better Shopify carousel detection

Carousel detection on Shopify sites now uses a more reliable two-pass strategy. The system first looks for data-section-type attributes, which Shopify themes use to annotate major sections like slideshows, providing higher confidence in identifying the primary carousel.

Improvements

SDK configuration validation

General bug fixes and improvements

Plus 3 internal improvements for better reliability and performance.

Better Shopify carousel detection

Bug Fixes

Carousel hero classification

Text extraction cleanup

Text extraction now automatically removes navigation, header, footer, and aside elements after Readability processing. This eliminates residual boilerplate content that previously appeared in extracted text, delivering cleaner article content and more accurate text analysis.

Playground button responsiveness

The playground interface now properly resets after API failures, preventing the Run button from getting stuck in a disabled state. The system tracks consecutive poll failures and automatically bails after 3 attempts, ensuring the interface remains responsive for subsequent operations.

Fixed key fetch fallback

The playground now handles cases where keys created in prior sessions or seeded keys don't have cached plaintext, preventing silent failures during crawl operations.

Fixed crawl button stuck state

The Run button in the playground no longer remains disabled after consecutive API failures, as the system now properly tracks and resets polling states after multiple errors.

Improved content extraction accuracy

Login and signup patterns no longer strip legitimate standalone content from paywalled sites, preserving important text that was previously removed during extraction.

Optimized scroll parameter handling

The scroll_count parameter is now only included when scroll_to_load is enabled, reducing unnecessary data in API responses and improving request efficiency.

Enhanced section filter validation

SDKs now validate section_filter parameters against allowed keys and value types, preventing typos and invalid configurations from causing unexpected behavior during scraping operations.

Refined hero image classification

Hero image detection now only applies to the primary carousel container, preventing every image in nested carousels from being incorrectly tagged as hero content.

Removed boilerplate from text extraction

Navigation, header, footer, and aside elements are now stripped from extracted text after Readability cleaning, resulting in cleaner content without residual boilerplate elements.

Improved srcset URL parsing

Commas inside URLs in srcset attributes are now preserved correctly, handling Cloudinary transforms and Shopify query parameters that were previously split incorrectly.

Enabled scroll with auto rendering

Scroll_to_load functionality now works seamlessly with render_js set to auto, providing more flexibility for scraping sites with dynamic content loading.

Enhanced lazy-loaded image handling

Image source resolution now handles common lazy-loading patterns including data attributes and srcset, improving image extraction from sites using various lazy-loading techniques.

Added section filter to SDKs

Python and Node SDKs now support the section_filter parameter, keeping SDK functionality in sync with the API's SectionFilter model for more precise content extraction.

Plus 5 internal changes for stability and performance.