infrastructure

Data Lake

A data lake is a centralised storage repository that holds raw scraped data in its native format at any scale, deferring schema definition and transformation to query time.

A data lake stores structured, semi-structured, and unstructured data in its original format — raw HTML, JSON responses, CSV files, images — in object storage (Amazon S3, Google Cloud Storage, Azure Blob). Unlike a data warehouse, a data lake imposes no schema at write time; data is written quickly and cheaply, and schema-on-read is applied when data is queried.

For scraping pipelines, a data lake serves as the archive layer: every scraped page or API response is written to the lake before transformation, preserving the original data for reprocessing if extraction logic changes. The lake also serves as the source for downstream transformation jobs (Spark, dbt, BigQuery) that produce structured tables.

Data lakes pair well with scraping because scraping is high-volume and produces heterogeneous data. Storing raw HTML in S3 at $0.023/GB/month is far cheaper than parsing and storing in a relational database, and the raw data can be re-mined with different extraction logic months later.

Related Terms

Extract Data Lake data from any website

AlterLab returns clean, structured data from any public URL — no scraper infrastructure needed. Start free, no credit card required.

View API docs

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expires

    Data Lake — Web Scraping Glossary | AlterLab