format

Output Format

Output format refers to the structured file or data format in which scraped results are delivered — JSON, CSV, Parquet, NDJSON, or database rows — each suited to different downstream consumers.

The choice of output format determines how easily downstream systems can consume scraped data. JSON is the most versatile: it supports nested structures, is human-readable, and is natively parsed by every modern language. CSV is simpler and works well for flat tabular data that must be opened in spreadsheets or ingested by legacy systems. NDJSON (newline-delimited JSON) is ideal for streaming large datasets where each line is a self-contained JSON object.

For analytics warehouses (BigQuery, Snowflake), Parquet is the preferred format: columnar layout with efficient compression reduces query cost dramatically compared to JSON or CSV. Apache Avro is used in streaming pipelines (Kafka) where schema evolution and compact binary encoding matter.

AlterLab returns scraped data as JSON by default. For bulk exports, NDJSON streaming allows the caller to begin processing records before the full dataset is transmitted, reducing end-to-end latency for large crawl jobs.

Examples

# Writing scraped results in NDJSON format (one JSON object per line)
import json

with open("output.ndjson", "w") as f:
    for record in scraped_records:
        f.write(json.dumps(record) + "\n")

# Reading NDJSON:
with open("output.ndjson") as f:
    records = [json.loads(line) for line in f]