Tutorials

Medium Data API: Extract Structured JSON in 2026

Learn how to extract structured Medium data via API using AlterLab's Extract API to get JSON fields like title, author, date, tags, and URL with zero parsing.

5 min read
4 views

TL;DR

To get structured Medium data via API, define a JSON schema for the fields you need (title, author, published_date, tags, url) and POST it to AlterLab's Extract API endpoint. The service returns validated JSON in a single request, handling anti‑bot measures and delivering typed output without any HTML parsing.

Why use Medium data?

Medium hosts a vast repository of technical articles, making it a valuable source for several engineering workflows. Teams building large language models often scrape public tech blogs to diversify training data with real‑world explanations and code snippets. Product analysts use Medium feeds to monitor competitor announcements, emerging frameworks, and developer sentiment for strategic planning. Data engineers also create pipelines that enrich internal knowledge bases with curated external content, improving search relevance and recommendation quality.

What data can you extract?

All article metadata visible on a public Medium page is accessible through structured extraction. The most commonly requested fields for tech‑focused pipelines include:

  • title: The headline of the article as displayed.
  • author: The display name of the writer or publication.
  • published_date: The ISO‑8601 timestamp when the story was posted.
  • tags: Topic tags attached by the author (e.g., "Python", "AI", "Startup").
  • url: The canonical URL of the article, useful for deduplication and linking. These fields are sufficient for indexing, citation tracking, and trend analysis without needing to process full‑text HTML.

The extraction approach

Attempting to pull Medium data with raw HTTP requests and HTML parsers leads to brittle pipelines. Medium’s page structure changes frequently, its class names are obfuscated, and anti‑bot mechanisms challenge simple scrapers. Maintaining selectors, handling pagination, and dealing with intermittent blocks consumes engineering effort that could be spent on downstream analysis. A data API abstracts these concerns: you specify the schema you want, the service retrieves the page, applies AI‑guided extraction, validates the output, and returns clean JSON. This approach treats the web as a database, letting you focus on what data means rather than how to get it.

Quick start with AlterLab Extract API

AlterLab’s Extract API accepts a target URL and a JSON schema, then returns the matched data. Below is a minimal Python example that pulls the title, author, and published date from a sample Medium post. See the Extract API docs for full parameter details.

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")

schema = {
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "The title field"
    },
    "author": {
      "type": "string",
      "description": "The author field"
    },
    "published_date": {
      "type": "string",
      "description": "The published date field"
    }
  }
}

result = client.extract(
    url="https://medium.com/@example/introduction-to-llms-2026",
    schema=schema,
)
print(result.data)

The equivalent cURL request looks like this:

Bash
curl -X POST https://api.alterlab.io/v1/extract \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://medium.com/@example/introduction-to-llms-2026",
    "schema": {"properties": {"title": {"type": "string"}, "author": {"type": "string"}, "published_date": {"type": "string"}}}
  }'

Both snippets produce a JSON payload similar to:

JSON
{
  "title": "Introduction to LLMs in 2026",
  "author": "Jane Doe",
  "published_date": "2026-02-14T08:30:00Z",
  "url": "https://medium.com/@example/introduction-to-llms-2026"
}

Define your schema

The schema parameter drives the entire extraction process. You declare each desired field with a type (string, number, boolean, array) and an optional description that helps the underlying model locate the correct element on the page. AlterLab validates the returned data against this schema, guaranteeing that every property exists and conforms to the declared type. If a field cannot be found, the API returns an error rather than guesswork, preventing silent data corruption. For the Medium use case, a typical schema might look like:

JSON
{
  "type": "object",
  "properties": {
    "title": {"type": "string"},
    "author": {"type": "string"},
    "published_date": {"type": "string", "format": "date-time"},
    "tags": {"type": "array", "items": {"type": "string"}},
    "url": {"type": "string", "format": "uri"}
  },
  "required": ["title", "author", "published_date", "url"]
}

By supplying this schema to the extract endpoint, you receive a typed JSON object ready for direct insertion into a data warehouse or feature store.

Handle pagination and scale

When extracting dozens or thousands of Medium articles, efficiency matters. AlterLab supports high‑volume workloads through asynchronous job submission and built‑in rate‑limit handling. You can batch many extract requests into a single API call using the jobs endpoint, or parallelize calls with asyncio in Python. The following example demonstrates fetching a list of article URLs concurrently:

Python
import asyncio
import alterlab

async def extract_one(client, url, schema):
    return await client.extract(url=url, schema=schema)

async def main():
    client = alterlab.AsyncClient("YOUR_API_KEY")
    schema = {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "author": {"type": "string"},
            "published_date": {"type": "string"},
            "tags": {"type": "array", "items": {"type": "string"}},
            "url": {"type": "string"}
        }
    }
    urls = [
        "https://medium.com/tag/python",
        "https://medium.com/tag/ai",
        "https://medium.com/tag/data-science"
    ]  # In practice, generate this list from a sitemap or search API
    tasks = [extract_one(client, u, schema) for u in urls]
    results = await asyncio.gather(*tasks)
    for r in results:
        print(r.data)

asyncio.run(main())

This pattern scales to thousands of URLs while respecting AlterLab’s concurrency limits. For cost estimates, visit the pricing page; you pay only for successful extractions, with volume discounts available at higher tiers.

Key takeaways

  • Structured data extraction replaces fragile HTML parsing with a schema‑driven, AI‑powered API.
  • Medium’s public article metadata (title, author, date, tags, URL) maps cleanly to JSON fields.
  • AlterLab’s Extract API handles anti‑bot measures, validation, and scaling so you can focus on analytics.
  • Start with a simple schema, test on a single URL, then expand to batch or async workflows for production pipelines.
  • Always review Medium’s robots.txt and Terms of Service before scraping public data.
99.2%Extraction Accuracy
1.4sAvg Response Time
100%Typed JSON Output
Try it yourself

Extract structured tech data from Medium

--- This is the end of the blog post. No additional text should follow.
Share

Was this article helpful?

Frequently Asked Questions

Medium offers limited official APIs focused on user actions and publishing; they do not provide unrestricted access to article metadata for third-party pipelines. AlterLab fills this gap by enabling structured JSON extraction from publicly available Medium pages while respecting robots.txt and rate limits.
You can extract any publicly visible field such as title, author, published date, tags, and URL by defining a JSON schema. AlterLab returns typed, validated JSON that matches your schema, eliminating the need for custom parsers.
AlterLab uses a pay‑as‑you‑go model with no minimums; you pay only for successful extractions. Credits never expire, and detailed pricing is available on the pricing page.