Pricing Compare Playground Blog Docs Changelog

Build a Token-Efficient RAG Pipeline with pgvector & Markdown

Learn how to build a token-efficient RAG pipeline using PostgreSQL, pgvector, and Markdown web scraping to reduce LLM costs and improve response accuracy.

Herald Blog ServiceJune 2, 2026

7 min read

393 views

AlterLab handles this automatically — scrape any URL with one API call. No infrastructure required.

Try it free

TL;DR

Converting scraped web content directly into Markdown reduces token consumption by up to 90% while preserving the semantic structure needed by LLMs. Combining Markdown extraction with PostgreSQL and the pgvector extension creates a highly efficient, production-ready Retrieval-Augmented Generation (RAG) pipeline without the operational overhead of a dedicated vector database.

The Token Problem in Web-Based RAG

Retrieval-Augmented Generation (RAG) systems are only as good as the context you feed them. When building RAG applications that ingest public documentation, technical blogs, or market reports, the default approach is often to scrape raw HTML, strip the tags, and dump the text into an embedding model.

This approach is fundamentally flawed.

Raw HTML is filled with token-heavy noise: navigation menus, footer links, inline SVGs, and DOM structure. A typical web page might contain 100KB of HTML but only 5KB of actual content. If you pass raw HTML to an embedding model, you waste context window space and compute budget on structural boilerplate.

If you strip the HTML entirely, you lose the semantic hierarchy. An <h1> tag carries more weight than a generic <p> tag. Without this structure, the LLM loses context about relationships between sections, leading to degraded generation quality.

The Markdown Solution

Markdown is the optimal format for Large Language Models. It is semantically dense. It preserves document hierarchy (headers, lists, code blocks) using minimal characters.

By extracting web pages directly to Markdown, you achieve three things:

Cost Reduction: Token usage drops significantly, lowering embedding and inference costs.
Context Window Optimization: You can fit more relevant chunks into the prompt.
Semantic Integrity: The LLM understands the structure of the document natively.

Try it yourself

Extract semantic Markdown from technical documentation

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com/api/v1/auth"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Pipeline Architecture

A token-efficient RAG pipeline requires four distinct phases:

Extraction: Retrieve the target URL and convert the core content to Markdown.
Semantic Chunking: Split the Markdown based on structural headers, not arbitrary character limits.
Embedding: Convert the chunks into vector representations.
Storage & Retrieval: Store the chunks and vectors in PostgreSQL using pgvector, then query using cosine similarity.

Step 1: Extracting Clean Markdown

To avoid the complexity of parsing DOM trees and stripping noise manually, we can use an extraction service that handles the conversion natively. AlterLab provides a direct format="markdown" parameter that extracts only the core article or documentation body, discarding navbars and footers.

Here is how you execute this using standard command-line tools:

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com/guide/getting-started",
    "format": "markdown"
  }'

For production Python pipelines, using the official Python SDK handles retries, connection pooling, and error management automatically.

Python

import alterlab

client = alterlab.Client(api_key="YOUR_API_KEY")

# The API handles bypassing any blocks and returns pure Markdown
response = client.scrape(
    url="https://docs.example.com/guide/getting-started",
    format="markdown"
)

markdown_content = response.text
print(f"Extracted {len(markdown_content)} characters of clean Markdown.")

If the target data sources rely heavily on client-side rendering (like React or Vue-based documentation sites), standard HTTP requests will only return an empty root <div>. In these cases, the platform's anti-bot handling and JavaScript rendering capabilities automatically evaluate the page before converting the final DOM to Markdown.

Step 2: Semantic Chunking

Standard chunking algorithms split text every $N$ characters. This is destructive. Splitting a sentence or a code block down the middle destroys the context the embedding model needs.

Because our source material is now Markdown, we can use Semantic Chunking. We split the document based on Markdown header boundaries (##, ###). This ensures each chunk represents a complete, cohesive thought.

Python

import re
from typing import List, Dict

def chunk_markdown_by_headers(markdown_text: str) -> List[Dict[str, str]]:
    """Splits markdown text into chunks based on headers."""
    
    # Match any header line (e.g., "## Step 1")
    header_pattern = re.compile(r'(?m)^#{1,6}\s+.*$')
    
    # Find all header locations
    matches = list(header_pattern.finditer(markdown_text))
    
    chunks = []
    start_idx = 0
    current_header = "Document Start"
    
    for match in matches:
        end_idx = match.start()
        # Extract the text between the last header and this header
        content = markdown_text[start_idx:end_idx].strip()
        
        if content:
            chunks.append({
                "header": current_header,
                "content": content
            })
            
        current_header = match.group().strip()
        start_idx = match.start()
        
    # Add the final chunk
    final_content = markdown_text[start_idx:].strip()
    if final_content:
        chunks.append({
            "header": current_header,
            "content": final_content
        })
        
    return chunks

# Example usage:
chunks = chunk_markdown_by_headers(markdown_content)

Step 3: Configuring PostgreSQL and pgvector

Dedicated vector databases add unnecessary complexity to most stacks. If you are already running PostgreSQL, installing the pgvector extension gives you highly performant similarity search without adding a new piece of infrastructure to monitor.

First, enable the extension and create the storage schema. We will use vector(1536) to match the output dimensions of standard OpenAI embedding models (text-embedding-3-small).

SQL

-- Enable the pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create the documents table
CREATE TABLE document_chunks (
    id BIGSERIAL PRIMARY KEY,
    source_url TEXT NOT NULL,
    header_context TEXT,
    content TEXT NOT NULL,
    embedding vector(1536)
);

-- Create an HNSW index for fast approximate nearest neighbor search
-- Note: vector_cosine_ops optimizes for cosine distance (<=>)
CREATE INDEX ON document_chunks 
USING hnsw (embedding vector_cosine_ops) 
WITH (m = 16, ef_construction = 64);

HNSW vs IVFFlat

In the schema above, we use an HNSW (Hierarchical Navigable Small World) index. While IVFFlat (Inverted File with Flat Compression) builds faster and uses less memory, it requires you to build the index after you have loaded a substantial amount of data to calculate the centroids correctly. HNSW builds a graph structure incrementally, meaning you can query it immediately with high recall as data flows in from your scraping pipeline.

Step 4: Storing and Querying Vectors

With the table ready, we generate embeddings for our Markdown chunks and insert them into PostgreSQL. We will use the standard psycopg2 library alongside the pgvector Python adapter.

Python

import psycopg2
from pgvector.psycopg2 import register_vector
from openai import OpenAI

# Initialize clients
db_conn = psycopg2.connect("dbname=ragdb user=postgres password=secret")
register_vector(db_conn)
openai_client = OpenAI()

def store_chunk(source_url: str, header: str, content: str):
    # Generate vector embedding for the markdown chunk
    response = openai_client.embeddings.create(
        input=content,
        model="text-embedding-3-small"
    )
    vector = response.data[0].embedding
    
    # Insert into PostgreSQL
    with db_conn.cursor() as cur:
        cur.execute(
            """
            INSERT INTO document_chunks (source_url, header_context, content, embedding)
            VALUES (%s, %s, %s, %s)
            """,
            (source_url, header, content, vector)
        )
    db_conn.commit()

# Process our extracted chunks
for chunk in chunks:
    store_chunk(
        source_url="https://docs.example.com/guide/getting-started",
        header=chunk["header"],
        content=chunk["content"]
    )

Retrieval via Cosine Similarity

When a user asks a question, we embed their query using the exact same model and use PostgreSQL's <=> operator. This operator calculates the cosine distance between vectors. A lower distance means higher semantic similarity.

Python

def retrieve_context(query: str, limit: int = 3) -> str:
    # Embed the user query
    response = openai_client.embeddings.create(
        input=query,
        model="text-embedding-3-small"
    )
    query_vector = response.data[0].embedding
    
    # Perform vector similarity search
    with db_conn.cursor() as cur:
        cur.execute(
            """
            SELECT header_context, content, embedding <=> %s::vector AS distance
            FROM document_chunks
            ORDER BY distance ASC
            LIMIT %s
            """,
            (query_vector, limit)
        )
        results = cur.fetchall()
        
    # Format the context for the LLM prompt
    context = ""
    for row in results:
        context += f"\n{row[0]}\n{row[1]}\n---\n"
        
    return context

# Example retrieval
context = retrieve_context("How do I authenticate with the API?")

Because the retrieved content is cleanly formatted Markdown, it can be injected directly into the system prompt of your LLM without further transformation. The LLM effortlessly understands the headers, code blocks, and lists, yielding highly accurate, hallucination-free answers.

Production Considerations

When scaling this pipeline to millions of documents, keep these operational principles in mind:

Upsert Logic: Web content changes. Your pipeline needs a mechanism to hash the source URL, detect modifications, and UPDATE the embeddings rather than infinitely inserting duplicate chunks.
Rate Limiting: When scraping public infrastructure, distribute your requests over time. If you need high throughput across protected endpoints, leverage managed proxy networks to rotate connection origins organically.
Chunk Overlap: While header-based chunking is superior, very long sections (e.g., a massive tutorial under a single ## header) still need secondary recursive splitting. A standard overlap of 10-15% prevents cutting context mid-sentence.

Takeaway

Raw HTML is a liability in GenAI architectures. By shifting the extraction layer to output Markdown natively, you drastically reduce token overhead and preserve the structural intent of the data. Pairing this extraction technique with PostgreSQL and pgvector delivers a robust, scalable RAG architecture that requires minimal infrastructure maintenance.

To implement the extraction layer shown in this guide, read the API docs to configure your routing and format parameters.

Was this article helpful?

Try it yourself

Feed your AI pipeline with fresh web data

AlterLab returns clean Markdown from any URL — ready to chunk, embed, and store in your vector DB. One API call, no parsing.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com/page", "output": "markdown"}'

No credit card required · 5,000 free requests

Frequently Asked Questions

Markdown preserves the semantic structure of a document (like headers and lists) while stripping out verbose HTML tags, CSS, and JavaScript. This reduces token consumption by up to 90%, lowering LLM costs and fitting more relevant context into the model's limited context window.

pgvector is an open-source extension for PostgreSQL that enables vector similarity search using exact and approximate nearest neighbor algorithms. It allows you to store vector embeddings alongside your relational data, eliminating the need to sync data across a separate, dedicated vector database infrastructure.

The most effective approach is semantic chunking, which splits the document based on Markdown header boundaries (like `##` or `###`) rather than arbitrary character counts. This ensures that each chunk contains a complete thought or section, significantly improving retrieval accuracy during the RAG process.

Herald Blog Service

View all posts

Tutorials

How to Scrape SEC EDGAR Data: Complete Guide for 2026

Learn how to scrape SEC EDGAR for public financial data using AlterLab's API with Python and Node.js. Covers anti-bot handling, structured extraction, and cost-effective scaling.

Herald Blog Service

Jul 17, 2026

Tutorials

How to Scrape Yellow Pages Data: Complete Guide for 2026

Herald Blog Service

Jul 17, 2026

Product Updates

Engineering Update: Billing Identity and Deployment Fixes

We've implemented a new billing identity API, fixed Stripe webhook ordering gaps, and resolved deployment configuration bugs in our latest engineering update.

Herald Blog Service

Jul 17, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

TL;DR

The Token Problem in Web-Based RAG

The Markdown Solution

Pipeline Architecture

Step 1: Extracting Clean Markdown

Step 2: Semantic Chunking

Step 3: Configuring PostgreSQL and pgvector

HNSW vs IVFFlat

Step 4: Storing and Querying Vectors

Retrieval via Cosine Similarity

Production Considerations

Takeaway

Frequently Asked Questions

Related Articles

How to Scrape SEC EDGAR Data: Complete Guide for 2026

How to Scrape Yellow Pages Data: Complete Guide for 2026

Engineering Update: Billing Identity and Deployment Fixes

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X: Complete Guide for 2026

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Python Web Scraping API

Compare Scraping APIs

Pricing

Documentation

Web Scraping API Resources