
Build a Token-Efficient RAG Pipeline with pgvector & Markdown
Learn how to build a token-efficient RAG pipeline using PostgreSQL, pgvector, and Markdown web scraping to reduce LLM costs and improve response accuracy.
June 2, 2026
TL;DR
Converting scraped web content directly into Markdown reduces token consumption by up to 90% while preserving the semantic structure needed by LLMs. Combining Markdown extraction with PostgreSQL and the pgvector extension creates a highly efficient, production-ready Retrieval-Augmented Generation (RAG) pipeline without the operational overhead of a dedicated vector database.
The Token Problem in Web-Based RAG
Retrieval-Augmented Generation (RAG) systems are only as good as the context you feed them. When building RAG applications that ingest public documentation, technical blogs, or market reports, the default approach is often to scrape raw HTML, strip the tags, and dump the text into an embedding model.
This approach is fundamentally flawed.
Raw HTML is filled with token-heavy noise: navigation menus, footer links, inline SVGs, and DOM structure. A typical web page might contain 100KB of HTML but only 5KB of actual content. If you pass raw HTML to an embedding model, you waste context window space and compute budget on structural boilerplate.
If you strip the HTML entirely, you lose the semantic hierarchy. An <h1> tag carries more weight than a generic <p> tag. Without this structure, the LLM loses context about relationships between sections, leading to degraded generation quality.
The Markdown Solution
Markdown is the optimal format for Large Language Models. It is semantically dense. It preserves document hierarchy (headers, lists, code blocks) using minimal characters.
By extracting web pages directly to Markdown, you achieve three things:
- Cost Reduction: Token usage drops significantly, lowering embedding and inference costs.
- Context Window Optimization: You can fit more relevant chunks into the prompt.
- Semantic Integrity: The LLM understands the structure of the document natively.
Extract semantic Markdown from technical documentation
Pipeline Architecture
A token-efficient RAG pipeline requires four distinct phases:
- Extraction: Retrieve the target URL and convert the core content to Markdown.
- Semantic Chunking: Split the Markdown based on structural headers, not arbitrary character limits.
- Embedding: Convert the chunks into vector representations.
- Storage & Retrieval: Store the chunks and vectors in PostgreSQL using pgvector, then query using cosine similarity.
Step 1: Extracting Clean Markdown
To avoid the complexity of parsing DOM trees and stripping noise manually, we can use an extraction service that handles the conversion natively. AlterLab provides a direct format="markdown" parameter that extracts only the core article or documentation body, discarding navbars and footers.
Here is how you execute this using standard command-line tools:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com/guide/getting-started",
"format": "markdown"
}'For production Python pipelines, using the official Python SDK handles retries, connection pooling, and error management automatically.
import alterlab
client = alterlab.Client(api_key="YOUR_API_KEY")
# The API handles bypassing any blocks and returns pure Markdown
response = client.scrape(
url="https://docs.example.com/guide/getting-started",
format="markdown"
)
markdown_content = response.text
print(f"Extracted {len(markdown_content)} characters of clean Markdown.")If the target data sources rely heavily on client-side rendering (like React or Vue-based documentation sites), standard HTTP requests will only return an empty root <div>. In these cases, the platform's anti-bot handling and JavaScript rendering capabilities automatically evaluate the page before converting the final DOM to Markdown.
Step 2: Semantic Chunking
Standard chunking algorithms split text every $N$ characters. This is destructive. Splitting a sentence or a code block down the middle destroys the context the embedding model needs.
Because our source material is now Markdown, we can use Semantic Chunking. We split the document based on Markdown header boundaries (##, ###). This ensures each chunk represents a complete, cohesive thought.
import re
from typing import List, Dict
def chunk_markdown_by_headers(markdown_text: str) -> List[Dict[str, str]]:
"""Splits markdown text into chunks based on headers."""
# Match any header line (e.g., "## Step 1")
header_pattern = re.compile(r'(?m)^#{1,6}\s+.*$')
# Find all header locations
matches = list(header_pattern.finditer(markdown_text))
chunks = []
start_idx = 0
current_header = "Document Start"
for match in matches:
end_idx = match.start()
# Extract the text between the last header and this header
content = markdown_text[start_idx:end_idx].strip()
if content:
chunks.append({
"header": current_header,
"content": content
})
current_header = match.group().strip()
start_idx = match.start()
# Add the final chunk
final_content = markdown_text[start_idx:].strip()
if final_content:
chunks.append({
"header": current_header,
"content": final_content
})
return chunks
# Example usage:
chunks = chunk_markdown_by_headers(markdown_content)Step 3: Configuring PostgreSQL and pgvector
Dedicated vector databases add unnecessary complexity to most stacks. If you are already running PostgreSQL, installing the pgvector extension gives you highly performant similarity search without adding a new piece of infrastructure to monitor.
First, enable the extension and create the storage schema. We will use vector(1536) to match the output dimensions of standard OpenAI embedding models (text-embedding-3-small).
-- Enable the pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create the documents table
CREATE TABLE document_chunks (
id BIGSERIAL PRIMARY KEY,
source_url TEXT NOT NULL,
header_context TEXT,
content TEXT NOT NULL,
embedding vector(1536)
);
-- Create an HNSW index for fast approximate nearest neighbor search
-- Note: vector_cosine_ops optimizes for cosine distance (<=>)
CREATE INDEX ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);HNSW vs IVFFlat
In the schema above, we use an HNSW (Hierarchical Navigable Small World) index. While IVFFlat (Inverted File with Flat Compression) builds faster and uses less memory, it requires you to build the index after you have loaded a substantial amount of data to calculate the centroids correctly. HNSW builds a graph structure incrementally, meaning you can query it immediately with high recall as data flows in from your scraping pipeline.
Step 4: Storing and Querying Vectors
With the table ready, we generate embeddings for our Markdown chunks and insert them into PostgreSQL. We will use the standard psycopg2 library alongside the pgvector Python adapter.
import psycopg2
from pgvector.psycopg2 import register_vector
from openai import OpenAI
# Initialize clients
db_conn = psycopg2.connect("dbname=ragdb user=postgres password=secret")
register_vector(db_conn)
openai_client = OpenAI()
def store_chunk(source_url: str, header: str, content: str):
# Generate vector embedding for the markdown chunk
response = openai_client.embeddings.create(
input=content,
model="text-embedding-3-small"
)
vector = response.data[0].embedding
# Insert into PostgreSQL
with db_conn.cursor() as cur:
cur.execute(
"""
INSERT INTO document_chunks (source_url, header_context, content, embedding)
VALUES (%s, %s, %s, %s)
""",
(source_url, header, content, vector)
)
db_conn.commit()
# Process our extracted chunks
for chunk in chunks:
store_chunk(
source_url="https://docs.example.com/guide/getting-started",
header=chunk["header"],
content=chunk["content"]
)Retrieval via Cosine Similarity
When a user asks a question, we embed their query using the exact same model and use PostgreSQL's <=> operator. This operator calculates the cosine distance between vectors. A lower distance means higher semantic similarity.
def retrieve_context(query: str, limit: int = 3) -> str:
# Embed the user query
response = openai_client.embeddings.create(
input=query,
model="text-embedding-3-small"
)
query_vector = response.data[0].embedding
# Perform vector similarity search
with db_conn.cursor() as cur:
cur.execute(
"""
SELECT header_context, content, embedding <=> %s::vector AS distance
FROM document_chunks
ORDER BY distance ASC
LIMIT %s
""",
(query_vector, limit)
)
results = cur.fetchall()
# Format the context for the LLM prompt
context = ""
for row in results:
context += f"\n{row[0]}\n{row[1]}\n---\n"
return context
# Example retrieval
context = retrieve_context("How do I authenticate with the API?")Because the retrieved content is cleanly formatted Markdown, it can be injected directly into the system prompt of your LLM without further transformation. The LLM effortlessly understands the headers, code blocks, and lists, yielding highly accurate, hallucination-free answers.
Production Considerations
When scaling this pipeline to millions of documents, keep these operational principles in mind:
- Upsert Logic: Web content changes. Your pipeline needs a mechanism to hash the source URL, detect modifications, and
UPDATEthe embeddings rather than infinitely inserting duplicate chunks. - Rate Limiting: When scraping public infrastructure, distribute your requests over time. If you need high throughput across protected endpoints, leverage managed proxy networks to rotate connection origins organically.
- Chunk Overlap: While header-based chunking is superior, very long sections (e.g., a massive tutorial under a single
##header) still need secondary recursive splitting. A standard overlap of 10-15% prevents cutting context mid-sentence.
Takeaway
Raw HTML is a liability in GenAI architectures. By shifting the extraction layer to output Markdown natively, you drastically reduce token overhead and preserve the structural intent of the data. Pairing this extraction technique with PostgreSQL and pgvector delivers a robust, scalable RAG architecture that requires minimal infrastructure maintenance.
To implement the extraction layer shown in this guide, read the API docs to configure your routing and format parameters.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

How to Scrape Amazon in 2026: Engineering Guide

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Twitter/X Data: Complete Guide for 2026
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


