protocol

Character Encoding

Character encoding defines how text characters are represented as bytes; mismatched encoding between server and parser causes garbled (mojibake) text extraction.

Every web page's text must be encoded as bytes for transmission. The encoding specifies which byte sequence represents each character. UTF-8 is the dominant encoding on the modern web (covering all Unicode characters) but legacy sites still use ISO-8859-1 (Latin-1), Windows-1252, or region-specific encodings (GBK for Chinese, Shift-JIS for Japanese).

The encoding can be declared in three places: the HTTP `Content-Type` header (`Content-Type: text/html; charset=utf-8`), the HTML `<meta charset='utf-8'>` tag, or the XML declaration. Scrapers must detect and apply the correct encoding before parsing text; mismatched encoding produces mojibake — garbled sequences like `é` instead of `é`.

Python's `chardet` and `charset-normalizer` libraries can detect encoding heuristically when it is not declared. `requests` library applies the `apparent_encoding` detected by chardet when the server does not specify a charset. BeautifulSoup also handles encoding detection internally when given raw bytes.

Examples

import requests

response = requests.get("https://example.com")
# requests uses encoding from Content-Type header by default
# Override if detection is wrong:
response.encoding = "utf-8"
text = response.text  # correctly decoded string

# Or let chardet detect:
response.encoding = response.apparent_encoding
text = response.text

Related Terms

Extract Character Encoding data from any website

AlterLab returns clean, structured data from any public URL — no scraper infrastructure needed. Start free, no credit card required.

View API docs

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expires

    Character Encoding — Web Scraping Glossary | AlterLab