Every web page's text must be encoded as bytes for transmission. The encoding specifies which byte sequence represents each character. UTF-8 is the dominant encoding on the modern web (covering all Unicode characters) but legacy sites still use ISO-8859-1 (Latin-1), Windows-1252, or region-specific encodings (GBK for Chinese, Shift-JIS for Japanese).

The encoding can be declared in three places: the HTTP `Content-Type` header (`Content-Type: text/html; charset=utf-8`), the HTML `<meta charset='utf-8'>` tag, or the XML declaration. Scrapers must detect and apply the correct encoding before parsing text; mismatched encoding produces mojibake — garbled sequences like `Ã©` instead of `é`.

Python's `chardet` and `charset-normalizer` libraries can detect encoding heuristically when it is not declared. `requests` library applies the `apparent_encoding` detected by chardet when the server does not specify a charset. BeautifulSoup also handles encoding detection internally when given raw bytes.

Examples

import requests

response = requests.get("https://example.com")
# requests uses encoding from Content-Type header by default
# Override if detection is wrong:
response.encoding = "utf-8"
text = response.text  # correctly decoded string

# Or let chardet detect:
response.encoding = response.apparent_encoding
text = response.text

Character Encoding

Examples

Related Terms

Extract Character Encoding data from any website

Your first scrape.
Sixty seconds.

Examples

Related Terms

Extract Character Encoding data from any website

Your first scrape. Sixty seconds.

Your first scrape.
Sixty seconds.