HTML Character Sets

compiled by Stanislav Sýkora, Extra Byte, Via R.Sanzio 22C, Castano Primo, Italy 20022
in Stan's Library, Ed.S.Sykora, Vol.I. First release March 29, 2006.
Permalink via DOI: 10.3247/SL1Web06.001

Other DATABASES | Stan's LIBRARY ( Programming) | SCIENCE Links | WWW Links

Extra Byte | Stan's HUB

The Table below lists the codes of currently used non-Unicode (ASCII-like) Character Sets and their Code Pages.

Of course, we should all be using the Unicode, so that the one and only charset specification we should use is utf-8, the Unicode Transfromation Format -8. Which, just as of course, most of us know very little about. I, for one, am quite ready for a triple 6-month shift of onion peeling in a submarine, the punishment invented by Joel Spolsky for those who are not up to current standards.

The charset specification is used - or should be used - by anybody writing HTML documents. It appears as a <META ... > entry in the <HEAD>...</HEAD> section of the HTML script. I usually place it as the first entry in the header but the order should be actually irrelevant. Here is an example of such an entry:

The case (upper/lower characters) of the content string should be also irrelevant so that writing Windows-1252 or windows-1252 should be equivalent in all browsers (the reason why I use the conditional is that I did not have the opportunity to test this aspect in all possible browsers). I use consistently lower case.

When you do not specify the charset within a HTML document, the browser will use the current default of your computer. Which, of course, need not be the same in the whole world. The result is that everything looks fine when you display the document on your computer but when somebody displays it in another country it may look as complete garbage!!!

Extended CharSet name	Scripting name	Code-Page	Note
Arabic (ISO)	ISO-8859-6	1256
Arabic (Windows)	Windows-1256	1256
Baltic (ISO)	ISO-8859-4	1257
Baltic (Windows)	Windows-1257	1257
Central European (ISO)	ISO-8859-2	1250	Czechs like this for Czech
Central European (Windows)	Windows-1250	1250	I prefer this for Czech texts
Chinese Simplified (GB2312)	GB2312	936
Chinese Simplified (HZ)	HZ-GB-2312	936
Chinese Traditional (Big5)	Big5	950
Cyrillic (ISO)	ISO-8859-5	1251
Cyrillic (Windows)	Windows-1251	1251
Greek (ISO)	ISO-8859-7	1253
Greek (Windows)	Windows-1253	1253
Hebrew (ISO-Logical)	ISO-8859-8-I	1255
Hebrew (Windows)	Windows-1255	1255
Japanese (EUC)	EUC-JP	932
Japanese (JIS)	ISO-2022-JP	932
Japanese (Shift-JIS)	ISO-2022-JP	932
Korean	KS-C-5601-1987	949
Korean (EUC)	EUC-KR	949
Latin 3 (ISO)	ISO-8859-3	1252
Latin 9 (ISO)	ISO-8859-15	1252
Thai (Windows)	Windows-874	874
Turkish (ISO)	ISO-8859-9	1254
Turkish (Windows)	Windows-1254	1254
Vietnamese (Windows)	Windows-1258	1258
Western European (ISO)	ISO-8859-1	1252	ISO for English texts
Western European (Windows)	Windows-1252	1252	I prefer this for English/Italian

The Code-Page identifiers are rarely used by anybody but those programmers who interfere with Operating Systems and/or write boot-up scripts for DOS, UNIX, LINUX and the like. Most of us commoners do not havy any use for them.

The following Table is for those who would nevertheless like to know more about them. It shows the code-pages supported by the Windows API functions GetACP, GetOEMCP and GetCPInfo.

CodePage	Meaning
Identifiers of ANSI character pages
874	Thai
932	Japan
936	Chinese (PRC, Singapore)
949	Korean
950	Chinese (Taiwan, Hong Kong)
1200	Unicode (BMP of ISO 10646)
1250	Windows Eastern European
1251	Windows Cyrillic
1252	Windows Latin 1 (US, Western Europe)
1253	Windows Greek
1254	Windows Turkish
1255	Hebrew
1256	Arabic
1257	Baltic
Identifiers of OEM character pages
437	MS-DOS United States
708	Arabic (ASMO 708)
709	Arabic (ASMO 449+, BCON V4)
710	Arabic (Transparent Arabic)
720	Arabic (Transparent ASMO)
737	Greek (formerly 437G)
775	Baltic
850	MS-DOS Multilingual (Latin I)
852	MS-DOS Slavic (Latin II)
855	BM Cyrillic (primarily Russian)
857	IBM Turkish
860	MS-DOS Portuguese
861	MS-DOS Icelandic
862	Hebrew
863	MS-DOS Canadian-French
864	Arabic
865	MS-DOS Nordic
866	MS-DOS Russian
869	IBM Modern Greek
1361	Korean (Johab)

TOP | Other DATABASES | Stan's LIBRARY ( Programming) | WWW Links

Extra Byte | Stan's HUB | TOP