The Table below lists the codes of currently used non-Unicode (ASCII-like) Character Sets and their Code Pages.
Of course, we should all be using the Unicode, so that the one and only charset specification we should use is utf-8, the Unicode Transfromation Format -8. Which, just as of course, most of us know very little about. I, for one, am quite ready for a triple 6-month shift of onion peeling in a submarine, the punishment invented by Joel Spolsky for those who are not up to current standards.
The charset specification is used - or should be used - by anybody writing HTML documents. It appears as a <META ... > entry in the <HEAD>...</HEAD> section of the HTML script. I usually place it as the first entry in the header but the order should be actually irrelevant. Here is an example of such an entry:
<META http-equiv="Content-Type" content="text/html; charset=windows-1252">
The case (upper/lower characters) of the content string should be also irrelevant so that writing Windows-1252 or windows-1252 should be equivalent in all browsers (the reason why I use the conditional is that I did not have the opportunity to test this aspect in all possible browsers). I use consistently lower case.
When you do not specify the charset within a HTML document, the browser will use the current default of your computer. Which, of course, need not be the same in the whole world. The result is that everything looks fine when you display the document on your computer but when somebody displays it in another country it may look as complete garbage!!!
Extended CharSet name |
Scripting name |
Code-Page |
Note |
Arabic (ISO) |
ISO-8859-6 |
1256 |
|
Arabic (Windows) |
Windows-1256 |
1256 |
|
Baltic (ISO) |
ISO-8859-4 |
1257 |
|
Baltic (Windows) |
Windows-1257 |
1257 |
|
Central European (ISO) |
ISO-8859-2 |
1250 |
Czechs like this for Czech |
Central European (Windows) |
Windows-1250 |
1250 |
I prefer this for Czech texts |
Chinese Simplified (GB2312) |
GB2312 |
936 |
|
Chinese Simplified (HZ) |
HZ-GB-2312 |
936 |
|
Chinese Traditional (Big5) |
Big5 |
950 |
|
Cyrillic (ISO) |
ISO-8859-5 |
1251 |
|
Cyrillic (Windows) |
Windows-1251 |
1251 |
|
Greek (ISO) |
ISO-8859-7 |
1253 |
|
Greek (Windows) |
Windows-1253 |
1253 |
|
Hebrew (ISO-Logical) |
ISO-8859-8-I |
1255 |
|
Hebrew (Windows) |
Windows-1255 |
1255 |
|
Japanese (EUC) |
EUC-JP |
932 |
|
Japanese (JIS) |
ISO-2022-JP |
932 |
|
Japanese (Shift-JIS) |
ISO-2022-JP |
932 |
|
Korean |
KS-C-5601-1987 |
949 |
|
Korean (EUC) |
EUC-KR |
949 |
|
Latin 3 (ISO) |
ISO-8859-3 |
1252 |
|
Latin 9 (ISO) |
ISO-8859-15 |
1252 |
|
Thai (Windows) |
Windows-874 |
874 |
|
Turkish (ISO) |
ISO-8859-9 |
1254 |
|
Turkish (Windows) |
Windows-1254 |
1254 |
|
Vietnamese (Windows) |
Windows-1258 |
1258 |
|
Western European (ISO) |
ISO-8859-1 |
1252 |
ISO for English texts |
Western European (Windows) |
Windows-1252 |
1252 |
I prefer this for English/Italian |
The Code-Page identifiers are rarely used by anybody but those programmers who interfere with Operating Systems and/or write boot-up scripts for DOS, UNIX, LINUX and the like. Most of us commoners do not havy any use for them.
The following Table is for those who would nevertheless like to know more about them. It shows the code-pages supported by the Windows API functions GetACP, GetOEMCP and GetCPInfo.
|
CodePage |
Meaning |
Identifiers of ANSI character pages |
874 | Thai |
932 | Japan |
936 | Chinese (PRC, Singapore) |
949 | Korean |
950 | Chinese (Taiwan, Hong Kong) |
1200 | Unicode (BMP of ISO 10646) |
1250 | Windows Eastern European |
1251 | Windows Cyrillic |
1252 | Windows Latin 1 (US, Western Europe) |
1253 | Windows Greek |
1254 | Windows Turkish |
1255 | Hebrew |
1256 | Arabic |
1257 | Baltic |
Identifiers of OEM character pages |
437 | MS-DOS United States |
708 | Arabic (ASMO 708) |
709 | Arabic (ASMO 449+, BCON V4) |
710 | Arabic (Transparent Arabic) |
720 | Arabic (Transparent ASMO) |
737 | Greek (formerly 437G) |
775 | Baltic |
850 | MS-DOS Multilingual (Latin I) |
852 | MS-DOS Slavic (Latin II) |
855 | BM Cyrillic (primarily Russian) |
857 | IBM Turkish |
860 | MS-DOS Portuguese |
861 | MS-DOS Icelandic |
862 | Hebrew |
863 | MS-DOS Canadian-French |
864 | Arabic |
865 | MS-DOS Nordic |
866 | MS-DOS Russian |
869 | IBM Modern Greek |
1361 | Korean (Johab) |
| |
|