HTML Character Sets
compiled by Stanislav Sýkora, Extra Byte, Via R.Sanzio 22C, Castano Primo, Italy 20022
in Stan's Library, Ed.S.Sykora, Vol.I. First release March 29, 2006.
Permalink via DOI:  10.3247/SL1Web06.001
Other DATABASES | Stan's LIBRARY ( Programming) | SCIENCE Links | WWW Links Extra Byte | Stan's HUB

The Table below lists the codes of currently used non-Unicode (ASCII-like) Character Sets and their Code Pages.

The Unicode Standard, Version 4.0

Of course, we should all be using the Unicode, so that the one and only charset specification we should use is utf-8, the Unicode Transfromation Format -8. Which, just as of course, most of us know very little about. I, for one, am quite ready for a triple 6-month shift of onion peeling in a submarine, the punishment invented by Joel Spolsky for those who are not up to current standards.

The charset specification is used - or should be used - by anybody writing HTML documents. It appears as a <META ... > entry in the <HEAD>...</HEAD> section of the HTML script. I usually place it as the first entry in the header but the order should be actually irrelevant. Here is an example of such an entry:

<META http-equiv="Content-Type" content="text/html; charset=windows-1252">

The case (upper/lower characters) of the content string should be also irrelevant so that writing Windows-1252 or windows-1252 should be equivalent in all browsers (the reason why I use the conditional is that I did not have the opportunity to test this aspect in all possible browsers). I use consistently lower case.

When you do not specify the charset within a HTML document, the browser will use the current default of your computer. Which, of course, need not be the same in the whole world. The result is that everything looks fine when you display the document on your computer but when somebody displays it in another country it may look as complete garbage!!!

 

Extended CharSet name

Scripting name

Code-Page

Note

Arabic (ISO) ISO-8859-6 1256  
Arabic (Windows) Windows-1256 1256  
Baltic (ISO) ISO-8859-4 1257  
Baltic (Windows) Windows-1257 1257  
Central European (ISO) ISO-8859-2 1250 Czechs like this for Czech
Central European (Windows) Windows-1250 1250 I prefer this for Czech texts
Chinese Simplified (GB2312) GB2312 936  
Chinese Simplified (HZ) HZ-GB-2312 936  
Chinese Traditional (Big5) Big5 950  
Cyrillic (ISO) ISO-8859-5 1251  
Cyrillic (Windows) Windows-1251 1251  
Greek (ISO) ISO-8859-7 1253  
Greek (Windows) Windows-1253 1253  
Hebrew (ISO-Logical) ISO-8859-8-I 1255  
Hebrew (Windows) Windows-1255 1255  
Japanese (EUC) EUC-JP 932  
Japanese (JIS) ISO-2022-JP 932  
Japanese (Shift-JIS) ISO-2022-JP 932  
Korean KS-C-5601-1987 949  
Korean (EUC) EUC-KR 949  
Latin 3 (ISO) ISO-8859-3 1252  
Latin 9 (ISO) ISO-8859-15 1252  
Thai (Windows) Windows-874 874  
Turkish (ISO) ISO-8859-9 1254  
Turkish (Windows) Windows-1254 1254  
Vietnamese (Windows) Windows-1258 1258  
Western European (ISO) ISO-8859-1 1252 ISO for English texts
Western European (Windows) Windows-1252 1252 I prefer this for English/Italian


The Code-Page identifiers are rarely used by anybody but those programmers who interfere with Operating Systems and/or write boot-up scripts for DOS, UNIX, LINUX and the like. Most of us commoners do not havy any use for them.

The following Table is for those who would nevertheless like to know more about them. It shows the code-pages supported by the Windows API functions GetACP, GetOEMCP and GetCPInfo.
 

 

CodePage

Meaning

Identifiers of ANSI character pages

874Thai
932Japan
936Chinese (PRC, Singapore)
949Korean
950Chinese (Taiwan, Hong Kong)
1200Unicode (BMP of ISO 10646)
1250Windows Eastern European
1251Windows Cyrillic
1252Windows Latin 1 (US, Western Europe)
1253Windows Greek
1254Windows Turkish
1255Hebrew
1256Arabic
1257Baltic

Identifiers of OEM character pages

437MS-DOS United States
708Arabic (ASMO 708)
709Arabic (ASMO 449+, BCON V4)
710Arabic (Transparent Arabic)
720Arabic (Transparent ASMO)
737Greek (formerly 437G)
775Baltic
850MS-DOS Multilingual (Latin I)
852MS-DOS Slavic (Latin II)
855BM Cyrillic (primarily Russian)
857IBM Turkish
860MS-DOS Portuguese
861MS-DOS Icelandic
862Hebrew
863MS-DOS Canadian-French
864Arabic
865MS-DOS Nordic
866MS-DOS Russian
869IBM Modern Greek
1361Korean (Johab)
 

TOP | Other DATABASES | Stan's LIBRARY ( Programming) | WWW Links Extra Byte | Stan's HUB | TOP
   
Copyright ©2006 Stanislav Sýkora    DOI: 10.3247/SL1Web06.001 Designed by Stan Sýkora