902 lines
28 KiB
Plaintext
902 lines
28 KiB
Plaintext
=head1 NAME
|
|
|
|
Encode::Supported -- Encodings supported by Encode
|
|
|
|
=head1 DESCRIPTION
|
|
|
|
=head2 Encoding Names
|
|
|
|
Encoding names are case insensitive. White space in names
|
|
is ignored. In addition, an encoding may have aliases.
|
|
Each encoding has one "canonical" name. The "canonical"
|
|
name is chosen from the names of the encoding by picking
|
|
the first in the following sequence (with a few exceptions).
|
|
|
|
=over 2
|
|
|
|
=item *
|
|
|
|
The name used by the Perl community. That includes 'utf8' and 'ascii'.
|
|
Unlike aliases, canonical names directly reach the method so such
|
|
frequently used words like 'utf8' don't need to do alias lookups.
|
|
|
|
=item *
|
|
|
|
The MIME name as defined in IETF RFCs. This includes all "iso-"s.
|
|
|
|
=item *
|
|
|
|
The name in the IANA registry.
|
|
|
|
=item *
|
|
|
|
The name used by the organization that defined it.
|
|
|
|
=back
|
|
|
|
In case I<de jure> canonical names differ from that of the Encode
|
|
module, they are always aliased if it ever be implemented. So you can
|
|
safely tell if a given encoding is implemented or not just by passing
|
|
the canonical name.
|
|
|
|
Because of all the alias issues, and because in the general case
|
|
encodings have state, "Encode" uses an encoding object internally
|
|
once an operation is in progress.
|
|
|
|
=head1 Supported Encodings
|
|
|
|
As of Perl 5.8.0, at least the following encodings are recognized.
|
|
Note that unless otherwise specified, they are all case insensitive
|
|
(via alias) and all occurrence of spaces are replaced with '-'.
|
|
In other words, "ISO 8859 1" and "iso-8859-1" are identical.
|
|
|
|
Encodings are categorized and implemented in several different modules
|
|
but you don't have to C<use Encode::XX> to make them available for
|
|
most cases. Encode.pm will automatically load those modules on demand.
|
|
|
|
=head2 Built-in Encodings
|
|
|
|
The following encodings are always available.
|
|
|
|
Canonical Aliases Comments & References
|
|
----------------------------------------------------------------
|
|
ascii US-ascii ISO-646-US [ECMA]
|
|
ascii-ctrl Special Encoding
|
|
iso-8859-1 latin1 [ISO]
|
|
null Special Encoding
|
|
utf8 UTF-8 [RFC2279]
|
|
----------------------------------------------------------------
|
|
|
|
I<null> and I<ascii-ctrl> are special. "null" fails for all character
|
|
so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
|
|
CHARACTERS will fall back to character references. Ditto for
|
|
"ascii-ctrl" except for control characters. For fallback modes, see
|
|
L<Encode>.
|
|
|
|
=head2 Encode::Unicode -- other Unicode encodings
|
|
|
|
Unicode coding schemes other than native utf8 are supported by
|
|
Encode::Unicode, which will be autoloaded on demand.
|
|
|
|
----------------------------------------------------------------
|
|
UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
|
|
UCS-2LE [UC]
|
|
UTF-16 [UC]
|
|
UTF-16BE [UC]
|
|
UTF-16LE [UC]
|
|
UTF-32 [UC]
|
|
UTF-32BE UCS-4 [UC]
|
|
UTF-32LE [UC]
|
|
UTF-7 [RFC2152]
|
|
----------------------------------------------------------------
|
|
|
|
To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another,
|
|
see L<Encode::Unicode>.
|
|
|
|
UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
|
|
encoding. It is implemented separately by Encode::Unicode::UTF7.
|
|
|
|
=head2 Encode::Byte -- Extended ASCII
|
|
|
|
Encode::Byte implements most single-byte encodings except for
|
|
Symbols and EBCDIC. The following encodings are based on single-byte
|
|
encodings implemented as extended ASCII. Most of them map
|
|
\x80-\xff (upper half) to non-ASCII characters.
|
|
|
|
=over 2
|
|
|
|
=item ISO-8859 and corresponding vendor mappings
|
|
|
|
Since there are so many, they are presented in table format with
|
|
languages and corresponding encoding names by vendors. Note that
|
|
the table is sorted in order of ISO-8859 and the corresponding vendor
|
|
mappings are slightly different from that of ISO. See
|
|
L<http://czyborra.com/charsets/iso8859.html> for details.
|
|
|
|
Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
|
|
----------------------------------------------------------------
|
|
N. America (ASCII) cp437 AdobeStandardEncoding
|
|
cp863 (DOSCanadaF)
|
|
W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
|
|
hp-roman8
|
|
cp860 (DOSPortuguese)
|
|
Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
|
|
MacCroatian
|
|
MacRomanian
|
|
MacRumanian
|
|
Latin3[1] iso-8859-3
|
|
Latin4[2] iso-8859-4
|
|
Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
|
|
(See also next section) cp866 MacUkrainian
|
|
Arabic iso-8859-6 cp864 cp1256 MacArabic
|
|
cp1006 MacFarsi
|
|
Greek iso-8859-7 cp737 cp1253 MacGreek
|
|
cp869 (DOSGreek2)
|
|
Hebrew iso-8859-8 cp862 cp1255 MacHebrew
|
|
Turkish iso-8859-9 cp857 cp1254 MacTurkish
|
|
Nordics iso-8859-10 cp865
|
|
cp861 MacIcelandic
|
|
MacSami
|
|
Thai iso-8859-11[3] cp874 MacThai
|
|
(iso-8859-12 is nonexistent. Reserved for Indics?)
|
|
Baltics iso-8859-13 cp775 cp1257
|
|
Celtics iso-8859-14
|
|
Latin9 [4] iso-8859-15
|
|
Latin10 iso-8859-16
|
|
Vietnamese viscii cp1258 MacVietnamese
|
|
----------------------------------------------------------------
|
|
|
|
[1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
|
|
[2] Baltics. Now on 8859-10, except for Latvian.
|
|
[3] TIS 620 + Non-Breaking Space (0xA0 / U+00A0)
|
|
[4] Nicknamed Latin0; the Euro sign as well as French and Finnish
|
|
letters that are missing from 8859-1 were added.
|
|
|
|
All cp* are also available as ibm-*, ms-*, and windows-* . See also
|
|
L<http://czyborra.com/charsets/codepages.html>.
|
|
|
|
Macintosh encodings don't seem to be registered in such entities as
|
|
IANA. "Canonical" names in Encode are based upon Apple's Tech Note
|
|
1150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
|
|
for details.
|
|
|
|
=item KOI8 - De Facto Standard for the Cyrillic world
|
|
|
|
Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
|
|
popular in the Net. L<Encode> comes with the following KOI charsets.
|
|
For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
|
|
|
|
----------------------------------------------------------------
|
|
koi8-f
|
|
koi8-r cp878 [RFC1489]
|
|
koi8-u [RFC2319]
|
|
----------------------------------------------------------------
|
|
|
|
=back
|
|
|
|
=head2 gsm0338 - Hentai Latin 1
|
|
|
|
GSM0338 is for GSM handsets. Though it shares alphanumerals with
|
|
ASCII, control character ranges and other parts are mapped very
|
|
differently, mainly to store Greek characters. There are also escape
|
|
sequences (starting with 0x1B) to cover e.g. the Euro sign.
|
|
|
|
This was once handled by L<Encode::Bytes> but because of all those
|
|
unusual specifications, Encode 2.20 has relocated the support to
|
|
L<Encode::GSM0338>. See L<Encode::GSM0338> for details.
|
|
|
|
=over 2
|
|
|
|
=item gsm0338 support before 2.19
|
|
|
|
Some special cases like a trailing 0x00 byte or a lone 0x1B byte are not
|
|
well-defined and decode() will return an empty string for them.
|
|
One possible workaround is
|
|
|
|
$gsm =~ s/\x00\z/\x00\x00/;
|
|
$uni = decode("gsm0338", $gsm);
|
|
$uni .= "\xA0" if $gsm =~ /\x1B\z/;
|
|
|
|
Note that the Encode implementation of GSM0338 does not implement the
|
|
reuse of Latin capital letters as Greek capital letters (for example,
|
|
the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL
|
|
LETTER ZETA).
|
|
|
|
The GSM0338 is also covered in Encode::Byte even though it is not
|
|
an "extended ASCII" encoding.
|
|
|
|
=back
|
|
|
|
=head2 CJK: Chinese, Japanese, Korean (Multibyte)
|
|
|
|
Note that Vietnamese is listed above. Also read "Encoding vs Charset"
|
|
below. Also note that these are implemented in distinct modules by
|
|
countries, due to the size concerns (simplified Chinese is mapped
|
|
to 'CN', continental China, while traditional Chinese is mapped to
|
|
'TW', Taiwan). Please refer to their respective documentation pages.
|
|
|
|
=over 2
|
|
|
|
=item Encode::CN -- Continental China
|
|
|
|
Standard DOS/Win Macintosh Comment/Reference
|
|
----------------------------------------------------------------
|
|
euc-cn [1] MacChineseSimp
|
|
(gbk) cp936 [2]
|
|
gb12345-raw { GB12345 without CES }
|
|
gb2312-raw { GB2312 without CES }
|
|
hz
|
|
iso-ir-165
|
|
----------------------------------------------------------------
|
|
|
|
[1] GB2312 is aliased to this. See L<Microsoft-related naming mess>
|
|
[2] gbk is aliased to this. See L<Microsoft-related naming mess>
|
|
|
|
=item Encode::JP -- Japan
|
|
|
|
Standard DOS/Win Macintosh Comment/Reference
|
|
----------------------------------------------------------------
|
|
euc-jp
|
|
shiftjis cp932 macJapanese
|
|
7bit-jis
|
|
iso-2022-jp [RFC1468]
|
|
iso-2022-jp-1 [RFC2237]
|
|
jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
|
|
jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
|
|
jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
|
|
----------------------------------------------------------------
|
|
|
|
=item Encode::KR -- Korea
|
|
|
|
Standard DOS/Win Macintosh Comment/Reference
|
|
----------------------------------------------------------------
|
|
euc-kr MacKorean [RFC1557]
|
|
cp949 [1]
|
|
iso-2022-kr [RFC1557]
|
|
johab [KS X 1001:1998, Annex 3]
|
|
ksc5601-raw { KSC5601 without CES }
|
|
----------------------------------------------------------------
|
|
|
|
[1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
|
|
See below.
|
|
|
|
=item Encode::TW -- Taiwan
|
|
|
|
Standard DOS/Win Macintosh Comment/Reference
|
|
----------------------------------------------------------------
|
|
big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten}
|
|
big5-hkscs
|
|
----------------------------------------------------------------
|
|
|
|
=item Encode::HanExtra -- More Chinese via CPAN
|
|
|
|
Due to the size concerns, additional Chinese encodings below are
|
|
distributed separately on CPAN, under the name Encode::HanExtra.
|
|
|
|
Standard DOS/Win Macintosh Comment/Reference
|
|
----------------------------------------------------------------
|
|
big5ext CMEX's Big5e Extension
|
|
big5plus CMEX's Big5+ Extension
|
|
cccii Chinese Character Code for Information Interchange
|
|
euc-tw EUC (Extended Unix Character)
|
|
gb18030 GBK with Traditional Characters
|
|
----------------------------------------------------------------
|
|
|
|
=item Encode::JIS2K -- JIS X 0213 encodings via CPAN
|
|
|
|
Due to size concerns, additional Japanese encodings below are
|
|
distributed separately on CPAN, under the name Encode::JIS2K.
|
|
|
|
Standard DOS/Win Macintosh Comment/Reference
|
|
----------------------------------------------------------------
|
|
euc-jisx0213
|
|
shiftjisx0123
|
|
iso-2022-jp-3
|
|
jis0213-1-raw
|
|
jis0213-2-raw
|
|
----------------------------------------------------------------
|
|
|
|
=back
|
|
|
|
=head2 Miscellaneous encodings
|
|
|
|
=over 2
|
|
|
|
=item Encode::EBCDIC
|
|
|
|
See L<perlebcdic> for details.
|
|
|
|
----------------------------------------------------------------
|
|
cp37
|
|
cp500
|
|
cp875
|
|
cp1026
|
|
cp1047
|
|
posix-bc
|
|
----------------------------------------------------------------
|
|
|
|
=item Encode::Symbols
|
|
|
|
For symbols and dingbats.
|
|
|
|
----------------------------------------------------------------
|
|
symbol
|
|
dingbats
|
|
MacDingbats
|
|
AdobeZdingbat
|
|
AdobeSymbol
|
|
----------------------------------------------------------------
|
|
|
|
=item Encode::MIME::Header
|
|
|
|
Strictly speaking, MIME header encoding documented in RFC 2047 is more
|
|
of encapsulation than encoding. However, their support in modern
|
|
world is imperative so they are supported.
|
|
|
|
----------------------------------------------------------------
|
|
MIME-Header [RFC2047]
|
|
MIME-B [RFC2047]
|
|
MIME-Q [RFC2047]
|
|
----------------------------------------------------------------
|
|
|
|
=item Encode::Guess
|
|
|
|
This one is not a name of encoding but a utility that lets you pick up
|
|
the most appropriate encoding for a data out of given I<suspects>. See
|
|
L<Encode::Guess> for details.
|
|
|
|
=back
|
|
|
|
=head1 Unsupported encodings
|
|
|
|
The following encodings are not supported as yet; some because they
|
|
are rarely used, some because of technical difficulties. They may
|
|
be supported by external modules via CPAN in the future, however.
|
|
|
|
=over 2
|
|
|
|
=item ISO-2022-JP-2 [RFC1554]
|
|
|
|
Not very popular yet. Needs Unicode Database or equivalent to
|
|
implement encode() (because it includes JIS X 0208/0212, KSC5601, and
|
|
GB2312 simultaneously, whose code points in Unicode overlap. So you
|
|
need to lookup the database to determine to what character set a given
|
|
Unicode character should belong).
|
|
|
|
=item ISO-2022-CN [RFC1922]
|
|
|
|
Not very popular. Needs CNS 11643-1 and -2 which are not available in
|
|
this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
|
|
Audrey Tang may add support for this encoding in her module in future.
|
|
|
|
=item Various HP-UX encodings
|
|
|
|
The following are unsupported due to the lack of mapping data.
|
|
|
|
'8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
|
|
'15' - japanese15, korean15, and roi15
|
|
|
|
=item Cyrillic encoding ISO-IR-111
|
|
|
|
Anton Tagunov doubts its usefulness.
|
|
|
|
=item ISO-8859-8-1 [Hebrew]
|
|
|
|
None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
|
|
MacHebrew are supported because and just because there were mappings
|
|
available at L<http://www.unicode.org/>). Contributions welcome.
|
|
|
|
=item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
|
|
|
|
Ditto.
|
|
|
|
=item Thai encoding TCVN
|
|
|
|
Ditto.
|
|
|
|
=item Vietnamese encodings VPS
|
|
|
|
Though Jungshik Shin has reported that Mozilla supports this encoding,
|
|
it was too late before 5.8.0 for us to add it. In the future, it
|
|
may be available via a separate module. See
|
|
L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
|
|
and
|
|
L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
|
|
if you are interested in helping us.
|
|
|
|
=item Various Mac encodings
|
|
|
|
The following are unsupported due to the lack of mapping data.
|
|
|
|
MacArmenian, MacBengali, MacBurmese, MacEthiopic
|
|
MacExtArabic, MacGeorgian, MacKannada, MacKhmer
|
|
MacLaotian, MacMalayalam, MacMongolian, MacOriya
|
|
MacSinhalese, MacTamil, MacTelugu, MacTibetan
|
|
MacVietnamese
|
|
|
|
The rest which are already available are based upon the vendor mappings
|
|
at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
|
|
|
|
=item (Mac) Indic encodings
|
|
|
|
The maps for the following are available at L<http://www.unicode.org/>
|
|
but remain unsupported because those encodings need an algorithmical
|
|
approach, currently unsupported by F<enc2xs>:
|
|
|
|
MacDevanagari
|
|
MacGurmukhi
|
|
MacGujarati
|
|
|
|
For details, please see C<Unicode mapping issues and notes:> at
|
|
L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
|
|
|
|
I believe this issue is prevalent not only for Mac Indics but also in
|
|
other Indic encodings, but the above were the only Indic encodings
|
|
maps that I could find at L<http://www.unicode.org/> .
|
|
|
|
=back
|
|
|
|
=head1 Encoding vs. Charset -- terminology
|
|
|
|
We are used to using the term (character) I<encoding> and I<character
|
|
set> interchangeably. But just as confusing the terms byte and
|
|
character is dangerous and the terms should be differentiated when
|
|
needed, we need to differentiate I<encoding> and I<character set>.
|
|
|
|
To understand that, here is a description of how we make computers
|
|
grok our characters.
|
|
|
|
=over 2
|
|
|
|
=item *
|
|
|
|
First we start with which characters to include. We call this
|
|
collection of characters I<character repertoire>.
|
|
|
|
=item *
|
|
|
|
Then we have to give each character a unique ID so your computer can
|
|
tell the difference between 'a' and 'A'. This itemized character
|
|
repertoire is now a I<character set>.
|
|
|
|
=item *
|
|
|
|
If your computer can grow the character set without further
|
|
processing, you can go ahead and use it. This is called a I<coded
|
|
character set> (CCS) or I<raw character encoding>. ASCII is used this
|
|
way for most cases.
|
|
|
|
=item *
|
|
|
|
But in many cases, especially multi-byte CJK encodings, you have to
|
|
tweak a little more. Your network connection may not accept any data
|
|
with the Most Significant Bit set, and your computer may not be able to
|
|
tell if a given byte is a whole character or just half of it. So you
|
|
have to I<encode> the character set to use it.
|
|
|
|
A I<character encoding scheme> (CES) determines how to encode a given
|
|
character set, or a set of multiple character sets. 7bit ISO-2022 is
|
|
an example of a CES. You switch between character sets via I<escape
|
|
sequences>.
|
|
|
|
=back
|
|
|
|
Technically, or mathematically, speaking, a character set encoded in
|
|
such a CES that maps character by character may form a CCS. EUC is such
|
|
an example. The CES of EUC is as follows:
|
|
|
|
=over 2
|
|
|
|
=item *
|
|
|
|
Map ASCII unchanged.
|
|
|
|
=item *
|
|
|
|
Map such a character set that consists of 94 or 96 powered by N
|
|
members by adding 0x80 to each byte.
|
|
|
|
=item *
|
|
|
|
You can also use 0x8e and 0x8f to indicate that the following sequence of
|
|
characters belongs to yet another character set. To each following byte
|
|
is added the value 0x80.
|
|
|
|
=back
|
|
|
|
By carefully looking at the encoded byte sequence, you can find that the
|
|
byte sequence conforms a unique number. In that sense, EUC is a CCS
|
|
generated by a CES above from up to four CCS (complicated?). UTF-8
|
|
falls into this category. See L<perlUnicode/"UTF-8"> to find out how
|
|
UTF-8 maps Unicode to a byte sequence.
|
|
|
|
You may also have found out by now why 7bit ISO-2022 cannot comprise
|
|
a CCS. If you look at a byte sequence \x21\x21, you can't tell if
|
|
it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1
|
|
so you have no trouble differentiating between "!!". and S<" ">.
|
|
|
|
=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
|
|
|
|
This section tries to classify the supported encodings by their
|
|
applicability for information exchange over the Internet and to
|
|
choose the most suitable aliases to name them in the context of
|
|
such communication.
|
|
|
|
=over 2
|
|
|
|
=item *
|
|
|
|
To (en|de)code encodings marked by C<(**)>, you need
|
|
C<Encode::HanExtra>, available from CPAN.
|
|
|
|
=back
|
|
|
|
Encoding names
|
|
|
|
US-ASCII UTF-8 ISO-8859-* KOI8-R
|
|
Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
|
|
EUC-KR Big5 GB2312
|
|
|
|
are registered with IANA as preferred MIME names and may
|
|
be used over the Internet.
|
|
|
|
C<Shift_JIS> has been officialized by JIS X 0208:1997.
|
|
L<Microsoft-related naming mess> gives details.
|
|
|
|
C<GB2312> is the IANA name for C<EUC-CN>.
|
|
See L<Microsoft-related naming mess> for details.
|
|
|
|
C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
|
|
with Encode. See L<Encode::CN> for details.
|
|
|
|
EUC-CN
|
|
KOI8-U [RFC2319]
|
|
|
|
have not been registered with IANA (as of March 2002) but
|
|
seem to be supported by major web browsers.
|
|
The IANA name for C<EUC-CN> is C<GB2312>.
|
|
|
|
KS_C_5601-1987
|
|
|
|
is heavily misused.
|
|
See L<Microsoft-related naming mess> for details.
|
|
|
|
C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
|
|
with Encode. See L<Encode::KR> for details.
|
|
|
|
UTF-16 UTF-16BE UTF-16LE
|
|
|
|
are IANA-registered C<charset>s. See [RFC 2781] for details.
|
|
Jungshik Shin reports that UTF-16 with a BOM is well accepted
|
|
by MS IE 5/6 and NS 4/6. Beware however that
|
|
|
|
=over 2
|
|
|
|
=item *
|
|
|
|
C<UTF-16> support in any software you're going to be
|
|
using/interoperating with has probably been less tested
|
|
then C<UTF-8> support
|
|
|
|
=item *
|
|
|
|
C<UTF-8> coded data seamlessly passes traditional
|
|
command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
|
|
data is likely to cause confusion (with its zero bytes,
|
|
for example)
|
|
|
|
=item *
|
|
|
|
it is beyond the power of words to describe the way HTML browsers
|
|
encode non-C<ASCII> form data. To get a general impression, visit
|
|
L<http://www.alanflavell.org.uk/charset/form-i18n.html>.
|
|
While encoding of form data has stabilized for C<UTF-8> encoded pages
|
|
(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
|
|
expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
|
|
pages!
|
|
|
|
=back
|
|
|
|
The rule of thumb is to use C<UTF-8> unless you know what
|
|
you're doing and unless you really benefit from using C<UTF-16>.
|
|
|
|
ISO-IR-165 [RFC1345]
|
|
VISCII
|
|
GB 12345
|
|
GB 18030 (**) (see links below)
|
|
EUC-TW (**)
|
|
|
|
are totally valid encodings but not registered at IANA.
|
|
The names under which they are listed here are probably the
|
|
most widely-known names for these encodings and are recommended
|
|
names.
|
|
|
|
BIG5PLUS (**)
|
|
|
|
is a proprietary name.
|
|
|
|
=head2 Microsoft-related naming mess
|
|
|
|
Microsoft products misuse the following names:
|
|
|
|
=over 2
|
|
|
|
=item KS_C_5601-1987
|
|
|
|
Microsoft extension to C<EUC-KR>.
|
|
|
|
Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
|
|
|
|
See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
|
|
for details.
|
|
|
|
Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
|
|
misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
|
|
C<kcs5601-raw>.
|
|
|
|
See L<Encode::KR> for details.
|
|
|
|
=item GB2312
|
|
|
|
Microsoft extension to C<EUC-CN>.
|
|
|
|
Proper names: C<CP936>, C<GBK>.
|
|
|
|
C<GB2312> has been registered in the C<EUC-CN> meaning at
|
|
IANA. This has partially repaired the situation: Microsoft's
|
|
C<GB2312> has become a superset of the official C<GB2312>.
|
|
|
|
Encode aliases C<GB2312> to C<euc-cn> in full agreement with
|
|
IANA registration. C<cp936> is supported separately.
|
|
I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
|
|
|
|
See L<Encode::CN> for details.
|
|
|
|
=item Big5
|
|
|
|
Microsoft extension to C<Big5>.
|
|
|
|
Proper name: C<CP950>.
|
|
|
|
Encode separately supports C<Big5> and C<cp950>.
|
|
|
|
=item Shift_JIS
|
|
|
|
Microsoft's understanding of C<Shift_JIS>.
|
|
|
|
JIS has not endorsed the full Microsoft standard however.
|
|
The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
|
|
character sets, while Microsoft has always used C<Shift_JIS>
|
|
to encode a wider character repertoire. See C<IANA> registration for
|
|
C<Windows-31J>.
|
|
|
|
As a historical predecessor, Microsoft's variant
|
|
probably has more rights for the name, though it may be objected
|
|
that Microsoft shouldn't have used JIS as part of the name
|
|
in the first place.
|
|
|
|
Unambiguous name: C<CP932>. C<IANA> name (also used by Mozilla, and
|
|
provided as an alias by Encode): C<Windows-31J>.
|
|
|
|
Encode separately supports C<Shift_JIS> and C<cp932>.
|
|
|
|
=back
|
|
|
|
=head1 Glossary
|
|
|
|
=over 2
|
|
|
|
=item character repertoire
|
|
|
|
A collection of unique characters. A I<character> set in the strictest
|
|
sense. At this stage, characters are not numbered.
|
|
|
|
=item coded character set (CCS)
|
|
|
|
A character set that is mapped in a way computers can use directly.
|
|
Many character encodings, including EUC, fall in this category.
|
|
|
|
=item character encoding scheme (CES)
|
|
|
|
An algorithm to map a character set to a byte sequence. You don't
|
|
have to be able to tell which character set a given byte sequence
|
|
belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
|
|
example of being both a CCS and CES.
|
|
|
|
=item charset (in MIME context)
|
|
|
|
has long been used in the meaning of C<encoding>, CES.
|
|
|
|
While the word combination C<character set> has lost this meaning
|
|
in MIME context since [RFC 2130], the C<charset> abbreviation has
|
|
retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>:
|
|
|
|
This document uses the term "charset" to mean a set of rules for
|
|
mapping from a sequence of octets to a sequence of characters, such
|
|
as the combination of a coded character set and a character encoding
|
|
scheme; this is also what is used as an identifier in MIME "charset="
|
|
parameters, and registered in the IANA charset registry ... (Note
|
|
that this is NOT a term used by other standards bodies, such as ISO).
|
|
[RFC 2277]
|
|
|
|
=item EUC
|
|
|
|
Extended Unix Character. See ISO-2022.
|
|
|
|
=item ISO-2022
|
|
|
|
A CES that was carefully designed to coexist with ASCII. There are a 7
|
|
bit version and an 8 bit version.
|
|
|
|
The 7 bit version switches character set via escape sequence so it
|
|
cannot form a CCS. Since this is more difficult to handle in programs
|
|
than the 8 bit version, the 7 bit version is not very popular except for
|
|
iso-2022-jp, the I<de facto> standard CES for e-mails.
|
|
|
|
The 8 bit version can form a CCS. EUC and ISO-8859 are two examples
|
|
thereof. Pre-5.6 perl could use them as string literals.
|
|
|
|
=item UCS
|
|
|
|
Short for I<Universal Character Set>. When you say just UCS, it means
|
|
I<Unicode>.
|
|
|
|
=item UCS-2
|
|
|
|
ISO/IEC 10646 encoding form: Universal Character Set coded in two
|
|
octets.
|
|
|
|
=item Unicode
|
|
|
|
A character set that aims to include all character repertoires of the
|
|
world. Many character sets in various national as well as industrial
|
|
standards have become, in a way, just subsets of Unicode.
|
|
|
|
=item UTF
|
|
|
|
Short for I<Unicode Transformation Format>. Determines how to map a
|
|
Unicode character into a byte sequence.
|
|
|
|
=item UTF-16
|
|
|
|
A UTF in 16-bit encoding. Can either be in big endian or little
|
|
endian. The big endian version is called UTF-16BE (equal to UCS-2 +
|
|
surrogate support) and the little endian version is called UTF-16LE.
|
|
|
|
=back
|
|
|
|
=head1 See Also
|
|
|
|
L<Encode>,
|
|
L<Encode::Byte>,
|
|
L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
|
|
L<Encode::EBCDIC>, L<Encode::Symbol>
|
|
L<Encode::MIME::Header>, L<Encode::Guess>
|
|
|
|
=head1 References
|
|
|
|
=over 2
|
|
|
|
=item ECMA
|
|
|
|
European Computer Manufacturers Association
|
|
L<http://www.ecma.ch>
|
|
|
|
=over 2
|
|
|
|
=item ECMA-035 (eq C<ISO-2022>)
|
|
|
|
L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
|
|
|
|
The specification of ISO-2022 is available from the link above.
|
|
|
|
=back
|
|
|
|
=item IANA
|
|
|
|
Internet Assigned Numbers Authority
|
|
L<http://www.iana.org/>
|
|
|
|
=over 2
|
|
|
|
=item Assigned Charset Names by IANA
|
|
|
|
L<http://www.iana.org/assignments/character-sets>
|
|
|
|
Most of the C<canonical names> in Encode derive from this list
|
|
so you can directly apply the string you have extracted from MIME
|
|
header of mails and web pages.
|
|
|
|
=back
|
|
|
|
=item ISO
|
|
|
|
International Organization for Standardization
|
|
L<http://www.iso.ch/>
|
|
|
|
=item RFC
|
|
|
|
Request For Comments -- need I say more?
|
|
L<http://www.rfc-editor.org/>, L<http://www.ietf.org/rfc.html>,
|
|
L<http://www.faqs.org/rfcs/>
|
|
|
|
=item UC
|
|
|
|
Unicode Consortium
|
|
L<http://www.unicode.org/>
|
|
|
|
=over 2
|
|
|
|
=item Unicode Glossary
|
|
|
|
L<http://www.unicode.org/glossary/>
|
|
|
|
The glossary of this document is based upon this site.
|
|
|
|
=back
|
|
|
|
=back
|
|
|
|
=head2 Other Notable Sites
|
|
|
|
=over 2
|
|
|
|
=item czyborra.com
|
|
|
|
L<http://czyborra.com/>
|
|
|
|
Contains a lot of useful information, especially gory details of ISO
|
|
vs. vendor mappings.
|
|
|
|
=item CJK.inf
|
|
|
|
L<http://examples.oreilly.com/cjkvinfo/doc/cjk.inf>
|
|
|
|
Somewhat obsolete (last update in 1996), but still useful. Also try
|
|
|
|
L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
|
|
|
|
You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.
|
|
|
|
=item Jungshik Shin's Hangul FAQ
|
|
|
|
L<http://jshin.net/faq>
|
|
|
|
And especially its subject 8.
|
|
|
|
L<http://jshin.net/faq/qa8.html>
|
|
|
|
A comprehensive overview of the Korean (C<KS *>) standards.
|
|
|
|
=item debian.org: "Introduction to i18n"
|
|
|
|
A brief description for most of the mentioned CJK encodings is
|
|
contained in
|
|
L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
|
|
|
|
=back
|
|
|
|
=head2 Offline sources
|
|
|
|
=over 2
|
|
|
|
=item C<CJKV Information Processing> by Ken Lunde
|
|
|
|
CJKV Information Processing
|
|
1999 O'Reilly & Associates, ISBN : 1-56592-224-7
|
|
|
|
The modern successor of C<CJK.inf>.
|
|
|
|
Features a comprehensive coverage of CJKV character sets and
|
|
encodings along with many other issues faced by anyone trying
|
|
to better support CJKV languages/scripts in all the areas of
|
|
information processing.
|
|
|
|
To purchase this book, visit
|
|
L<http://oreilly.com/catalog/9780596514471/>
|
|
or your favourite bookstore.
|
|
|
|
=back
|
|
|
|
=cut
|