aboutsummaryrefslogtreecommitdiff
path: root/manual/charset.texi
diff options
context:
space:
mode:
Diffstat (limited to 'manual/charset.texi')
-rw-r--r--manual/charset.texi136
1 files changed, 75 insertions, 61 deletions
diff --git a/manual/charset.texi b/manual/charset.texi
index deae7af08a..89a54d8e13 100644
--- a/manual/charset.texi
+++ b/manual/charset.texi
@@ -15,7 +15,7 @@ limitations of this approach became more apparent as more people
grappled with non-Roman character sets, where not all the characters
that make up a language's character set can be represented by @math{2^8}
choices. This chapter shows the functionality which was added to the C
-library to correctly support multiple character sets.
+library to support multiple character sets.
@menu
* Extended Char Intro:: Introduction to Extended Characters.
@@ -46,13 +46,13 @@ through whatever communication channel. Examples of external
representations include files lying in a directory that are going to be
read and parsed.
-Traditionally there was no difference between the two representations.
-It was equally comfortable and useful to use the same one-byte
+Traditionally there has been no difference between the two representations.
+It was equally comfortable and useful to use the same single-byte
representation internally and externally. This changes with more and
larger character sets.
One of the problems to overcome with the internal representation is
-handling text which is externally encoded using different character
+handling text that is externally encoded using different character
sets. Assume a program which reads two texts and compares them using
some metric. The comparison can be usefully done only if the texts are
internally kept in a common format.
@@ -69,14 +69,28 @@ than four bytes seem not to be necessary).
As shown in some other part of this manual,
@c !!! Ahem, wide char string functions are not yet covered -- drepper
there exists a completely new family of functions which can handle texts
-of this kind in memory. The most commonly used character set for such
-internal wide character representations are Unicode and @w{ISO 10646}.
-The former is a subset of the latter and used when wide characters are
-chosen to by 2 bytes (@math{= 16} bits) wide. The standard names of the
-@cindex UCS2
-@cindex UCS4
-encodings used in these cases are UCS2 (@math{= 16} bits) and UCS4
-(@math{= 32} bits).
+of this kind in memory. The most commonly used character sets for such
+internal wide character representations are Unicode and @w{ISO 10646}
+(also known as UCS for Universal Character Set). Unicode was originally
+planned as a 16-bit character set, whereas @w{ISO 10646} was designed to
+be a 31-bit large code space. The two standards are practically identical.
+They have the same character repertoire and code table, but Unicode specifies
+added semantics. At the moment, only characters in the first @code{0x10000}
+code positions (the so-called Basic Multilingual Plane, BMP) have been
+assigned, but the assignment of more specialized characters outside this
+16-bit space is already in progress. A number of encodings have been
+defined for Unicode and @w{ISO 10646} characters:
+@cindex UCS-2
+@cindex UCS-4
+@cindex UTF-8
+@cindex UTF-16
+UCS-2 is a 16-bit word that can only represent characters
+from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
+and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where
+ASCII characters are represented by ASCII bytes and non-ASCII characters
+by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
+of UCS-2 in which pairs of certain UCS-2 words can be used to encode
+non-BMP characters up to @code{0x10ffff}.
To represent wide characters the @code{char} type is not suitable. For
this reason the @w{ISO C} standard introduces a new type which is
@@ -93,18 +107,18 @@ for multibyte character strings. The type is defined in @file{stddef.h}.
The @w{ISO C90} standard, where this type was introduced, does not say
anything specific about the representation. It only requires that this
-type is capable to store all elements of the basic character set.
+type is capable of storing all elements of the basic character set.
Therefore it would be legitimate to define @code{wchar_t} as
@code{char}. This might make sense for embedded systems.
But for GNU systems this type is always 32 bits wide. It is therefore
-capable to represent all UCS4 value therefore covering all of @w{ISO
-10646}. Some Unix systems define @code{wchar_t} as a 16 bit type and
+capable of representing all UCS-4 values and therefore covering all of
+@w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type and
thereby follow Unicode very strictly. This is perfectly fine with the
standard but it also means that to represent all characters from Unicode
-and @w{ISO 10646} one has to use surrogate character which is in fact a
-multi-wide-character encoding. But this contradicts the purpose of the
-@code{wchar_t} type.
+and @w{ISO 10646} one has to use UTF-16 surrogate characters which is in
+fact a multi-wide-character encoding. But this contradicts the purpose
+of the @code{wchar_t} type.
@end deftp
@comment wchar.h
@@ -119,8 +133,8 @@ defined as @code{char} the type @code{wint_t} must be defined as
@code{int} due to the parameter promotion.
@pindex wchar.h
-This type is defined in @file{wchar.h} and got introduced in the second
-amendment to @w{ISO C90}.
+This type is defined in @file{wchar.h} and got introduced in
+@w{Amendment 1} to @w{ISO C90}.
@end deftp
As there are for the @code{char} data type there also exist macros
@@ -133,7 +147,7 @@ type @code{wchar_t}.
The macro @code{WCHAR_MIN} evaluates to the minimum value representable
by an object of type @code{wint_t}.
-This macro got introduced in the second amendment to @w{ISO C90}.
+This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
@end deftypevr
@comment wchar.h
@@ -142,7 +156,7 @@ This macro got introduced in the second amendment to @w{ISO C90}.
The macro @code{WCHAR_MIN} evaluates to the maximum value representable
by an object of type @code{wint_t}.
-This macro got introduced in the second amendment to @w{ISO C90}.
+This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
@end deftypevr
Another special wide character value is the equivalent to @code{EOF}.
@@ -180,7 +194,7 @@ are used.
@end smallexample
@pindex wchar.h
-This macro was introduced in the second amendment to @w{ISO C90} and is
+This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is
defined in @file{wchar.h}.
@end deftypevr
@@ -198,7 +212,7 @@ oriented character set.
@cindex multibyte character
@cindex EBCDIC
For all the above reasons, an external encoding which is different
-from the internal encoding is often used if the latter is UCS2 or UCS4.
+from the internal encoding is often used if the latter is UCS-2 or UCS-4.
The external encoding is byte-based and can be chosen appropriately for
the environment and for the texts to be handled. There exist a variety
of different character sets which can be used for this external
@@ -215,7 +229,7 @@ system calls have to be converted first anyhow.
@itemize @bullet
@item
-The simplest character sets are one-byte character sets. There can be
+The simplest character sets are single-byte character sets. There can be
only up to 256 characters (for @w{8 bit} character sets) which is not
sufficient to cover all languages but might be sufficient to handle a
specific text. Another reason to choose this is because of constraints
@@ -240,7 +254,7 @@ big advantage that whenever one can identify the beginning of the byte
sequence of a character one can interpret a text correctly. Examples of
character sets using this policy are the various EUC character sets
(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
-or SJIS (Shift JIS, a Japanese encoding).
+or SJIS (Shift-JIS, a Japanese encoding).
But there are also character sets using a state which is valid for more
than one character and has to be changed by another byte sequence.
@@ -257,23 +271,23 @@ acute accent, following by lower-case `a') to get the ``small a with
acute'' character. To get the acute accent character on its on one has
to write @code{0xc2 0x20} (the non-spacing acute followed by a space).
-This type of characters sets is quite frequently used in embedded
-systems such as video text.
+This type of character set is used in some embedded systems such as
+teletex.
@item
@cindex UTF-8
-Instead of converting the Unicode or @w{ISO 10646} text used internally
+Instead of converting the Unicode or @w{ISO 10646} text used internally,
it is often also sufficient to simply use an encoding different than
-UCS2/UCS4. The Unicode and @w{ISO 10646} standards even specify such an
+UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an
encoding: UTF-8. This encoding is able to represent all of @w{ISO
-10464} 31 bits in a byte string of length one to seven.
+10464} 31 bits in a byte string of length one to six.
@cindex UTF-7
There were a few other attempts to encode @w{ISO 10646} such as UTF-7
but UTF-8 is today the only encoding which should be used. In fact,
-UTF-8 will hopefully soon be the only external which has to be
+UTF-8 will hopefully soon be the only external encoding that has to be
supported. It proves to be universally usable and the only disadvantage
-is that it favor Roman languages very much by making the byte string
+is that it favors Roman languages by making the byte string
representation of other scripts (Cyrillic, Greek, Asian scripts) longer
than necessary if using a specific character set for these scripts.
Methods like the Unicode compression scheme can alleviate these
@@ -324,7 +338,7 @@ developing libraries (as opposed to applications).
The second family of functions got introduced in the early Unix standards
(XPG2) and is still part of the latest and greatest Unix standard:
@w{Unix 98}. It is also the most powerful and useful set of functions.
-But we will start with the functions defined in the second amendment to
+But we will start with the functions defined in @w{Amendment 1} to
@w{ISO C90}.
@node Restartable multibyte conversion
@@ -377,7 +391,7 @@ We already said above that the currently selected locale for the
by the functions we are about to describe. Each locale uses its own
character set (given as an argument to @code{localedef}) and this is the
one assumed as the external multibyte encoding. The wide character
-character set always is UCS4, at least on GNU systems.
+character set always is UCS-4, at least on GNU systems.
A characteristic of each multibyte character set is the maximum number
of bytes which can be necessary to represent one character. This
@@ -456,8 +470,8 @@ about the @dfn{shift state} needed from one call to a conversion
function to another.
@pindex wchar.h
-This type is defined in @file{wchar.h}. It got introduced in the second
-amendment to @w{ISO C90}.
+This type is defined in @file{wchar.h}. It got introduced in
+@w{Amendment 1} to @w{ISO C90}.
@end deftp
To use objects of this type the programmer has to define such objects
@@ -495,7 +509,7 @@ object is in the initial state the return value is nonzero. Otherwise
it is zero.
@pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
is declared in @file{wchar.h}.
@end deftypefun
@@ -559,7 +573,7 @@ which the state information is taken and the function also does not use
any static state.
@pindex wchar.h
-This function was introduced in the second amendment of @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
is declared in @file{wchar.h}.
@end deftypefun
@@ -608,7 +622,7 @@ value of this function is this character. Otherwise the return value is
@code{EOF}.
@pindex wchar.h
-This function was introduced in the second amendment of @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
is declared in @file{wchar.h}.
@end deftypefun
@@ -655,7 +669,7 @@ a valid multibyte character also no value is stored, the global variable
@code{(size_t) -1}. The conversion state is afterwards undefined.
@pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
is declared in @file{wchar.h}.
@end deftypefun
@@ -733,7 +747,7 @@ object pointed to by @var{ps}. If @var{ps} is a null pointer, a state
object local to @code{mbrlen} is used.
@pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
is declared in @file{wchar.h}.
@end deftypefun
@@ -839,7 +853,7 @@ character. So the caller has to make sure that there is enough space
available, otherwise buffer overruns can occur.
@pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C} and is
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
declared in @file{wchar.h}.
@end deftypefun
@@ -977,7 +991,7 @@ byte in the input string was reached) or the address of the byte
following the last converted multibyte character.
@pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C} and is
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
declared in @file{wchar.h}.
@end deftypefun
@@ -1058,7 +1072,7 @@ the initial shift state in case the terminating NUL wide character was
converted.
@pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C} and is
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
declared in @file{wchar.h}.
@end deftypefun
@@ -1231,8 +1245,8 @@ file_mbsrtowcs (int input, int output)
@node Non-reentrant Conversion
@section Non-reentrant Conversion Function
-The functions described in the last chapter are defined in the second
-amendment to @w{ISO C90}. But the original @w{ISO C90} standard also
+The functions described in the last chapter are defined in
+@w{Amendment 1} to @w{ISO C90}. But the original @w{ISO C90} standard also
contained functions for character set conversion. The reason that they
are not described in the first place is that they are almost entirely
useless.
@@ -1369,8 +1383,8 @@ The function @code{mblen} is declared in @file{stdlib.h}.
For convenience reasons the @w{ISO C90} standard defines also functions
to convert entire strings instead of single characters. These functions
-suffer from the same problems as their reentrant counterparts from the
-second amendment to @w{ISO C90}; see @ref{Converting Strings}.
+suffer from the same problems as their reentrant counterparts from
+@w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}.
@comment stdlib.h
@comment ISO
@@ -1513,7 +1527,7 @@ common that they operate on character sets which are not directly
specified by the functions. The multibyte encoding used is specified by
the currently selected locale for the @code{LC_CTYPE} category. The
wide character set is fixed by the implementation (in the case of GNU C
-library it always is UCS4 encoded @w{ISO 10646}.
+library it always is UCS-4 encoded @w{ISO 10646}.
This has of course several problems when it comes to general character
conversion:
@@ -1806,12 +1820,12 @@ file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail)
int result = 0;
iconv_t cd;
- cd = iconv_open ("UCS4", charset);
+ cd = iconv_open ("UCS-4", charset);
if (cd == (iconv_t) -1)
@{
/* @r{Something went wrong.} */
if (errno == EINVAL)
- error (0, 0, "conversion from `%s' to `UCS4' no available",
+ error (0, 0, "conversion from '%s' to 'UCS-4' not available",
charset);
else
perror ("iconv_open");
@@ -2024,7 +2038,7 @@ will succeed but how to find @math{@cal{B}}?
Unfortunately, the answer is: there is no general solution. On some
systems guessing might help. On those systems most character sets can
-convert to and from UTF8 encoded @w{ISO 10646} or Unicode text.
+convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text.
Beside this only some very system-specific methods can help. Since the
conversion functions come from loadable modules and these modules must
be stored somewhere in the filesystem, one @emph{could} try to find them
@@ -2082,7 +2096,7 @@ wanted conversion.
@cindex triangulation
This is achieved by providing for each character set a conversion from
-and to UCS4 encoded @w{ISO 10646}. Using @w{ISO 10646} as an
+and to UCS-4 encoded @w{ISO 10646}. Using @w{ISO 10646} as an
intermediate representation it is possible to @dfn{triangulate}, i.e.,
converting with an intermediate representation.
@@ -2210,15 +2224,15 @@ ending with @code{//}. There often is a character set named
@code{INTERNAL} mentioned. From the discussion above and the chosen
name it should have become clear that this is the name for the
representation used in the intermediate step of the triangulation. We
-have said that this is UCS4 but actually it is not quite right. The
-UCS4 specification also includes the specification of the byte ordering
-used. Since a UCS4 value consists of four bytes a stored value is
+have said that this is UCS-4 but actually it is not quite right. The
+UCS-4 specification also includes the specification of the byte ordering
+used. Since a UCS-4 value consists of four bytes a stored value is
effected by byte ordering. The internal representation is @emph{not}
-the same as UCS4 in case the byte ordering of the processor (or at least
-the running process) is not the same as the one required for UCS4. This
+the same as UCS-4 in case the byte ordering of the processor (or at least
+the running process) is not the same as the one required for UCS-4. This
is done for performance reasons as one does not want to perform
unnecessary byte-swapping operations if one is not interested in actually
-seeing the result in UCS4. To avoid trouble with endianess the internal
+seeing the result in UCS-4. To avoid trouble with endianess the internal
representation consistently is named @code{INTERNAL} even on big-endian
systems where the representations are identical.
@@ -2570,7 +2584,7 @@ One interesting thing is the initialization of the @code{__min_} and
character can consist of one to four bytes. Therefore the
@code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined
this way. The output is always the @code{INTERNAL} character set (aka
-UCS4) and therefore each character consists of exactly four bytes. For
+UCS-4) and therefore each character consists of exactly four bytes. For
the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
account that escape sequences might be necessary to switch the character
sets. Therefore the @code{__max_needed_to} element for this direction