diff options
Diffstat (limited to 'manual')
-rw-r--r-- | manual/charset.texi | 5786 |
1 files changed, 2895 insertions, 2891 deletions
diff --git a/manual/charset.texi b/manual/charset.texi index b7b2f734a8..bae2910236 100644 --- a/manual/charset.texi +++ b/manual/charset.texi @@ -1,2892 +1,2896 @@ -@node Character Set Handling, Locales, String and Array Utilities, Top
-@c %MENU% Support for extended character sets
-@chapter Character Set Handling
-
-@ifnottex
-@macro cal{text}
-\text\
-@end macro
-@end ifnottex
-
-Character sets used in the early days of computing had only six, seven,
-or eight bits for each character: there was never a case where more than
-eight bits (one byte) were used to represent a single character. The
-limitations of this approach became more apparent as more people
-grappled with non-Roman character sets, where not all the characters
-that make up a language's character set can be represented by @math{2^8}
-choices. This chapter shows the functionality that was added to the C
-library to support multiple character sets.
-
-@menu
-* Extended Char Intro:: Introduction to Extended Characters.
-* Charset Function Overview:: Overview about Character Handling
- Functions.
-* Restartable multibyte conversion:: Restartable multibyte conversion
- Functions.
-* Non-reentrant Conversion:: Non-reentrant Conversion Function.
-* Generic Charset Conversion:: Generic Charset Conversion.
-@end menu
-
-
-@node Extended Char Intro
-@section Introduction to Extended Characters
-
-A variety of solutions is available to overcome the differences between
-character sets with a 1:1 relation between bytes and characters and
-character sets with ratios of 2:1 or 4:1. The remainder of this
-section gives a few examples to help understand the design decisions
-made while developing the functionality of the @w{C library}.
-
-@cindex internal representation
-A distinction we have to make right away is between internal and
-external representation. @dfn{Internal representation} means the
-representation used by a program while keeping the text in memory.
-External representations are used when text is stored or transmitted
-through some communication channel. Examples of external
-representations include files waiting in a directory to be
-read and parsed.
-
-Traditionally there has been no difference between the two representations.
-It was equally comfortable and useful to use the same single-byte
-representation internally and externally. This comfort level decreases
-with more and larger character sets.
-
-One of the problems to overcome with the internal representation is
-handling text that is externally encoded using different character
-sets. Assume a program that reads two texts and compares them using
-some metric. The comparison can be usefully done only if the texts are
-internally kept in a common format.
-
-@cindex wide character
-For such a common format (@math{=} character set) eight bits are certainly
-no longer enough. So the smallest entity will have to grow: @dfn{wide
-characters} will now be used. Instead of one byte per character, two or
-four will be used instead. (Three are not good to address in memory and
-more than four bytes seem not to be necessary).
-
-@cindex Unicode
-@cindex ISO 10646
-As shown in some other part of this manual,
-@c !!! Ahem, wide char string functions are not yet covered -- drepper
-a completely new family has been created of functions that can handle wide
-character texts in memory. The most commonly used character sets for such
-internal wide character representations are Unicode and @w{ISO 10646}
-(also known as UCS for Universal Character Set). Unicode was originally
-planned as a 16-bit character set; whereas, @w{ISO 10646} was designed to
-be a 31-bit large code space. The two standards are practically identical.
-They have the same character repertoire and code table, but Unicode specifies
-added semantics. At the moment, only characters in the first @code{0x10000}
-code positions (the so-called Basic Multilingual Plane, BMP) have been
-assigned, but the assignment of more specialized characters outside this
-16-bit space is already in progress. A number of encodings have been
-defined for Unicode and @w{ISO 10646} characters:
-@cindex UCS-2
-@cindex UCS-4
-@cindex UTF-8
-@cindex UTF-16
-UCS-2 is a 16-bit word that can only represent characters
-from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
-and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where
-ASCII characters are represented by ASCII bytes and non-ASCII characters
-by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
-of UCS-2 in which pairs of certain UCS-2 words can be used to encode
-non-BMP characters up to @code{0x10ffff}.
-
-To represent wide characters the @code{char} type is not suitable. For
-this reason the @w{ISO C} standard introduces a new type that is
-designed to keep one character of a wide character string. To maintain
-the similarity there is also a type corresponding to @code{int} for
-those functions that take a single wide character.
-
-@comment stddef.h
-@comment ISO
-@deftp {Data type} wchar_t
-This data type is used as the base type for wide character strings.
-I.e., arrays of objects of this type are the equivalent of @code{char[]}
-for multibyte character strings. The type is defined in @file{stddef.h}.
-
-The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not
-say anything specific about the representation. It only requires that
-this type is capable of storing all elements of the basic character set.
-Therefore it would be legitimate to define @code{wchar_t} as @code{char},
-which might make sense for embedded systems.
-
-But for GNU systems @code{wchar_t} is always 32 bits wide and, therefore,
-capable of representing all UCS-4 values and, therefore, covering all of
-@w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type
-and thereby follow Unicode very strictly. This definition is perfectly
-fine with the standard, but it also means that to represent all
-characters from Unicode and @w{ISO 10646} one has to use UTF-16 surrogate
-characters, which is in fact a multi-wide-character encoding. But
-resorting to multi-wide-character encoding contradicts the purpose of the
-@code{wchar_t} type.
-@end deftp
-
-@comment wchar.h
-@comment ISO
-@deftp {Data type} wint_t
-@code{wint_t} is a data type used for parameters and variables that
-contain a single wide character. As the name suggests this type is the
-equivalent of @code{int} when using the normal @code{char} strings. The
-types @code{wchar_t} and @code{wint_t} often have the same
-representation if their size is 32 bits wide but if @code{wchar_t} is
-defined as @code{char} the type @code{wint_t} must be defined as
-@code{int} due to the parameter promotion.
-
-@pindex wchar.h
-This type is defined in @file{wchar.h} and was introduced in
-@w{Amendment 1} to @w{ISO C90}.
-@end deftp
-
-As there are for the @code{char} data type macros are available for
-specifying the minimum and maximum value representable in an object of
-type @code{wchar_t}.
-
-@comment wchar.h
-@comment ISO
-@deftypevr Macro wint_t WCHAR_MIN
-The macro @code{WCHAR_MIN} evaluates to the minimum value representable
-by an object of type @code{wint_t}.
-
-This macro was introduced in @w{Amendment 1} to @w{ISO C90}.
-@end deftypevr
-
-@comment wchar.h
-@comment ISO
-@deftypevr Macro wint_t WCHAR_MAX
-The macro @code{WCHAR_MAX} evaluates to the maximum value representable
-by an object of type @code{wint_t}.
-
-This macro was introduced in @w{Amendment 1} to @w{ISO C90}.
-@end deftypevr
-
-Another special wide character value is the equivalent to @code{EOF}.
-
-@comment wchar.h
-@comment ISO
-@deftypevr Macro wint_t WEOF
-The macro @code{WEOF} evaluates to a constant expression of type
-@code{wint_t} whose value is different from any member of the extended
-character set.
-
-@code{WEOF} need not be the same value as @code{EOF} and unlike
-@code{EOF} it also need @emph{not} be negative. I.e., sloppy code like
-
-@smallexample
-@{
- int c;
- ...
- while ((c = getc (fp)) < 0)
- ...
-@}
-@end smallexample
-
-@noindent
-has to be rewritten to use @code{WEOF} explicitly when wide characters
-are used:
-
-@smallexample
-@{
- wint_t c;
- ...
- while ((c = wgetc (fp)) != WEOF)
- ...
-@}
-@end smallexample
-
-@pindex wchar.h
-This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is
-defined in @file{wchar.h}.
-@end deftypevr
-
-
-These internal representations present problems when it comes to storing
-and transmittal. Because each single wide character consists of more
-than one byte, they are effected by byte-ordering. Thus, machines with
-different endianesses would see different values when accessing the same
-data. This byte ordering concern also applies for communication protocols
-that are all byte-based and, thereforet require that the sender has to
-decide about splitting the wide character in bytes. A last (but not least
-important) point is that wide characters often require more storage space
-than a customized byte-oriented character set.
-
-@cindex multibyte character
-@cindex EBCDIC
- For all the above reasons, an external encoding that is different
-from the internal encoding is often used if the latter is UCS-2 or UCS-4.
-The external encoding is byte-based and can be chosen appropriately for
-the environment and for the texts to be handled. A variety of different
-character sets can be used for this external encoding (information that
-will not be exhaustively presented here--instead, a description of the
-major groups will suffice). All of the ASCII-based character sets
-[_bkoz_: do you mean Roman character sets? If not, what do you mean
-here?] fulfill one requirement: they are "filesystem safe." This means
-that the character @code{'/'} is used in the encoding @emph{only} to
-represent itself. Things are a bit different for character sets like
-EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set
-family used by IBM), but if the operation system does not understand
-EBCDIC directly the parameters-to-system calls have to be converted first
-anyhow.
-
-@itemize @bullet
-@item
-The simplest character sets are single-byte character sets. There can
-be only up to 256 characters (for @w{8 bit} character sets), which is
-not sufficient to cover all languages but might be sufficient to handle
-a specific text. Handling of a @w{8 bit} character sets is simple. This
-is not true for other kinds presented later, and therefore, the
-application one uses might require the use of @w{8 bit} character sets.
-
-@cindex ISO 2022
-@item
-The @w{ISO 2022} standard defines a mechanism for extended character
-sets where one character @emph{can} be represented by more than one
-byte. This is achieved by associating a state with the text.
-Characters that can be used to change the state can be embedded in the
-text. Each byte in the text might have a different interpretation in each
-state. The state might even influence whether a given byte stands for a
-character on its own or whether it has to be combined with some more
-bytes.
-
-@cindex EUC
-@cindex Shift_JIS
-@cindex SJIS
-In most uses of @w{ISO 2022} the defined character sets do not allow
-state changes which cover more than the next character. This has the
-big advantage that whenever one can identify the beginning of the byte
-sequence of a character one can interpret a text correctly. Examples of
-character sets using this policy are the various EUC character sets
-(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
-or Shift_JIS (SJIS, a Japanese encoding).
-
-But there are also character sets using a state which is valid for more
-than one character and has to be changed by another byte sequence.
-Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
-
-@item
-@cindex ISO 6937
-Early attempts to fix 8 bit character sets for other languages using the
-Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes
-representing characters like the acute accent do not produce output
-themselves: one has to combine them with other characters to get the
-desired result. For example, the byte sequence @code{0xc2 0x61}
-(non-spacing acute accent, followed by lower-case `a') to get the ``small
-a with acute'' character. To get the acute accent character on its own,
-one has to write @code{0xc2 0x20} (the non-spacing acute followed by a
-space).
-
-Character sets like @w[ISO 6937] are used in some embedded systems such
-as teletex.
-
-@item
-@cindex UTF-8
-Instead of converting the Unicode or @w{ISO 10646} text used internally,
-it is often also sufficient to simply use an encoding different than
-UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an
-encoding: UTF-8. This encoding is able to represent all of @w{ISO
-10646} 31 bits in a byte string of length one to six.
-
-@cindex UTF-7
-There were a few other attempts to encode @w{ISO 10646} such as UTF-7,
-but UTF-8 is today the only encoding which should be used. In fact, with
-any luck UTF-8 will soon be the only external encoding that has to be
-supported. It proves to be universally usable and its only disadvantage
-is that it favors Roman languages by making the byte string
-representation of other scripts (Cyrillic, Greek, Asian scripts) longer
-than necessary if using a specific character set for these scripts.
-Methods like the Unicode compression scheme can alleviate these
-problems.
-@end itemize
-
-The question remaining is: how to select the character set or encoding
-to use. The answer: you cannot decide about it yourself, it is decided
-by the developers of the system or the majority of the users. Since the
-goal is interoperability one has to use whatever the other people one
-works with use. If there are no constraints, the selection is based on
-the requirements the expected circle of users will have. In other words,
-if a project is expected to be used in only, say, Russia it is fine to use
-KOI8-R or a similar character set. But if at the same time people from,
-say, Greece are participating one should use a character set which allows
-all people to collaborate.
-
-The most widely useful solution seems to be: go with the most general
-character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding
-and problems about users not being able to use their own language
-adequately are a thing of the past.
-
-One final comment about the choice of the wide character representation
-is necessary at this point. We have said above that the natural choice
-is using Unicode or @w{ISO 10646}. This is not required, but at least
-encouraged, by the @w{ISO C} standard. The standard defines at least a
-macro @code{__STDC_ISO_10646__} that is only defined on systems where
-the @code{wchar_t} type encodes @w{ISO 10646} characters. If this
-symbol is not defined one should avoid making assumptions about the wide
-character representation. If the programmer uses only the functions
-provided by the C library to handle wide character strings there should
-be no compatibility problems with other systems.
-
-@node Charset Function Overview
-@section Overview about Character Handling Functions
-
-A Unix @w{C library} contains three different sets of functions in two
-families to handle character set conversion. One of the function families
-(the most commonly used) is specified in the @w{ISO C90} standard and,
-therefore, is portable even beyond the Unix world. Unfortunately this
-family is the least useful one. These functions should be avoided
-whenever possible, especially when developing libraries (as opposed to
-applications).
-
-The second family of functions got introduced in the early Unix standards
-(XPG2) and is still part of the latest and greatest Unix standard:
-@w{Unix 98}. It is also the most powerful and useful set of functions.
-But we will start with the functions defined in @w{Amendment 1} to
-@w{ISO C90}.
-
-@node Restartable multibyte conversion
-@section Restartable Multibyte Conversion Functions
-
-The @w{ISO C} standard defines functions to convert strings from a
-multibyte representation to wide character strings. There are a number
-of peculiarities:
-
-@itemize @bullet
-@item
-The character set assumed for the multibyte encoding is not specified
-as an argument to the functions. Instead the character set specified by
-the @code{LC_CTYPE} category of the current locale is used; see
-@ref{Locale Categories}.
-
-@item
-The functions handling more than one character at a time require NUL
-terminated strings as the argument. I.e., converting blocks of text
-does not work unless one can add a NUL byte at an appropriate place.
-The GNU C library contains some extensions to the standard that allow
-specifying a size, but basically they also expect terminated strings.
-@end itemize
-
-Despite these limitations the @w{ISO C} functions can be used in many
-contexts. In graphical user interfaces, for instance, it is not
-uncommon to have functions that require text to be displayed in a wide
-character string if the text is not simple ASCII. The text itself might come
-from a file with translations and the user should decide about the
-current locale which determines the translation and therefore also the
-external encoding used. In such a situation (and many others) the
-functions described here are perfect. If more freedom while performing
-the conversion is necessary take a look at the @code{iconv} functions
-(@pxref{Generic Charset Conversion}).
-
-@menu
-* Selecting the Conversion:: Selecting the conversion and its properties.
-* Keeping the state:: Representing the state of the conversion.
-* Converting a Character:: Converting Single Characters.
-* Converting Strings:: Converting Multibyte and Wide Character
- Strings.
-* Multibyte Conversion Example:: A Complete Multibyte Conversion Example.
-@end menu
-
-@node Selecting the Conversion
-@subsection Selecting the conversion and its properties
-
-We already said above that the currently selected locale for the
-@code{LC_CTYPE} category decides about the conversion which is performed
-by the functions we are about to describe. Each locale uses its own
-character set (given as an argument to @code{localedef}) and this is the
-one assumed as the external multibyte encoding. The wide character
-character set always is UCS-4, at least on GNU systems.
-
-A characteristic of each multibyte character set is the maximum number
-of bytes that can be necessary to represent one character. This
-information is quite important when writing code that uses the
-conversion functions (as shown in the examples below).
-The @w{ISO C} standard defines two macros which provide this information.
-
-
-@comment limits.h
-@comment ISO
-@deftypevr Macro int MB_LEN_MAX
-@code{MB_LEN_MAX} specifies the maximum number of bytes in the multibyte
-sequence for a single character in any of the supported locales. It is
-a compile-time constant and is defined in @file{limits.h}.
-@pindex limits.h
-@end deftypevr
-
-@comment stdlib.h
-@comment ISO
-@deftypevr Macro int MB_CUR_MAX
-@code{MB_CUR_MAX} expands into a positive integer expression that is the
-maximum number of bytes in a multibyte character in the current locale.
-The value is never greater than @code{MB_LEN_MAX}. Unlike
-@code{MB_LEN_MAX} this macro need not be a compile-time constant, and in
-the GNU C library it is not.
-
-@pindex stdlib.h
-@code{MB_CUR_MAX} is defined in @file{stdlib.h}.
-@end deftypevr
-
-Two different macros are necessary since strictly @w{ISO C90} compilers
-do not allow variable length array definitions, but still it is desirable
-to avoid dynamic allocation. This incomplete piece of code shows the
-problem:
-
-@smallexample
-@{
- char buf[MB_LEN_MAX];
- ssize_t len = 0;
-
- while (! feof (fp))
- @{
- fread (&buf[len], 1, MB_CUR_MAX - len, fp);
- /* @r{... process} buf */
- len -= used;
- @}
-@}
-@end smallexample
-
-The code in the inner loop is expected to have always enough bytes in
-the array @var{buf} to convert one multibyte character. The array
-@var{buf} has to be sized statically since many compilers do not allow a
-variable size. The @code{fread} call makes sure that @code{MB_CUR_MAX}
-bytes are always available in @var{buf}. Note that it isn't
-a problem if @code{MB_CUR_MAX} is not a compile-time constant.
-
-
-@node Keeping the state
-@subsection Representing the state of the conversion
-
-@cindex stateful
-In the introduction of this chapter it was said that certain character
-sets use a @dfn{stateful} encoding. That is, the encoded values depend
-in some way on the previous bytes in the text.
-
-Since the conversion functions allow converting a text in more than one
-step we must have a way to pass this information from one call of the
-functions to another.
-
-@comment wchar.h
-@comment ISO
-@deftp {Data type} mbstate_t
-@cindex shift state
-A variable of type @code{mbstate_t} can contain all the information
-about the @dfn{shift state} needed from one call to a conversion
-function to another.
-
-@pindex wchar.h
-@code{mbstate_t} is defined in @file{wchar.h}. It was introduced in
-@w{Amendment 1} to @w{ISO C90}.
-@end deftp
-
-To use objects of type @code{mbstate_t} the programmer has to define such
-objects (normally as local variables on the stack) and pass a pointer to
-the object to the conversion functions. This way the conversion function
-can update the object if the current multibyte character set is stateful.
-
-There is no specific function or initializer to put the state object in
-any specific state. The rules are that the object should always
-represent the initial state before the first use, and this is achieved by
-clearing the whole variable with code such as follows:
-
-@smallexample
-@{
- mbstate_t state;
- memset (&state, '\0', sizeof (state));
- /* @r{from now on @var{state} can be used.} */
- ...
-@}
-@end smallexample
-
-When using the conversion functions to generate output it is often
-necessary to test whether the current state corresponds to the initial
-state. This is necessary, for example, to decide whether to emit
-escape sequences to set the state to the initial state at certain
-sequence points. Communication protocols often require this.
-
-@comment wchar.h
-@comment ISO
-@deftypefun int mbsinit (const mbstate_t *@var{ps})
-The @code {mbsinit} function determines whether the state object pointed
-to by @var{ps} is in the initial state. If @var{ps} is a null pointer or
-the object is in the initial state the return value is nonzero. Otherwise
-it is zero.
-
-@pindex wchar.h
-@code {mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is
-declared in @file{wchar.h}.
-@end deftypefun
-
-Code using @code {mbsinit} often looks similar to this:
-
-@c Fix the example to explicitly say how to generate the escape sequence
-@c to restore the initial state.
-@smallexample
-@{
- mbstate_t state;
- memset (&state, '\0', sizeof (state));
- /* @r{Use @var{state}.} */
- ...
- if (! mbsinit (&state))
- @{
- /* @r{Emit code to return to initial state.} */
- const wchar_t empty[] = L"";
- const wchar_t *srcp = empty;
- wcsrtombs (outbuf, &srcp, outbuflen, &state);
- @}
- ...
-@}
-@end smallexample
-
-The code to emit the escape sequence to get back to the initial state is
-interesting. The @code{wcsrtombs} function can be used to determine the
-necessary output code (@pxref{Converting Strings}). Please note that on
-GNU systems it is not necessary to perform this extra action for the
-conversion from multibyte text to wide character text since the wide
-character encoding is not stateful. But there is nothing mentioned in
-any standard which prohibits making @code{wchar_t} using a stateful
-encoding.
-
-@node Converting a Character
-@subsection Converting Single Characters
-
-The most fundamental of the conversion functions are those dealing with
-single characters. Please note that this does not always mean single
-bytes. But since there is very often a subset of the multibyte
-character set which consists of single byte sequences there are
-functions to help with converting bytes. Frequently, ASCII is a subpart
-of the multibyte character set. In such a scenario, each ASCII character
-stands for itself, and all other characters have at least a first byte
-that is beyond the range @math{0} to @math{127}.
-
-@comment wchar.h
-@comment ISO
-@deftypefun wint_t btowc (int @var{c})
-The @code{btowc} function (``byte to wide character'') converts a valid
-single byte character @var{c} in the initial shift state into the wide
-character equivalent using the conversion rules from the currently
-selected locale of the @code{LC_CTYPE} category.
-
-If @code{(unsigned char) @var{c}} is no valid single byte multibyte
-character or if @var{c} is @code{EOF}, the function returns @code{WEOF}.
-
-Please note the restriction of @var{c} being tested for validity only in
-the initial shift state. No @code{mbstate_t} object is used from
-which the state information is taken, and the function also does not use
-any static state.
-
-@pindex wchar.h
-The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90}
-and is declared in @file{wchar.h}.
-@end deftypefun
-
-Despite the limitation that the single byte value always is interpreted
-in the initial state this function is actually useful most of the time.
-Most characters are either entirely single-byte character sets or they
-are extension to ASCII. But then it is possible to write code like this
-(not that this specific example is very useful):
-
-@smallexample
-wchar_t *
-itow (unsigned long int val)
-@{
- static wchar_t buf[30];
- wchar_t *wcp = &buf[29];
- *wcp = L'\0';
- while (val != 0)
- @{
- *--wcp = btowc ('0' + val % 10);
- val /= 10;
- @}
- if (wcp == &buf[29])
- *--wcp = L'0';
- return wcp;
-@}
-@end smallexample
-
-Why is it necessary to use such a complicated implementation and not
-simply cast @code{'0' + val % 10} to a wide character? The answer is
-that there is no guarantee that one can perform this kind of arithmetic
-on the character of the character set used for @code{wchar_t}
-representation. In other situations the bytes are not constant at
-compile time and so the compiler cannot do the work. In situations like
-this it is necessary @code{btowc}.
-
-@noindent
-There also is a function for the conversion in the other direction.
-
-@comment wchar.h
-@comment ISO
-@deftypefun int wctob (wint_t @var{c})
-The @code{wctob} function (``wide character to byte'') takes as the
-parameter a valid wide character. If the multibyte representation for
-this character in the initial state is exactly one byte long the return
-value of this function is this character. Otherwise the return value is
-@code{EOF}.
-
-@pindex wchar.h
-@code{wctob} was introduced in @w{Amendment 1} to @w{ISO C90} and
-is declared in @file{wchar.h}.
-@end deftypefun
-
-There are more general functions to convert single character from
-multibyte representation to wide characters and vice versa. These
-functions pose no limit on the length of the multibyte representation
-and they also do not require it to be in the initial state.
-
-@comment wchar.h
-@comment ISO
-@deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps})
-@cindex stateful
-The @code{mbrtowc} function (``multibyte restartable to wide
-character'') converts the next multibyte character in the string pointed
-to by @var{s} into a wide character and stores it in the wide character
-string pointed to by @var{pwc}. The conversion is performed according
-to the locale currently selected for the @code{LC_CTYPE} category. If
-the conversion for the character set used in the locale requires a state,
-the multibyte string is interpreted in the state represented by the
-object pointed to by @var{ps}. If @var{ps} is a null pointer, a static,
-internal state variable used only by the @code{mbrtowc} function is
-used.
-
-If the next multibyte character corresponds to the NUL wide character,
-the return value of the function is @math{0} and the state object is
-afterwards in the initial state. If the next @var{n} or fewer bytes
-form a correct multibyte character, the return value is the number of
-bytes starting from @var{s} that form the multibyte character. The
-conversion state is updated according to the bytes consumed in the
-conversion. In both cases the wide character (either the @code{L'\0'}
-or the one found in the conversion) is stored in the string pointed to
-by @var{pwc} if @var{pwc} is not null.
-
-If the first @var{n} bytes of the multibyte string possibly form a valid
-multibyte character but there are more than @var{n} bytes needed to
-complete it, the return value of the function is @code{(size_t) -2} and
-no value is stored. Please note that this can happen even if @var{n}
-has a value greater than or equal to @code{MB_CUR_MAX} since the input
-might contain redundant shift sequences.
-
-If the first @code{n} bytes of the multibyte string cannot possibly form
-a valid multibyte character, no value is stored, the global variable
-@code{errno} is set to the value @code{EILSEQ}, and the function returns
-@code{(size_t) -1}. The conversion state is afterwards undefined.
-
-@pindex wchar.h
-@code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and
-is declared in @file{wchar.h}.
-@end deftypefun
-
-Use of @code{mbrtowc} is straightforward. A function which copies a
-multibyte string into a wide character string while at the same time
-converting all lowercase characters into uppercase could look like this
-(this is not the final version, just an example; it has no error
-checking, and sometimes leaks memory):
-
-@smallexample
-wchar_t *
-mbstouwcs (const char *s)
-@{
- size_t len = strlen (s);
- wchar_t *result = malloc ((len + 1) * sizeof (wchar_t));
- wchar_t *wcp = result;
- wchar_t tmp[1];
- mbstate_t state;
- size_t nbytes;
-
- memset (&state, '\0', sizeof (state));
- while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0)
- @{
- if (nbytes >= (size_t) -2)
- /* Invalid input string. */
- return NULL;
- *result++ = towupper (tmp[0]);
- len -= nbytes;
- s += nbytes;
- @}
- return result;
-@}
-@end smallexample
-
-The use of @code{mbrtowc} should be clear. A single wide character is
-stored in @code{@var{tmp}[0]}, and the number of consumed bytes is stored
-in the variable @var{nbytes}. If the conversion is successful, the
-uppercase variant of the wide character is stored in the @var{result}
-array and the pointer to the input string and the number of available
-bytes is adjusted.
-
-The only non-obvious thing about @code{mbrtowc} might be the way memory
-is allocated for the result. The above code uses the fact that there
-can never be more wide characters in the converted results than there are
-bytes in the multibyte input string. This method yields a pessimistic
-guess about the size of the result, and if many wide character strings
-have to be constructed this way or if the strings are long, the extra
-memory required to be allocated because the input string contains
-multibyte characters might be significant. The allocated memory block can
-be resized to the correct size before returning it, but a better solution
-might be to allocate just the right amount of space for the result right
-away. Unfortunately there is no function to compute the length of the wide
-character string directly from the multibyte string. There is, however, a
-function which does part of the work.
-
-@comment wchar.h
-@comment ISO
-@deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps})
-The @code{mbrlen} function (``multibyte restartable length'') computes
-the number of at most @var{n} bytes starting at @var{s} which form the
-next valid and complete multibyte character.
-
-If the next multibyte character corresponds to the NUL wide character,
-the return value is @math{0}. If the next @var{n} bytes form a valid
-multibyte character, the number of bytes belonging to this multibyte
-character byte sequence is returned.
-
-If the the first @var{n} bytes possibly form a valid multibyte
-character but the character is incomplete, the return value is
-@code{(size_t) -2}. Otherwise the multibyte character sequence is invalid
-and the return value is @code{(size_t) -1}.
-
-The multibyte sequence is interpreted in the state represented by the
-object pointed to by @var{ps}. If @var{ps} is a null pointer, a state
-object local to @code{mbrlen} is used.
-
-@pindex wchar.h
-@code{mbrlen} was introduced in @w{Amendment 1} to @w{ISO C90} and
-is declared in @file{wchar.h}.
-@end deftypefun
-
-The attentive reader now will note that @code{mbrlen} can be implemented
-as
-
-@smallexample
-mbrtowc (NULL, s, n, ps != NULL ? ps : &internal)
-@end smallexample
-
-This is true and in fact is mentioned in the official specification.
-How can this function be used to determine the length of the wide
-character string created from a multibyte character string? It is not
-directly usable, but we can define a function @code{mbslen} using it:
-
-@smallexample
-size_t
-mbslen (const char *s)
-@{
- mbstate_t state;
- size_t result = 0;
- size_t nbytes;
- memset (&state, '\0', sizeof (state));
- while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0)
- @{
- if (nbytes >= (size_t) -2)
- /* @r{Something is wrong.} */
- return (size_t) -1;
- s += nbytes;
- ++result;
- @}
- return result;
-@}
-@end smallexample
-
-This function simply calls @code{mbrlen} for each multibyte character
-in the string and counts the number of function calls. Please note that
-we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen}
-call. This is acceptable since a) this value is larger then the length of
-the longest multibyte character sequence and b) we know that the string
-@var{s} ends with a NUL byte, which cannot be part of any other multibyte
-character sequence but the one representing the NUL wide character.
-Therefore, the @code{mbrlen} function will never read invalid memory.
-
-Now that this function is available (just to make this clear, this
-function is @emph{not} part of the GNU C library) we can compute the
-number of wide character required to store the converted multibyte
-character string @var{s} using
-
-@smallexample
-wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t);
-@end smallexample
-
-Please note that the @code{mbslen} function is quite inefficient. The
-implementation of @code{mbstouwcs} with @code{mbslen} would have to
-perform the conversion of the multibyte character input string twice, and
-this conversion might be quite expensive. So it is necessary to think
-about the consequences of using the easier but imprecise method before
-doing the work twice.
-
-@comment wchar.h
-@comment ISO
-@deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps})
-The @code{wcrtomb} function (``wide character restartable to
-multibyte'') converts a single wide character into a multibyte string
-corresponding to that wide character.
-
-If @var{s} is a null pointer, the function resets the state stored in
-the objects pointed to by @var{ps} (or the internal @code{mbstate_t}
-object) to the initial state. This can also be achieved by a call like
-this:
-
-@smallexample
-wcrtombs (temp_buf, L'\0', ps)
-@end smallexample
-
-@noindent
-since, if @var{s} is a null pointer, @code{wcrtomb} performs as if it
-writes into an internal buffer, which is guaranteed to be large enough.
-
-If @var{wc} is the NUL wide character, @code{wcrtomb} emits, if
-necessary, a shift sequence to get the state @var{ps} into the initial
-state followed by a single NUL byte, which is stored in the string
-@var{s}.
-
-Otherwise a byte sequence (possibly including shift sequences) is written
-into the string @var{s}. This only happens if @var{wc} is a valid wide
-character (i.e., it has a multibyte representation in the character set
-selected by locale of the @code{LC_CTYPE} category). If @var{wc} is no
-valid wide character, nothing is stored in the strings @var{s},
-@code{errno} is set to @code{EILSEQ}, the conversion state in @var{ps}
-is undefined and the return value is @code{(size_t) -1}.
-
-If no error occurred the function returns the number of bytes stored in
-the string @var{s}. This includes all bytes representing shift
-sequences.
-
-One word about the interface of the function: there is no parameter
-specifying the length of the array @var{s}. Instead the function
-assumes that there are at least @code{MB_CUR_MAX} bytes available since
-this is the maximum length of any byte sequence representing a single
-character. So the caller has to make sure that there is enough space
-available, otherwise buffer overruns can occur.
-
-@pindex wchar.h
-@code{wcrtomb} was introduced in @w{Amendment 1} to @w{ISO C90} and is
-declared in @file{wchar.h}.
-@end deftypefun
-
-Using @code{wcrtomb} is as easy as using @code{mbrtowc}. The following
-example appends a wide character string to a multibyte character string.
-Again, the code is not really useful (or correct), it is simply here to
-demonstrate the use and some problems.
-
-@smallexample
-char *
-mbscatwcs (char *s, size_t len, const wchar_t *ws)
-@{
- mbstate_t state;
- /* @r{Find the end of the existing string.} */
- char *wp = strchr (s, '\0');
- len -= wp - s;
- memset (&state, '\0', sizeof (state));
- do
- @{
- size_t nbytes;
- if (len < MB_CUR_LEN)
- @{
- /* @r{We cannot guarantee that the next}
- @r{character fits into the buffer, so}
- @r{return an error.} */
- errno = E2BIG;
- return NULL;
- @}
- nbytes = wcrtomb (wp, *ws, &state);
- if (nbytes == (size_t) -1)
- /* @r{Error in the conversion.} */
- return NULL;
- len -= nbytes;
- wp += nbytes;
- @}
- while (*ws++ != L'\0');
- return s;
-@}
-@end smallexample
-
-First the function has to find the end of the string currently in the
-array @var{s}. The @code{strchr} call does this very efficiently since a
-requirement for multibyte character representations is that the NUL byte
-is never used except to represent itself (and in this context, the end
-of the string).
-
-After initializing the state object the loop is entered where the first
-task is to make sure there is enough room in the array @var{s}. We
-abort if there are not at least @code{MB_CUR_LEN} bytes available. This
-is not always optimal but we have no other choice. We might have less
-than @code{MB_CUR_LEN} bytes available but the next multibyte character
-might also be only one byte long. At the time the @code{wcrtomb} call
-returns it is too late to decide whether the buffer was large enough. If
-this solution is unsuitable, there is a very slow but more accurate
-solution.
-
-@smallexample
- ...
- if (len < MB_CUR_LEN)
- @{
- mbstate_t temp_state;
- memcpy (&temp_state, &state, sizeof (state));
- if (wcrtomb (NULL, *ws, &temp_state) > len)
- @{
- /* @r{We cannot guarantee that the next}
- @r{character fits into the buffer, so}
- @r{return an error.} */
- errno = E2BIG;
- return NULL;
- @}
- @}
- ...
-@end smallexample
-
-Here we perform the conversion that might overflow the buffer so that
-we are afterwards in the position to make an exact decision about the
-buffer size. Please note the @code{NULL} argument for the destination
-buffer in the new @code{wcrtomb} call; since we are not interested in the
-converted text at this point, this is a nice way to express this. The
-most unusual thing about this piece of code certainly is the duplication
-of the conversion state object, but if a change of the state is necessary
-to emit the next multibyte character, we want to have the same shift state
-change performed in the real conversion. Therefore, we have to preserve
-the initial shift state information.
-
-There are certainly many more and even better solutions to this problem.
-This example is only provided for educational purposes.
-
-@node Converting Strings
-@subsection Converting Multibyte and Wide Character Strings
-
-The functions described in the previous section only convert a single
-character at a time. Most operations to be performed in real-world
-programs include strings and therefore the @w{ISO C} standard also
-defines conversions on entire strings. However, the defined set of
-functions is quite limited; therefore, the GNU C library contains a few
-extensions which can help in some important situations.
-
-@comment wchar.h
-@comment ISO
-@deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
-The @code{mbsrtowcs} function (``multibyte string restartable to wide
-character string'') converts an NUL-terminated multibyte character
-string at @code{*@var{src}} into an equivalent wide character string,
-including the NUL wide character at the end. The conversion is started
-using the state information from the object pointed to by @var{ps} or
-from an internal object of @code{mbsrtowcs} if @var{ps} is a null
-pointer. Before returning, the state object is updated to match the state
-after the last converted character. The state is the initial state if the
-terminating NUL byte is reached and converted.
-
-If @var{dst} is not a null pointer, the result is stored in the array
-pointed to by @var{dst}; otherwise, the conversion result is not
-available since it is stored in an internal buffer.
-
-If @var{len} wide characters are stored in the array @var{dst} before
-reaching the end of the input string, the conversion stops and @var{len}
-is returned. If @var{dst} is a null pointer, @var{len} is never checked.
-
-Another reason for a premature return from the function call is if the
-input string contains an invalid multibyte sequence. In this case the
-global variable @code{errno} is set to @code{EILSEQ} and the function
-returns @code{(size_t) -1}.
-
-@c XXX The ISO C9x draft seems to have a problem here. It says that PS
-@c is not updated if DST is NULL. This is not said straightforward and
-@c none of the other functions is described like this. It would make sense
-@c to define the function this way but I don't think it is meant like this.
-
-In all other cases the function returns the number of wide characters
-converted during this call. If @var{dst} is not null, @code{mbsrtowcs}
-stores in the pointer pointed to by @var{src} either a null pointer (if
-the NUL byte in the input string was reached) or the address of the byte
-following the last converted multibyte character.
-
-@pindex wchar.h
-@code{mbsrtowcs} was introduced in @w{Amendment 1} to @w{ISO C90} and is
-declared in @file{wchar.h}.
-@end deftypefun
-
-The definition of the @code{mbsrtowcs} function has one important
-limitation. The requirement that @var{dst} has to be a NUL-terminated
-string provides problems if one wants to convert buffers with text. A
-buffer is normally no collection of NUL-terminated strings but instead a
-continuous collection of lines, separated by newline characters. Now
-assume that a function to convert one line from a buffer is needed. Since
-the line is not NUL-terminated the source pointer cannot directly point
-into the unmodified text buffer. This means, either one inserts the NUL
-byte at the appropriate place for the time of the @code{mbsrtowcs}
-function call (which is not doable for a read-only buffer or in a
-multi-threaded application) or one copies the line in an extra buffer
-where it can be terminated by a NUL byte. Note that it is not in general
-possible to limit the number of characters to convert by setting the
-parameter @var{len} to any specific value. Since it is not known how
-many bytes each multibyte character sequence is in length, one can only
-guess.
-
-@cindex stateful
-There is still a problem with the method of NUL-terminating a line right
-after the newline character which could lead to very strange results.
-As said in the description of the @code{mbsrtowcs} function above the
-conversion state is guaranteed to be in the initial shift state after
-processing the NUL byte at the end of the input string. But this NUL
-byte is not really part of the text. I.e., the conversion state after
-the newline in the original text could be something different than the
-initial shift state and therefore the first character of the next line
-is encoded using this state. But the state in question is never
-accessible to the user since the conversion stops after the NUL byte
-(which resets the state). Most stateful character sets in use today
-require that the shift state after a newline be the initial state--but
-this is not a strict guarantee. Therefore, simply NUL-terminating a
-piece of a running text is not always an adequate solution and,
-therefore, should never be used in generally used code.
-
-The generic conversion interface (@pxref{Generic Charset Conversion})
-does not have this limitation (it simply works on buffers, not
-strings), and the GNU C library contains a set of functions which take
-additional parameters specifying the maximal number of bytes which are
-consumed from the input string. This way the problem of
-@code{mbsrtowcs}'s example above could be solved by determining the line
-length and passing this length to the function.
-
-@comment wchar.h
-@comment ISO
-@deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
-The @code{wcsrtombs} function (``wide character string restartable to
-multibyte string'') converts the NUL-terminated wide character string at
-@code{*@var{src}} into an equivalent multibyte character string and
-stores the result in the array pointed to by @var{dst}. The NUL wide
-character is also converted. The conversion starts in the state
-described in the object pointed to by @var{ps} or by a state object
-locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If
-@var{dst} is a null pointer, the conversion is performed as usual but the
-result is not available. If all characters of the input string were
-successfully converted and if @var{dst} is not a null pointer, the
-pointer pointed to by @var{src} gets assigned a null pointer.
-
-If one of the wide characters in the input string has no valid multibyte
-character equivalent, the conversion stops early, sets the global
-variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}.
-
-Another reason for a premature stop is if @var{dst} is not a null
-pointer and the next converted character would require more than
-@var{len} bytes in total to the array @var{dst}. In this case (and if
-@var{dest} is not a null pointer) the pointer pointed to by @var{src} is
-assigned a value pointing to the wide character right after the last one
-successfully converted.
-
-Except in the case of an encoding error the return value of the
-@code{wcsrtombs} function is the number of bytes in all the multibyte
-character sequences stored in @var{dst}. Before returning the state in
-the object pointed to by @var{ps} (or the internal object in case
-@var{ps} is a null pointer) is updated to reflect the state after the
-last conversion. The state is the initial shift state in case the
-terminating NUL wide character was converted.
-
-@pindex wchar.h
-The @code{wcsrtombs} function was introduced in @w{Amendment 1} to
-@w{ISO C90} and is declared in @file{wchar.h}.
-@end deftypefun
-
-The restriction mentioned above for the @code{mbsrtowcs} function applies
-here also. There is no possibility of directly controlling the number of
-input characters. One has to place the NUL wide character at the correct
-place or control the consumed input indirectly via the available output
-array size (the @var{len} parameter).
-
-@comment wchar.h
-@comment GNU
-@deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps})
-The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs}
-function. All the parameters are the same except for @var{nmc} which is
-new. The return value is the same as for @code{mbsrtowcs}.
-
-This new parameter specifies how many bytes at most can be used from the
-multibyte character string. In other words, the multibyte character
-string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte is
-found within the @var{nmc} first bytes of the string, the conversion
-stops here.
-
-This function is a GNU extension. It is meant to work around the
-problems mentioned above. Now it is possible to convert a buffer with
-multibyte character text piece for piece without having to care about
-inserting NUL bytes and the effect of NUL bytes on the conversion state.
-@end deftypefun
-
-A function to convert a multibyte string into a wide character string
-and display it could be written like this (this is not a really useful
-example):
-
-@smallexample
-void
-showmbs (const char *src, FILE *fp)
-@{
- mbstate_t state;
- int cnt = 0;
- memset (&state, '\0', sizeof (state));
- while (1)
- @{
- wchar_t linebuf[100];
- const char *endp = strchr (src, '\n');
- size_t n;
-
- /* @r{Exit if there is no more line.} */
- if (endp == NULL)
- break;
-
- n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state);
- linebuf[n] = L'\0';
- fprintf (fp, "line %d: \"%S\"\n", linebuf);
- @}
-@}
-@end smallexample
-
-There is no problem with the state after a call to @code{mbsnrtowcs}.
-Since we don't insert characters in the strings which were not in there
-right from the beginning and we use @var{state} only for the conversion
-of the given buffer, there is no problem with altering the state.
-
-@comment wchar.h
-@comment GNU
-@deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps})
-The @code{wcsnrtombs} function implements the conversion from wide
-character strings to multibyte character strings. It is similar to
-@code{wcsrtombs} but, just like @code{mbsnrtowcs}, it takes an extra
-parameter, which specifies the length of the input string.
-
-No more than @var{nwc} wide characters from the input string
-@code{*@var{src}} are converted. If the input string contains a NUL
-wide character in the first @var{nwc} characters, the conversion stops at
-this place.
-
-The @code{wcsnrtombs} function is a GNU extension and just like
-@code{mbsnrtowcs} helps in situations where no NUL-terminated input
-strings are available.
-@end deftypefun
-
-
-@node Multibyte Conversion Example
-@subsection A Complete Multibyte Conversion Example
-
-The example programs given in the last sections are only brief and do
-not contain all the error checking etc. Presented here is a complete
-and documented example. It features the @code{mbrtowc} function but it
-should be easy to derive versions using the other functions.
-
-@smallexample
-int
-file_mbsrtowcs (int input, int output)
-@{
- /* @r{Note the use of @code{MB_LEN_MAX}.}
- @r{@code{MB_CUR_MAX} cannot portably be used here.} */
- char buffer[BUFSIZ + MB_LEN_MAX];
- mbstate_t state;
- int filled = 0;
- int eof = 0;
-
- /* @r{Initialize the state.} */
- memset (&state, '\0', sizeof (state));
-
- while (!eof)
- @{
- ssize_t nread;
- ssize_t nwrite;
- char *inp = buffer;
- wchar_t outbuf[BUFSIZ];
- wchar_t *outp = outbuf;
-
- /* @r{Fill up the buffer from the input file.} */
- nread = read (input, buffer + filled, BUFSIZ);
- if (nread < 0)
- @{
- perror ("read");
- return 0;
- @}
- /* @r{If we reach end of file, make a note to read no more.} */
- if (nread == 0)
- eof = 1;
-
- /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */
- filled += nread;
-
- /* @r{Convert those bytes to wide characters--as many as we can.} */
- while (1)
- @{
- size_t thislen = mbrtowc (outp, inp, filled, &state);
- /* @r{Stop converting at invalid character;}
- @r{this can mean we have read just the first part}
- @r{of a valid character.} */
- if (thislen == (size_t) -1)
- break;
- /* @r{We want to handle embedded NUL bytes}
- @r{but the return value is 0. Correct this.} */
- if (thislen == 0)
- thislen = 1;
- /* @r{Advance past this character.} */
- inp += thislen;
- filled -= thislen;
- ++outp;
- @}
-
- /* @r{Write the wide characters we just made.} */
- nwrite = write (output, outbuf,
- (outp - outbuf) * sizeof (wchar_t));
- if (nwrite < 0)
- @{
- perror ("write");
- return 0;
- @}
-
- /* @r{See if we have a @emph{real} invalid character.} */
- if ((eof && filled > 0) || filled >= MB_CUR_MAX)
- @{
- error (0, 0, "invalid multibyte character");
- return 0;
- @}
-
- /* @r{If any characters must be carried forward,}
- @r{put them at the beginning of @code{buffer}.} */
- if (filled > 0)
- memmove (inp, buffer, filled);
- @}
-
- return 1;
-@}
-@end smallexample
-
-
-@node Non-reentrant Conversion
-@section Non-reentrant Conversion Function
-
-The functions described in the previous chapter are defined in
-@w{Amendment 1} to @w{ISO C90}, but the original @w{ISO C90} standard
-also contained functions for character set conversion. The reason that
-these original functions are not described first is that they are almost
-entirely useless.
-
-The problem is that all the conversion functions described in the
-original @w{ISO C90} use a local state. Using a local state implies that
-multiple conversions at the same time (not only when using threads)
-cannot be done, and that you cannot first convert single characters and
-then strings since you cannot tell the conversion functions which state
-to use.
-
-These original functions are therefore usable only in a very limited set
-of situations. One must complete converting the entire string before
-starting a new one, and each string/text must be converted with the same
-function (there is no problem with the library itself; it is guaranteed
-that no library function changes the state of any of these functions).
-@strong{For the above reasons it is highly requested that the functions
-described in the previous section be used in place of non-reentrant
-conversion functions.}
-
-@menu
-* Non-reentrant Character Conversion:: Non-reentrant Conversion of Single
- Characters.
-* Non-reentrant String Conversion:: Non-reentrant Conversion of Strings.
-* Shift State:: States in Non-reentrant Functions.
-@end menu
-
-@node Non-reentrant Character Conversion
-@subsection Non-reentrant Conversion of Single Characters
-
-@comment stdlib.h
-@comment ISO
-@deftypefun int mbtowc (wchar_t *restrict @var{result}, const char *restrict @var{string}, size_t @var{size})
-The @code{mbtowc} (``multibyte to wide character'') function when called
-with non-null @var{string} converts the first multibyte character
-beginning at @var{string} to its corresponding wide character code. It
-stores the result in @code{*@var{result}}.
-
-@code{mbtowc} never examines more than @var{size} bytes. (The idea is
-to supply for @var{size} the number of bytes of data you have in hand.)
-
-@code{mbtowc} with non-null @var{string} distinguishes three
-possibilities: the first @var{size} bytes at @var{string} start with
-valid multibyte characters, they start with an invalid byte sequence or
-just part of a character, or @var{string} points to an empty string (a
-null character).
-
-For a valid multibyte character, @code{mbtowc} converts it to a wide
-character and stores that in @code{*@var{result}}, and returns the
-number of bytes in that character (always at least @math{1} and never
-more than @var{size}).
-
-For an invalid byte sequence, @code{mbtowc} returns @math{-1}. For an
-empty string, it returns @math{0}, also storing @code{'\0'} in
-@code{*@var{result}}.
-
-If the multibyte character code uses shift characters, then
-@code{mbtowc} maintains and updates a shift state as it scans. If you
-call @code{mbtowc} with a null pointer for @var{string}, that
-initializes the shift state to its standard initial value. It also
-returns nonzero if the multibyte character code in use actually has a
-shift state. @xref{Shift State}.
-@end deftypefun
-
-@comment stdlib.h
-@comment ISO
-@deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar})
-The @code{wctomb} (``wide character to multibyte'') function converts
-the wide character code @var{wchar} to its corresponding multibyte
-character sequence, and stores the result in bytes starting at
-@var{string}. At most @code{MB_CUR_MAX} characters are stored.
-
-@code{wctomb} with non-null @var{string} distinguishes three
-possibilities for @var{wchar}: a valid wide character code (one that can
-be translated to a multibyte character), an invalid code, and @code{L'\0'}.
-
-Given a valid code, @code{wctomb} converts it to a multibyte character,
-storing the bytes starting at @var{string}. Then it returns the number
-of bytes in that character (always at least @math{1} and never more
-than @code{MB_CUR_MAX}).
-
-If @var{wchar} is an invalid wide character code, @code{wctomb} returns
-@math{-1}. If @var{wchar} is @code{L'\0'}, it returns @code{0}, also
-storing @code{'\0'} in @code{*@var{string}}.
-
-If the multibyte character code uses shift characters, then
-@code{wctomb} maintains and updates a shift state as it scans. If you
-call @code{wctomb} with a null pointer for @var{string}, that
-initializes the shift state to its standard initial value. It also
-returns nonzero if the multibyte character code in use actually has a
-shift state. @xref{Shift State}.
-
-Calling this function with a @var{wchar} argument of zero when
-@var{string} is not null has the side-effect of reinitializing the
-stored shift state @emph{as well as} storing the multibyte character
-@code{'\0'} and returning @math{0}.
-@end deftypefun
-
-Similar to @code{mbrlen} there is also a non-reentrant function which
-computes the length of a multibyte character. It can be defined in
-terms of @code{mbtowc}.
-
-@comment stdlib.h
-@comment ISO
-@deftypefun int mblen (const char *@var{string}, size_t @var{size})
-The @code{mblen} function with a non-null @var{string} argument returns
-the number of bytes that make up the multibyte character beginning at
-@var{string}, never examining more than @var{size} bytes. (The idea is
-to supply for @var{size} the number of bytes of data you have in hand.)
-
-The return value of @code{mblen} distinguishes three possibilities: the
-first @var{size} bytes at @var{string} start with valid multibyte
-characters, they start with an invalid byte sequence or just part of a
-character, or @var{string} points to an empty string (a null character).
-
-For a valid multibyte character, @code{mblen} returns the number of
-bytes in that character (always at least @code{1} and never more than
-@var{size}). For an invalid byte sequence, @code{mblen} returns
-@math{-1}. For an empty string, it returns @math{0}.
-
-If the multibyte character code uses shift characters, then @code{mblen}
-maintains and updates a shift state as it scans. If you call
-@code{mblen} with a null pointer for @var{string}, that initializes the
-shift state to its standard initial value. It also returns a nonzero
-value if the multibyte character code in use actually has a shift state.
-@xref{Shift State}.
-
-@pindex stdlib.h
-The function @code{mblen} is declared in @file{stdlib.h}.
-@end deftypefun
-
-
-@node Non-reentrant String Conversion
-@subsection Non-reentrant Conversion of Strings
-
-For convenience the @w{ISO C90} standard also defines functions to
-convert entire strings instead of single characters. These functions
-suffer from the same problems as their reentrant counterparts from
-@w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}.
-
-@comment stdlib.h
-@comment ISO
-@deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size})
-The @code{mbstowcs} (``multibyte string to wide character string'')
-function converts the null-terminated string of multibyte characters
-@var{string} to an array of wide character codes, storing not more than
-@var{size} wide characters into the array beginning at @var{wstring}.
-The terminating null character counts towards the size, so if @var{size}
-is less than the actual number of wide characters resulting from
-@var{string}, no terminating null character is stored.
-
-The conversion of characters from @var{string} begins in the initial
-shift state.
-
-If an invalid multibyte character sequence is found, the @code{mbstowcs}
-function returns a value of @math{-1}. Otherwise, it returns the number
-of wide characters stored in the array @var{wstring}. This number does
-not include the terminating null character, which is present if the
-number is less than @var{size}.
-
-Here is an example showing how to convert a string of multibyte
-characters, allocating enough space for the result.
-
-@smallexample
-wchar_t *
-mbstowcs_alloc (const char *string)
-@{
- size_t size = strlen (string) + 1;
- wchar_t *buf = xmalloc (size * sizeof (wchar_t));
-
- size = mbstowcs (buf, string, size);
- if (size == (size_t) -1)
- return NULL;
- buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));
- return buf;
-@}
-@end smallexample
-
-@end deftypefun
-
-@comment stdlib.h
-@comment ISO
-@deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size})
-The @code{wcstombs} (``wide character string to multibyte string'')
-function converts the null-terminated wide character array @var{wstring}
-into a string containing multibyte characters, storing not more than
-@var{size} bytes starting at @var{string}, followed by a terminating
-null character if there is room. The conversion of characters begins in
-the initial shift state.
-
-The terminating null character counts towards the size, so if @var{size}
-is less than or equal to the number of bytes needed in @var{wstring}, no
-terminating null character is stored.
-
-If a code that does not correspond to a valid multibyte character is
-found, the @code{wcstombs} function returns a value of @math{-1}.
-Otherwise, the return value is the number of bytes stored in the array
-@var{string}. This number does not include the terminating null character,
-which is present if the number is less than @var{size}.
-@end deftypefun
-
-@node Shift State
-@subsection States in Non-reentrant Functions
-
-In some multibyte character codes, the @emph{meaning} of any particular
-byte sequence is not fixed; it depends on what other sequences have come
-earlier in the same string. Typically there are just a few sequences that
-can change the meaning of other sequences; these few are called
-@dfn{shift sequences} and we say that they set the @dfn{shift state} for
-other sequences that follow.
-
-To illustrate shift state and shift sequences, suppose we decide that
-the sequence @code{0200} (just one byte) enters Japanese mode, in which
-pairs of bytes in the range from @code{0240} to @code{0377} are single
-characters, while @code{0201} enters Latin-1 mode, in which single bytes
-in the range from @code{0240} to @code{0377} are characters, and
-interpreted according to the ISO Latin-1 character set. This is a
-multibyte code which has two alternative shift states (``Japanese mode''
-and ``Latin-1 mode''), and two shift sequences that specify particular
-shift states.
-
-When the multibyte character code in use has shift states, then
-@code{mblen}, @code{mbtowc}, and @code{wctomb} must maintain and update
-the current shift state as they scan the string. To make this work
-properly, you must follow these rules:
-
-@itemize @bullet
-@item
-Before starting to scan a string, call the function with a null pointer
-for the multibyte character address---for example, @code{mblen (NULL,
-0)}. This initializes the shift state to its standard initial value.
-
-@item
-Scan the string one character at a time, in order. Do not ``back up''
-and rescan characters already scanned, and do not intersperse the
-processing of different strings.
-@end itemize
-
-Here is an example of using @code{mblen} following these rules:
-
-@smallexample
-void
-scan_string (char *s)
-@{
- int length = strlen (s);
-
- /* @r{Initialize shift state.} */
- mblen (NULL, 0);
-
- while (1)
- @{
- int thischar = mblen (s, length);
- /* @r{Deal with end of string and invalid characters.} */
- if (thischar == 0)
- break;
- if (thischar == -1)
- @{
- error ("invalid multibyte character");
- break;
- @}
- /* @r{Advance past this character.} */
- s += thischar;
- length -= thischar;
- @}
-@}
-@end smallexample
-
-The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not
-reentrant when using a multibyte code that uses a shift state. However,
-no other library functions call these functions, so you don't have to
-worry that the shift state will be changed mysteriously.
-
-
-@node Generic Charset Conversion
-@section Generic Charset Conversion
-
-The conversion functions mentioned so far in this chapter all had in
-common that they operate on character sets that are not directly
-specified by the functions. The multibyte encoding used is specified by
-the currently selected locale for the @code{LC_CTYPE} category. The
-wide character set is fixed by the implementation (in the case of GNU C
-library it is always UCS-4 encoded @w{ISO 10646}.
-
-This has of course several problems when it comes to general character
-conversion:
-
-@itemize @bullet
-@item
-For every conversion where neither the source nor the destination
-character set is the character set of the locale for the @code{LC_CTYPE}
-category, one has to change the @code{LC_CTYPE} locale using
-@code{setlocale}.
-
-Changing the @code{LC_TYPE} locale introduces major problems for the rest
-of the programs since several more functions (e.g., the character
-classification functions, @pxref{Classification of Characters}) use the
-@code{LC_CTYPE} category.
-
-@item
-Parallel conversions to and from different character sets are not
-possible since the @code{LC_CTYPE} selection is global and shared by all
-threads.
-
-@item
-If neither the source nor the destination character set is the character
-set used for @code{wchar_t} representation, there is at least a two-step
-process necessary to convert a text using the functions above. One would
-have to select the source character set as the multibyte encoding,
-convert the text into a @code{wchar_t} text, select the destination
-character set as the multibyte encoding, and convert the wide character
-text to the multibyte (@math{=} destination) character set.
-
-Even if this is possible (which is not guaranteed) it is a very tiring
-work. Plus it suffers from the other two raised points even more due to
-the steady changing of the locale.
-@end itemize
-
-The XPG2 standard defines a completely new set of functions which has
-none of these limitations. They are not at all coupled to the selected
-locales, and they have no constraints on the character sets selected for
-source and destination. Only the set of available conversions limits
-them. The standard does not specify that any conversion at all must be
-available. Such availability is a measure of the quality of the
-implementation.
-
-In the following text first the interface to @code{iconv} and then the
-conversion function, will be described. Comparisons with other
-implementations will show what obstacles stand in the way of portable
-applications. Finally, the implementation is described in so far as might
-interest the advanced user who wants to extend conversion capabilities.
-
-@menu
-* Generic Conversion Interface:: Generic Character Set Conversion Interface.
-* iconv Examples:: A complete @code{iconv} example.
-* Other iconv Implementations:: Some Details about other @code{iconv}
- Implementations.
-* glibc iconv Implementation:: The @code{iconv} Implementation in the GNU C
- library.
-@end menu
-
-@node Generic Conversion Interface
-@subsection Generic Character Set Conversion Interface
-
-This set of functions follows the traditional cycle of using a resource:
-open--use--close. The interface consists of three functions, each of
-which implements one step.
-
-Before the interfaces are described it is necessary to introduce a
-data type. Just like other open--use--close interfaces the functions
-introduced here work using handles and the @file{iconv.h} header
-defines a special type for the handles used.
-
-@comment iconv.h
-@comment XPG2
-@deftp {Data Type} iconv_t
-This data type is an abstract type defined in @file{iconv.h}. The user
-must not assume anything about the definition of this type; it must be
-completely opaque.
-
-Objects of this type can get assigned handles for the conversions using
-the @code{iconv} functions. The objects themselves need not be freed, but
-the conversions for which the handles stand for have to.
-@end deftp
-
-@noindent
-The first step is the function to create a handle.
-
-@comment iconv.h
-@comment XPG2
-@deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode})
-The @code{iconv_open} function has to be used before starting a
-conversion. The two parameters this function takes determine the
-source and destination character set for the conversion, and if the
-implementation has the possibility to perform such a conversion, the
-function returns a handle.
-
-If the wanted conversion is not available, the @code{iconv_open} function
-returns @code{(iconv_t) -1}. In this case the global variable
-@code{errno} can have the following values:
-
-@table @code
-@item EMFILE
-The process already has @code{OPEN_MAX} file descriptors open.
-@item ENFILE
-The system limit of open file is reached.
-@item ENOMEM
-Not enough memory to carry out the operation.
-@item EINVAL
-The conversion from @var{fromcode} to @var{tocode} is not supported.
-@end table
-
-It is not possible to use the same descriptor in different threads to
-perform independent conversions. The data structures associated
-with the descriptor include information about the conversion state.
-This must not be messed up by using it in different conversions.
-
-An @code{iconv} descriptor is like a file descriptor as for every use a
-new descriptor must be created. The descriptor does not stand for all
-of the conversions from @var{fromset} to @var{toset}.
-
-The GNU C library implementation of @code{iconv_open} has one
-significant extension to other implementations. To ease the extension
-of the set of available conversions, the implementation allows storing
-the necessary files with data and code in an arbitrary number of
-directories. How this extension must be written will be explained below
-(@pxref{glibc iconv Implementation}). Here it is only important to say
-that all directories mentioned in the @code{GCONV_PATH} environment
-variable are considered only if they contain a file @file{gconv-modules}.
-These directories need not necessarily be created by the system
-administrator. In fact, this extension is introduced to help users
-writing and using their own, new conversions. Of course, this does not
-work for security reasons in SUID binaries; in this case only the system
-directory is considered and this normally is
-@file{@var{prefix}/lib/gconv}. The @code{GCONV_PATH} environment variable
-is examined exactly once at the first call of the @code{iconv_open}
-function. Later modifications of the variable have no effect.
-
-@pindex iconv.h
-The @code{iconv_open} function was introduced early in the X/Open
-Portability Guide, @w{version 2}. It is supported by all commercial
-Unices as it is required for the Unix branding. However, the quality and
-completeness of the implementation varies widely. The @code{iconv_open}
-function is declared in @file{iconv.h}.
-@end deftypefun
-
-The @code{iconv} implementation can associate large data structure with
-the handle returned by @code{iconv_open}. Therefore, it is crucial to
-free all the resources once all conversions are carried out and the
-conversion is not needed anymore.
-
-@comment iconv.h
-@comment XPG2
-@deftypefun int iconv_close (iconv_t @var{cd})
-The @code{iconv_close} function frees all resources associated with the
-handle @var{cd}, which must have been returned by a successful call to
-the @code{iconv_open} function.
-
-If the function call was successful the return value is @math{0}.
-Otherwise it is @math{-1} and @code{errno} is set appropriately.
-Defined error are:
-
-@table @code
-@item EBADF
-The conversion descriptor is invalid.
-@end table
-
-@pindex iconv.h
-The @code{iconv_close} function was introduced together with the rest
-of the @code{iconv} functions in XPG2 and is declared in @file{iconv.h}.
-@end deftypefun
-
-The standard defines only one actual conversion function. This has,
-therefore, the most general interface: it allows conversion from one
-buffer to another. Conversion from a file to a buffer, vice versa, or
-even file to file can be implemented on top of it.
-
-@comment iconv.h
-@comment XPG2
-@deftypefun size_t iconv (iconv_t @var{cd}, char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
-@cindex stateful
-The @code{iconv} function converts the text in the input buffer
-according to the rules associated with the descriptor @var{cd} and
-stores the result in the output buffer. It is possible to call the
-function for the same text several times in a row since for stateful
-character sets the necessary state information is kept in the data
-structures associated with the descriptor.
-
-The input buffer is specified by @code{*@var{inbuf}} and it contains
-@code{*@var{inbytesleft}} bytes. The extra indirection is necessary for
-communicating the used input back to the caller (see below). It is
-important to note that the buffer pointer is of type @code{char} and the
-length is measured in bytes even if the input text is encoded in wide
-characters.
-
-The output buffer is specified in a similar way. @code{*@var{outbuf}}
-points to the beginning of the buffer with at least
-@code{*@var{outbytesleft}} bytes room for the result. The buffer
-pointer again is of type @code{char} and the length is measured in
-bytes. If @var{outbuf} or @code{*@var{outbuf}} is a null pointer, the
-conversion is performed but no output is available.
-
-If @var{inbuf} is a null pointer, the @code{iconv} function performs the
-necessary action to put the state of the conversion into the initial
-state. This is obviously a no-op for non-stateful encodings, but if the
-encoding has a state, such a function call might put some byte sequences
-in the output buffer, which perform the necessary state changes. The
-next call with @var{inbuf} not being a null pointer then simply goes on
-from the initial state. It is important that the programmer never makes
-any assumption as to whether the conversion has to deal with states. Even
-if the input and output character sets are not stateful, the
-implementation might still have to keep states. This is due to the
-implementation chosen for the GNU C library as it is described below.
-Therefore an @code{iconv} call to reset the state should always be
-performed if some protocol requires this for the output text.
-
-The conversion stops for one of three reasons. The first is that all
-characters from the input buffer are converted. This actually can mean
-two things: either all bytes from the input buffer are consumed or
-there are some bytes at the end of the buffer that possibly can form a
-complete character but the input is incomplete. The second reason for a
-stop is that the output buffer is full. And the third reason is that
-the input contains invalid characters.
-
-In all of these cases the buffer pointers after the last successful
-conversion, for input and output buffer, are stored in @var{inbuf} and
-@var{outbuf}, and the available room in each buffer is stored in
-@var{inbytesleft} and @var{outbytesleft}.
-
-Since the character sets selected in the @code{iconv_open} call can be
-almost arbitrary, there can be situations where the input buffer contains
-valid characters, which have no identical representation in the output
-character set. The behavior in this situation is undefined. The
-@emph{current} behavior of the GNU C library in this situation is to
-return with an error immediately. This certainly is not the most
-desirable solution; therefore, future versions will provide better ones,
-but they are not yet finished.
-
-If all input from the input buffer is successfully converted and stored
-in the output buffer, the function returns the number of non-reversible
-conversions performed. In all other cases the return value is
-@code{(size_t) -1} and @code{errno} is set appropriately. In such cases
-the value pointed to by @var{inbytesleft} is nonzero.
-
-@table @code
-@item EILSEQ
-The conversion stopped because of an invalid byte sequence in the input.
-After the call, @code{*@var{inbuf}} points at the first byte of the
-invalid byte sequence.
-
-@item E2BIG
-The conversion stopped because it ran out of space in the output buffer.
-
-@item EINVAL
-The conversion stopped because of an incomplete byte sequence at the end
-of the input buffer.
-
-@item EBADF
-The @var{cd} argument is invalid.
-@end table
-
-@pindex iconv.h
-The @code{iconv} function was introduced in the XPG2 standard and is
-declared in the @file{iconv.h} header.
-@end deftypefun
-
-The definition of the @code{iconv} function is quite good overall. It
-provides quite flexible functionality. The only problems lie in the
-boundary cases, which are incomplete byte sequences at the end of the
-input buffer and invalid input. A third problem, which is not really
-a design problem, is the way conversions are selected. The standard
-does not say anything about the legitimate names, a minimal set of
-available conversions. We will see how this negatively impacts other
-implementations, as demonstrated below.
-
-@node iconv Examples
-@subsection A complete @code{iconv} example
-
-The example below features a solution for a common problem. Given that
-one knows the internal encoding used by the system for @code{wchar_t}
-strings, one often is in the position to read text from a file and store
-it in wide character buffers. One can do this using @code{mbsrtowcs},
-but then we run into the problems discussed above.
-
-@smallexample
-int
-file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail)
-@{
- char inbuf[BUFSIZ];
- size_t insize = 0;
- char *wrptr = (char *) outbuf;
- int result = 0;
- iconv_t cd;
-
- cd = iconv_open ("WCHAR_T", charset);
- if (cd == (iconv_t) -1)
- @{
- /* @r{Something went wrong.} */
- if (errno == EINVAL)
- error (0, 0, "conversion from '%s' to wchar_t not available",
- charset);
- else
- perror ("iconv_open");
-
- /* @r{Terminate the output string.} */
- *outbuf = L'\0';
-
- return -1;
- @}
-
- while (avail > 0)
- @{
- size_t nread;
- size_t nconv;
- char *inptr = inbuf;
-
- /* @r{Read more input.} */
- nread = read (fd, inbuf + insize, sizeof (inbuf) - insize);
- if (nread == 0)
- @{
- /* @r{When we come here the file is completely read.}
- @r{This still could mean there are some unused}
- @r{characters in the @code{inbuf}. Put them back.} */
- if (lseek (fd, -insize, SEEK_CUR) == -1)
- result = -1;
-
- /* @r{Now write out the byte sequence to get into the}
- @r{initial state if this is necessary.} */
- iconv (cd, NULL, NULL, &wrptr, &avail);
-
- break;
- @}
- insize += nread;
-
- /* @r{Do the conversion.} */
- nconv = iconv (cd, &inptr, &insize, &wrptr, &avail);
- if (nconv == (size_t) -1)
- @{
- /* @r{Not everything went right. It might only be}
- @r{an unfinished byte sequence at the end of the}
- @r{buffer. Or it is a real problem.} */
- if (errno == EINVAL)
- /* @r{This is harmless. Simply move the unused}
- @r{bytes to the beginning of the buffer so that}
- @r{they can be used in the next round.} */
- memmove (inbuf, inptr, insize);
- else
- @{
- /* @r{It is a real problem. Maybe we ran out of}
- @r{space in the output buffer or we have invalid}
- @r{input. In any case back the file pointer to}
- @r{the position of the last processed byte.} */
- lseek (fd, -insize, SEEK_CUR);
- result = -1;
- break;
- @}
- @}
- @}
-
- /* @r{Terminate the output string.} */
- if (avail >= sizeof (wchar_t))
- *((wchar_t *) wrptr) = L'\0';
-
- if (iconv_close (cd) != 0)
- perror ("iconv_close");
-
- return (wchar_t *) wrptr - outbuf;
-@}
-@end smallexample
-
-@cindex stateful
-This example shows the most important aspects of using the @code{iconv}
-functions. It shows how successive calls to @code{iconv} can be used to
-convert large amounts of text. The user does not have to care about
-stateful encodings as the functions take care of everything.
-
-An interesting point is the case where @code{iconv} returns an error and
-@code{errno} is set to @code{EINVAL}. This is not really an error in the
-transformation. It can happen whenever the input character set contains
-byte sequences of more than one byte for some character and texts are not
-processed in one piece. In this case there is a chance that a multibyte
-sequence is cut. The caller can then simply read the remainder of the
-takes and feed the offending bytes together with new character from the
-input to @code{iconv} and continue the work. The internal state kept in
-the descriptor is @emph{not} unspecified after such an event as is the
-case with the conversion functions from the @w{ISO C} standard.
-
-The example also shows the problem of using wide character strings with
-@code{iconv}. As explained in the description of the @code{iconv}
-function above, the function always takes a pointer to a @code{char}
-array and the available space is measured in bytes. In the example, the
-output buffer is a wide character buffer; therefore, we use a local
-variable @var{wrptr} of type @code{char *}, which is used in the
-@code{iconv} calls.
-
-This looks rather innocent but can lead to problems on platforms that
-have tight restriction on alignment. Therefore the caller of @code{iconv}
-has to make sure that the pointers passed are suitable for access of
-characters from the appropriate character set. Since, in the
-above case, the input parameter to the function is a @code{wchar_t}
-pointer, this is the case (unless the user violates alignment when
-computing the parameter). But in other situations, especially when
-writing generic functions where one does not know what type of character
-set one uses and, therefore, treats text as a sequence of bytes, it might
-become tricky.
-
-@node Other iconv Implementations
-@subsection Some Details about other @code{iconv} Implementations
-
-This is not really the place to discuss the @code{iconv} implementation
-of other systems but it is necessary to know a bit about them to write
-portable programs. The above mentioned problems with the specification
-of the @code{iconv} functions can lead to portability issues.
-
-The first thing to notice is that, due to the large number of character
-sets in use, it is certainly not practical to encode the conversions
-directly in the C library. Therefore, the conversion information must
-come from files outside the C library. This is usually done in one or
-both of the following ways:
-
-@itemize @bullet
-@item
-The C library contains a set of generic conversion functions which can
-read the needed conversion tables and other information from data files.
-These files get loaded when necessary.
-
-This solution is problematic as it requires a great deal of effort to
-apply to all character sets (potentially an infinite set). The
-differences in the structure of the different character sets is so large
-that many different variants of the table-processing functions must be
-developed. In addition, the generic nature of these functions make them
-slower than specifically implemented functions.
-
-@item
-The C library only contains a framework which can dynamically load
-object files and execute the conversion functions contained therein.
-
-This solution provides much more flexibility. The C library itself
-contains only very little code and therefore reduces the general memory
-footprint. Also, with a documented interface between the C library and
-the loadable modules it is possible for third parties to extend the set
-of available conversion modules. A drawback of this solution is that
-dynamic loading must be available.
-@end itemize
-
-Some implementations in commercial Unices implement a mixture of these
-possibilities; the majority implement only the second solution. Using
-loadable modules moves the code out of the library itself and keeps
-the door open for extensions and improvements, but this design is also
-limiting on some platforms since not many platforms support dynamic
-loading in statically linked programs. On platforms without this
-capability it is therefore not possible to use this interface in
-statically linked programs. The GNU C library has, on ELF platforms, no
-problems with dynamic loading in these situations; therefore, this
-point is moot. The danger is that one gets acquainted with this situation
-and forgets about the restrictions on other systems.
-
-A second thing to know about other @code{iconv} implementations is that
-the number of available conversions is often very limited. Some
-implementations provide, in the standard release (not special
-international or developer releases), at most 100 to 200 conversion
-possibilities. This does not mean 200 different character sets are
-supported; for example, conversions from one character set to a set of 10
-others might count as 10 conversions. Together with the other direction
-this makes 20 conversion possibilities used up by one character set. One
-can imagine the thin coverage these platform provide. Some Unix vendors
-even provide only a handful of conversions which renders them useless for
-almost all uses.
-
-This directly leads to a third and probably the most problematic point.
-The way the @code{iconv} conversion functions are implemented on all
-known Unix systems and the availability of the conversion functions from
-character set @math{@cal{A}} to @math{@cal{B}} and the conversion from
-@math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the
-conversion from @math{@cal{A}} to @math{@cal{C}} is available.
-
-This might not seem unreasonable and problematic at first, but it is a
-quite big problem as one will notice shortly after hitting it. To show
-the problem we assume to write a program which has to convert from
-@math{@cal{A}} to @math{@cal{C}}. A call like
-
-@smallexample
-cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}");
-@end smallexample
-
-@noindent
-fails according to the assumption above. But what does the program
-do now? The conversion is necessary; therefore, simply giving up is not
-an option.
-
-This is a nuisance. The @code{iconv} function should take care of this.
-But how should the program proceed from here on? If it tries to convert
-to character set @math{@cal{B}}, first the two @code{iconv_open}
-calls
-
-@smallexample
-cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}");
-@end smallexample
-
-@noindent
-and
-
-@smallexample
-cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}");
-@end smallexample
-
-@noindent
-will succeed, but how to find @math{@cal{B}}?
-
-Unfortunately, the answer is: there is no general solution. On some
-systems guessing might help. On those systems most character sets can
-convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Beside
-this only some very system-specific methods can help. Since the
-conversion functions come from loadable modules and these modules must
-be stored somewhere in the filesystem, one @emph{could} try to find them
-and determine from the available file which conversions are available
-and whether there is an indirect route from @math{@cal{A}} to
-@math{@cal{C}}.
-
-This example shows one of the design errors of @code{iconv} mentioned
-above. It should at least be possible to determine the list of available
-conversion programmatically so that if @code{iconv_open} says there is no
-such conversion, one could make sure this also is true for indirect
-routes.
-
-@node glibc iconv Implementation
-@subsection The @code{iconv} Implementation in the GNU C library
-
-After reading about the problems of @code{iconv} implementations in the
-last section it is certainly good to note that the implementation in
-the GNU C library has none of the problems mentioned above. What
-follows is a step-by-step analysis of the points raised above. The
-evaluation is based on the current state of the development (as of
-January 1999). The development of the @code{iconv} functions is not
-complete, but basic functionality has solidified.
-
-The GNU C library's @code{iconv} implementation uses shared loadable
-modules to implement the conversions. A very small number of
-conversions are built into the library itself but these are only rather
-trivial conversions.
-
-All the benefits of loadable modules are available in the GNU C library
-implementation. This is especially appealing since the interface is
-well documented (see below), and it, therefore, is easy to write new
-conversion modules. The drawback of using loadable objects is not a
-problem in the GNU C library, at least on ELF systems. Since the
-library is able to load shared objects even in statically linked
-binaries, static linking need not be forbidden in case one wants to use
-@code{iconv}.
-
-The second mentioned problem is the number of supported conversions.
-Currently, the GNU C library supports more than 150 character sets. The
-way the implementation is designed the number of supported conversions
-is greater than 22350 (@math{150} times @math{149}). If any conversion
-from or to a character set is missing, it can be added easily.
-
-Particularly impressive as it may be, this high number is due to the
-fact that the GNU C library implementation of @code{iconv} does not have
-the third problem mentioned above (i.e., whenever there is a conversion
-from a character set @math{@cal{A}} to @math{@cal{B}} and from
-@math{@cal{B}} to @math{@cal{C}} it is always possible to convert from
-@math{@cal{A}} to @math{@cal{C}} directly). If the @code{iconv_open}
-returns an error and sets @code{errno} to @code{EINVAL}, there is no
-known way, directly or indirectly, to perform the wanted conversion.
-
-@cindex triangulation
-Triangulation is achieved by providing for each character set a
-conversion from and to UCS-4 encoded @w{ISO 10646}. Using @w{ISO 10646}
-as an intermediate representation it is possible to @dfn{triangulate}
-(i.e., convert with an intermediate representation).
-
-There is no inherent requirement to provide a conversion to @w{ISO
-10646} for a new character set, and it is also possible to provide other
-conversions where neither source nor destination character set is @w{ISO
-10646}. The existing set of conversions is simply meant to cover all
-conversions that might be of interest.
-
-@cindex ISO-2022-JP
-@cindex EUC-JP
-All currently available conversions use the triangulation method above,
-making conversion run unnecessarily slow. If, for example, somebody
-often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution
-would involve direct conversion between the two character sets, skipping
-the input to @w{ISO 10646} first. The two character sets of interest
-are much more similar to each other than to @w{ISO 10646}.
-
-In such a situation one easily can write a new conversion and provide it
-as a better alternative. The GNU C library @code{iconv} implementation
-would automatically use the module implementing the conversion if it is
-specified to be more efficient.
-
-@subsubsection Format of @file{gconv-modules} files
-
-All information about the available conversions comes from a file named
-@file{gconv-modules} which can be found in any of the directories along
-the @code{GCONV_PATH}. The @file{gconv-modules} files are line-oriented
-text files, where each of the lines has one of the following formats:
-
-@itemize @bullet
-@item
-If the first non-whitespace character is a @kbd{#} the line contains only
-comments and is ignored.
-
-@item
-Lines starting with @code{alias} define an alias name for a character
-set. Two more words are expected on the line. The first word
-defines the alias name, and the second defines the original name of the
-character set. The effect is that it is possible to use the alias name
-in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and
-achieve the same result as when using the real character set name.
-
-This is quite important as a character set has often many different
-names. There is normally an official name but this need not correspond to
-the most popular name. Beside this many character sets have special
-names that are somehow constructed. For example, all character sets
-specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}}
-where @var{nnn} is the registration number. This allows programs which
-know about the registration number to construct character set names and
-use them in @code{iconv_open} calls. More on the available names and
-aliases follows below.
-
-@item
-Lines starting with @code{module} introduce an available conversion
-module. These lines must contain three or four more words.
-
-The first word specifies the source character set, the second word the
-destination character set of conversion implemented in this module, and
-the third word is the name of the loadable module. The filename is
-constructed by appending the usual shared object suffix (normally
-@file{.so}) and this file is then supposed to be found in the same
-directory the @file{gconv-modules} file is in. The last word on the line,
-which is optional, is a numeric value representing the cost of the
-conversion. If this word is missing, a cost of @math{1} is assumed. The
-numeric value itself does not matter that much; what counts are the
-relative values of the sums of costs for all possible conversion paths.
-Below is a more precise description of the use of the cost value.
-@end itemize
-
-Returning to the example above where one has written a module to directly
-convert from ISO-2022-JP to EUC-JP and back. All that has to be done is
-to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory
-and add a file @file{gconv-modules} with the following content in the
-same directory:
-
-@smallexample
-module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1
-module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1
-@end smallexample
-
-To see why this is sufficient, it is necessary to understand how the
-conversion used by @code{iconv} (and described in the descriptor) is
-selected. The approach to this problem is quite simple.
-
-At the first call of the @code{iconv_open} function the program reads
-all available @file{gconv-modules} files and builds up two tables: one
-containing all the known aliases and another that contains the
-information about the conversions and which shared object implements
-them.
-
-@subsubsection Finding the conversion path in @code{iconv}
-
-The set of available conversions form a directed graph with weighted
-edges. The weights on the edges are the costs specified in the
-@file{gconv-modules} files. The @code{iconv_open} function uses an
-algorithm suitable for search for the best path in such a graph and so
-constructs a list of conversions which must be performed in succession
-to get the transformation from the source to the destination character
-set.
-
-Explaining why the above @file{gconv-modules} files allows the
-@code{iconv} implementation to resolve the specific ISO-2022-JP to
-EUC-JP conversion module instead of the conversion coming with the
-library itself is straightforward. Since the latter conversion takes two
-steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to
-EUC-JP), the cost is @math{1+1 = 2}. The above @file{gconv-modules}
-file, however, specifies that the new conversion modules can perform this
-conversion with only the cost of @math{1}.
-
-A mysterious item about the @file{gconv-modules} file above (and also
-the file coming with the GNU C library) are the names of the character
-sets specified in the @code{module} lines. Why do almost all the names
-end in @code{//}? And this is not all: the names can actually be
-regular expressions. At this point in time this mystery should not be
-revealed, unless you have the relevant spell-casting materials: ashes
-from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix
-blessed by St.@: Emacs, assorted herbal roots from Central America, sand
-from Cebu, etc. Sorry! @strong{The part of the implementation where
-this is used is not yet finished. For now please simply follow the
-existing examples. It'll become clearer once it is. --drepper}
-
-A last remark about the @file{gconv-modules} is about the names not
-ending with @code{//}. Aa character set named @code{INTERNAL} is often
-mentioned. From the discussion above and the chosen name it should have
-become clear that this is the name for the representation used in the
-intermediate step of the triangulation. We have said that this is UCS-4
-but actually that is not quite right. The UCS-4 specification also
-includes the specification of the byte ordering used. Since a UCS-4 value
-consists of four bytes, a stored value is effected by byte ordering. The
-internal representation is @emph{not} the same as UCS-4 in case the byte
-ordering of the processor (or at least the running process) is not the
-same as the one required for UCS-4. This is done for performance reasons
-as one does not want to perform unnecessary byte-swapping operations if
-one is not interested in actually seeing the result in UCS-4. To avoid
-trouble with endianess, the internal representation consistently is named
-@code{INTERNAL} even on big-endian systems where the representations are
-identical.
-
-@subsubsection @code{iconv} module data structures
-
-So far this section has described how modules are located and considered
-to be used. What remains to be described is the interface of the modules
-so that one can write new ones. This section describes the interface as
-it is in use in January 1999. The interface will change a bit in the
-future but, with luck, only in an upwardly compatible way.
-
-The definitions necessary to write new modules are publicly available
-in the non-standard header @file{gconv.h}. The following text,
-therefore, describes the definitions from this header file. First,
-however, it is necessary to get an overview.
-
-From the perspective of the user of @code{iconv} the interface is quite
-simple: the @code{iconv_open} function returns a handle that can be used
-in calls to @code{iconv}, and finally the handle is freed with a call to
-@code{iconv_close}. The problem is that the handle has to be able to
-represent the possibly long sequences of conversion steps and also the
-state of each conversion since the handle is all that is passed to the
-@code{iconv} function. Therefore, the data structures are really the
-elements necessary to understanding the implementation.
-
-We need two different kinds of data structures. The first describes the
-conversion and the second describes the state etc. There are really two
-type definitions like this in @file{gconv.h}.
-@pindex gconv.h
-
-@comment gconv.h
-@comment GNU
-@deftp {Data type} {struct __gconv_step}
-This data structure describes one conversion a module can perform. For
-each function in a loaded module with conversion functions there is
-exactly one object of this type. This object is shared by all users of
-the conversion (i.e., this object does not contain any information
-corresponding to an actual conversion; it only describes the conversion
-itself).
-
-@table @code
-@item struct __gconv_loaded_object *__shlib_handle
-@itemx const char *__modname
-@itemx int __counter
-All these elements of the structure are used internally in the C library
-to coordinate loading and unloading the shared. One must not expect any
-of the other elements to be available or initialized.
-
-@item const char *__from_name
-@itemx const char *__to_name
-@code{__from_name} and @code{__to_name} contain the names of the source and
-destination character sets. They can be used to identify the actual
-conversion to be carried out since one module might implement conversions
-for more than one character set and/or direction.
-
-@item gconv_fct __fct
-@itemx gconv_init_fct __init_fct
-@itemx gconv_end_fct __end_fct
-These elements contain pointers to the functions in the loadable module.
-The interface will be explained below.
-
-@item int __min_needed_from
-@itemx int __max_needed_from
-@itemx int __min_needed_to
-@itemx int __max_needed_to;
-These values have to be supplied in the init function of the module. The
-@code{__min_needed_from} value specifies how many bytes a character of
-the source character set at least needs. The @code{__max_needed_from}
-specifies the maximum value that also includes possible shift sequences.
-
-The @code{__min_needed_to} and @code{__max_needed_to} values serve the
-same purpose as @code{__min_needed_from} and @code{__max_needed_from} but
-this time for the destination character set.
-
-It is crucial that these values be accurate since otherwise the
-conversion functions will have problems or not work at all.
-
-@item int __stateful
-This element must also be initialized by the init function.
-@code{int __stateful} is nonzero if the source character set is stateful.
-Otherwise it is zero.
-
-@item void *__data
-This element can be used freely by the conversion functions in the
-module. @code{void *__data} can be used to communicate extra information
-from one call to another. @code{void *__data} need not be initialized if
-not needed at all. If @code{void *__data} element is assigned a pointer
-to dynamically allocated memory (presumably in the init function) it has
-to be made sure that the end function deallocates the memory. Otherwise
-the application will leak memory.
-
-It is important to be aware that this data structure is shared by all
-users of this specification conversion and therefore the @code{__data}
-element must not contain data specific to one specific use of the
-conversion function.
-@end table
-@end deftp
-
-@comment gconv.h
-@comment GNU
-@deftp {Data type} {struct __gconv_step_data}
-This is the data structure that contains the information specific to
-each use of the conversion functions.
-
-
-@table @code
-@item char *__outbuf
-@itemx char *__outbufend
-These elements specify the output buffer for the conversion step. The
-@code{__outbuf} element points to the beginning of the buffer, and
-@code{__outbufend} points to the byte following the last byte in the
-buffer. The conversion function must not assume anything about the size
-of the buffer but it can be safely assumed the there is room for at
-least one complete character in the output buffer.
-
-Once the conversion is finished, if the conversion is the last step, the
-@code{__outbuf} element must be modified to point after the last byte
-written into the buffer to signal how much output is available. If this
-conversion step is not the last one, the element must not be modified.
-The @code{__outbufend} element must not be modified.
-
-@item int __is_last
-This element is nonzero if this conversion step is the last one. This
-information is necessary for the recursion. See the description of the
-conversion function internals below. This element must never be
-modified.
-
-@item int __invocation_counter
-The conversion function can use this element to see how many calls of
-the conversion function already happened. Some character sets require a
-certain prolog when generating output, and by comparing this value with
-zero, one can find out whether it is the first call and whether,
-therefore, the prolog should be emitted. This element must never be
-modified.
-
-@item int __internal_use
-This element is another one rarely used but needed in certain
-situations. It is assigned a nonzero value in case the conversion
-functions are used to implement @code{mbsrtowcs} et.al.@: (i.e., the
-function is not used directly through the @code{iconv} interface).
-
-This sometimes makes a difference as it is expected that the
-@code{iconv} functions are used to translate entire texts while the
-@code{mbsrtowcs} functions are normally used only to convert single
-strings and might be used multiple times to convert entire texts.
-
-But in this situation we would have problem complying with some rules of
-the character set specification. Some character sets require a prolog
-which must appear exactly once for an entire text. If a number of
-@code{mbsrtowcs} calls are used to convert the text, only the first call
-must add the prolog. However, because there is no communication between the
-different calls of @code{mbsrtowcs}, the conversion functions have no
-possibility to find this out. The situation is different for sequences
-of @code{iconv} calls since the handle allows access to the needed
-information.
-
-The @code{int __internal_use} element is mostly used together with
-@code{__invocation_counter} as follows:
-
-@smallexample
-if (!data->__internal_use
- && data->__invocation_counter == 0)
- /* @r{Emit prolog.} */
- ...
-@end smallexample
-
-This element must never be modified.
-
-@item mbstate_t *__statep
-The @code{__statep} element points to an object of type @code{mbstate_t}
-(@pxref{Keeping the state}). The conversion of a stateful character
-set must use the object pointed to by @code{__statep} to store
-information about the conversion state. The @code{__statep} element
-itself must never be modified.
-
-@item mbstate_t __state
-This element must @emph{never} be used directly. It is only part of
-this structure to have the needed space allocated.
-@end table
-@end deftp
-
-@subsubsection @code{iconv} module interfaces
-
-With the knowledge about the data structures we now can describe the
-conversion function itself. To understand the interface a bit of
-knowledge is necessary about the functionality in the C library that
-loads the objects with the conversions.
-
-It is often the case that one conversion is used more than once (i.e.,
-there are several @code{iconv_open} calls for the same set of character
-sets during one program run). The @code{mbsrtowcs} et.al.@: functions in
-the GNU C library also use the @code{iconv} functionality, which
-increases the number of uses of the same functions even more.
-
-Because of this multiple use of conversions, the modules do not get
-loaded exclusively for one conversion. Instead a module once loaded can
-be used by an arbitrary number of @code{iconv} or @code{mbsrtowcs} calls
-at the same time. The splitting of the information between conversion-
-function-specific information and conversion data makes this possible.
-The last section showed the two data structures used to do this.
-
-This is of course also reflected in the interface and semantics of the
-functions that the modules must provide. There are three functions that
-must have the following names:
-
-@table @code
-@item gconv_init
-The @code{gconv_init} function initializes the conversion function
-specific data structure. This very same object is shared by all
-conversions that use this conversion and, therefore, no state information
-about the conversion itself must be stored in here. If a module
-implements more than one conversion, the @code{gconv_init} function will
-be called multiple times.
-
-@item gconv_end
-The @code{gconv_end} function is responsible for freeing all resources
-allocated by the @code{gconv_init} function. If there is nothing to do,
-this function can be missing. Special care must be taken if the module
-implements more than one conversion and the @code{gconv_init} function
-does not allocate the same resources for all conversions.
-
-@item gconv
-This is the actual conversion function. It is called to convert one
-block of text. It gets passed the conversion step information
-initialized by @code{gconv_init} and the conversion data, specific to
-this use of the conversion functions.
-@end table
-
-There are three data types defined for the three module interface
-functions and these define the interface.
-
-@comment gconv.h
-@comment GNU
-@deftypevr {Data type} int {(*__gconv_init_fct)} (struct __gconv_step *)
-This specifies the interface of the initialization function of the
-module. It is called exactly once for each conversion the module
-implements.
-
-As explained in the description of the @code{struct __gconv_step} data
-structure above the initialization function has to initialize parts of
-it.
-
-@table @code
-@item __min_needed_from
-@itemx __max_needed_from
-@itemx __min_needed_to
-@itemx __max_needed_to
-These elements must be initialized to the exact numbers of the minimum
-and maximum number of bytes used by one character in the source and
-destination character sets, respectively. If the characters all have the
-same size, the minimum and maximum values are the same.
-
-@item __stateful
-This element must be initialized to an nonzero value if the source
-character set is stateful. Otherwise it must be zero.
-@end table
-
-If the initialization function needs to communicate some information
-to the conversion function, this communication can happen using the
-@code{__data} element of the @code{__gconv_step} structure. But since
-this data is shared by all the conversions, it must not be modified by
-the conversion function. The example below shows how this can be used.
-
-@smallexample
-#define MIN_NEEDED_FROM 1
-#define MAX_NEEDED_FROM 4
-#define MIN_NEEDED_TO 4
-#define MAX_NEEDED_TO 4
-
-int
-gconv_init (struct __gconv_step *step)
-@{
- /* @r{Determine which direction.} */
- struct iso2022jp_data *new_data;
- enum direction dir = illegal_dir;
- enum variant var = illegal_var;
- int result;
-
- if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0)
- @{
- dir = from_iso2022jp;
- var = iso2022jp;
- @}
- else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0)
- @{
- dir = to_iso2022jp;
- var = iso2022jp;
- @}
- else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0)
- @{
- dir = from_iso2022jp;
- var = iso2022jp2;
- @}
- else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0)
- @{
- dir = to_iso2022jp;
- var = iso2022jp2;
- @}
-
- result = __GCONV_NOCONV;
- if (dir != illegal_dir)
- @{
- new_data = (struct iso2022jp_data *)
- malloc (sizeof (struct iso2022jp_data));
-
- result = __GCONV_NOMEM;
- if (new_data != NULL)
- @{
- new_data->dir = dir;
- new_data->var = var;
- step->__data = new_data;
-
- if (dir == from_iso2022jp)
- @{
- step->__min_needed_from = MIN_NEEDED_FROM;
- step->__max_needed_from = MAX_NEEDED_FROM;
- step->__min_needed_to = MIN_NEEDED_TO;
- step->__max_needed_to = MAX_NEEDED_TO;
- @}
- else
- @{
- step->__min_needed_from = MIN_NEEDED_TO;
- step->__max_needed_from = MAX_NEEDED_TO;
- step->__min_needed_to = MIN_NEEDED_FROM;
- step->__max_needed_to = MAX_NEEDED_FROM + 2;
- @}
-
- /* @r{Yes, this is a stateful encoding.} */
- step->__stateful = 1;
-
- result = __GCONV_OK;
- @}
- @}
-
- return result;
-@}
-@end smallexample
-
-The function first checks which conversion is wanted. The module from
-which this function is taken implements four different conversions;
-which one is selected can be determined by comparing the names. The
-comparison should always be done without paying attention to the case.
-
-Next, a data structure, which contains the necessary information about
-which conversion is selected, is allocated. The data structure
-@code{struct iso2022jp_data} is locally defined since, outside the
-module, this data is not used at all. Please note that if all four
-conversions this modules supports are requested there are four data
-blocks.
-
-One interesting thing is the initialization of the @code{__min_} and
-@code{__max_} elements of the step data object. A single ISO-2022-JP
-character can consist of one to four bytes. Therefore the
-@code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined
-this way. The output is always the @code{INTERNAL} character set (aka
-UCS-4) and therefore each character consists of exactly four bytes. For
-the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
-account that escape sequences might be necessary to switch the character
-sets. Therefore the @code{__max_needed_to} element for this direction
-gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the
-two bytes needed for the escape sequences to single the switching. The
-asymmetry in the maximum values for the two directions can be explained
-easily: when reading ISO-2022-JP text, escape sequences can be handled
-alone (i.e., it is not necessary to process a real character since the
-effect of the escape sequence can be recorded in the state information).
-The situation is different for the other direction. Since it is in
-general not known which character comes next, one cannot emit escape
-sequences to change the state in advance. This means the escape
-sequences that have to be emitted together with the next character.
-Therefore one needs more room than only for the character itself.
-
-The possible return values of the initialization function are:
-
-@table @code
-@item __GCONV_OK
-The initialization succeeded
-@item __GCONV_NOCONV
-The requested conversion is not supported in the module. This can
-happen if the @file{gconv-modules} file has errors.
-@item __GCONV_NOMEM
-Memory required to store additional information could not be allocated.
-@end table
-@end deftypevr
-
-The function called before the module is unloaded is significantly
-easier. It often has nothing at all to do; in which case it can be left
-out completely.
-
-@comment gconv.h
-@comment GNU
-@deftypevr {Data type} void {(*__gconv_end_fct)} (struct gconv_step *)
-The task of this function is to free all resources allocated in the
-initialization function. Therefore only the @code{__data} element of
-the object pointed to by the argument is of interest. Continuing the
-example from the initialization function, the finalization function
-looks like this:
-
-@smallexample
-void
-gconv_end (struct __gconv_step *data)
-@{
- free (data->__data);
-@}
-@end smallexample
-@end deftypevr
-
-The most important function is the conversion function itself, which can
-get quite complicated for complex character sets. But since this is not
-of interest here, we will only describe a possible skeleton for the
-conversion function.
-
-@comment gconv.h
-@comment GNU
-@deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int)
-The conversion function can be called for two basic reason: to convert
-text or to reset the state. From the description of the @code{iconv}
-function it can be seen why the flushing mode is necessary. What mode
-is selected is determined by the sixth argument, an integer. This
-argument being nonzero means that flushing is selected.
-
-Common to both modes is where the output buffer can be found. The
-information about this buffer is stored in the conversion step data. A
-pointer to this information is passed as the second argument to this
-function. The description of the @code{struct __gconv_step_data}
-structure has more information on the conversion step data.
-
-@cindex stateful
-What has to be done for flushing depends on the source character set.
-If the source character set is not stateful, nothing has to be done.
-Otherwise the function has to emit a byte sequence to bring the state
-object into the initial state. Once this all happened the other
-conversion modules in the chain of conversions have to get the same
-chance. Whether another step follows can be determined from the
-@code{__is_last} element of the step data structure to which the first
-parameter points.
-
-The more interesting mode is when actual text has to be converted. The
-first step in this case is to convert as much text as possible from the
-input buffer and store the result in the output buffer. The start of the
-input buffer is determined by the third argument which is a pointer to a
-pointer variable referencing the beginning of the buffer. The fourth
-argument is a pointer to the byte right after the last byte in the buffer.
-
-The conversion has to be performed according to the current state if the
-character set is stateful. The state is stored in an object pointed to
-by the @code{__statep} element of the step data (second argument). Once
-either the input buffer is empty or the output buffer is full the
-conversion stops. At this point, the pointer variable referenced by the
-third parameter must point to the byte following the last processed
-byte (i.e., if all of the input is consumed, this pointer and the fourth
-parameter have the same value).
-
-What now happens depends on whether this step is the last one. If it is
-the last step, the only thing that has to be done is to update the
-@code{__outbuf} element of the step data structure to point after the
-last written byte. This update gives the caller the information on how
-much text is available in the output buffer. In addition, the variable
-pointed to by the fifth parameter, which is of type @code{size_t}, must
-be incremented by the number of characters (@emph{not bytes}) that were
-converted in a non-reversible way. Then, the function can return.
-
-In case the step is not the last one, the later conversion functions have
-to get a chance to do their work. Therefore, the appropriate conversion
-function has to be called. The information about the functions is
-stored in the conversion data structures, passed as the first parameter.
-This information and the step data are stored in arrays, so the next
-element in both cases can be found by simple pointer arithmetic:
-
-@smallexample
-int
-gconv (struct __gconv_step *step, struct __gconv_step_data *data,
- const char **inbuf, const char *inbufend, size_t *written,
- int do_flush)
-@{
- struct __gconv_step *next_step = step + 1;
- struct __gconv_step_data *next_data = data + 1;
- ...
-@end smallexample
-
-The @code{next_step} pointer references the next step information and
-@code{next_data} the next data record. The call of the next function
-therefore will look similar to this:
-
-@smallexample
- next_step->__fct (next_step, next_data, &outerr, outbuf,
- written, 0)
-@end smallexample
-
-But this is not yet all. Once the function call returns the conversion
-function might have some more to do. If the return value of the function
-is @code{__GCONV_EMPTY_INPUT}, more room is available in the output
-buffer. Unless the input buffer is empty the conversion, functions start
-all over again and process the rest of the input buffer. If the return
-value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have
-to recover from this.
-
-A requirement for the conversion function is that the input buffer
-pointer (the third argument) always point to the last character that
-was put in converted form into the output buffer. This is trivially
-true after the conversion performed in the current step, but if the
-conversion functions deeper downstream stop prematurely, not all
-characters from the output buffer are consumed and, therefore, the input
-buffer pointers must be backed off to the right position.
-
-Correcting the input buffers is easy to do if the input and output
-character sets have a fixed width for all characters. In this situation
-we can compute how many characters are left in the output buffer and,
-therefore, can correct the input buffer pointer appropriately with a
-similar computation. Things are getting tricky if either character set
-has characters represented with variable length byte sequences, and it
-gets even more complicated if the conversion has to take care of the
-state. In these cases the conversion has to be performed once again, from
-the known state before the initial conversion (i.e., if necessary the
-state of the conversion has to be reset and the conversion loop has to be
-executed again). The difference now is that it is known how much input
-must be created, and the conversion can stop before converting the first
-unused character. Once this is done the input buffer pointers must be
-updated again and the function can return.
-
-One final thing should be mentioned. If it is necessary for the
-conversion to know whether it is the first invocation (in case a prolog
-has to be emitted), the conversion function should increment the
-@code{__invocation_counter} element of the step data structure just
-before returning to the caller. See the description of the @code{struct
-__gconv_step_data} structure above for more information on how this can
-be used.
-
-The return value must be one of the following values:
-
-@table @code
-@item __GCONV_EMPTY_INPUT
-All input was consumed and there is room left in the output buffer.
-@item __GCONV_FULL_OUTPUT
-No more room in the output buffer. In case this is not the last step
-this value is propagated down from the call of the next conversion
-function in the chain.
-@item __GCONV_INCOMPLETE_INPUT
-The input buffer is not entirely empty since it contains an incomplete
-character sequence.
-@end table
-
-The following example provides a framework for a conversion function.
-In case a new conversion has to be written the holes in this
-implementation have to be filled and that is it.
-
-@smallexample
-int
-gconv (struct __gconv_step *step, struct __gconv_step_data *data,
- const char **inbuf, const char *inbufend, size_t *written,
- int do_flush)
-@{
- struct __gconv_step *next_step = step + 1;
- struct __gconv_step_data *next_data = data + 1;
- gconv_fct fct = next_step->__fct;
- int status;
-
- /* @r{If the function is called with no input this means we have}
- @r{to reset to the initial state. The possibly partly}
- @r{converted input is dropped.} */
- if (do_flush)
- @{
- status = __GCONV_OK;
-
- /* @r{Possible emit a byte sequence which put the state object}
- @r{into the initial state.} */
-
- /* @r{Call the steps down the chain if there are any but only}
- @r{if we successfully emitted the escape sequence.} */
- if (status == __GCONV_OK && ! data->__is_last)
- status = fct (next_step, next_data, NULL, NULL,
- written, 1);
- @}
- else
- @{
- /* @r{We preserve the initial values of the pointer variables.} */
- const char *inptr = *inbuf;
- char *outbuf = data->__outbuf;
- char *outend = data->__outbufend;
- char *outptr;
-
- do
- @{
- /* @r{Remember the start value for this round.} */
- inptr = *inbuf;
- /* @r{The outbuf buffer is empty.} */
- outptr = outbuf;
-
- /* @r{For stateful encodings the state must be safe here.} */
-
- /* @r{Run the conversion loop. @code{status} is set}
- @r{appropriately afterwards.} */
-
- /* @r{If this is the last step, leave the loop. There is}
- @r{nothing we can do.} */
- if (data->__is_last)
- @{
- /* @r{Store information about how many bytes are}
- @r{available.} */
- data->__outbuf = outbuf;
-
- /* @r{If any non-reversible conversions were performed,}
- @r{add the number to @code{*written}.} */
-
- break;
- @}
-
- /* @r{Write out all output which was produced.} */
- if (outbuf > outptr)
- @{
- const char *outerr = data->__outbuf;
- int result;
-
- result = fct (next_step, next_data, &outerr,
- outbuf, written, 0);
-
- if (result != __GCONV_EMPTY_INPUT)
- @{
- if (outerr != outbuf)
- @{
- /* @r{Reset the input buffer pointer. We}
- @r{document here the complex case.} */
- size_t nstatus;
-
- /* @r{Reload the pointers.} */
- *inbuf = inptr;
- outbuf = outptr;
-
- /* @r{Possibly reset the state.} */
-
- /* @r{Redo the conversion, but this time}
- @r{the end of the output buffer is at}
- @r{@code{outerr}.} */
- @}
-
- /* @r{Change the status.} */
- status = result;
- @}
- else
- /* @r{All the output is consumed, we can make}
- @r{ another run if everything was ok.} */
- if (status == __GCONV_FULL_OUTPUT)
- status = __GCONV_OK;
- @}
- @}
- while (status == __GCONV_OK);
-
- /* @r{We finished one use of this step.} */
- ++data->__invocation_counter;
- @}
-
- return status;
-@}
-@end smallexample
-@end deftypevr
-
-This information should be sufficient to write new modules. Anybody
-doing so should also take a look at the available source code in the GNU
-C library sources. It contains many examples of working and optimized
-modules.
-
+@node Character Set Handling, Locales, String and Array Utilities, Top +@c %MENU% Support for extended character sets +@chapter Character Set Handling + +@ifnottex +@macro cal{text} +\text\ +@end macro +@end ifnottex + +Character sets used in the early days of computing had only six, seven, +or eight bits for each character: there was never a case where more than +eight bits (one byte) were used to represent a single character. The +limitations of this approach became more apparent as more people +grappled with non-Roman character sets, where not all the characters +that make up a language's character set can be represented by @math{2^8} +choices. This chapter shows the functionality that was added to the C +library to support multiple character sets. + +@menu +* Extended Char Intro:: Introduction to Extended Characters. +* Charset Function Overview:: Overview about Character Handling + Functions. +* Restartable multibyte conversion:: Restartable multibyte conversion + Functions. +* Non-reentrant Conversion:: Non-reentrant Conversion Function. +* Generic Charset Conversion:: Generic Charset Conversion. +@end menu + + +@node Extended Char Intro +@section Introduction to Extended Characters + +A variety of solutions is available to overcome the differences between +character sets with a 1:1 relation between bytes and characters and +character sets with ratios of 2:1 or 4:1. The remainder of this +section gives a few examples to help understand the design decisions +made while developing the functionality of the @w{C library}. + +@cindex internal representation +A distinction we have to make right away is between internal and +external representation. @dfn{Internal representation} means the +representation used by a program while keeping the text in memory. +External representations are used when text is stored or transmitted +through some communication channel. Examples of external +representations include files waiting in a directory to be +read and parsed. + +Traditionally there has been no difference between the two representations. +It was equally comfortable and useful to use the same single-byte +representation internally and externally. This comfort level decreases +with more and larger character sets. + +One of the problems to overcome with the internal representation is +handling text that is externally encoded using different character +sets. Assume a program that reads two texts and compares them using +some metric. The comparison can be usefully done only if the texts are +internally kept in a common format. + +@cindex wide character +For such a common format (@math{=} character set) eight bits are certainly +no longer enough. So the smallest entity will have to grow: @dfn{wide +characters} will now be used. Instead of one byte per character, two or +four will be used instead. (Three are not good to address in memory and +more than four bytes seem not to be necessary). + +@cindex Unicode +@cindex ISO 10646 +As shown in some other part of this manual, +@c !!! Ahem, wide char string functions are not yet covered -- drepper +a completely new family has been created of functions that can handle wide +character texts in memory. The most commonly used character sets for such +internal wide character representations are Unicode and @w{ISO 10646} +(also known as UCS for Universal Character Set). Unicode was originally +planned as a 16-bit character set; whereas, @w{ISO 10646} was designed to +be a 31-bit large code space. The two standards are practically identical. +They have the same character repertoire and code table, but Unicode specifies +added semantics. At the moment, only characters in the first @code{0x10000} +code positions (the so-called Basic Multilingual Plane, BMP) have been +assigned, but the assignment of more specialized characters outside this +16-bit space is already in progress. A number of encodings have been +defined for Unicode and @w{ISO 10646} characters: +@cindex UCS-2 +@cindex UCS-4 +@cindex UTF-8 +@cindex UTF-16 +UCS-2 is a 16-bit word that can only represent characters +from the BMP, UCS-4 is a 32-bit word than can represent any Unicode +and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where +ASCII characters are represented by ASCII bytes and non-ASCII characters +by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension +of UCS-2 in which pairs of certain UCS-2 words can be used to encode +non-BMP characters up to @code{0x10ffff}. + +To represent wide characters the @code{char} type is not suitable. For +this reason the @w{ISO C} standard introduces a new type that is +designed to keep one character of a wide character string. To maintain +the similarity there is also a type corresponding to @code{int} for +those functions that take a single wide character. + +@comment stddef.h +@comment ISO +@deftp {Data type} wchar_t +This data type is used as the base type for wide character strings. +In other words, arrays of objects of this type are the equivalent of +@code{char[]} for multibyte character strings. The type is defined in +@file{stddef.h}. + +The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not +say anything specific about the representation. It only requires that +this type is capable of storing all elements of the basic character set. +Therefore it would be legitimate to define @code{wchar_t} as @code{char}, +which might make sense for embedded systems. + +But for GNU systems @code{wchar_t} is always 32 bits wide and, therefore, +capable of representing all UCS-4 values and, therefore, covering all of +@w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type +and thereby follow Unicode very strictly. This definition is perfectly +fine with the standard, but it also means that to represent all +characters from Unicode and @w{ISO 10646} one has to use UTF-16 surrogate +characters, which is in fact a multi-wide-character encoding. But +resorting to multi-wide-character encoding contradicts the purpose of the +@code{wchar_t} type. +@end deftp + +@comment wchar.h +@comment ISO +@deftp {Data type} wint_t +@code{wint_t} is a data type used for parameters and variables that +contain a single wide character. As the name suggests this type is the +equivalent of @code{int} when using the normal @code{char} strings. The +types @code{wchar_t} and @code{wint_t} often have the same +representation if their size is 32 bits wide but if @code{wchar_t} is +defined as @code{char} the type @code{wint_t} must be defined as +@code{int} due to the parameter promotion. + +@pindex wchar.h +This type is defined in @file{wchar.h} and was introduced in +@w{Amendment 1} to @w{ISO C90}. +@end deftp + +As there are for the @code{char} data type macros are available for +specifying the minimum and maximum value representable in an object of +type @code{wchar_t}. + +@comment wchar.h +@comment ISO +@deftypevr Macro wint_t WCHAR_MIN +The macro @code{WCHAR_MIN} evaluates to the minimum value representable +by an object of type @code{wint_t}. + +This macro was introduced in @w{Amendment 1} to @w{ISO C90}. +@end deftypevr + +@comment wchar.h +@comment ISO +@deftypevr Macro wint_t WCHAR_MAX +The macro @code{WCHAR_MAX} evaluates to the maximum value representable +by an object of type @code{wint_t}. + +This macro was introduced in @w{Amendment 1} to @w{ISO C90}. +@end deftypevr + +Another special wide character value is the equivalent to @code{EOF}. + +@comment wchar.h +@comment ISO +@deftypevr Macro wint_t WEOF +The macro @code{WEOF} evaluates to a constant expression of type +@code{wint_t} whose value is different from any member of the extended +character set. + +@code{WEOF} need not be the same value as @code{EOF} and unlike +@code{EOF} it also need @emph{not} be negative. In other words, sloppy +code like + +@smallexample +@{ + int c; + ... + while ((c = getc (fp)) < 0) + ... +@} +@end smallexample + +@noindent +has to be rewritten to use @code{WEOF} explicitly when wide characters +are used: + +@smallexample +@{ + wint_t c; + ... + while ((c = wgetc (fp)) != WEOF) + ... +@} +@end smallexample + +@pindex wchar.h +This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is +defined in @file{wchar.h}. +@end deftypevr + + +These internal representations present problems when it comes to storing +and transmittal. Because each single wide character consists of more +than one byte, they are effected by byte-ordering. Thus, machines with +different endianesses would see different values when accessing the same +data. This byte ordering concern also applies for communication protocols +that are all byte-based and, thereforet require that the sender has to +decide about splitting the wide character in bytes. A last (but not least +important) point is that wide characters often require more storage space +than a customized byte-oriented character set. + +@cindex multibyte character +@cindex EBCDIC + For all the above reasons, an external encoding that is different +from the internal encoding is often used if the latter is UCS-2 or UCS-4. +The external encoding is byte-based and can be chosen appropriately for +the environment and for the texts to be handled. A variety of different +character sets can be used for this external encoding (information that +will not be exhaustively presented here--instead, a description of the +major groups will suffice). All of the ASCII-based character sets +[_bkoz_: do you mean Roman character sets? If not, what do you mean +here?] fulfill one requirement: they are "filesystem safe." This means +that the character @code{'/'} is used in the encoding @emph{only} to +represent itself. Things are a bit different for character sets like +EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set +family used by IBM), but if the operation system does not understand +EBCDIC directly the parameters-to-system calls have to be converted first +anyhow. + +@itemize @bullet +@item +The simplest character sets are single-byte character sets. There can +be only up to 256 characters (for @w{8 bit} character sets), which is +not sufficient to cover all languages but might be sufficient to handle +a specific text. Handling of a @w{8 bit} character sets is simple. This +is not true for other kinds presented later, and therefore, the +application one uses might require the use of @w{8 bit} character sets. + +@cindex ISO 2022 +@item +The @w{ISO 2022} standard defines a mechanism for extended character +sets where one character @emph{can} be represented by more than one +byte. This is achieved by associating a state with the text. +Characters that can be used to change the state can be embedded in the +text. Each byte in the text might have a different interpretation in each +state. The state might even influence whether a given byte stands for a +character on its own or whether it has to be combined with some more +bytes. + +@cindex EUC +@cindex Shift_JIS +@cindex SJIS +In most uses of @w{ISO 2022} the defined character sets do not allow +state changes that cover more than the next character. This has the +big advantage that whenever one can identify the beginning of the byte +sequence of a character one can interpret a text correctly. Examples of +character sets using this policy are the various EUC character sets +(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN) +or Shift_JIS (SJIS, a Japanese encoding). + +But there are also character sets using a state that is valid for more +than one character and has to be changed by another byte sequence. +Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN. + +@item +@cindex ISO 6937 +Early attempts to fix 8 bit character sets for other languages using the +Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes +representing characters like the acute accent do not produce output +themselves: one has to combine them with other characters to get the +desired result. For example, the byte sequence @code{0xc2 0x61} +(non-spacing acute accent, followed by lower-case `a') to get the ``small +a with acute'' character. To get the acute accent character on its own, +one has to write @code{0xc2 0x20} (the non-spacing acute followed by a +space). + +Character sets like @w[ISO 6937] are used in some embedded systems such +as teletex. + +@item +@cindex UTF-8 +Instead of converting the Unicode or @w{ISO 10646} text used internally, +it is often also sufficient to simply use an encoding different than +UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an +encoding: UTF-8. This encoding is able to represent all of @w{ISO +10646} 31 bits in a byte string of length one to six. + +@cindex UTF-7 +There were a few other attempts to encode @w{ISO 10646} such as UTF-7, +but UTF-8 is today the only encoding that should be used. In fact, with +any luck UTF-8 will soon be the only external encoding that has to be +supported. It proves to be universally usable and its only disadvantage +is that it favors Roman languages by making the byte string +representation of other scripts (Cyrillic, Greek, Asian scripts) longer +than necessary if using a specific character set for these scripts. +Methods like the Unicode compression scheme can alleviate these +problems. +@end itemize + +The question remaining is: how to select the character set or encoding +to use. The answer: you cannot decide about it yourself, it is decided +by the developers of the system or the majority of the users. Since the +goal is interoperability one has to use whatever the other people one +works with use. If there are no constraints, the selection is based on +the requirements the expected circle of users will have. In other words, +if a project is expected to be used in only, say, Russia it is fine to use +KOI8-R or a similar character set. But if at the same time people from, +say, Greece are participating one should use a character set that allows +all people to collaborate. + +The most widely useful solution seems to be: go with the most general +character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding +and problems about users not being able to use their own language +adequately are a thing of the past. + +One final comment about the choice of the wide character representation +is necessary at this point. We have said above that the natural choice +is using Unicode or @w{ISO 10646}. This is not required, but at least +encouraged, by the @w{ISO C} standard. The standard defines at least a +macro @code{__STDC_ISO_10646__} that is only defined on systems where +the @code{wchar_t} type encodes @w{ISO 10646} characters. If this +symbol is not defined one should avoid making assumptions about the wide +character representation. If the programmer uses only the functions +provided by the C library to handle wide character strings there should +be no compatibility problems with other systems. + +@node Charset Function Overview +@section Overview about Character Handling Functions + +A Unix @w{C library} contains three different sets of functions in two +families to handle character set conversion. One of the function families +(the most commonly used) is specified in the @w{ISO C90} standard and, +therefore, is portable even beyond the Unix world. Unfortunately this +family is the least useful one. These functions should be avoided +whenever possible, especially when developing libraries (as opposed to +applications). + +The second family of functions got introduced in the early Unix standards +(XPG2) and is still part of the latest and greatest Unix standard: +@w{Unix 98}. It is also the most powerful and useful set of functions. +But we will start with the functions defined in @w{Amendment 1} to +@w{ISO C90}. + +@node Restartable multibyte conversion +@section Restartable Multibyte Conversion Functions + +The @w{ISO C} standard defines functions to convert strings from a +multibyte representation to wide character strings. There are a number +of peculiarities: + +@itemize @bullet +@item +The character set assumed for the multibyte encoding is not specified +as an argument to the functions. Instead the character set specified by +the @code{LC_CTYPE} category of the current locale is used; see +@ref{Locale Categories}. + +@item +The functions handling more than one character at a time require NUL +terminated strings as the argument (i.e., converting blocks of text +does not work unless one can add a NUL byte at an appropriate place). +The GNU C library contains some extensions to the standard that allow +specifying a size, but basically they also expect terminated strings. +@end itemize + +Despite these limitations the @w{ISO C} functions can be used in many +contexts. In graphical user interfaces, for instance, it is not +uncommon to have functions that require text to be displayed in a wide +character string if the text is not simple ASCII. The text itself might +come from a file with translations and the user should decide about the +current locale, which determines the translation and therefore also the +external encoding used. In such a situation (and many others) the +functions described here are perfect. If more freedom while performing +the conversion is necessary take a look at the @code{iconv} functions +(@pxref{Generic Charset Conversion}). + +@menu +* Selecting the Conversion:: Selecting the conversion and its properties. +* Keeping the state:: Representing the state of the conversion. +* Converting a Character:: Converting Single Characters. +* Converting Strings:: Converting Multibyte and Wide Character + Strings. +* Multibyte Conversion Example:: A Complete Multibyte Conversion Example. +@end menu + +@node Selecting the Conversion +@subsection Selecting the conversion and its properties + +We already said above that the currently selected locale for the +@code{LC_CTYPE} category decides about the conversion that is performed +by the functions we are about to describe. Each locale uses its own +character set (given as an argument to @code{localedef}) and this is the +one assumed as the external multibyte encoding. The wide character +character set always is UCS-4, at least on GNU systems. + +A characteristic of each multibyte character set is the maximum number +of bytes that can be necessary to represent one character. This +information is quite important when writing code that uses the +conversion functions (as shown in the examples below). +The @w{ISO C} standard defines two macros that provide this information. + + +@comment limits.h +@comment ISO +@deftypevr Macro int MB_LEN_MAX +@code{MB_LEN_MAX} specifies the maximum number of bytes in the multibyte +sequence for a single character in any of the supported locales. It is +a compile-time constant and is defined in @file{limits.h}. +@pindex limits.h +@end deftypevr + +@comment stdlib.h +@comment ISO +@deftypevr Macro int MB_CUR_MAX +@code{MB_CUR_MAX} expands into a positive integer expression that is the +maximum number of bytes in a multibyte character in the current locale. +The value is never greater than @code{MB_LEN_MAX}. Unlike +@code{MB_LEN_MAX} this macro need not be a compile-time constant, and in +the GNU C library it is not. + +@pindex stdlib.h +@code{MB_CUR_MAX} is defined in @file{stdlib.h}. +@end deftypevr + +Two different macros are necessary since strictly @w{ISO C90} compilers +do not allow variable length array definitions, but still it is desirable +to avoid dynamic allocation. This incomplete piece of code shows the +problem: + +@smallexample +@{ + char buf[MB_LEN_MAX]; + ssize_t len = 0; + + while (! feof (fp)) + @{ + fread (&buf[len], 1, MB_CUR_MAX - len, fp); + /* @r{... process} buf */ + len -= used; + @} +@} +@end smallexample + +The code in the inner loop is expected to have always enough bytes in +the array @var{buf} to convert one multibyte character. The array +@var{buf} has to be sized statically since many compilers do not allow a +variable size. The @code{fread} call makes sure that @code{MB_CUR_MAX} +bytes are always available in @var{buf}. Note that it isn't +a problem if @code{MB_CUR_MAX} is not a compile-time constant. + + +@node Keeping the state +@subsection Representing the state of the conversion + +@cindex stateful +In the introduction of this chapter it was said that certain character +sets use a @dfn{stateful} encoding. That is, the encoded values depend +in some way on the previous bytes in the text. + +Since the conversion functions allow converting a text in more than one +step we must have a way to pass this information from one call of the +functions to another. + +@comment wchar.h +@comment ISO +@deftp {Data type} mbstate_t +@cindex shift state +A variable of type @code{mbstate_t} can contain all the information +about the @dfn{shift state} needed from one call to a conversion +function to another. + +@pindex wchar.h +@code{mbstate_t} is defined in @file{wchar.h}. It was introduced in +@w{Amendment 1} to @w{ISO C90}. +@end deftp + +To use objects of type @code{mbstate_t} the programmer has to define such +objects (normally as local variables on the stack) and pass a pointer to +the object to the conversion functions. This way the conversion function +can update the object if the current multibyte character set is stateful. + +There is no specific function or initializer to put the state object in +any specific state. The rules are that the object should always +represent the initial state before the first use, and this is achieved by +clearing the whole variable with code such as follows: + +@smallexample +@{ + mbstate_t state; + memset (&state, '\0', sizeof (state)); + /* @r{from now on @var{state} can be used.} */ + ... +@} +@end smallexample + +When using the conversion functions to generate output it is often +necessary to test whether the current state corresponds to the initial +state. This is necessary, for example, to decide whether to emit +escape sequences to set the state to the initial state at certain +sequence points. Communication protocols often require this. + +@comment wchar.h +@comment ISO +@deftypefun int mbsinit (const mbstate_t *@var{ps}) +The @code {mbsinit} function determines whether the state object pointed +to by @var{ps} is in the initial state. If @var{ps} is a null pointer or +the object is in the initial state the return value is nonzero. Otherwise +it is zero. + +@pindex wchar.h +@code {mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is +declared in @file{wchar.h}. +@end deftypefun + +Code using @code {mbsinit} often looks similar to this: + +@c Fix the example to explicitly say how to generate the escape sequence +@c to restore the initial state. +@smallexample +@{ + mbstate_t state; + memset (&state, '\0', sizeof (state)); + /* @r{Use @var{state}.} */ + ... + if (! mbsinit (&state)) + @{ + /* @r{Emit code to return to initial state.} */ + const wchar_t empty[] = L""; + const wchar_t *srcp = empty; + wcsrtombs (outbuf, &srcp, outbuflen, &state); + @} + ... +@} +@end smallexample + +The code to emit the escape sequence to get back to the initial state is +interesting. The @code{wcsrtombs} function can be used to determine the +necessary output code (@pxref{Converting Strings}). Please note that on +GNU systems it is not necessary to perform this extra action for the +conversion from multibyte text to wide character text since the wide +character encoding is not stateful. But there is nothing mentioned in +any standard that prohibits making @code{wchar_t} using a stateful +encoding. + +@node Converting a Character +@subsection Converting Single Characters + +The most fundamental of the conversion functions are those dealing with +single characters. Please note that this does not always mean single +bytes. But since there is very often a subset of the multibyte +character set that consists of single byte sequences, there are +functions to help with converting bytes. Frequently, ASCII is a subpart +of the multibyte character set. In such a scenario, each ASCII character +stands for itself, and all other characters have at least a first byte +that is beyond the range @math{0} to @math{127}. + +@comment wchar.h +@comment ISO +@deftypefun wint_t btowc (int @var{c}) +The @code{btowc} function (``byte to wide character'') converts a valid +single byte character @var{c} in the initial shift state into the wide +character equivalent using the conversion rules from the currently +selected locale of the @code{LC_CTYPE} category. + +If @code{(unsigned char) @var{c}} is no valid single byte multibyte +character or if @var{c} is @code{EOF}, the function returns @code{WEOF}. + +Please note the restriction of @var{c} being tested for validity only in +the initial shift state. No @code{mbstate_t} object is used from +which the state information is taken, and the function also does not use +any static state. + +@pindex wchar.h +The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90} +and is declared in @file{wchar.h}. +@end deftypefun + +Despite the limitation that the single byte value always is interpreted +in the initial state this function is actually useful most of the time. +Most characters are either entirely single-byte character sets or they +are extension to ASCII. But then it is possible to write code like this +(not that this specific example is very useful): + +@smallexample +wchar_t * +itow (unsigned long int val) +@{ + static wchar_t buf[30]; + wchar_t *wcp = &buf[29]; + *wcp = L'\0'; + while (val != 0) + @{ + *--wcp = btowc ('0' + val % 10); + val /= 10; + @} + if (wcp == &buf[29]) + *--wcp = L'0'; + return wcp; +@} +@end smallexample + +Why is it necessary to use such a complicated implementation and not +simply cast @code{'0' + val % 10} to a wide character? The answer is +that there is no guarantee that one can perform this kind of arithmetic +on the character of the character set used for @code{wchar_t} +representation. In other situations the bytes are not constant at +compile time and so the compiler cannot do the work. In situations like +this it is necessary @code{btowc}. + +@noindent +There also is a function for the conversion in the other direction. + +@comment wchar.h +@comment ISO +@deftypefun int wctob (wint_t @var{c}) +The @code{wctob} function (``wide character to byte'') takes as the +parameter a valid wide character. If the multibyte representation for +this character in the initial state is exactly one byte long, the return +value of this function is this character. Otherwise the return value is +@code{EOF}. + +@pindex wchar.h +@code{wctob} was introduced in @w{Amendment 1} to @w{ISO C90} and +is declared in @file{wchar.h}. +@end deftypefun + +There are more general functions to convert single character from +multibyte representation to wide characters and vice versa. These +functions pose no limit on the length of the multibyte representation +and they also do not require it to be in the initial state. + +@comment wchar.h +@comment ISO +@deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps}) +@cindex stateful +The @code{mbrtowc} function (``multibyte restartable to wide +character'') converts the next multibyte character in the string pointed +to by @var{s} into a wide character and stores it in the wide character +string pointed to by @var{pwc}. The conversion is performed according +to the locale currently selected for the @code{LC_CTYPE} category. If +the conversion for the character set used in the locale requires a state, +the multibyte string is interpreted in the state represented by the +object pointed to by @var{ps}. If @var{ps} is a null pointer, a static, +internal state variable used only by the @code{mbrtowc} function is +used. + +If the next multibyte character corresponds to the NUL wide character, +the return value of the function is @math{0} and the state object is +afterwards in the initial state. If the next @var{n} or fewer bytes +form a correct multibyte character, the return value is the number of +bytes starting from @var{s} that form the multibyte character. The +conversion state is updated according to the bytes consumed in the +conversion. In both cases the wide character (either the @code{L'\0'} +or the one found in the conversion) is stored in the string pointed to +by @var{pwc} if @var{pwc} is not null. + +If the first @var{n} bytes of the multibyte string possibly form a valid +multibyte character but there are more than @var{n} bytes needed to +complete it, the return value of the function is @code{(size_t) -2} and +no value is stored. Please note that this can happen even if @var{n} +has a value greater than or equal to @code{MB_CUR_MAX} since the input +might contain redundant shift sequences. + +If the first @code{n} bytes of the multibyte string cannot possibly form +a valid multibyte character, no value is stored, the global variable +@code{errno} is set to the value @code{EILSEQ}, and the function returns +@code{(size_t) -1}. The conversion state is afterwards undefined. + +@pindex wchar.h +@code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and +is declared in @file{wchar.h}. +@end deftypefun + +Use of @code{mbrtowc} is straightforward. A function that copies a +multibyte string into a wide character string while at the same time +converting all lowercase characters into uppercase could look like this +(this is not the final version, just an example; it has no error +checking, and sometimes leaks memory): + +@smallexample +wchar_t * +mbstouwcs (const char *s) +@{ + size_t len = strlen (s); + wchar_t *result = malloc ((len + 1) * sizeof (wchar_t)); + wchar_t *wcp = result; + wchar_t tmp[1]; + mbstate_t state; + size_t nbytes; + + memset (&state, '\0', sizeof (state)); + while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0) + @{ + if (nbytes >= (size_t) -2) + /* Invalid input string. */ + return NULL; + *result++ = towupper (tmp[0]); + len -= nbytes; + s += nbytes; + @} + return result; +@} +@end smallexample + +The use of @code{mbrtowc} should be clear. A single wide character is +stored in @code{@var{tmp}[0]}, and the number of consumed bytes is stored +in the variable @var{nbytes}. If the conversion is successful, the +uppercase variant of the wide character is stored in the @var{result} +array and the pointer to the input string and the number of available +bytes is adjusted. + +The only non-obvious thing about @code{mbrtowc} might be the way memory +is allocated for the result. The above code uses the fact that there +can never be more wide characters in the converted results than there are +bytes in the multibyte input string. This method yields a pessimistic +guess about the size of the result, and if many wide character strings +have to be constructed this way or if the strings are long, the extra +memory required to be allocated because the input string contains +multibyte characters might be significant. The allocated memory block can +be resized to the correct size before returning it, but a better solution +might be to allocate just the right amount of space for the result right +away. Unfortunately there is no function to compute the length of the wide +character string directly from the multibyte string. There is, however, a +function that does part of the work. + +@comment wchar.h +@comment ISO +@deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps}) +The @code{mbrlen} function (``multibyte restartable length'') computes +the number of at most @var{n} bytes starting at @var{s}, which form the +next valid and complete multibyte character. + +If the next multibyte character corresponds to the NUL wide character, +the return value is @math{0}. If the next @var{n} bytes form a valid +multibyte character, the number of bytes belonging to this multibyte +character byte sequence is returned. + +If the the first @var{n} bytes possibly form a valid multibyte +character but the character is incomplete, the return value is +@code{(size_t) -2}. Otherwise the multibyte character sequence is invalid +and the return value is @code{(size_t) -1}. + +The multibyte sequence is interpreted in the state represented by the +object pointed to by @var{ps}. If @var{ps} is a null pointer, a state +object local to @code{mbrlen} is used. + +@pindex wchar.h +@code{mbrlen} was introduced in @w{Amendment 1} to @w{ISO C90} and +is declared in @file{wchar.h}. +@end deftypefun + +The attentive reader now will note that @code{mbrlen} can be implemented +as + +@smallexample +mbrtowc (NULL, s, n, ps != NULL ? ps : &internal) +@end smallexample + +This is true and in fact is mentioned in the official specification. +How can this function be used to determine the length of the wide +character string created from a multibyte character string? It is not +directly usable, but we can define a function @code{mbslen} using it: + +@smallexample +size_t +mbslen (const char *s) +@{ + mbstate_t state; + size_t result = 0; + size_t nbytes; + memset (&state, '\0', sizeof (state)); + while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0) + @{ + if (nbytes >= (size_t) -2) + /* @r{Something is wrong.} */ + return (size_t) -1; + s += nbytes; + ++result; + @} + return result; +@} +@end smallexample + +This function simply calls @code{mbrlen} for each multibyte character +in the string and counts the number of function calls. Please note that +we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen} +call. This is acceptable since a) this value is larger then the length of +the longest multibyte character sequence and b) we know that the string +@var{s} ends with a NUL byte, which cannot be part of any other multibyte +character sequence but the one representing the NUL wide character. +Therefore, the @code{mbrlen} function will never read invalid memory. + +Now that this function is available (just to make this clear, this +function is @emph{not} part of the GNU C library) we can compute the +number of wide character required to store the converted multibyte +character string @var{s} using + +@smallexample +wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t); +@end smallexample + +Please note that the @code{mbslen} function is quite inefficient. The +implementation of @code{mbstouwcs} with @code{mbslen} would have to +perform the conversion of the multibyte character input string twice, and +this conversion might be quite expensive. So it is necessary to think +about the consequences of using the easier but imprecise method before +doing the work twice. + +@comment wchar.h +@comment ISO +@deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps}) +The @code{wcrtomb} function (``wide character restartable to +multibyte'') converts a single wide character into a multibyte string +corresponding to that wide character. + +If @var{s} is a null pointer, the function resets the state stored in +the objects pointed to by @var{ps} (or the internal @code{mbstate_t} +object) to the initial state. This can also be achieved by a call like +this: + +@smallexample +wcrtombs (temp_buf, L'\0', ps) +@end smallexample + +@noindent +since, if @var{s} is a null pointer, @code{wcrtomb} performs as if it +writes into an internal buffer, which is guaranteed to be large enough. + +If @var{wc} is the NUL wide character, @code{wcrtomb} emits, if +necessary, a shift sequence to get the state @var{ps} into the initial +state followed by a single NUL byte, which is stored in the string +@var{s}. + +Otherwise a byte sequence (possibly including shift sequences) is written +into the string @var{s}. This only happens if @var{wc} is a valid wide +character (i.e., it has a multibyte representation in the character set +selected by locale of the @code{LC_CTYPE} category). If @var{wc} is no +valid wide character, nothing is stored in the strings @var{s}, +@code{errno} is set to @code{EILSEQ}, the conversion state in @var{ps} +is undefined and the return value is @code{(size_t) -1}. + +If no error occurred the function returns the number of bytes stored in +the string @var{s}. This includes all bytes representing shift +sequences. + +One word about the interface of the function: there is no parameter +specifying the length of the array @var{s}. Instead the function +assumes that there are at least @code{MB_CUR_MAX} bytes available since +this is the maximum length of any byte sequence representing a single +character. So the caller has to make sure that there is enough space +available, otherwise buffer overruns can occur. + +@pindex wchar.h +@code{wcrtomb} was introduced in @w{Amendment 1} to @w{ISO C90} and is +declared in @file{wchar.h}. +@end deftypefun + +Using @code{wcrtomb} is as easy as using @code{mbrtowc}. The following +example appends a wide character string to a multibyte character string. +Again, the code is not really useful (or correct), it is simply here to +demonstrate the use and some problems. + +@smallexample +char * +mbscatwcs (char *s, size_t len, const wchar_t *ws) +@{ + mbstate_t state; + /* @r{Find the end of the existing string.} */ + char *wp = strchr (s, '\0'); + len -= wp - s; + memset (&state, '\0', sizeof (state)); + do + @{ + size_t nbytes; + if (len < MB_CUR_LEN) + @{ + /* @r{We cannot guarantee that the next} + @r{character fits into the buffer, so} + @r{return an error.} */ + errno = E2BIG; + return NULL; + @} + nbytes = wcrtomb (wp, *ws, &state); + if (nbytes == (size_t) -1) + /* @r{Error in the conversion.} */ + return NULL; + len -= nbytes; + wp += nbytes; + @} + while (*ws++ != L'\0'); + return s; +@} +@end smallexample + +First the function has to find the end of the string currently in the +array @var{s}. The @code{strchr} call does this very efficiently since a +requirement for multibyte character representations is that the NUL byte +is never used except to represent itself (and in this context, the end +of the string). + +After initializing the state object the loop is entered where the first +task is to make sure there is enough room in the array @var{s}. We +abort if there are not at least @code{MB_CUR_LEN} bytes available. This +is not always optimal but we have no other choice. We might have less +than @code{MB_CUR_LEN} bytes available but the next multibyte character +might also be only one byte long. At the time the @code{wcrtomb} call +returns it is too late to decide whether the buffer was large enough. If +this solution is unsuitable, there is a very slow but more accurate +solution. + +@smallexample + ... + if (len < MB_CUR_LEN) + @{ + mbstate_t temp_state; + memcpy (&temp_state, &state, sizeof (state)); + if (wcrtomb (NULL, *ws, &temp_state) > len) + @{ + /* @r{We cannot guarantee that the next} + @r{character fits into the buffer, so} + @r{return an error.} */ + errno = E2BIG; + return NULL; + @} + @} + ... +@end smallexample + +Here we perform the conversion that might overflow the buffer so that +we are afterwards in the position to make an exact decision about the +buffer size. Please note the @code{NULL} argument for the destination +buffer in the new @code{wcrtomb} call; since we are not interested in the +converted text at this point, this is a nice way to express this. The +most unusual thing about this piece of code certainly is the duplication +of the conversion state object, but if a change of the state is necessary +to emit the next multibyte character, we want to have the same shift state +change performed in the real conversion. Therefore, we have to preserve +the initial shift state information. + +There are certainly many more and even better solutions to this problem. +This example is only provided for educational purposes. + +@node Converting Strings +@subsection Converting Multibyte and Wide Character Strings + +The functions described in the previous section only convert a single +character at a time. Most operations to be performed in real-world +programs include strings and therefore the @w{ISO C} standard also +defines conversions on entire strings. However, the defined set of +functions is quite limited; therefore, the GNU C library contains a few +extensions that can help in some important situations. + +@comment wchar.h +@comment ISO +@deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) +The @code{mbsrtowcs} function (``multibyte string restartable to wide +character string'') converts an NUL-terminated multibyte character +string at @code{*@var{src}} into an equivalent wide character string, +including the NUL wide character at the end. The conversion is started +using the state information from the object pointed to by @var{ps} or +from an internal object of @code{mbsrtowcs} if @var{ps} is a null +pointer. Before returning, the state object is updated to match the state +after the last converted character. The state is the initial state if the +terminating NUL byte is reached and converted. + +If @var{dst} is not a null pointer, the result is stored in the array +pointed to by @var{dst}; otherwise, the conversion result is not +available since it is stored in an internal buffer. + +If @var{len} wide characters are stored in the array @var{dst} before +reaching the end of the input string, the conversion stops and @var{len} +is returned. If @var{dst} is a null pointer, @var{len} is never checked. + +Another reason for a premature return from the function call is if the +input string contains an invalid multibyte sequence. In this case the +global variable @code{errno} is set to @code{EILSEQ} and the function +returns @code{(size_t) -1}. + +@c XXX The ISO C9x draft seems to have a problem here. It says that PS +@c is not updated if DST is NULL. This is not said straightforward and +@c none of the other functions is described like this. It would make sense +@c to define the function this way but I don't think it is meant like this. + +In all other cases the function returns the number of wide characters +converted during this call. If @var{dst} is not null, @code{mbsrtowcs} +stores in the pointer pointed to by @var{src} either a null pointer (if +the NUL byte in the input string was reached) or the address of the byte +following the last converted multibyte character. + +@pindex wchar.h +@code{mbsrtowcs} was introduced in @w{Amendment 1} to @w{ISO C90} and is +declared in @file{wchar.h}. +@end deftypefun + +The definition of the @code{mbsrtowcs} function has one important +limitation. The requirement that @var{dst} has to be a NUL-terminated +string provides problems if one wants to convert buffers with text. A +buffer is normally no collection of NUL-terminated strings but instead a +continuous collection of lines, separated by newline characters. Now +assume that a function to convert one line from a buffer is needed. Since +the line is not NUL-terminated, the source pointer cannot directly point +into the unmodified text buffer. This means, either one inserts the NUL +byte at the appropriate place for the time of the @code{mbsrtowcs} +function call (which is not doable for a read-only buffer or in a +multi-threaded application) or one copies the line in an extra buffer +where it can be terminated by a NUL byte. Note that it is not in general +possible to limit the number of characters to convert by setting the +parameter @var{len} to any specific value. Since it is not known how +many bytes each multibyte character sequence is in length, one can only +guess. + +@cindex stateful +There is still a problem with the method of NUL-terminating a line right +after the newline character, which could lead to very strange results. +As said in the description of the @code{mbsrtowcs} function above the +conversion state is guaranteed to be in the initial shift state after +processing the NUL byte at the end of the input string. But this NUL +byte is not really part of the text (i.e., the conversion state after +the newline in the original text could be something different than the +initial shift state and therefore the first character of the next line +is encoded using this state). But the state in question is never +accessible to the user since the conversion stops after the NUL byte +(which resets the state). Most stateful character sets in use today +require that the shift state after a newline be the initial state--but +this is not a strict guarantee. Therefore, simply NUL-terminating a +piece of a running text is not always an adequate solution and, +therefore, should never be used in generally used code. + +The generic conversion interface (@pxref{Generic Charset Conversion}) +does not have this limitation (it simply works on buffers, not +strings), and the GNU C library contains a set of functions that take +additional parameters specifying the maximal number of bytes that are +consumed from the input string. This way the problem of +@code{mbsrtowcs}'s example above could be solved by determining the line +length and passing this length to the function. + +@comment wchar.h +@comment ISO +@deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) +The @code{wcsrtombs} function (``wide character string restartable to +multibyte string'') converts the NUL-terminated wide character string at +@code{*@var{src}} into an equivalent multibyte character string and +stores the result in the array pointed to by @var{dst}. The NUL wide +character is also converted. The conversion starts in the state +described in the object pointed to by @var{ps} or by a state object +locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If +@var{dst} is a null pointer, the conversion is performed as usual but the +result is not available. If all characters of the input string were +successfully converted and if @var{dst} is not a null pointer, the +pointer pointed to by @var{src} gets assigned a null pointer. + +If one of the wide characters in the input string has no valid multibyte +character equivalent, the conversion stops early, sets the global +variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}. + +Another reason for a premature stop is if @var{dst} is not a null +pointer and the next converted character would require more than +@var{len} bytes in total to the array @var{dst}. In this case (and if +@var{dest} is not a null pointer) the pointer pointed to by @var{src} is +assigned a value pointing to the wide character right after the last one +successfully converted. + +Except in the case of an encoding error the return value of the +@code{wcsrtombs} function is the number of bytes in all the multibyte +character sequences stored in @var{dst}. Before returning the state in +the object pointed to by @var{ps} (or the internal object in case +@var{ps} is a null pointer) is updated to reflect the state after the +last conversion. The state is the initial shift state in case the +terminating NUL wide character was converted. + +@pindex wchar.h +The @code{wcsrtombs} function was introduced in @w{Amendment 1} to +@w{ISO C90} and is declared in @file{wchar.h}. +@end deftypefun + +The restriction mentioned above for the @code{mbsrtowcs} function applies +here also. There is no possibility of directly controlling the number of +input characters. One has to place the NUL wide character at the correct +place or control the consumed input indirectly via the available output +array size (the @var{len} parameter). + +@comment wchar.h +@comment GNU +@deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps}) +The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs} +function. All the parameters are the same except for @var{nmc}, which is +new. The return value is the same as for @code{mbsrtowcs}. + +This new parameter specifies how many bytes at most can be used from the +multibyte character string. In other words, the multibyte character +string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte +is found within the @var{nmc} first bytes of the string, the conversion +stops here. + +This function is a GNU extension. It is meant to work around the +problems mentioned above. Now it is possible to convert a buffer with +multibyte character text piece for piece without having to care about +inserting NUL bytes and the effect of NUL bytes on the conversion state. +@end deftypefun + +A function to convert a multibyte string into a wide character string +and display it could be written like this (this is not a really useful +example): + +@smallexample +void +showmbs (const char *src, FILE *fp) +@{ + mbstate_t state; + int cnt = 0; + memset (&state, '\0', sizeof (state)); + while (1) + @{ + wchar_t linebuf[100]; + const char *endp = strchr (src, '\n'); + size_t n; + + /* @r{Exit if there is no more line.} */ + if (endp == NULL) + break; + + n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state); + linebuf[n] = L'\0'; + fprintf (fp, "line %d: \"%S\"\n", linebuf); + @} +@} +@end smallexample + +There is no problem with the state after a call to @code{mbsnrtowcs}. +Since we don't insert characters in the strings that were not in there +right from the beginning and we use @var{state} only for the conversion +of the given buffer, there is no problem with altering the state. + +@comment wchar.h +@comment GNU +@deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps}) +The @code{wcsnrtombs} function implements the conversion from wide +character strings to multibyte character strings. It is similar to +@code{wcsrtombs} but, just like @code{mbsnrtowcs}, it takes an extra +parameter, which specifies the length of the input string. + +No more than @var{nwc} wide characters from the input string +@code{*@var{src}} are converted. If the input string contains a NUL +wide character in the first @var{nwc} characters, the conversion stops at +this place. + +The @code{wcsnrtombs} function is a GNU extension and just like +@code{mbsnrtowcs} helps in situations where no NUL-terminated input +strings are available. +@end deftypefun + + +@node Multibyte Conversion Example +@subsection A Complete Multibyte Conversion Example + +The example programs given in the last sections are only brief and do +not contain all the error checking, etc. Presented here is a complete +and documented example. It features the @code{mbrtowc} function but it +should be easy to derive versions using the other functions. + +@smallexample +int +file_mbsrtowcs (int input, int output) +@{ + /* @r{Note the use of @code{MB_LEN_MAX}.} + @r{@code{MB_CUR_MAX} cannot portably be used here.} */ + char buffer[BUFSIZ + MB_LEN_MAX]; + mbstate_t state; + int filled = 0; + int eof = 0; + + /* @r{Initialize the state.} */ + memset (&state, '\0', sizeof (state)); + + while (!eof) + @{ + ssize_t nread; + ssize_t nwrite; + char *inp = buffer; + wchar_t outbuf[BUFSIZ]; + wchar_t *outp = outbuf; + + /* @r{Fill up the buffer from the input file.} */ + nread = read (input, buffer + filled, BUFSIZ); + if (nread < 0) + @{ + perror ("read"); + return 0; + @} + /* @r{If we reach end of file, make a note to read no more.} */ + if (nread == 0) + eof = 1; + + /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */ + filled += nread; + + /* @r{Convert those bytes to wide characters--as many as we can.} */ + while (1) + @{ + size_t thislen = mbrtowc (outp, inp, filled, &state); + /* @r{Stop converting at invalid character;} + @r{this can mean we have read just the first part} + @r{of a valid character.} */ + if (thislen == (size_t) -1) + break; + /* @r{We want to handle embedded NUL bytes} + @r{but the return value is 0. Correct this.} */ + if (thislen == 0) + thislen = 1; + /* @r{Advance past this character.} */ + inp += thislen; + filled -= thislen; + ++outp; + @} + + /* @r{Write the wide characters we just made.} */ + nwrite = write (output, outbuf, + (outp - outbuf) * sizeof (wchar_t)); + if (nwrite < 0) + @{ + perror ("write"); + return 0; + @} + + /* @r{See if we have a @emph{real} invalid character.} */ + if ((eof && filled > 0) || filled >= MB_CUR_MAX) + @{ + error (0, 0, "invalid multibyte character"); + return 0; + @} + + /* @r{If any characters must be carried forward,} + @r{put them at the beginning of @code{buffer}.} */ + if (filled > 0) + memmove (inp, buffer, filled); + @} + + return 1; +@} +@end smallexample + + +@node Non-reentrant Conversion +@section Non-reentrant Conversion Function + +The functions described in the previous chapter are defined in +@w{Amendment 1} to @w{ISO C90}, but the original @w{ISO C90} standard +also contained functions for character set conversion. The reason that +these original functions are not described first is that they are almost +entirely useless. + +The problem is that all the conversion functions described in the +original @w{ISO C90} use a local state. Using a local state implies that +multiple conversions at the same time (not only when using threads) +cannot be done, and that you cannot first convert single characters and +then strings since you cannot tell the conversion functions which state +to use. + +These original functions are therefore usable only in a very limited set +of situations. One must complete converting the entire string before +starting a new one, and each string/text must be converted with the same +function (there is no problem with the library itself; it is guaranteed +that no library function changes the state of any of these functions). +@strong{For the above reasons it is highly requested that the functions +described in the previous section be used in place of non-reentrant +conversion functions.} + +@menu +* Non-reentrant Character Conversion:: Non-reentrant Conversion of Single + Characters. +* Non-reentrant String Conversion:: Non-reentrant Conversion of Strings. +* Shift State:: States in Non-reentrant Functions. +@end menu + +@node Non-reentrant Character Conversion +@subsection Non-reentrant Conversion of Single Characters + +@comment stdlib.h +@comment ISO +@deftypefun int mbtowc (wchar_t *restrict @var{result}, const char *restrict @var{string}, size_t @var{size}) +The @code{mbtowc} (``multibyte to wide character'') function when called +with non-null @var{string} converts the first multibyte character +beginning at @var{string} to its corresponding wide character code. It +stores the result in @code{*@var{result}}. + +@code{mbtowc} never examines more than @var{size} bytes. (The idea is +to supply for @var{size} the number of bytes of data you have in hand.) + +@code{mbtowc} with non-null @var{string} distinguishes three +possibilities: the first @var{size} bytes at @var{string} start with +valid multibyte characters, they start with an invalid byte sequence or +just part of a character, or @var{string} points to an empty string (a +null character). + +For a valid multibyte character, @code{mbtowc} converts it to a wide +character and stores that in @code{*@var{result}}, and returns the +number of bytes in that character (always at least @math{1} and never +more than @var{size}). + +For an invalid byte sequence, @code{mbtowc} returns @math{-1}. For an +empty string, it returns @math{0}, also storing @code{'\0'} in +@code{*@var{result}}. + +If the multibyte character code uses shift characters, then +@code{mbtowc} maintains and updates a shift state as it scans. If you +call @code{mbtowc} with a null pointer for @var{string}, that +initializes the shift state to its standard initial value. It also +returns nonzero if the multibyte character code in use actually has a +shift state. @xref{Shift State}. +@end deftypefun + +@comment stdlib.h +@comment ISO +@deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar}) +The @code{wctomb} (``wide character to multibyte'') function converts +the wide character code @var{wchar} to its corresponding multibyte +character sequence, and stores the result in bytes starting at +@var{string}. At most @code{MB_CUR_MAX} characters are stored. + +@code{wctomb} with non-null @var{string} distinguishes three +possibilities for @var{wchar}: a valid wide character code (one that can +be translated to a multibyte character), an invalid code, and +@code{L'\0'}. + +Given a valid code, @code{wctomb} converts it to a multibyte character, +storing the bytes starting at @var{string}. Then it returns the number +of bytes in that character (always at least @math{1} and never more +than @code{MB_CUR_MAX}). + +If @var{wchar} is an invalid wide character code, @code{wctomb} returns +@math{-1}. If @var{wchar} is @code{L'\0'}, it returns @code{0}, also +storing @code{'\0'} in @code{*@var{string}}. + +If the multibyte character code uses shift characters, then +@code{wctomb} maintains and updates a shift state as it scans. If you +call @code{wctomb} with a null pointer for @var{string}, that +initializes the shift state to its standard initial value. It also +returns nonzero if the multibyte character code in use actually has a +shift state. @xref{Shift State}. + +Calling this function with a @var{wchar} argument of zero when +@var{string} is not null has the side-effect of reinitializing the +stored shift state @emph{as well as} storing the multibyte character +@code{'\0'} and returning @math{0}. +@end deftypefun + +Similar to @code{mbrlen} there is also a non-reentrant function that +computes the length of a multibyte character. It can be defined in +terms of @code{mbtowc}. + +@comment stdlib.h +@comment ISO +@deftypefun int mblen (const char *@var{string}, size_t @var{size}) +The @code{mblen} function with a non-null @var{string} argument returns +the number of bytes that make up the multibyte character beginning at +@var{string}, never examining more than @var{size} bytes. (The idea is +to supply for @var{size} the number of bytes of data you have in hand.) + +The return value of @code{mblen} distinguishes three possibilities: the +first @var{size} bytes at @var{string} start with valid multibyte +characters, they start with an invalid byte sequence or just part of a +character, or @var{string} points to an empty string (a null character). + +For a valid multibyte character, @code{mblen} returns the number of +bytes in that character (always at least @code{1} and never more than +@var{size}). For an invalid byte sequence, @code{mblen} returns +@math{-1}. For an empty string, it returns @math{0}. + +If the multibyte character code uses shift characters, then @code{mblen} +maintains and updates a shift state as it scans. If you call +@code{mblen} with a null pointer for @var{string}, that initializes the +shift state to its standard initial value. It also returns a nonzero +value if the multibyte character code in use actually has a shift state. +@xref{Shift State}. + +@pindex stdlib.h +The function @code{mblen} is declared in @file{stdlib.h}. +@end deftypefun + + +@node Non-reentrant String Conversion +@subsection Non-reentrant Conversion of Strings + +For convenience the @w{ISO C90} standard also defines functions to +convert entire strings instead of single characters. These functions +suffer from the same problems as their reentrant counterparts from +@w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}. + +@comment stdlib.h +@comment ISO +@deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size}) +The @code{mbstowcs} (``multibyte string to wide character string'') +function converts the null-terminated string of multibyte characters +@var{string} to an array of wide character codes, storing not more than +@var{size} wide characters into the array beginning at @var{wstring}. +The terminating null character counts towards the size, so if @var{size} +is less than the actual number of wide characters resulting from +@var{string}, no terminating null character is stored. + +The conversion of characters from @var{string} begins in the initial +shift state. + +If an invalid multibyte character sequence is found, the @code{mbstowcs} +function returns a value of @math{-1}. Otherwise, it returns the number +of wide characters stored in the array @var{wstring}. This number does +not include the terminating null character, which is present if the +number is less than @var{size}. + +Here is an example showing how to convert a string of multibyte +characters, allocating enough space for the result. + +@smallexample +wchar_t * +mbstowcs_alloc (const char *string) +@{ + size_t size = strlen (string) + 1; + wchar_t *buf = xmalloc (size * sizeof (wchar_t)); + + size = mbstowcs (buf, string, size); + if (size == (size_t) -1) + return NULL; + buf = xrealloc (buf, (size + 1) * sizeof (wchar_t)); + return buf; +@} +@end smallexample + +@end deftypefun + +@comment stdlib.h +@comment ISO +@deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size}) +The @code{wcstombs} (``wide character string to multibyte string'') +function converts the null-terminated wide character array @var{wstring} +into a string containing multibyte characters, storing not more than +@var{size} bytes starting at @var{string}, followed by a terminating +null character if there is room. The conversion of characters begins in +the initial shift state. + +The terminating null character counts towards the size, so if @var{size} +is less than or equal to the number of bytes needed in @var{wstring}, no +terminating null character is stored. + +If a code that does not correspond to a valid multibyte character is +found, the @code{wcstombs} function returns a value of @math{-1}. +Otherwise, the return value is the number of bytes stored in the array +@var{string}. This number does not include the terminating null character, +which is present if the number is less than @var{size}. +@end deftypefun + +@node Shift State +@subsection States in Non-reentrant Functions + +In some multibyte character codes, the @emph{meaning} of any particular +byte sequence is not fixed; it depends on what other sequences have come +earlier in the same string. Typically there are just a few sequences that +can change the meaning of other sequences; these few are called +@dfn{shift sequences} and we say that they set the @dfn{shift state} for +other sequences that follow. + +To illustrate shift state and shift sequences, suppose we decide that +the sequence @code{0200} (just one byte) enters Japanese mode, in which +pairs of bytes in the range from @code{0240} to @code{0377} are single +characters, while @code{0201} enters Latin-1 mode, in which single bytes +in the range from @code{0240} to @code{0377} are characters, and +interpreted according to the ISO Latin-1 character set. This is a +multibyte code that has two alternative shift states (``Japanese mode'' +and ``Latin-1 mode''), and two shift sequences that specify particular +shift states. + +When the multibyte character code in use has shift states, then +@code{mblen}, @code{mbtowc}, and @code{wctomb} must maintain and update +the current shift state as they scan the string. To make this work +properly, you must follow these rules: + +@itemize @bullet +@item +Before starting to scan a string, call the function with a null pointer +for the multibyte character address---for example, @code{mblen (NULL, +0)}. This initializes the shift state to its standard initial value. + +@item +Scan the string one character at a time, in order. Do not ``back up'' +and rescan characters already scanned, and do not intersperse the +processing of different strings. +@end itemize + +Here is an example of using @code{mblen} following these rules: + +@smallexample +void +scan_string (char *s) +@{ + int length = strlen (s); + + /* @r{Initialize shift state.} */ + mblen (NULL, 0); + + while (1) + @{ + int thischar = mblen (s, length); + /* @r{Deal with end of string and invalid characters.} */ + if (thischar == 0) + break; + if (thischar == -1) + @{ + error ("invalid multibyte character"); + break; + @} + /* @r{Advance past this character.} */ + s += thischar; + length -= thischar; + @} +@} +@end smallexample + +The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not +reentrant when using a multibyte code that uses a shift state. However, +no other library functions call these functions, so you don't have to +worry that the shift state will be changed mysteriously. + + +@node Generic Charset Conversion +@section Generic Charset Conversion + +The conversion functions mentioned so far in this chapter all had in +common that they operate on character sets that are not directly +specified by the functions. The multibyte encoding used is specified by +the currently selected locale for the @code{LC_CTYPE} category. The +wide character set is fixed by the implementation (in the case of GNU C +library it is always UCS-4 encoded @w{ISO 10646}. + +This has of course several problems when it comes to general character +conversion: + +@itemize @bullet +@item +For every conversion where neither the source nor the destination +character set is the character set of the locale for the @code{LC_CTYPE} +category, one has to change the @code{LC_CTYPE} locale using +@code{setlocale}. + +Changing the @code{LC_TYPE} locale introduces major problems for the rest +of the programs since several more functions (e.g., the character +classification functions, @pxref{Classification of Characters}) use the +@code{LC_CTYPE} category. + +@item +Parallel conversions to and from different character sets are not +possible since the @code{LC_CTYPE} selection is global and shared by all +threads. + +@item +If neither the source nor the destination character set is the character +set used for @code{wchar_t} representation, there is at least a two-step +process necessary to convert a text using the functions above. One would +have to select the source character set as the multibyte encoding, +convert the text into a @code{wchar_t} text, select the destination +character set as the multibyte encoding, and convert the wide character +text to the multibyte (@math{=} destination) character set. + +Even if this is possible (which is not guaranteed) it is a very tiring +work. Plus it suffers from the other two raised points even more due to +the steady changing of the locale. +@end itemize + +The XPG2 standard defines a completely new set of functions, which has +none of these limitations. They are not at all coupled to the selected +locales, and they have no constraints on the character sets selected for +source and destination. Only the set of available conversions limits +them. The standard does not specify that any conversion at all must be +available. Such availability is a measure of the quality of the +implementation. + +In the following text first the interface to @code{iconv} and then the +conversion function, will be described. Comparisons with other +implementations will show what obstacles stand in the way of portable +applications. Finally, the implementation is described in so far as might +interest the advanced user who wants to extend conversion capabilities. + +@menu +* Generic Conversion Interface:: Generic Character Set Conversion Interface. +* iconv Examples:: A complete @code{iconv} example. +* Other iconv Implementations:: Some Details about other @code{iconv} + Implementations. +* glibc iconv Implementation:: The @code{iconv} Implementation in the GNU C + library. +@end menu + +@node Generic Conversion Interface +@subsection Generic Character Set Conversion Interface + +This set of functions follows the traditional cycle of using a resource: +open--use--close. The interface consists of three functions, each of +which implements one step. + +Before the interfaces are described it is necessary to introduce a +data type. Just like other open--use--close interfaces the functions +introduced here work using handles and the @file{iconv.h} header +defines a special type for the handles used. + +@comment iconv.h +@comment XPG2 +@deftp {Data Type} iconv_t +This data type is an abstract type defined in @file{iconv.h}. The user +must not assume anything about the definition of this type; it must be +completely opaque. + +Objects of this type can get assigned handles for the conversions using +the @code{iconv} functions. The objects themselves need not be freed, but +the conversions for which the handles stand for have to. +@end deftp + +@noindent +The first step is the function to create a handle. + +@comment iconv.h +@comment XPG2 +@deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode}) +The @code{iconv_open} function has to be used before starting a +conversion. The two parameters this function takes determine the +source and destination character set for the conversion, and if the +implementation has the possibility to perform such a conversion, the +function returns a handle. + +If the wanted conversion is not available, the @code{iconv_open} function +returns @code{(iconv_t) -1}. In this case the global variable +@code{errno} can have the following values: + +@table @code +@item EMFILE +The process already has @code{OPEN_MAX} file descriptors open. +@item ENFILE +The system limit of open file is reached. +@item ENOMEM +Not enough memory to carry out the operation. +@item EINVAL +The conversion from @var{fromcode} to @var{tocode} is not supported. +@end table + +It is not possible to use the same descriptor in different threads to +perform independent conversions. The data structures associated +with the descriptor include information about the conversion state. +This must not be messed up by using it in different conversions. + +An @code{iconv} descriptor is like a file descriptor as for every use a +new descriptor must be created. The descriptor does not stand for all +of the conversions from @var{fromset} to @var{toset}. + +The GNU C library implementation of @code{iconv_open} has one +significant extension to other implementations. To ease the extension +of the set of available conversions, the implementation allows storing +the necessary files with data and code in an arbitrary number of +directories. How this extension must be written will be explained below +(@pxref{glibc iconv Implementation}). Here it is only important to say +that all directories mentioned in the @code{GCONV_PATH} environment +variable are considered only if they contain a file @file{gconv-modules}. +These directories need not necessarily be created by the system +administrator. In fact, this extension is introduced to help users +writing and using their own, new conversions. Of course, this does not +work for security reasons in SUID binaries; in this case only the system +directory is considered and this normally is +@file{@var{prefix}/lib/gconv}. The @code{GCONV_PATH} environment +variable is examined exactly once at the first call of the +@code{iconv_open} function. Later modifications of the variable have no +effect. + +@pindex iconv.h +The @code{iconv_open} function was introduced early in the X/Open +Portability Guide, @w{version 2}. It is supported by all commercial +Unices as it is required for the Unix branding. However, the quality and +completeness of the implementation varies widely. The @code{iconv_open} +function is declared in @file{iconv.h}. +@end deftypefun + +The @code{iconv} implementation can associate large data structure with +the handle returned by @code{iconv_open}. Therefore, it is crucial to +free all the resources once all conversions are carried out and the +conversion is not needed anymore. + +@comment iconv.h +@comment XPG2 +@deftypefun int iconv_close (iconv_t @var{cd}) +The @code{iconv_close} function frees all resources associated with the +handle @var{cd}, which must have been returned by a successful call to +the @code{iconv_open} function. + +If the function call was successful the return value is @math{0}. +Otherwise it is @math{-1} and @code{errno} is set appropriately. +Defined error are: + +@table @code +@item EBADF +The conversion descriptor is invalid. +@end table + +@pindex iconv.h +The @code{iconv_close} function was introduced together with the rest +of the @code{iconv} functions in XPG2 and is declared in @file{iconv.h}. +@end deftypefun + +The standard defines only one actual conversion function. This has, +therefore, the most general interface: it allows conversion from one +buffer to another. Conversion from a file to a buffer, vice versa, or +even file to file can be implemented on top of it. + +@comment iconv.h +@comment XPG2 +@deftypefun size_t iconv (iconv_t @var{cd}, char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft}) +@cindex stateful +The @code{iconv} function converts the text in the input buffer +according to the rules associated with the descriptor @var{cd} and +stores the result in the output buffer. It is possible to call the +function for the same text several times in a row since for stateful +character sets the necessary state information is kept in the data +structures associated with the descriptor. + +The input buffer is specified by @code{*@var{inbuf}} and it contains +@code{*@var{inbytesleft}} bytes. The extra indirection is necessary for +communicating the used input back to the caller (see below). It is +important to note that the buffer pointer is of type @code{char} and the +length is measured in bytes even if the input text is encoded in wide +characters. + +The output buffer is specified in a similar way. @code{*@var{outbuf}} +points to the beginning of the buffer with at least +@code{*@var{outbytesleft}} bytes room for the result. The buffer +pointer again is of type @code{char} and the length is measured in +bytes. If @var{outbuf} or @code{*@var{outbuf}} is a null pointer, the +conversion is performed but no output is available. + +If @var{inbuf} is a null pointer, the @code{iconv} function performs the +necessary action to put the state of the conversion into the initial +state. This is obviously a no-op for non-stateful encodings, but if the +encoding has a state, such a function call might put some byte sequences +in the output buffer, which perform the necessary state changes. The +next call with @var{inbuf} not being a null pointer then simply goes on +from the initial state. It is important that the programmer never makes +any assumption as to whether the conversion has to deal with states. +Even if the input and output character sets are not stateful, the +implementation might still have to keep states. This is due to the +implementation chosen for the GNU C library as it is described below. +Therefore an @code{iconv} call to reset the state should always be +performed if some protocol requires this for the output text. + +The conversion stops for one of three reasons. The first is that all +characters from the input buffer are converted. This actually can mean +two things: either all bytes from the input buffer are consumed or +there are some bytes at the end of the buffer that possibly can form a +complete character but the input is incomplete. The second reason for a +stop is that the output buffer is full. And the third reason is that +the input contains invalid characters. + +In all of these cases the buffer pointers after the last successful +conversion, for input and output buffer, are stored in @var{inbuf} and +@var{outbuf}, and the available room in each buffer is stored in +@var{inbytesleft} and @var{outbytesleft}. + +Since the character sets selected in the @code{iconv_open} call can be +almost arbitrary, there can be situations where the input buffer contains +valid characters, which have no identical representation in the output +character set. The behavior in this situation is undefined. The +@emph{current} behavior of the GNU C library in this situation is to +return with an error immediately. This certainly is not the most +desirable solution; therefore, future versions will provide better ones, +but they are not yet finished. + +If all input from the input buffer is successfully converted and stored +in the output buffer, the function returns the number of non-reversible +conversions performed. In all other cases the return value is +@code{(size_t) -1} and @code{errno} is set appropriately. In such cases +the value pointed to by @var{inbytesleft} is nonzero. + +@table @code +@item EILSEQ +The conversion stopped because of an invalid byte sequence in the input. +After the call, @code{*@var{inbuf}} points at the first byte of the +invalid byte sequence. + +@item E2BIG +The conversion stopped because it ran out of space in the output buffer. + +@item EINVAL +The conversion stopped because of an incomplete byte sequence at the end +of the input buffer. + +@item EBADF +The @var{cd} argument is invalid. +@end table + +@pindex iconv.h +The @code{iconv} function was introduced in the XPG2 standard and is +declared in the @file{iconv.h} header. +@end deftypefun + +The definition of the @code{iconv} function is quite good overall. It +provides quite flexible functionality. The only problems lie in the +boundary cases, which are incomplete byte sequences at the end of the +input buffer and invalid input. A third problem, which is not really +a design problem, is the way conversions are selected. The standard +does not say anything about the legitimate names, a minimal set of +available conversions. We will see how this negatively impacts other +implementations, as demonstrated below. + +@node iconv Examples +@subsection A complete @code{iconv} example + +The example below features a solution for a common problem. Given that +one knows the internal encoding used by the system for @code{wchar_t} +strings, one often is in the position to read text from a file and store +it in wide character buffers. One can do this using @code{mbsrtowcs}, +but then we run into the problems discussed above. + +@smallexample +int +file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail) +@{ + char inbuf[BUFSIZ]; + size_t insize = 0; + char *wrptr = (char *) outbuf; + int result = 0; + iconv_t cd; + + cd = iconv_open ("WCHAR_T", charset); + if (cd == (iconv_t) -1) + @{ + /* @r{Something went wrong.} */ + if (errno == EINVAL) + error (0, 0, "conversion from '%s' to wchar_t not available", + charset); + else + perror ("iconv_open"); + + /* @r{Terminate the output string.} */ + *outbuf = L'\0'; + + return -1; + @} + + while (avail > 0) + @{ + size_t nread; + size_t nconv; + char *inptr = inbuf; + + /* @r{Read more input.} */ + nread = read (fd, inbuf + insize, sizeof (inbuf) - insize); + if (nread == 0) + @{ + /* @r{When we come here the file is completely read.} + @r{This still could mean there are some unused} + @r{characters in the @code{inbuf}. Put them back.} */ + if (lseek (fd, -insize, SEEK_CUR) == -1) + result = -1; + + /* @r{Now write out the byte sequence to get into the} + @r{initial state if this is necessary.} */ + iconv (cd, NULL, NULL, &wrptr, &avail); + + break; + @} + insize += nread; + + /* @r{Do the conversion.} */ + nconv = iconv (cd, &inptr, &insize, &wrptr, &avail); + if (nconv == (size_t) -1) + @{ + /* @r{Not everything went right. It might only be} + @r{an unfinished byte sequence at the end of the} + @r{buffer. Or it is a real problem.} */ + if (errno == EINVAL) + /* @r{This is harmless. Simply move the unused} + @r{bytes to the beginning of the buffer so that} + @r{they can be used in the next round.} */ + memmove (inbuf, inptr, insize); + else + @{ + /* @r{It is a real problem. Maybe we ran out of} + @r{space in the output buffer or we have invalid} + @r{input. In any case back the file pointer to} + @r{the position of the last processed byte.} */ + lseek (fd, -insize, SEEK_CUR); + result = -1; + break; + @} + @} + @} + + /* @r{Terminate the output string.} */ + if (avail >= sizeof (wchar_t)) + *((wchar_t *) wrptr) = L'\0'; + + if (iconv_close (cd) != 0) + perror ("iconv_close"); + + return (wchar_t *) wrptr - outbuf; +@} +@end smallexample + +@cindex stateful +This example shows the most important aspects of using the @code{iconv} +functions. It shows how successive calls to @code{iconv} can be used to +convert large amounts of text. The user does not have to care about +stateful encodings as the functions take care of everything. + +An interesting point is the case where @code{iconv} returns an error and +@code{errno} is set to @code{EINVAL}. This is not really an error in the +transformation. It can happen whenever the input character set contains +byte sequences of more than one byte for some character and texts are not +processed in one piece. In this case there is a chance that a multibyte +sequence is cut. The caller can then simply read the remainder of the +takes and feed the offending bytes together with new character from the +input to @code{iconv} and continue the work. The internal state kept in +the descriptor is @emph{not} unspecified after such an event as is the +case with the conversion functions from the @w{ISO C} standard. + +The example also shows the problem of using wide character strings with +@code{iconv}. As explained in the description of the @code{iconv} +function above, the function always takes a pointer to a @code{char} +array and the available space is measured in bytes. In the example, the +output buffer is a wide character buffer; therefore, we use a local +variable @var{wrptr} of type @code{char *}, which is used in the +@code{iconv} calls. + +This looks rather innocent but can lead to problems on platforms that +have tight restriction on alignment. Therefore the caller of @code{iconv} +has to make sure that the pointers passed are suitable for access of +characters from the appropriate character set. Since, in the +above case, the input parameter to the function is a @code{wchar_t} +pointer, this is the case (unless the user violates alignment when +computing the parameter). But in other situations, especially when +writing generic functions where one does not know what type of character +set one uses and, therefore, treats text as a sequence of bytes, it might +become tricky. + +@node Other iconv Implementations +@subsection Some Details about other @code{iconv} Implementations + +This is not really the place to discuss the @code{iconv} implementation +of other systems but it is necessary to know a bit about them to write +portable programs. The above mentioned problems with the specification +of the @code{iconv} functions can lead to portability issues. + +The first thing to notice is that, due to the large number of character +sets in use, it is certainly not practical to encode the conversions +directly in the C library. Therefore, the conversion information must +come from files outside the C library. This is usually done in one or +both of the following ways: + +@itemize @bullet +@item +The C library contains a set of generic conversion functions that can +read the needed conversion tables and other information from data files. +These files get loaded when necessary. + +This solution is problematic as it requires a great deal of effort to +apply to all character sets (potentially an infinite set). The +differences in the structure of the different character sets is so large +that many different variants of the table-processing functions must be +developed. In addition, the generic nature of these functions make them +slower than specifically implemented functions. + +@item +The C library only contains a framework that can dynamically load +object files and execute the conversion functions contained therein. + +This solution provides much more flexibility. The C library itself +contains only very little code and therefore reduces the general memory +footprint. Also, with a documented interface between the C library and +the loadable modules it is possible for third parties to extend the set +of available conversion modules. A drawback of this solution is that +dynamic loading must be available. +@end itemize + +Some implementations in commercial Unices implement a mixture of these +possibilities; the majority implement only the second solution. Using +loadable modules moves the code out of the library itself and keeps +the door open for extensions and improvements, but this design is also +limiting on some platforms since not many platforms support dynamic +loading in statically linked programs. On platforms without this +capability it is therefore not possible to use this interface in +statically linked programs. The GNU C library has, on ELF platforms, no +problems with dynamic loading in these situations; therefore, this +point is moot. The danger is that one gets acquainted with this +situation and forgets about the restrictions on other systems. + +A second thing to know about other @code{iconv} implementations is that +the number of available conversions is often very limited. Some +implementations provide, in the standard release (not special +international or developer releases), at most 100 to 200 conversion +possibilities. This does not mean 200 different character sets are +supported; for example, conversions from one character set to a set of 10 +others might count as 10 conversions. Together with the other direction +this makes 20 conversion possibilities used up by one character set. One +can imagine the thin coverage these platform provide. Some Unix vendors +even provide only a handful of conversions, which renders them useless for +almost all uses. + +This directly leads to a third and probably the most problematic point. +The way the @code{iconv} conversion functions are implemented on all +known Unix systems and the availability of the conversion functions from +character set @math{@cal{A}} to @math{@cal{B}} and the conversion from +@math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the +conversion from @math{@cal{A}} to @math{@cal{C}} is available. + +This might not seem unreasonable and problematic at first, but it is a +quite big problem as one will notice shortly after hitting it. To show +the problem we assume to write a program that has to convert from +@math{@cal{A}} to @math{@cal{C}}. A call like + +@smallexample +cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}"); +@end smallexample + +@noindent +fails according to the assumption above. But what does the program +do now? The conversion is necessary; therefore, simply giving up is not +an option. + +This is a nuisance. The @code{iconv} function should take care of this. +But how should the program proceed from here on? If it tries to convert +to character set @math{@cal{B}}, first the two @code{iconv_open} +calls + +@smallexample +cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}"); +@end smallexample + +@noindent +and + +@smallexample +cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}"); +@end smallexample + +@noindent +will succeed, but how to find @math{@cal{B}}? + +Unfortunately, the answer is: there is no general solution. On some +systems guessing might help. On those systems most character sets can +convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Beside +this only some very system-specific methods can help. Since the +conversion functions come from loadable modules and these modules must +be stored somewhere in the filesystem, one @emph{could} try to find them +and determine from the available file which conversions are available +and whether there is an indirect route from @math{@cal{A}} to +@math{@cal{C}}. + +This example shows one of the design errors of @code{iconv} mentioned +above. It should at least be possible to determine the list of available +conversion programmatically so that if @code{iconv_open} says there is no +such conversion, one could make sure this also is true for indirect +routes. + +@node glibc iconv Implementation +@subsection The @code{iconv} Implementation in the GNU C library + +After reading about the problems of @code{iconv} implementations in the +last section it is certainly good to note that the implementation in +the GNU C library has none of the problems mentioned above. What +follows is a step-by-step analysis of the points raised above. The +evaluation is based on the current state of the development (as of +January 1999). The development of the @code{iconv} functions is not +complete, but basic functionality has solidified. + +The GNU C library's @code{iconv} implementation uses shared loadable +modules to implement the conversions. A very small number of +conversions are built into the library itself but these are only rather +trivial conversions. + +All the benefits of loadable modules are available in the GNU C library +implementation. This is especially appealing since the interface is +well documented (see below), and it, therefore, is easy to write new +conversion modules. The drawback of using loadable objects is not a +problem in the GNU C library, at least on ELF systems. Since the +library is able to load shared objects even in statically linked +binaries, static linking need not be forbidden in case one wants to use +@code{iconv}. + +The second mentioned problem is the number of supported conversions. +Currently, the GNU C library supports more than 150 character sets. The +way the implementation is designed the number of supported conversions +is greater than 22350 (@math{150} times @math{149}). If any conversion +from or to a character set is missing, it can be added easily. + +Particularly impressive as it may be, this high number is due to the +fact that the GNU C library implementation of @code{iconv} does not have +the third problem mentioned above (i.e., whenever there is a conversion +from a character set @math{@cal{A}} to @math{@cal{B}} and from +@math{@cal{B}} to @math{@cal{C}} it is always possible to convert from +@math{@cal{A}} to @math{@cal{C}} directly). If the @code{iconv_open} +returns an error and sets @code{errno} to @code{EINVAL}, there is no +known way, directly or indirectly, to perform the wanted conversion. + +@cindex triangulation +Triangulation is achieved by providing for each character set a +conversion from and to UCS-4 encoded @w{ISO 10646}. Using @w{ISO 10646} +as an intermediate representation it is possible to @dfn{triangulate} +(i.e., convert with an intermediate representation). + +There is no inherent requirement to provide a conversion to @w{ISO +10646} for a new character set, and it is also possible to provide other +conversions where neither source nor destination character set is @w{ISO +10646}. The existing set of conversions is simply meant to cover all +conversions that might be of interest. + +@cindex ISO-2022-JP +@cindex EUC-JP +All currently available conversions use the triangulation method above, +making conversion run unnecessarily slow. If, for example, somebody +often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution +would involve direct conversion between the two character sets, skipping +the input to @w{ISO 10646} first. The two character sets of interest +are much more similar to each other than to @w{ISO 10646}. + +In such a situation one easily can write a new conversion and provide it +as a better alternative. The GNU C library @code{iconv} implementation +would automatically use the module implementing the conversion if it is +specified to be more efficient. + +@subsubsection Format of @file{gconv-modules} files + +All information about the available conversions comes from a file named +@file{gconv-modules}, which can be found in any of the directories along +the @code{GCONV_PATH}. The @file{gconv-modules} files are line-oriented +text files, where each of the lines has one of the following formats: + +@itemize @bullet +@item +If the first non-whitespace character is a @kbd{#} the line contains only +comments and is ignored. + +@item +Lines starting with @code{alias} define an alias name for a character +set. Two more words are expected on the line. The first word +defines the alias name, and the second defines the original name of the +character set. The effect is that it is possible to use the alias name +in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and +achieve the same result as when using the real character set name. + +This is quite important as a character set has often many different +names. There is normally an official name but this need not correspond to +the most popular name. Beside this many character sets have special +names that are somehow constructed. For example, all character sets +specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}} +where @var{nnn} is the registration number. This allows programs that +know about the registration number to construct character set names and +use them in @code{iconv_open} calls. More on the available names and +aliases follows below. + +@item +Lines starting with @code{module} introduce an available conversion +module. These lines must contain three or four more words. + +The first word specifies the source character set, the second word the +destination character set of conversion implemented in this module, and +the third word is the name of the loadable module. The filename is +constructed by appending the usual shared object suffix (normally +@file{.so}) and this file is then supposed to be found in the same +directory the @file{gconv-modules} file is in. The last word on the line, +which is optional, is a numeric value representing the cost of the +conversion. If this word is missing, a cost of @math{1} is assumed. The +numeric value itself does not matter that much; what counts are the +relative values of the sums of costs for all possible conversion paths. +Below is a more precise description of the use of the cost value. +@end itemize + +Returning to the example above where one has written a module to directly +convert from ISO-2022-JP to EUC-JP and back. All that has to be done is +to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory +and add a file @file{gconv-modules} with the following content in the +same directory: + +@smallexample +module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1 +module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1 +@end smallexample + +To see why this is sufficient, it is necessary to understand how the +conversion used by @code{iconv} (and described in the descriptor) is +selected. The approach to this problem is quite simple. + +At the first call of the @code{iconv_open} function the program reads +all available @file{gconv-modules} files and builds up two tables: one +containing all the known aliases and another that contains the +information about the conversions and which shared object implements +them. + +@subsubsection Finding the conversion path in @code{iconv} + +The set of available conversions form a directed graph with weighted +edges. The weights on the edges are the costs specified in the +@file{gconv-modules} files. The @code{iconv_open} function uses an +algorithm suitable for search for the best path in such a graph and so +constructs a list of conversions that must be performed in succession +to get the transformation from the source to the destination character +set. + +Explaining why the above @file{gconv-modules} files allows the +@code{iconv} implementation to resolve the specific ISO-2022-JP to +EUC-JP conversion module instead of the conversion coming with the +library itself is straightforward. Since the latter conversion takes two +steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to +EUC-JP), the cost is @math{1+1 = 2}. The above @file{gconv-modules} +file, however, specifies that the new conversion modules can perform this +conversion with only the cost of @math{1}. + +A mysterious item about the @file{gconv-modules} file above (and also +the file coming with the GNU C library) are the names of the character +sets specified in the @code{module} lines. Why do almost all the names +end in @code{//}? And this is not all: the names can actually be +regular expressions. At this point in time this mystery should not be +revealed, unless you have the relevant spell-casting materials: ashes +from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix +blessed by St.@: Emacs, assorted herbal roots from Central America, sand +from Cebu, etc. Sorry! @strong{The part of the implementation where +this is used is not yet finished. For now please simply follow the +existing examples. It'll become clearer once it is. --drepper} + +A last remark about the @file{gconv-modules} is about the names not +ending with @code{//}. A character set named @code{INTERNAL} is often +mentioned. From the discussion above and the chosen name it should have +become clear that this is the name for the representation used in the +intermediate step of the triangulation. We have said that this is UCS-4 +but actually that is not quite right. The UCS-4 specification also +includes the specification of the byte ordering used. Since a UCS-4 value +consists of four bytes, a stored value is effected by byte ordering. The +internal representation is @emph{not} the same as UCS-4 in case the byte +ordering of the processor (or at least the running process) is not the +same as the one required for UCS-4. This is done for performance reasons +as one does not want to perform unnecessary byte-swapping operations if +one is not interested in actually seeing the result in UCS-4. To avoid +trouble with endianess, the internal representation consistently is named +@code{INTERNAL} even on big-endian systems where the representations are +identical. + +@subsubsection @code{iconv} module data structures + +So far this section has described how modules are located and considered +to be used. What remains to be described is the interface of the modules +so that one can write new ones. This section describes the interface as +it is in use in January 1999. The interface will change a bit in the +future but, with luck, only in an upwardly compatible way. + +The definitions necessary to write new modules are publicly available +in the non-standard header @file{gconv.h}. The following text, +therefore, describes the definitions from this header file. First, +however, it is necessary to get an overview. + +From the perspective of the user of @code{iconv} the interface is quite +simple: the @code{iconv_open} function returns a handle that can be used +in calls to @code{iconv}, and finally the handle is freed with a call to +@code{iconv_close}. The problem is that the handle has to be able to +represent the possibly long sequences of conversion steps and also the +state of each conversion since the handle is all that is passed to the +@code{iconv} function. Therefore, the data structures are really the +elements necessary to understanding the implementation. + +We need two different kinds of data structures. The first describes the +conversion and the second describes the state etc. There are really two +type definitions like this in @file{gconv.h}. +@pindex gconv.h + +@comment gconv.h +@comment GNU +@deftp {Data type} {struct __gconv_step} +This data structure describes one conversion a module can perform. For +each function in a loaded module with conversion functions there is +exactly one object of this type. This object is shared by all users of +the conversion (i.e., this object does not contain any information +corresponding to an actual conversion; it only describes the conversion +itself). + +@table @code +@item struct __gconv_loaded_object *__shlib_handle +@itemx const char *__modname +@itemx int __counter +All these elements of the structure are used internally in the C library +to coordinate loading and unloading the shared. One must not expect any +of the other elements to be available or initialized. + +@item const char *__from_name +@itemx const char *__to_name +@code{__from_name} and @code{__to_name} contain the names of the source and +destination character sets. They can be used to identify the actual +conversion to be carried out since one module might implement conversions +for more than one character set and/or direction. + +@item gconv_fct __fct +@itemx gconv_init_fct __init_fct +@itemx gconv_end_fct __end_fct +These elements contain pointers to the functions in the loadable module. +The interface will be explained below. + +@item int __min_needed_from +@itemx int __max_needed_from +@itemx int __min_needed_to +@itemx int __max_needed_to; +These values have to be supplied in the init function of the module. The +@code{__min_needed_from} value specifies how many bytes a character of +the source character set at least needs. The @code{__max_needed_from} +specifies the maximum value that also includes possible shift sequences. + +The @code{__min_needed_to} and @code{__max_needed_to} values serve the +same purpose as @code{__min_needed_from} and @code{__max_needed_from} but +this time for the destination character set. + +It is crucial that these values be accurate since otherwise the +conversion functions will have problems or not work at all. + +@item int __stateful +This element must also be initialized by the init function. +@code{int __stateful} is nonzero if the source character set is stateful. +Otherwise it is zero. + +@item void *__data +This element can be used freely by the conversion functions in the +module. @code{void *__data} can be used to communicate extra information +from one call to another. @code{void *__data} need not be initialized if +not needed at all. If @code{void *__data} element is assigned a pointer +to dynamically allocated memory (presumably in the init function) it has +to be made sure that the end function deallocates the memory. Otherwise +the application will leak memory. + +It is important to be aware that this data structure is shared by all +users of this specification conversion and therefore the @code{__data} +element must not contain data specific to one specific use of the +conversion function. +@end table +@end deftp + +@comment gconv.h +@comment GNU +@deftp {Data type} {struct __gconv_step_data} +This is the data structure that contains the information specific to +each use of the conversion functions. + + +@table @code +@item char *__outbuf +@itemx char *__outbufend +These elements specify the output buffer for the conversion step. The +@code{__outbuf} element points to the beginning of the buffer, and +@code{__outbufend} points to the byte following the last byte in the +buffer. The conversion function must not assume anything about the size +of the buffer but it can be safely assumed the there is room for at +least one complete character in the output buffer. + +Once the conversion is finished, if the conversion is the last step, the +@code{__outbuf} element must be modified to point after the last byte +written into the buffer to signal how much output is available. If this +conversion step is not the last one, the element must not be modified. +The @code{__outbufend} element must not be modified. + +@item int __is_last +This element is nonzero if this conversion step is the last one. This +information is necessary for the recursion. See the description of the +conversion function internals below. This element must never be +modified. + +@item int __invocation_counter +The conversion function can use this element to see how many calls of +the conversion function already happened. Some character sets require a +certain prolog when generating output, and by comparing this value with +zero, one can find out whether it is the first call and whether, +therefore, the prolog should be emitted. This element must never be +modified. + +@item int __internal_use +This element is another one rarely used but needed in certain +situations. It is assigned a nonzero value in case the conversion +functions are used to implement @code{mbsrtowcs} et.al.@: (i.e., the +function is not used directly through the @code{iconv} interface). + +This sometimes makes a difference as it is expected that the +@code{iconv} functions are used to translate entire texts while the +@code{mbsrtowcs} functions are normally used only to convert single +strings and might be used multiple times to convert entire texts. + +But in this situation we would have problem complying with some rules of +the character set specification. Some character sets require a prolog, +which must appear exactly once for an entire text. If a number of +@code{mbsrtowcs} calls are used to convert the text, only the first call +must add the prolog. However, because there is no communication between the +different calls of @code{mbsrtowcs}, the conversion functions have no +possibility to find this out. The situation is different for sequences +of @code{iconv} calls since the handle allows access to the needed +information. + +The @code{int __internal_use} element is mostly used together with +@code{__invocation_counter} as follows: + +@smallexample +if (!data->__internal_use + && data->__invocation_counter == 0) + /* @r{Emit prolog.} */ + ... +@end smallexample + +This element must never be modified. + +@item mbstate_t *__statep +The @code{__statep} element points to an object of type @code{mbstate_t} +(@pxref{Keeping the state}). The conversion of a stateful character +set must use the object pointed to by @code{__statep} to store +information about the conversion state. The @code{__statep} element +itself must never be modified. + +@item mbstate_t __state +This element must @emph{never} be used directly. It is only part of +this structure to have the needed space allocated. +@end table +@end deftp + +@subsubsection @code{iconv} module interfaces + +With the knowledge about the data structures we now can describe the +conversion function itself. To understand the interface a bit of +knowledge is necessary about the functionality in the C library that +loads the objects with the conversions. + +It is often the case that one conversion is used more than once (i.e., +there are several @code{iconv_open} calls for the same set of character +sets during one program run). The @code{mbsrtowcs} et.al.@: functions in +the GNU C library also use the @code{iconv} functionality, which +increases the number of uses of the same functions even more. + +Because of this multiple use of conversions, the modules do not get +loaded exclusively for one conversion. Instead a module once loaded can +be used by an arbitrary number of @code{iconv} or @code{mbsrtowcs} calls +at the same time. The splitting of the information between conversion- +function-specific information and conversion data makes this possible. +The last section showed the two data structures used to do this. + +This is of course also reflected in the interface and semantics of the +functions that the modules must provide. There are three functions that +must have the following names: + +@table @code +@item gconv_init +The @code{gconv_init} function initializes the conversion function +specific data structure. This very same object is shared by all +conversions that use this conversion and, therefore, no state information +about the conversion itself must be stored in here. If a module +implements more than one conversion, the @code{gconv_init} function will +be called multiple times. + +@item gconv_end +The @code{gconv_end} function is responsible for freeing all resources +allocated by the @code{gconv_init} function. If there is nothing to do, +this function can be missing. Special care must be taken if the module +implements more than one conversion and the @code{gconv_init} function +does not allocate the same resources for all conversions. + +@item gconv +This is the actual conversion function. It is called to convert one +block of text. It gets passed the conversion step information +initialized by @code{gconv_init} and the conversion data, specific to +this use of the conversion functions. +@end table + +There are three data types defined for the three module interface +functions and these define the interface. + +@comment gconv.h +@comment GNU +@deftypevr {Data type} int {(*__gconv_init_fct)} (struct __gconv_step *) +This specifies the interface of the initialization function of the +module. It is called exactly once for each conversion the module +implements. + +As explained in the description of the @code{struct __gconv_step} data +structure above the initialization function has to initialize parts of +it. + +@table @code +@item __min_needed_from +@itemx __max_needed_from +@itemx __min_needed_to +@itemx __max_needed_to +These elements must be initialized to the exact numbers of the minimum +and maximum number of bytes used by one character in the source and +destination character sets, respectively. If the characters all have the +same size, the minimum and maximum values are the same. + +@item __stateful +This element must be initialized to an nonzero value if the source +character set is stateful. Otherwise it must be zero. +@end table + +If the initialization function needs to communicate some information +to the conversion function, this communication can happen using the +@code{__data} element of the @code{__gconv_step} structure. But since +this data is shared by all the conversions, it must not be modified by +the conversion function. The example below shows how this can be used. + +@smallexample +#define MIN_NEEDED_FROM 1 +#define MAX_NEEDED_FROM 4 +#define MIN_NEEDED_TO 4 +#define MAX_NEEDED_TO 4 + +int +gconv_init (struct __gconv_step *step) +@{ + /* @r{Determine which direction.} */ + struct iso2022jp_data *new_data; + enum direction dir = illegal_dir; + enum variant var = illegal_var; + int result; + + if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0) + @{ + dir = from_iso2022jp; + var = iso2022jp; + @} + else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0) + @{ + dir = to_iso2022jp; + var = iso2022jp; + @} + else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0) + @{ + dir = from_iso2022jp; + var = iso2022jp2; + @} + else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0) + @{ + dir = to_iso2022jp; + var = iso2022jp2; + @} + + result = __GCONV_NOCONV; + if (dir != illegal_dir) + @{ + new_data = (struct iso2022jp_data *) + malloc (sizeof (struct iso2022jp_data)); + + result = __GCONV_NOMEM; + if (new_data != NULL) + @{ + new_data->dir = dir; + new_data->var = var; + step->__data = new_data; + + if (dir == from_iso2022jp) + @{ + step->__min_needed_from = MIN_NEEDED_FROM; + step->__max_needed_from = MAX_NEEDED_FROM; + step->__min_needed_to = MIN_NEEDED_TO; + step->__max_needed_to = MAX_NEEDED_TO; + @} + else + @{ + step->__min_needed_from = MIN_NEEDED_TO; + step->__max_needed_from = MAX_NEEDED_TO; + step->__min_needed_to = MIN_NEEDED_FROM; + step->__max_needed_to = MAX_NEEDED_FROM + 2; + @} + + /* @r{Yes, this is a stateful encoding.} */ + step->__stateful = 1; + + result = __GCONV_OK; + @} + @} + + return result; +@} +@end smallexample + +The function first checks which conversion is wanted. The module from +which this function is taken implements four different conversions; +which one is selected can be determined by comparing the names. The +comparison should always be done without paying attention to the case. + +Next, a data structure, which contains the necessary information about +which conversion is selected, is allocated. The data structure +@code{struct iso2022jp_data} is locally defined since, outside the +module, this data is not used at all. Please note that if all four +conversions this modules supports are requested there are four data +blocks. + +One interesting thing is the initialization of the @code{__min_} and +@code{__max_} elements of the step data object. A single ISO-2022-JP +character can consist of one to four bytes. Therefore the +@code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined +this way. The output is always the @code{INTERNAL} character set (aka +UCS-4) and therefore each character consists of exactly four bytes. For +the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into +account that escape sequences might be necessary to switch the character +sets. Therefore the @code{__max_needed_to} element for this direction +gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the +two bytes needed for the escape sequences to single the switching. The +asymmetry in the maximum values for the two directions can be explained +easily: when reading ISO-2022-JP text, escape sequences can be handled +alone (i.e., it is not necessary to process a real character since the +effect of the escape sequence can be recorded in the state information). +The situation is different for the other direction. Since it is in +general not known which character comes next, one cannot emit escape +sequences to change the state in advance. This means the escape +sequences that have to be emitted together with the next character. +Therefore one needs more room than only for the character itself. + +The possible return values of the initialization function are: + +@table @code +@item __GCONV_OK +The initialization succeeded +@item __GCONV_NOCONV +The requested conversion is not supported in the module. This can +happen if the @file{gconv-modules} file has errors. +@item __GCONV_NOMEM +Memory required to store additional information could not be allocated. +@end table +@end deftypevr + +The function called before the module is unloaded is significantly +easier. It often has nothing at all to do; in which case it can be left +out completely. + +@comment gconv.h +@comment GNU +@deftypevr {Data type} void {(*__gconv_end_fct)} (struct gconv_step *) +The task of this function is to free all resources allocated in the +initialization function. Therefore only the @code{__data} element of +the object pointed to by the argument is of interest. Continuing the +example from the initialization function, the finalization function +looks like this: + +@smallexample +void +gconv_end (struct __gconv_step *data) +@{ + free (data->__data); +@} +@end smallexample +@end deftypevr + +The most important function is the conversion function itself, which can +get quite complicated for complex character sets. But since this is not +of interest here, we will only describe a possible skeleton for the +conversion function. + +@comment gconv.h +@comment GNU +@deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int) +The conversion function can be called for two basic reason: to convert +text or to reset the state. From the description of the @code{iconv} +function it can be seen why the flushing mode is necessary. What mode +is selected is determined by the sixth argument, an integer. This +argument being nonzero means that flushing is selected. + +Common to both modes is where the output buffer can be found. The +information about this buffer is stored in the conversion step data. A +pointer to this information is passed as the second argument to this +function. The description of the @code{struct __gconv_step_data} +structure has more information on the conversion step data. + +@cindex stateful +What has to be done for flushing depends on the source character set. +If the source character set is not stateful, nothing has to be done. +Otherwise the function has to emit a byte sequence to bring the state +object into the initial state. Once this all happened the other +conversion modules in the chain of conversions have to get the same +chance. Whether another step follows can be determined from the +@code{__is_last} element of the step data structure to which the first +parameter points. + +The more interesting mode is when actual text has to be converted. The +first step in this case is to convert as much text as possible from the +input buffer and store the result in the output buffer. The start of the +input buffer is determined by the third argument, which is a pointer to a +pointer variable referencing the beginning of the buffer. The fourth +argument is a pointer to the byte right after the last byte in the buffer. + +The conversion has to be performed according to the current state if the +character set is stateful. The state is stored in an object pointed to +by the @code{__statep} element of the step data (second argument). Once +either the input buffer is empty or the output buffer is full the +conversion stops. At this point, the pointer variable referenced by the +third parameter must point to the byte following the last processed +byte (i.e., if all of the input is consumed, this pointer and the fourth +parameter have the same value). + +What now happens depends on whether this step is the last one. If it is +the last step, the only thing that has to be done is to update the +@code{__outbuf} element of the step data structure to point after the +last written byte. This update gives the caller the information on how +much text is available in the output buffer. In addition, the variable +pointed to by the fifth parameter, which is of type @code{size_t}, must +be incremented by the number of characters (@emph{not bytes}) that were +converted in a non-reversible way. Then, the function can return. + +In case the step is not the last one, the later conversion functions have +to get a chance to do their work. Therefore, the appropriate conversion +function has to be called. The information about the functions is +stored in the conversion data structures, passed as the first parameter. +This information and the step data are stored in arrays, so the next +element in both cases can be found by simple pointer arithmetic: + +@smallexample +int +gconv (struct __gconv_step *step, struct __gconv_step_data *data, + const char **inbuf, const char *inbufend, size_t *written, + int do_flush) +@{ + struct __gconv_step *next_step = step + 1; + struct __gconv_step_data *next_data = data + 1; + ... +@end smallexample + +The @code{next_step} pointer references the next step information and +@code{next_data} the next data record. The call of the next function +therefore will look similar to this: + +@smallexample + next_step->__fct (next_step, next_data, &outerr, outbuf, + written, 0) +@end smallexample + +But this is not yet all. Once the function call returns the conversion +function might have some more to do. If the return value of the function +is @code{__GCONV_EMPTY_INPUT}, more room is available in the output +buffer. Unless the input buffer is empty the conversion, functions start +all over again and process the rest of the input buffer. If the return +value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have +to recover from this. + +A requirement for the conversion function is that the input buffer +pointer (the third argument) always point to the last character that +was put in converted form into the output buffer. This is trivially +true after the conversion performed in the current step, but if the +conversion functions deeper downstream stop prematurely, not all +characters from the output buffer are consumed and, therefore, the input +buffer pointers must be backed off to the right position. + +Correcting the input buffers is easy to do if the input and output +character sets have a fixed width for all characters. In this situation +we can compute how many characters are left in the output buffer and, +therefore, can correct the input buffer pointer appropriately with a +similar computation. Things are getting tricky if either character set +has characters represented with variable length byte sequences, and it +gets even more complicated if the conversion has to take care of the +state. In these cases the conversion has to be performed once again, from +the known state before the initial conversion (i.e., if necessary the +state of the conversion has to be reset and the conversion loop has to be +executed again). The difference now is that it is known how much input +must be created, and the conversion can stop before converting the first +unused character. Once this is done the input buffer pointers must be +updated again and the function can return. + +One final thing should be mentioned. If it is necessary for the +conversion to know whether it is the first invocation (in case a prolog +has to be emitted), the conversion function should increment the +@code{__invocation_counter} element of the step data structure just +before returning to the caller. See the description of the @code{struct +__gconv_step_data} structure above for more information on how this can +be used. + +The return value must be one of the following values: + +@table @code +@item __GCONV_EMPTY_INPUT +All input was consumed and there is room left in the output buffer. +@item __GCONV_FULL_OUTPUT +No more room in the output buffer. In case this is not the last step +this value is propagated down from the call of the next conversion +function in the chain. +@item __GCONV_INCOMPLETE_INPUT +The input buffer is not entirely empty since it contains an incomplete +character sequence. +@end table + +The following example provides a framework for a conversion function. +In case a new conversion has to be written the holes in this +implementation have to be filled and that is it. + +@smallexample +int +gconv (struct __gconv_step *step, struct __gconv_step_data *data, + const char **inbuf, const char *inbufend, size_t *written, + int do_flush) +@{ + struct __gconv_step *next_step = step + 1; + struct __gconv_step_data *next_data = data + 1; + gconv_fct fct = next_step->__fct; + int status; + + /* @r{If the function is called with no input this means we have} + @r{to reset to the initial state. The possibly partly} + @r{converted input is dropped.} */ + if (do_flush) + @{ + status = __GCONV_OK; + + /* @r{Possible emit a byte sequence which put the state object} + @r{into the initial state.} */ + + /* @r{Call the steps down the chain if there are any but only} + @r{if we successfully emitted the escape sequence.} */ + if (status == __GCONV_OK && ! data->__is_last) + status = fct (next_step, next_data, NULL, NULL, + written, 1); + @} + else + @{ + /* @r{We preserve the initial values of the pointer variables.} */ + const char *inptr = *inbuf; + char *outbuf = data->__outbuf; + char *outend = data->__outbufend; + char *outptr; + + do + @{ + /* @r{Remember the start value for this round.} */ + inptr = *inbuf; + /* @r{The outbuf buffer is empty.} */ + outptr = outbuf; + + /* @r{For stateful encodings the state must be safe here.} */ + + /* @r{Run the conversion loop. @code{status} is set} + @r{appropriately afterwards.} */ + + /* @r{If this is the last step, leave the loop. There is} + @r{nothing we can do.} */ + if (data->__is_last) + @{ + /* @r{Store information about how many bytes are} + @r{available.} */ + data->__outbuf = outbuf; + + /* @r{If any non-reversible conversions were performed,} + @r{add the number to @code{*written}.} */ + + break; + @} + + /* @r{Write out all output that was produced.} */ + if (outbuf > outptr) + @{ + const char *outerr = data->__outbuf; + int result; + + result = fct (next_step, next_data, &outerr, + outbuf, written, 0); + + if (result != __GCONV_EMPTY_INPUT) + @{ + if (outerr != outbuf) + @{ + /* @r{Reset the input buffer pointer. We} + @r{document here the complex case.} */ + size_t nstatus; + + /* @r{Reload the pointers.} */ + *inbuf = inptr; + outbuf = outptr; + + /* @r{Possibly reset the state.} */ + + /* @r{Redo the conversion, but this time} + @r{the end of the output buffer is at} + @r{@code{outerr}.} */ + @} + + /* @r{Change the status.} */ + status = result; + @} + else + /* @r{All the output is consumed, we can make} + @r{ another run if everything was ok.} */ + if (status == __GCONV_FULL_OUTPUT) + status = __GCONV_OK; + @} + @} + while (status == __GCONV_OK); + + /* @r{We finished one use of this step.} */ + ++data->__invocation_counter; + @} + + return status; +@} +@end smallexample +@end deftypevr + +This information should be sufficient to write new modules. Anybody +doing so should also take a look at the available source code in the GNU +C library sources. It contains many examples of working and optimized +modules. + @c File charset.texi edited October 2001 by Dennis Grace, IBM Corporation
\ No newline at end of file |