diff options
author | Ulrich Drepper <drepper@redhat.com> | 2001-11-05 08:04:39 +0000 |
---|---|---|
committer | Ulrich Drepper <drepper@redhat.com> | 2001-11-05 08:04:39 +0000 |
commit | 91f07167e37541706554e4117c32aae1bd436cc9 (patch) | |
tree | 05ece0714b396155a8e923f8f226ec8edafe7757 /manual/charset.texi | |
parent | 50d274e5a66e4baed5fc0ade52650970a1728798 (diff) | |
download | glibc-91f07167e37541706554e4117c32aae1bd436cc9.tar glibc-91f07167e37541706554e4117c32aae1bd436cc9.tar.gz glibc-91f07167e37541706554e4117c32aae1bd436cc9.tar.bz2 glibc-91f07167e37541706554e4117c32aae1bd436cc9.zip |
Editing.
Diffstat (limited to 'manual/charset.texi')
-rw-r--r-- | manual/charset.texi | 5784 |
1 files changed, 2892 insertions, 2892 deletions
diff --git a/manual/charset.texi b/manual/charset.texi index bb9cc64b8d..b7b2f734a8 100644 --- a/manual/charset.texi +++ b/manual/charset.texi @@ -1,2892 +1,2892 @@ -@node Character Set Handling, Locales, String and Array Utilities, Top -@c %MENU% Support for extended character sets -@chapter Character Set Handling - -@ifnottex -@macro cal{text} -\text\ -@end macro -@end ifnottex - -Character sets used in the early days of computing had only six, seven, -or eight bits for each character: there was never a case where more than -eight bits (one byte) were used to represent a single character. The -limitations of this approach became more apparent as more people -grappled with non-Roman character sets, where not all the characters -that make up a language's character set can be represented by @math{2^8} -choices. This chapter shows the functionality which was added to the C -library to support multiple character sets. - -@menu -* Extended Char Intro:: Introduction to Extended Characters. -* Charset Function Overview:: Overview about Character Handling - Functions. -* Restartable multibyte conversion:: Restartable multibyte conversion - Functions. -* Non-reentrant Conversion:: Non-reentrant Conversion Function. -* Generic Charset Conversion:: Generic Charset Conversion. -@end menu - - -@node Extended Char Intro -@section Introduction to Extended Characters - -A variety of solutions to overcome the differences between -character sets with a 1:1 relation between bytes and characters and -character sets with ratios of 2:1 or 4:1 exist. The remainder of this -section gives a few examples to help understand the design decisions -made while developing the functionality of the @w{C library}. - -@cindex internal representation -A distinction we have to make right away is between internal and -external representation. @dfn{Internal representation} means the -representation used by a program while keeping the text in memory. -External representations are used when text is stored or transmitted -through whatever communication channel. Examples of external -representations include files lying in a directory that are going to be -read and parsed. - -Traditionally there has been no difference between the two representations. -It was equally comfortable and useful to use the same single-byte -representation internally and externally. This changes with more and -larger character sets. - -One of the problems to overcome with the internal representation is -handling text that is externally encoded using different character -sets. Assume a program which reads two texts and compares them using -some metric. The comparison can be usefully done only if the texts are -internally kept in a common format. - -@cindex wide character -For such a common format (@math{=} character set) eight bits are certainly -no longer enough. So the smallest entity will have to grow: @dfn{wide -characters} will now be used. Instead of one byte, two or four will -be used instead. (Three are not good to address in memory and more -than four bytes seem not to be necessary). - -@cindex Unicode -@cindex ISO 10646 -As shown in some other part of this manual, -@c !!! Ahem, wide char string functions are not yet covered -- drepper -there exists a completely new family of functions which can handle texts -of this kind in memory. The most commonly used character sets for such -internal wide character representations are Unicode and @w{ISO 10646} -(also known as UCS for Universal Character Set). Unicode was originally -planned as a 16-bit character set, whereas @w{ISO 10646} was designed to -be a 31-bit large code space. The two standards are practically identical. -They have the same character repertoire and code table, but Unicode specifies -added semantics. At the moment, only characters in the first @code{0x10000} -code positions (the so-called Basic Multilingual Plane, BMP) have been -assigned, but the assignment of more specialized characters outside this -16-bit space is already in progress. A number of encodings have been -defined for Unicode and @w{ISO 10646} characters: -@cindex UCS-2 -@cindex UCS-4 -@cindex UTF-8 -@cindex UTF-16 -UCS-2 is a 16-bit word that can only represent characters -from the BMP, UCS-4 is a 32-bit word than can represent any Unicode -and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where -ASCII characters are represented by ASCII bytes and non-ASCII characters -by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension -of UCS-2 in which pairs of certain UCS-2 words can be used to encode -non-BMP characters up to @code{0x10ffff}. - -To represent wide characters the @code{char} type is not suitable. For -this reason the @w{ISO C} standard introduces a new type which is -designed to keep one character of a wide character string. To maintain -the similarity there is also a type corresponding to @code{int} for -those functions which take a single wide character. - -@comment stddef.h -@comment ISO -@deftp {Data type} wchar_t -This data type is used as the base type for wide character strings. -I.e., arrays of objects of this type are the equivalent of @code{char[]} -for multibyte character strings. The type is defined in @file{stddef.h}. - -The @w{ISO C90} standard, where this type was introduced, does not say -anything specific about the representation. It only requires that this -type is capable of storing all elements of the basic character set. -Therefore it would be legitimate to define @code{wchar_t} as -@code{char}. This might make sense for embedded systems. - -But for GNU systems this type is always 32 bits wide. It is therefore -capable of representing all UCS-4 values and therefore covering all of -@w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type and -thereby follow Unicode very strictly. This is perfectly fine with the -standard but it also means that to represent all characters from Unicode -and @w{ISO 10646} one has to use UTF-16 surrogate characters which is in -fact a multi-wide-character encoding. But this contradicts the purpose -of the @code{wchar_t} type. -@end deftp - -@comment wchar.h -@comment ISO -@deftp {Data type} wint_t -@code{wint_t} is a data type used for parameters and variables which -contain a single wide character. As the name already suggests it is the -equivalent to @code{int} when using the normal @code{char} strings. The -types @code{wchar_t} and @code{wint_t} have often the same -representation if their size if 32 bits wide but if @code{wchar_t} is -defined as @code{char} the type @code{wint_t} must be defined as -@code{int} due to the parameter promotion. - -@pindex wchar.h -This type is defined in @file{wchar.h} and got introduced in -@w{Amendment 1} to @w{ISO C90}. -@end deftp - -As there are for the @code{char} data type there also exist macros -specifying the minimum and maximum value representable in an object of -type @code{wchar_t}. - -@comment wchar.h -@comment ISO -@deftypevr Macro wint_t WCHAR_MIN -The macro @code{WCHAR_MIN} evaluates to the minimum value representable -by an object of type @code{wint_t}. - -This macro got introduced in @w{Amendment 1} to @w{ISO C90}. -@end deftypevr - -@comment wchar.h -@comment ISO -@deftypevr Macro wint_t WCHAR_MAX -The macro @code{WCHAR_MAX} evaluates to the maximum value representable -by an object of type @code{wint_t}. - -This macro got introduced in @w{Amendment 1} to @w{ISO C90}. -@end deftypevr - -Another special wide character value is the equivalent to @code{EOF}. - -@comment wchar.h -@comment ISO -@deftypevr Macro wint_t WEOF -The macro @code{WEOF} evaluates to a constant expression of type -@code{wint_t} whose value is different from any member of the extended -character set. - -@code{WEOF} need not be the same value as @code{EOF} and unlike -@code{EOF} it also need @emph{not} be negative. I.e., sloppy code like - -@smallexample -@{ - int c; - ... - while ((c = getc (fp)) < 0) - ... -@} -@end smallexample - -@noindent -has to be rewritten to explicitly use @code{WEOF} when wide characters -are used. - -@smallexample -@{ - wint_t c; - ... - while ((c = wgetc (fp)) != WEOF) - ... -@} -@end smallexample - -@pindex wchar.h -This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is -defined in @file{wchar.h}. -@end deftypevr - - -These internal representations present problems when it comes to storing -and transmittal, since a single wide character consists of more -than one byte they are effected by byte-ordering. I.e., machines with -different endianesses would see different value accessing the same data. -This also applies for communication protocols which are all byte-based -and therefore the sender has to decide about splitting the wide -character in bytes. A last (but not least important) point is that wide -characters often require more storage space than an customized byte -oriented character set. - -@cindex multibyte character -@cindex EBCDIC - For all the above reasons, an external encoding which is different -from the internal encoding is often used if the latter is UCS-2 or UCS-4. -The external encoding is byte-based and can be chosen appropriately for -the environment and for the texts to be handled. There exist a variety -of different character sets which can be used for this external -encoding. Information which will not be exhaustively presented -here--instead, a description of the major groups will suffice. All of -the ASCII-based character sets fulfill one requirement: they are -"filesystem safe". This means that the character @code{'/'} is used in -the encoding @emph{only} to represent itself. Things are a bit -different for character sets like EBCDIC (Extended Binary Coded Decimal -Interchange Code, a character set family used by IBM) but if the -operation system does not understand EBCDIC directly the parameters to -system calls have to be converted first anyhow. - -@itemize @bullet -@item -The simplest character sets are single-byte character sets. There can -be only up to 256 characters (for @w{8 bit} character sets) which is not -sufficient to cover all languages but might be sufficient to handle a -specific text. Handling of @w{8 bit} character sets is simple. This is -not true for the other kinds presented later and therefore the -application one uses might require the use of @w{8 bit} character sets. - -@cindex ISO 2022 -@item -The @w{ISO 2022} standard defines a mechanism for extended character -sets where one character @emph{can} be represented by more than one -byte. This is achieved by associating a state with the text. Embedded -in the text can be characters which can be used to change the state. -Each byte in the text might have a different interpretation in each -state. The state might even influence whether a given byte stands for a -character on its own or whether it has to be combined with some more -bytes. - -@cindex EUC -@cindex Shift_JIS -@cindex SJIS -In most uses of @w{ISO 2022} the defined character sets do not allow -state changes which cover more than the next character. This has the -big advantage that whenever one can identify the beginning of the byte -sequence of a character one can interpret a text correctly. Examples of -character sets using this policy are the various EUC character sets -(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN) -or Shift_JIS (SJIS, a Japanese encoding). - -But there are also character sets using a state which is valid for more -than one character and has to be changed by another byte sequence. -Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN. - -@item -@cindex ISO 6937 -Early attempts to fix 8 bit character sets for other languages using the -Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes -representing characters like the acute accent do not produce output -themselves: one has to combine them with other characters to get the -desired result. E.g., the byte sequence @code{0xc2 0x61} (non-spacing -acute accent, following by lower-case `a') to get the ``small a with -acute'' character. To get the acute accent character on its own, one has -to write @code{0xc2 0x20} (the non-spacing acute followed by a space). - -This type of character set is used in some embedded systems such as -teletex. - -@item -@cindex UTF-8 -Instead of converting the Unicode or @w{ISO 10646} text used internally, -it is often also sufficient to simply use an encoding different than -UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an -encoding: UTF-8. This encoding is able to represent all of @w{ISO -10464} 31 bits in a byte string of length one to six. - -@cindex UTF-7 -There were a few other attempts to encode @w{ISO 10646} such as UTF-7 -but UTF-8 is today the only encoding which should be used. In fact, -UTF-8 will hopefully soon be the only external encoding that has to be -supported. It proves to be universally usable and the only disadvantage -is that it favors Roman languages by making the byte string -representation of other scripts (Cyrillic, Greek, Asian scripts) longer -than necessary if using a specific character set for these scripts. -Methods like the Unicode compression scheme can alleviate these -problems. -@end itemize - -The question remaining is: how to select the character set or encoding -to use. The answer: you cannot decide about it yourself, it is decided -by the developers of the system or the majority of the users. Since the -goal is interoperability one has to use whatever the other people one -works with use. If there are no constraints the selection is based on -the requirements the expected circle of users will have. I.e., if a -project is expected to only be used in, say, Russia it is fine to use -KOI8-R or a similar character set. But if at the same time people from, -say, Greece are participating one should use a character set which allows -all people to collaborate. - -The most widely useful solution seems to be: go with the most general -character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding -and problems about users not being able to use their own language -adequately are a thing of the past. - -One final comment about the choice of the wide character representation -is necessary at this point. We have said above that the natural choice -is using Unicode or @w{ISO 10646}. This is not required, but at least -encouraged, by the @w{ISO C} standard. The standard defines at least a -macro @code{__STDC_ISO_10646__} that is only defined on systems where -the @code{wchar_t} type encodes @w{ISO 10646} characters. If this -symbol is not defined one should as much as possible avoid making -assumption about the wide character representation. If the programmer -uses only the functions provided by the C library to handle wide -character strings there should not be any compatibility problems with -other systems. - -@node Charset Function Overview -@section Overview about Character Handling Functions - -A Unix @w{C library} contains three different sets of functions in two -families to handle character set conversion. The one function family -is specified in the @w{ISO C} standard and therefore is portable even -beyond the Unix world. - -The most commonly known set of functions, coming from the @w{ISO C90} -standard, is unfortunately the least useful one. In fact, these -functions should be avoided whenever possible, especially when -developing libraries (as opposed to applications). - -The second family of functions got introduced in the early Unix standards -(XPG2) and is still part of the latest and greatest Unix standard: -@w{Unix 98}. It is also the most powerful and useful set of functions. -But we will start with the functions defined in @w{Amendment 1} to -@w{ISO C90}. - -@node Restartable multibyte conversion -@section Restartable Multibyte Conversion Functions - -The @w{ISO C} standard defines functions to convert strings from a -multibyte representation to wide character strings. There are a number -of peculiarities: - -@itemize @bullet -@item -The character set assumed for the multibyte encoding is not specified -as an argument to the functions. Instead the character set specified by -the @code{LC_CTYPE} category of the current locale is used; see -@ref{Locale Categories}. - -@item -The functions handling more than one character at a time require NUL -terminated strings as the argument. I.e., converting blocks of text -does not work unless one can add a NUL byte at an appropriate place. -The GNU C library contains some extensions the standard which allow -specifying a size but basically they also expect terminated strings. -@end itemize - -Despite these limitations the @w{ISO C} functions can very well be used -in many contexts. In graphical user interfaces, for instance, it is not -uncommon to have functions which require text to be displayed in a wide -character string if it is not simple ASCII. The text itself might come -from a file with translations and the user should decide about the -current locale which determines the translation and therefore also the -external encoding used. In such a situation (and many others) the -functions described here are perfect. If more freedom while performing -the conversion is necessary take a look at the @code{iconv} functions -(@pxref{Generic Charset Conversion}). - -@menu -* Selecting the Conversion:: Selecting the conversion and its properties. -* Keeping the state:: Representing the state of the conversion. -* Converting a Character:: Converting Single Characters. -* Converting Strings:: Converting Multibyte and Wide Character - Strings. -* Multibyte Conversion Example:: A Complete Multibyte Conversion Example. -@end menu - -@node Selecting the Conversion -@subsection Selecting the conversion and its properties - -We already said above that the currently selected locale for the -@code{LC_CTYPE} category decides about the conversion which is performed -by the functions we are about to describe. Each locale uses its own -character set (given as an argument to @code{localedef}) and this is the -one assumed as the external multibyte encoding. The wide character -character set always is UCS-4, at least on GNU systems. - -A characteristic of each multibyte character set is the maximum number -of bytes which can be necessary to represent one character. This -information is quite important when writing code which uses the -conversion functions. In the examples below we will see some examples. -The @w{ISO C} standard defines two macros which provide this information. - - -@comment limits.h -@comment ISO -@deftypevr Macro int MB_LEN_MAX -This macro specifies the maximum number of bytes in the multibyte -sequence for a single character in any of the supported locales. It is -a compile-time constant and it is defined in @file{limits.h}. -@pindex limits.h -@end deftypevr - -@comment stdlib.h -@comment ISO -@deftypevr Macro int MB_CUR_MAX -@code{MB_CUR_MAX} expands into a positive integer expression that is the -maximum number of bytes in a multibyte character in the current locale. -The value is never greater than @code{MB_LEN_MAX}. Unlike -@code{MB_LEN_MAX} this macro need not be a compile-time constant and in -fact, in the GNU C library it is not. - -@pindex stdlib.h -@code{MB_CUR_MAX} is defined in @file{stdlib.h}. -@end deftypevr - -Two different macros are necessary since strictly @w{ISO C90} compilers -do not allow variable length array definitions but still it is desirable -to avoid dynamic allocation. This incomplete piece of code shows the -problem: - -@smallexample -@{ - char buf[MB_LEN_MAX]; - ssize_t len = 0; - - while (! feof (fp)) - @{ - fread (&buf[len], 1, MB_CUR_MAX - len, fp); - /* @r{... process} buf */ - len -= used; - @} -@} -@end smallexample - -The code in the inner loop is expected to have always enough bytes in -the array @var{buf} to convert one multibyte character. The array -@var{buf} has to be sized statically since many compilers do not allow a -variable size. The @code{fread} call makes sure that always -@code{MB_CUR_MAX} bytes are available in @var{buf}. Note that it isn't -a problem if @code{MB_CUR_MAX} is not a compile-time constant. - - -@node Keeping the state -@subsection Representing the state of the conversion - -@cindex stateful -In the introduction of this chapter it was said that certain character -sets use a @dfn{stateful} encoding. I.e., the encoded values depend in -some way on the previous bytes in the text. - -Since the conversion functions allow converting a text in more than one -step we must have a way to pass this information from one call of the -functions to another. - -@comment wchar.h -@comment ISO -@deftp {Data type} mbstate_t -@cindex shift state -A variable of type @code{mbstate_t} can contain all the information -about the @dfn{shift state} needed from one call to a conversion -function to another. - -@pindex wchar.h -This type is defined in @file{wchar.h}. It got introduced in -@w{Amendment 1} to @w{ISO C90}. -@end deftp - -To use objects of this type the programmer has to define such objects -(normally as local variables on the stack) and pass a pointer to the -object to the conversion functions. This way the conversion function -can update the object if the current multibyte character set is -stateful. - -There is no specific function or initializer to put the state object in -any specific state. The rules are that the object should always -represent the initial state before the first use and this is achieved by -clearing the whole variable with code such as follows: - -@smallexample -@{ - mbstate_t state; - memset (&state, '\0', sizeof (state)); - /* @r{from now on @var{state} can be used.} */ - ... -@} -@end smallexample - -When using the conversion functions to generate output it is often -necessary to test whether the current state corresponds to the initial -state. This is necessary, for example, to decide whether or not to emit -escape sequences to set the state to the initial state at certain -sequence points. Communication protocols often require this. - -@comment wchar.h -@comment ISO -@deftypefun int mbsinit (const mbstate_t *@var{ps}) -This function determines whether the state object pointed to by @var{ps} -is in the initial state or not. If @var{ps} is a null pointer or the -object is in the initial state the return value is nonzero. Otherwise -it is zero. - -@pindex wchar.h -This function was introduced in @w{Amendment 1} to @w{ISO C90} and -is declared in @file{wchar.h}. -@end deftypefun - -Code using this function often looks similar to this: - -@c Fix the example to explicitly say how to generate the escape sequence -@c to restore the initial state. -@smallexample -@{ - mbstate_t state; - memset (&state, '\0', sizeof (state)); - /* @r{Use @var{state}.} */ - ... - if (! mbsinit (&state)) - @{ - /* @r{Emit code to return to initial state.} */ - const wchar_t empty[] = L""; - const wchar_t *srcp = empty; - wcsrtombs (outbuf, &srcp, outbuflen, &state); - @} - ... -@} -@end smallexample - -The code to emit the escape sequence to get back to the initial state is -interesting. The @code{wcsrtombs} function can be used to determine the -necessary output code (@pxref{Converting Strings}). Please note that on -GNU systems it is not necessary to perform this extra action for the -conversion from multibyte text to wide character text since the wide -character encoding is not stateful. But there is nothing mentioned in -any standard which prohibits making @code{wchar_t} using a stateful -encoding. - -@node Converting a Character -@subsection Converting Single Characters - -The most fundamental of the conversion functions are those dealing with -single characters. Please note that this does not always mean single -bytes. But since there is very often a subset of the multibyte -character set which consists of single byte sequences there are -functions to help with converting bytes. One very important and often -applicable scenario is where ASCII is a subpart of the multibyte -character set. I.e., all ASCII characters stand for itself and all -other characters have at least a first byte which is beyond the range -@math{0} to @math{127}. - -@comment wchar.h -@comment ISO -@deftypefun wint_t btowc (int @var{c}) -The @code{btowc} function (``byte to wide character'') converts a valid -single byte character @var{c} in the initial shift state into the wide -character equivalent using the conversion rules from the currently -selected locale of the @code{LC_CTYPE} category. - -If @code{(unsigned char) @var{c}} is no valid single byte multibyte -character or if @var{c} is @code{EOF} the function returns @code{WEOF}. - -Please note the restriction of @var{c} being tested for validity only in -the initial shift state. There is no @code{mbstate_t} object used from -which the state information is taken and the function also does not use -any static state. - -@pindex wchar.h -This function was introduced in @w{Amendment 1} to @w{ISO C90} and -is declared in @file{wchar.h}. -@end deftypefun - -Despite the limitation that the single byte value always is interpreted -in the initial state this function is actually useful most of the time. -Most characters are either entirely single-byte character sets or they -are extension to ASCII. But then it is possible to write code like this -(not that this specific example is very useful): - -@smallexample -wchar_t * -itow (unsigned long int val) -@{ - static wchar_t buf[30]; - wchar_t *wcp = &buf[29]; - *wcp = L'\0'; - while (val != 0) - @{ - *--wcp = btowc ('0' + val % 10); - val /= 10; - @} - if (wcp == &buf[29]) - *--wcp = L'0'; - return wcp; -@} -@end smallexample - -Why is it necessary to use such a complicated implementation and not -simply cast @code{'0' + val % 10} to a wide character? The answer is -that there is no guarantee that one can perform this kind of arithmetic -on the character of the character set used for @code{wchar_t} -representation. In other situations the bytes are not constant at -compile time and so the compiler cannot do the work. In situations like -this it is necessary @code{btowc}. - -@noindent -There also is a function for the conversion in the other direction. - -@comment wchar.h -@comment ISO -@deftypefun int wctob (wint_t @var{c}) -The @code{wctob} function (``wide character to byte'') takes as the -parameter a valid wide character. If the multibyte representation for -this character in the initial state is exactly one byte long the return -value of this function is this character. Otherwise the return value is -@code{EOF}. - -@pindex wchar.h -This function was introduced in @w{Amendment 1} to @w{ISO C90} and -is declared in @file{wchar.h}. -@end deftypefun - -There are more general functions to convert single character from -multibyte representation to wide characters and vice versa. These -functions pose no limit on the length of the multibyte representation -and they also do not require it to be in the initial state. - -@comment wchar.h -@comment ISO -@deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps}) -@cindex stateful -The @code{mbrtowc} function (``multibyte restartable to wide -character'') converts the next multibyte character in the string pointed -to by @var{s} into a wide character and stores it in the wide character -string pointed to by @var{pwc}. The conversion is performed according -to the locale currently selected for the @code{LC_CTYPE} category. If -the conversion for the character set used in the locale requires a state -the multibyte string is interpreted in the state represented by the -object pointed to by @var{ps}. If @var{ps} is a null pointer, a static, -internal state variable used only by the @code{mbrtowc} function is -used. - -If the next multibyte character corresponds to the NUL wide character -the return value of the function is @math{0} and the state object is -afterwards in the initial state. If the next @var{n} or fewer bytes -form a correct multibyte character the return value is the number of -bytes starting from @var{s} which form the multibyte character. The -conversion state is updated according to the bytes consumed in the -conversion. In both cases the wide character (either the @code{L'\0'} -or the one found in the conversion) is stored in the string pointer to -by @var{pwc} iff @var{pwc} is not null. - -If the first @var{n} bytes of the multibyte string possibly form a valid -multibyte character but there are more than @var{n} bytes needed to -complete it the return value of the function is @code{(size_t) -2} and -no value is stored. Please note that this can happen even if @var{n} -has a value greater or equal to @code{MB_CUR_MAX} since the input might -contain redundant shift sequences. - -If the first @code{n} bytes of the multibyte string cannot possibly form -a valid multibyte character also no value is stored, the global variable -@code{errno} is set to the value @code{EILSEQ} and the function returns -@code{(size_t) -1}. The conversion state is afterwards undefined. - -@pindex wchar.h -This function was introduced in @w{Amendment 1} to @w{ISO C90} and -is declared in @file{wchar.h}. -@end deftypefun - -Using this function is straight forward. A function which copies a -multibyte string into a wide character string while at the same time -converting all lowercase character into uppercase could look like this -(this is not the final version, just an example; it has no error -checking, and leaks sometimes memory): - -@smallexample -wchar_t * -mbstouwcs (const char *s) -@{ - size_t len = strlen (s); - wchar_t *result = malloc ((len + 1) * sizeof (wchar_t)); - wchar_t *wcp = result; - wchar_t tmp[1]; - mbstate_t state; - size_t nbytes; - - memset (&state, '\0', sizeof (state)); - while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0) - @{ - if (nbytes >= (size_t) -2) - /* Invalid input string. */ - return NULL; - *result++ = towupper (tmp[0]); - len -= nbytes; - s += nbytes; - @} - return result; -@} -@end smallexample - -The use of @code{mbrtowc} should be clear. A single wide character is -stored in @code{@var{tmp}[0]} and the number of consumed bytes is stored -in the variable @var{nbytes}. In case the the conversion was successful -the uppercase variant of the wide character is stored in the -@var{result} array and the pointer to the input string and the number of -available bytes is adjusted. - -The only non-obvious thing about the function might be the way memory is -allocated for the result. The above code uses the fact that there can -never be more wide characters in the converted results than there are -bytes in the multibyte input string. This method yields to a -pessimistic guess about the size of the result and if many wide -character strings have to be constructed this way or the strings are -long, the extra memory required allocated because the input string -contains multibyte characters might be significant. It would be -possible to resize the allocated memory block to the correct size before -returning it. A better solution might be to allocate just the right -amount of space for the result right away. Unfortunately there is no -function to compute the length of the wide character string directly -from the multibyte string. But there is a function which does part of -the work. - -@comment wchar.h -@comment ISO -@deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps}) -The @code{mbrlen} function (``multibyte restartable length'') computes -the number of at most @var{n} bytes starting at @var{s} which form the -next valid and complete multibyte character. - -If the next multibyte character corresponds to the NUL wide character -the return value is @math{0}. If the next @var{n} bytes form a valid -multibyte character the number of bytes belonging to this multibyte -character byte sequence is returned. - -If the the first @var{n} bytes possibly form a valid multibyte -character but it is incomplete the return value is @code{(size_t) -2}. -Otherwise the multibyte character sequence is invalid and the return -value is @code{(size_t) -1}. - -The multibyte sequence is interpreted in the state represented by the -object pointed to by @var{ps}. If @var{ps} is a null pointer, a state -object local to @code{mbrlen} is used. - -@pindex wchar.h -This function was introduced in @w{Amendment 1} to @w{ISO C90} and -is declared in @file{wchar.h}. -@end deftypefun - -The tentative reader now will of course note that @code{mbrlen} can be -implemented as - -@smallexample -mbrtowc (NULL, s, n, ps != NULL ? ps : &internal) -@end smallexample - -This is true and in fact is mentioned in the official specification. -Now, how can this function be used to determine the length of the wide -character string created from a multibyte character string? It is not -directly usable but we can define a function @code{mbslen} using it: - -@smallexample -size_t -mbslen (const char *s) -@{ - mbstate_t state; - size_t result = 0; - size_t nbytes; - memset (&state, '\0', sizeof (state)); - while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0) - @{ - if (nbytes >= (size_t) -2) - /* @r{Something is wrong.} */ - return (size_t) -1; - s += nbytes; - ++result; - @} - return result; -@} -@end smallexample - -This function simply calls @code{mbrlen} for each multibyte character -in the string and counts the number of function calls. Please note that -we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen} -call. This is OK since a) this value is larger then the length of the -longest multibyte character sequence and b) because we know that the -string @var{s} ends with a NUL byte which cannot be part of any other -multibyte character sequence but the one representing the NUL wide -character. Therefore the @code{mbrlen} function will never read invalid -memory. - -Now that this function is available (just to make this clear, this -function is @emph{not} part of the GNU C library) we can compute the -number of wide character required to store the converted multibyte -character string @var{s} using - -@smallexample -wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t); -@end smallexample - -Please note that the @code{mbslen} function is quite inefficient. The -implementation of @code{mbstouwcs} implemented using @code{mbslen} would -have to perform the conversion of the multibyte character input string -twice and this conversion might be quite expensive. So it is necessary -to think about the consequences of using the easier but imprecise method -before doing the work twice. - -@comment wchar.h -@comment ISO -@deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps}) -The @code{wcrtomb} function (``wide character restartable to -multibyte'') converts a single wide character into a multibyte string -corresponding to that wide character. - -If @var{s} is a null pointer the function resets the the state stored in -the objects pointer to by @var{ps} (or the internal @code{mbstate_t} -object) to the initial state. This can also be achieved by a call like -this: - -@smallexample -wcrtombs (temp_buf, L'\0', ps) -@end smallexample - -@noindent -since if @var{s} is a null pointer @code{wcrtomb} performs as if it -writes into an internal buffer which is guaranteed to be large enough. - -If @var{wc} is the NUL wide character @code{wcrtomb} emits, if -necessary, a shift sequence to get the state @var{ps} into the initial -state followed by a single NUL byte is stored in the string @var{s}. - -Otherwise a byte sequence (possibly including shift sequences) is -written into the string @var{s}. This of only happens if @var{wc} is a -valid wide character, i.e., it has a multibyte representation in the -character set selected by locale of the @code{LC_CTYPE} category. If -@var{wc} is no valid wide character nothing is stored in the strings -@var{s}, @code{errno} is set to @code{EILSEQ}, the conversion state in -@var{ps} is undefined and the return value is @code{(size_t) -1}. - -If no error occurred the function returns the number of bytes stored in -the string @var{s}. This includes all byte representing shift -sequences. - -One word about the interface of the function: there is no parameter -specifying the length of the array @var{s}. Instead the function -assumes that there are at least @code{MB_CUR_MAX} bytes available since -this is the maximum length of any byte sequence representing a single -character. So the caller has to make sure that there is enough space -available, otherwise buffer overruns can occur. - -@pindex wchar.h -This function was introduced in @w{Amendment 1} to @w{ISO C90} and is -declared in @file{wchar.h}. -@end deftypefun - -Using this function is as easy as using @code{mbrtowc}. The following -example appends a wide character string to a multibyte character string. -Again, the code is not really useful (and correct), it is simply here to -demonstrate the use and some problems. - -@smallexample -char * -mbscatwcs (char *s, size_t len, const wchar_t *ws) -@{ - mbstate_t state; - /* @r{Find the end of the existing string.} */ - char *wp = strchr (s, '\0'); - len -= wp - s; - memset (&state, '\0', sizeof (state)); - do - @{ - size_t nbytes; - if (len < MB_CUR_LEN) - @{ - /* @r{We cannot guarantee that the next} - @r{character fits into the buffer, so} - @r{return an error.} */ - errno = E2BIG; - return NULL; - @} - nbytes = wcrtomb (wp, *ws, &state); - if (nbytes == (size_t) -1) - /* @r{Error in the conversion.} */ - return NULL; - len -= nbytes; - wp += nbytes; - @} - while (*ws++ != L'\0'); - return s; -@} -@end smallexample - -First the function has to find the end of the string currently in the -array @var{s}. The @code{strchr} call does this very efficiently since a -requirement for multibyte character representations is that the NUL byte -never is used except to represent itself (and in this context, the end -of the string). - -After initializing the state object the loop is entered where the first -task is to make sure there is enough room in the array @var{s}. We -abort if there are not at least @code{MB_CUR_LEN} bytes available. This -is not always optimal but we have no other choice. We might have less -than @code{MB_CUR_LEN} bytes available but the next multibyte character -might also be only one byte long. At the time the @code{wcrtomb} call -returns it is too late to decide whether the buffer was large enough or -not. If this solution is really unsuitable there is a very slow but -more accurate solution. - -@smallexample - ... - if (len < MB_CUR_LEN) - @{ - mbstate_t temp_state; - memcpy (&temp_state, &state, sizeof (state)); - if (wcrtomb (NULL, *ws, &temp_state) > len) - @{ - /* @r{We cannot guarantee that the next} - @r{character fits into the buffer, so} - @r{return an error.} */ - errno = E2BIG; - return NULL; - @} - @} - ... -@end smallexample - -Here we do perform the conversion which might overflow the buffer so -that we are afterwards in the position to make an exact decision about -the buffer size. Please note the @code{NULL} argument for the -destination buffer in the new @code{wcrtomb} call; since we are not -interested in the converted text at this point this is a nice way to -express this. The most unusual thing about this piece of code certainly -is the duplication of the conversion state object. But think about -this: if a change of the state is necessary to emit the next multibyte -character we want to have the same shift state change performed in the -real conversion. Therefore we have to preserve the initial shift state -information. - -There are certainly many more and even better solutions to this problem. -This example is only meant for educational purposes. - -@node Converting Strings -@subsection Converting Multibyte and Wide Character Strings - -The functions described in the previous section only convert a single -character at a time. Most operations to be performed in real-world -programs include strings and therefore the @w{ISO C} standard also -defines conversions on entire strings. However, the defined set of -functions is quite limited, thus the GNU C library contains a few -extensions which can help in some important situations. - -@comment wchar.h -@comment ISO -@deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) -The @code{mbsrtowcs} function (``multibyte string restartable to wide -character string'') converts an NUL terminated multibyte character -string at @code{*@var{src}} into an equivalent wide character string, -including the NUL wide character at the end. The conversion is started -using the state information from the object pointed to by @var{ps} or -from an internal object of @code{mbsrtowcs} if @var{ps} is a null -pointer. Before returning the state object to match the state after the -last converted character. The state is the initial state if the -terminating NUL byte is reached and converted. - -If @var{dst} is not a null pointer the result is stored in the array -pointed to by @var{dst}, otherwise the conversion result is not -available since it is stored in an internal buffer. - -If @var{len} wide characters are stored in the array @var{dst} before -reaching the end of the input string the conversion stops and @var{len} -is returned. If @var{dst} is a null pointer @var{len} is never checked. - -Another reason for a premature return from the function call is if the -input string contains an invalid multibyte sequence. In this case the -global variable @code{errno} is set to @code{EILSEQ} and the function -returns @code{(size_t) -1}. - -@c XXX The ISO C9x draft seems to have a problem here. It says that PS -@c is not updated if DST is NULL. This is not said straight forward and -@c none of the other functions is described like this. It would make sense -@c to define the function this way but I don't think it is meant like this. - -In all other cases the function returns the number of wide characters -converted during this call. If @var{dst} is not null @code{mbsrtowcs} -stores in the pointer pointed to by @var{src} a null pointer (if the NUL -byte in the input string was reached) or the address of the byte -following the last converted multibyte character. - -@pindex wchar.h -This function was introduced in @w{Amendment 1} to @w{ISO C90} and is -declared in @file{wchar.h}. -@end deftypefun - -The definition of this function has one limitation which has to be -understood. The requirement that @var{dst} has to be a NUL terminated -string provides problems if one wants to convert buffers with text. A -buffer is normally no collection of NUL terminated strings but instead a -continuous collection of lines, separated by newline characters. Now -assume a function to convert one line from a buffer is needed. Since -the line is not NUL terminated the source pointer cannot directly point -into the unmodified text buffer. This means, either one inserts the NUL -byte at the appropriate place for the time of the @code{mbsrtowcs} -function call (which is not doable for a read-only buffer or in a -multi-threaded application) or one copies the line in an extra buffer -where it can be terminated by a NUL byte. Note that it is not in -general possible to limit the number of characters to convert by setting -the parameter @var{len} to any specific value. Since it is not known -how many bytes each multibyte character sequence is in length one always -could do only a guess. - -@cindex stateful -There is still a problem with the method of NUL-terminating a line right -after the newline character which could lead to very strange results. -As said in the description of the @var{mbsrtowcs} function above the -conversion state is guaranteed to be in the initial shift state after -processing the NUL byte at the end of the input string. But this NUL -byte is not really part of the text. I.e., the conversion state after -the newline in the original text could be something different than the -initial shift state and therefore the first character of the next line -is encoded using this state. But the state in question is never -accessible to the user since the conversion stops after the NUL byte -(which resets the state). Most stateful character sets in use today -require that the shift state after a newline is the initial state--but -this is not a strict guarantee. Therefore simply NUL terminating a -piece of a running text is not always an adequate solution and therefore -never should be used in generally used code. - -The generic conversion interface (@pxref{Generic Charset Conversion}) -does not have this limitation (it simply works on buffers, not -strings), and the GNU C library contains a set of functions which take -additional parameters specifying the maximal number of bytes which are -consumed from the input string. This way the problem of -@code{mbsrtowcs}'s example above could be solved by determining the line -length and passing this length to the function. - -@comment wchar.h -@comment ISO -@deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) -The @code{wcsrtombs} function (``wide character string restartable to -multibyte string'') converts the NUL terminated wide character string at -@code{*@var{src}} into an equivalent multibyte character string and -stores the result in the array pointed to by @var{dst}. The NUL wide -character is also converted. The conversion starts in the state -described in the object pointed to by @var{ps} or by a state object -locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If -@var{dst} is a null pointer the conversion is performed as usual but the -result is not available. If all characters of the input string were -successfully converted and if @var{dst} is not a null pointer the -pointer pointed to by @var{src} gets assigned a null pointer. - -If one of the wide characters in the input string has no valid multibyte -character equivalent the conversion stops early, sets the global -variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}. - -Another reason for a premature stop is if @var{dst} is not a null -pointer and the next converted character would require more than -@var{len} bytes in total to the array @var{dst}. In this case (and if -@var{dest} is not a null pointer) the pointer pointed to by @var{src} is -assigned a value pointing to the wide character right after the last one -successfully converted. - -Except in the case of an encoding error the return value of the function -is the number of bytes in all the multibyte character sequences stored -in @var{dst}. Before returning the state in the object pointed to by -@var{ps} (or the internal object in case @var{ps} is a null pointer) is -updated to reflect the state after the last conversion. The state is -the initial shift state in case the terminating NUL wide character was -converted. - -@pindex wchar.h -This function was introduced in @w{Amendment 1} to @w{ISO C90} and is -declared in @file{wchar.h}. -@end deftypefun - -The restriction mentions above for the @code{mbsrtowcs} function applies -also here. There is no possibility to directly control the number of -input characters. One has to place the NUL wide character at the -correct place or control the consumed input indirectly via the available -output array size (the @var{len} parameter). - -@comment wchar.h -@comment GNU -@deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps}) -The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs} -function. All the parameters are the same except for @var{nmc} which is -new. The return value is the same as for @code{mbsrtowcs}. - -This new parameter specifies how many bytes at most can be used from the -multibyte character string. I.e., the multibyte character string -@code{*@var{src}} need not be NUL terminated. But if a NUL byte is -found within the @var{nmc} first bytes of the string the conversion -stops here. - -This function is a GNU extensions. It is meant to work around the -problems mentioned above. Now it is possible to convert buffer with -multibyte character text piece for piece without having to care about -inserting NUL bytes and the effect of NUL bytes on the conversion state. -@end deftypefun - -A function to convert a multibyte string into a wide character string -and display it could be written like this (this is not a really useful -example): - -@smallexample -void -showmbs (const char *src, FILE *fp) -@{ - mbstate_t state; - int cnt = 0; - memset (&state, '\0', sizeof (state)); - while (1) - @{ - wchar_t linebuf[100]; - const char *endp = strchr (src, '\n'); - size_t n; - - /* @r{Exit if there is no more line.} */ - if (endp == NULL) - break; - - n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state); - linebuf[n] = L'\0'; - fprintf (fp, "line %d: \"%S\"\n", linebuf); - @} -@} -@end smallexample - -There is no problem with the state after a call to @code{mbsnrtowcs}. -Since we don't insert characters in the strings which were not in there -right from the beginning and we use @var{state} only for the conversion -of the given buffer there is no problem with altering the state. - -@comment wchar.h -@comment GNU -@deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps}) -The @code{wcsnrtombs} function implements the conversion from wide -character strings to multibyte character strings. It is similar to -@code{wcsrtombs} but it takes, just like @code{mbsnrtowcs}, an extra -parameter which specifies the length of the input string. - -No more than @var{nwc} wide characters from the input string -@code{*@var{src}} are converted. If the input string contains a NUL -wide character in the first @var{nwc} character to conversion stops at -this place. - -This function is a GNU extension and just like @code{mbsnrtowcs} is -helps in situations where no NUL terminated input strings are available. -@end deftypefun - - -@node Multibyte Conversion Example -@subsection A Complete Multibyte Conversion Example - -The example programs given in the last sections are only brief and do -not contain all the error checking etc. Presented here is a complete -and documented example. It features the @code{mbrtowc} function but it -should be easy to derive versions using the other functions. - -@smallexample -int -file_mbsrtowcs (int input, int output) -@{ - /* @r{Note the use of @code{MB_LEN_MAX}.} - @r{@code{MB_CUR_MAX} cannot portably be used here.} */ - char buffer[BUFSIZ + MB_LEN_MAX]; - mbstate_t state; - int filled = 0; - int eof = 0; - - /* @r{Initialize the state.} */ - memset (&state, '\0', sizeof (state)); - - while (!eof) - @{ - ssize_t nread; - ssize_t nwrite; - char *inp = buffer; - wchar_t outbuf[BUFSIZ]; - wchar_t *outp = outbuf; - - /* @r{Fill up the buffer from the input file.} */ - nread = read (input, buffer + filled, BUFSIZ); - if (nread < 0) - @{ - perror ("read"); - return 0; - @} - /* @r{If we reach end of file, make a note to read no more.} */ - if (nread == 0) - eof = 1; - - /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */ - filled += nread; - - /* @r{Convert those bytes to wide characters--as many as we can.} */ - while (1) - @{ - size_t thislen = mbrtowc (outp, inp, filled, &state); - /* @r{Stop converting at invalid character;} - @r{this can mean we have read just the first part} - @r{of a valid character.} */ - if (thislen == (size_t) -1) - break; - /* @r{We want to handle embedded NUL bytes} - @r{but the return value is 0. Correct this.} */ - if (thislen == 0) - thislen = 1; - /* @r{Advance past this character.} */ - inp += thislen; - filled -= thislen; - ++outp; - @} - - /* @r{Write the wide characters we just made.} */ - nwrite = write (output, outbuf, - (outp - outbuf) * sizeof (wchar_t)); - if (nwrite < 0) - @{ - perror ("write"); - return 0; - @} - - /* @r{See if we have a @emph{real} invalid character.} */ - if ((eof && filled > 0) || filled >= MB_CUR_MAX) - @{ - error (0, 0, "invalid multibyte character"); - return 0; - @} - - /* @r{If any characters must be carried forward,} - @r{put them at the beginning of @code{buffer}.} */ - if (filled > 0) - memmove (inp, buffer, filled); - @} - - return 1; -@} -@end smallexample - - -@node Non-reentrant Conversion -@section Non-reentrant Conversion Function - -The functions described in the last chapter are defined in -@w{Amendment 1} to @w{ISO C90}. But the original @w{ISO C90} standard also -contained functions for character set conversion. The reason that they -are not described in the first place is that they are almost entirely -useless. - -The problem is that all the functions for conversion defined in @w{ISO -C90} use a local state. This implies that multiple conversions at the -same time (not only when using threads) cannot be done, and that you -cannot first convert single characters and then strings since you cannot -tell the conversion functions which state to use. - -These functions are therefore usable only in a very limited set of -situations. One must complete converting the entire string before -starting a new one and each string/text must be converted with the same -function (there is no problem with the library itself; it is guaranteed -that no library function changes the state of any of these functions). -@strong{For the above reasons it is highly requested that the functions -from the last section are used in place of non-reentrant conversion -functions.} - -@menu -* Non-reentrant Character Conversion:: Non-reentrant Conversion of Single - Characters. -* Non-reentrant String Conversion:: Non-reentrant Conversion of Strings. -* Shift State:: States in Non-reentrant Functions. -@end menu - -@node Non-reentrant Character Conversion -@subsection Non-reentrant Conversion of Single Characters - -@comment stdlib.h -@comment ISO -@deftypefun int mbtowc (wchar_t *restrict @var{result}, const char *restrict @var{string}, size_t @var{size}) -The @code{mbtowc} (``multibyte to wide character'') function when called -with non-null @var{string} converts the first multibyte character -beginning at @var{string} to its corresponding wide character code. It -stores the result in @code{*@var{result}}. - -@code{mbtowc} never examines more than @var{size} bytes. (The idea is -to supply for @var{size} the number of bytes of data you have in hand.) - -@code{mbtowc} with non-null @var{string} distinguishes three -possibilities: the first @var{size} bytes at @var{string} start with -valid multibyte character, they start with an invalid byte sequence or -just part of a character, or @var{string} points to an empty string (a -null character). - -For a valid multibyte character, @code{mbtowc} converts it to a wide -character and stores that in @code{*@var{result}}, and returns the -number of bytes in that character (always at least @math{1}, and never -more than @var{size}). - -For an invalid byte sequence, @code{mbtowc} returns @math{-1}. For an -empty string, it returns @math{0}, also storing @code{'\0'} in -@code{*@var{result}}. - -If the multibyte character code uses shift characters, then -@code{mbtowc} maintains and updates a shift state as it scans. If you -call @code{mbtowc} with a null pointer for @var{string}, that -initializes the shift state to its standard initial value. It also -returns nonzero if the multibyte character code in use actually has a -shift state. @xref{Shift State}. -@end deftypefun - -@comment stdlib.h -@comment ISO -@deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar}) -The @code{wctomb} (``wide character to multibyte'') function converts -the wide character code @var{wchar} to its corresponding multibyte -character sequence, and stores the result in bytes starting at -@var{string}. At most @code{MB_CUR_MAX} characters are stored. - -@code{wctomb} with non-null @var{string} distinguishes three -possibilities for @var{wchar}: a valid wide character code (one that can -be translated to a multibyte character), an invalid code, and @code{L'\0'}. - -Given a valid code, @code{wctomb} converts it to a multibyte character, -storing the bytes starting at @var{string}. Then it returns the number -of bytes in that character (always at least @math{1}, and never more -than @code{MB_CUR_MAX}). - -If @var{wchar} is an invalid wide character code, @code{wctomb} returns -@math{-1}. If @var{wchar} is @code{L'\0'}, it returns @code{0}, also -storing @code{'\0'} in @code{*@var{string}}. - -If the multibyte character code uses shift characters, then -@code{wctomb} maintains and updates a shift state as it scans. If you -call @code{wctomb} with a null pointer for @var{string}, that -initializes the shift state to its standard initial value. It also -returns nonzero if the multibyte character code in use actually has a -shift state. @xref{Shift State}. - -Calling this function with a @var{wchar} argument of zero when -@var{string} is not null has the side-effect of reinitializing the -stored shift state @emph{as well as} storing the multibyte character -@code{'\0'} and returning @math{0}. -@end deftypefun - -Similar to @code{mbrlen} there is also a non-reentrant function which -computes the length of a multibyte character. It can be defined in -terms of @code{mbtowc}. - -@comment stdlib.h -@comment ISO -@deftypefun int mblen (const char *@var{string}, size_t @var{size}) -The @code{mblen} function with a non-null @var{string} argument returns -the number of bytes that make up the multibyte character beginning at -@var{string}, never examining more than @var{size} bytes. (The idea is -to supply for @var{size} the number of bytes of data you have in hand.) - -The return value of @code{mblen} distinguishes three possibilities: the -first @var{size} bytes at @var{string} start with valid multibyte -character, they start with an invalid byte sequence or just part of a -character, or @var{string} points to an empty string (a null character). - -For a valid multibyte character, @code{mblen} returns the number of -bytes in that character (always at least @code{1}, and never more than -@var{size}). For an invalid byte sequence, @code{mblen} returns -@math{-1}. For an empty string, it returns @math{0}. - -If the multibyte character code uses shift characters, then @code{mblen} -maintains and updates a shift state as it scans. If you call -@code{mblen} with a null pointer for @var{string}, that initializes the -shift state to its standard initial value. It also returns a nonzero -value if the multibyte character code in use actually has a shift state. -@xref{Shift State}. - -@pindex stdlib.h -The function @code{mblen} is declared in @file{stdlib.h}. -@end deftypefun - - -@node Non-reentrant String Conversion -@subsection Non-reentrant Conversion of Strings - -For convenience reasons the @w{ISO C90} standard defines also functions -to convert entire strings instead of single characters. These functions -suffer from the same problems as their reentrant counterparts from -@w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}. - -@comment stdlib.h -@comment ISO -@deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size}) -The @code{mbstowcs} (``multibyte string to wide character string'') -function converts the null-terminated string of multibyte characters -@var{string} to an array of wide character codes, storing not more than -@var{size} wide characters into the array beginning at @var{wstring}. -The terminating null character counts towards the size, so if @var{size} -is less than the actual number of wide characters resulting from -@var{string}, no terminating null character is stored. - -The conversion of characters from @var{string} begins in the initial -shift state. - -If an invalid multibyte character sequence is found, this function -returns a value of @math{-1}. Otherwise, it returns the number of wide -characters stored in the array @var{wstring}. This number does not -include the terminating null character, which is present if the number -is less than @var{size}. - -Here is an example showing how to convert a string of multibyte -characters, allocating enough space for the result. - -@smallexample -wchar_t * -mbstowcs_alloc (const char *string) -@{ - size_t size = strlen (string) + 1; - wchar_t *buf = xmalloc (size * sizeof (wchar_t)); - - size = mbstowcs (buf, string, size); - if (size == (size_t) -1) - return NULL; - buf = xrealloc (buf, (size + 1) * sizeof (wchar_t)); - return buf; -@} -@end smallexample - -@end deftypefun - -@comment stdlib.h -@comment ISO -@deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size}) -The @code{wcstombs} (``wide character string to multibyte string'') -function converts the null-terminated wide character array @var{wstring} -into a string containing multibyte characters, storing not more than -@var{size} bytes starting at @var{string}, followed by a terminating -null character if there is room. The conversion of characters begins in -the initial shift state. - -The terminating null character counts towards the size, so if @var{size} -is less than or equal to the number of bytes needed in @var{wstring}, no -terminating null character is stored. - -If a code that does not correspond to a valid multibyte character is -found, this function returns a value of @math{-1}. Otherwise, the -return value is the number of bytes stored in the array @var{string}. -This number does not include the terminating null character, which is -present if the number is less than @var{size}. -@end deftypefun - -@node Shift State -@subsection States in Non-reentrant Functions - -In some multibyte character codes, the @emph{meaning} of any particular -byte sequence is not fixed; it depends on what other sequences have come -earlier in the same string. Typically there are just a few sequences -that can change the meaning of other sequences; these few are called -@dfn{shift sequences} and we say that they set the @dfn{shift state} for -other sequences that follow. - -To illustrate shift state and shift sequences, suppose we decide that -the sequence @code{0200} (just one byte) enters Japanese mode, in which -pairs of bytes in the range from @code{0240} to @code{0377} are single -characters, while @code{0201} enters Latin-1 mode, in which single bytes -in the range from @code{0240} to @code{0377} are characters, and -interpreted according to the ISO Latin-1 character set. This is a -multibyte code which has two alternative shift states (``Japanese mode'' -and ``Latin-1 mode''), and two shift sequences that specify particular -shift states. - -When the multibyte character code in use has shift states, then -@code{mblen}, @code{mbtowc} and @code{wctomb} must maintain and update -the current shift state as they scan the string. To make this work -properly, you must follow these rules: - -@itemize @bullet -@item -Before starting to scan a string, call the function with a null pointer -for the multibyte character address---for example, @code{mblen (NULL, -0)}. This initializes the shift state to its standard initial value. - -@item -Scan the string one character at a time, in order. Do not ``back up'' -and rescan characters already scanned, and do not intersperse the -processing of different strings. -@end itemize - -Here is an example of using @code{mblen} following these rules: - -@smallexample -void -scan_string (char *s) -@{ - int length = strlen (s); - - /* @r{Initialize shift state.} */ - mblen (NULL, 0); - - while (1) - @{ - int thischar = mblen (s, length); - /* @r{Deal with end of string and invalid characters.} */ - if (thischar == 0) - break; - if (thischar == -1) - @{ - error ("invalid multibyte character"); - break; - @} - /* @r{Advance past this character.} */ - s += thischar; - length -= thischar; - @} -@} -@end smallexample - -The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not -reentrant when using a multibyte code that uses a shift state. However, -no other library functions call these functions, so you don't have to -worry that the shift state will be changed mysteriously. - - -@node Generic Charset Conversion -@section Generic Charset Conversion - -The conversion functions mentioned so far in this chapter all had in -common that they operate on character sets which are not directly -specified by the functions. The multibyte encoding used is specified by -the currently selected locale for the @code{LC_CTYPE} category. The -wide character set is fixed by the implementation (in the case of GNU C -library it always is UCS-4 encoded @w{ISO 10646}. - -This has of course several problems when it comes to general character -conversion: - -@itemize @bullet -@item -For every conversion where neither the source or destination character -set is the character set of the locale for the @code{LC_CTYPE} category, -one has to change the @code{LC_CTYPE} locale using @code{setlocale}. - -This introduces major problems for the rest of the programs since -several more functions (e.g., the character classification functions, -@pxref{Classification of Characters}) use the @code{LC_CTYPE} category. - -@item -Parallel conversions to and from different character sets are not -possible since the @code{LC_CTYPE} selection is global and shared by all -threads. - -@item -If neither the source nor the destination character set is the character -set used for @code{wchar_t} representation there is at least a two-step -process necessary to convert a text using the functions above. One -would have to select the source character set as the multibyte encoding, -convert the text into a @code{wchar_t} text, select the destination -character set as the multibyte encoding and convert the wide character -text to the multibyte (@math{=} destination) character set. - -Even if this is possible (which is not guaranteed) it is a very tiring -work. Plus it suffers from the other two raised points even more due to -the steady changing of the locale. -@end itemize - - -The XPG2 standard defines a completely new set of functions which has -none of these limitations. They are not at all coupled to the selected -locales and they but no constraints on the character sets selected for -source and destination. Only the set of available conversions is -limiting them. The standard does not specify that any conversion at all -must be available. It is a measure of the quality of the implementation. - -In the following text first the interface to @code{iconv}, the -conversion function, will be described. Comparisons with other -implementations will show what pitfalls lie on the way of portable -applications. At last, the implementation is described as far as -interesting to the advanced user who wants to extend the conversion -capabilities. - -@menu -* Generic Conversion Interface:: Generic Character Set Conversion Interface. -* iconv Examples:: A complete @code{iconv} example. -* Other iconv Implementations:: Some Details about other @code{iconv} - Implementations. -* glibc iconv Implementation:: The @code{iconv} Implementation in the GNU C - library. -@end menu - -@node Generic Conversion Interface -@subsection Generic Character Set Conversion Interface - -This set of functions follows the traditional cycle of using a resource: -open--use--close. The interface consists of three functions, each of -which implement one step. - -Before the interfaces are described it is necessary to introduce a -datatype. Just like other open--use--close interface the functions -introduced here work using a handles and the @file{iconv.h} header -defines a special type for the handles used. - -@comment iconv.h -@comment XPG2 -@deftp {Data Type} iconv_t -This data type is an abstract type defined in @file{iconv.h}. The user -must not assume anything about the definition of this type, it must be -completely opaque. - -Objects of this type can get assigned handles for the conversions using -the @code{iconv} functions. The objects themselves need not be freed but -the conversions for which the handles stand for have to. -@end deftp - -@noindent -The first step is the function to create a handle. - -@comment iconv.h -@comment XPG2 -@deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode}) -The @code{iconv_open} function has to be used before starting a -conversion. The two parameters this function takes determine the -source and destination character set for the conversion and if the -implementation has the possibility to perform such a conversion the -function returns a handle. - -If the wanted conversion is not available the function returns -@code{(iconv_t) -1}. In this case the global variable @code{errno} can -have the following values: - -@table @code -@item EMFILE -The process already has @code{OPEN_MAX} file descriptors open. -@item ENFILE -The system limit of open file is reached. -@item ENOMEM -Not enough memory to carry out the operation. -@item EINVAL -The conversion from @var{fromcode} to @var{tocode} is not supported. -@end table - -It is not possible to use the same descriptor in different threads to -perform independent conversions. Within the data structures associated -with the descriptor there is information about the conversion state. -This must not be messed up by using it in different conversions. - -An @code{iconv} descriptor is like a file descriptor as for every use a -new descriptor must be created. The descriptor does not stand for all -of the conversions from @var{fromset} to @var{toset}. - -The GNU C library implementation of @code{iconv_open} has one -significant extension to other implementations. To ease the extension -of the set of available conversions the implementation allows storing -the necessary files with data and code in arbitrarily many directories. -How this extension has to be written will be explained below -(@pxref{glibc iconv Implementation}). Here it is only important to say -that all directories mentioned in the @code{GCONV_PATH} environment -variable are considered if they contain a file @file{gconv-modules}. -These directories need not necessarily be created by the system -administrator. In fact, this extension is introduced to help users -writing and using their own, new conversions. Of course this does not work -for security reasons in SUID binaries; in this case only the system -directory is considered and this normally is -@file{@var{prefix}/lib/gconv}. The @code{GCONV_PATH} environment -variable is examined exactly once at the first call of the -@code{iconv_open} function. Later modifications of the variable have no -effect. - -@pindex iconv.h -This function got introduced early in the X/Open Portability Guide, -@w{version 2}. It is supported by all commercial Unices as it is -required for the Unix branding. However, the quality and completeness -of the implementation varies widely. The function is declared in -@file{iconv.h}. -@end deftypefun - -The @code{iconv} implementation can associate large data structure with -the handle returned by @code{iconv_open}. Therefore it is crucial to -free all the resources once all conversions are carried out and the -conversion is not needed anymore. - -@comment iconv.h -@comment XPG2 -@deftypefun int iconv_close (iconv_t @var{cd}) -The @code{iconv_close} function frees all resources associated with the -handle @var{cd} which must have been returned by a successful call to -the @code{iconv_open} function. - -If the function call was successful the return value is @math{0}. -Otherwise it is @math{-1} and @code{errno} is set appropriately. -Defined error are: - -@table @code -@item EBADF -The conversion descriptor is invalid. -@end table - -@pindex iconv.h -This function was introduced together with the rest of the @code{iconv} -functions in XPG2 and it is declared in @file{iconv.h}. -@end deftypefun - -The standard defines only one actual conversion function. This has -therefore the most general interface: it allows conversion from one -buffer to another. Conversion from a file to a buffer, vice versa, or -even file to file can be implemented on top of it. - -@comment iconv.h -@comment XPG2 -@deftypefun size_t iconv (iconv_t @var{cd}, char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft}) -@cindex stateful -The @code{iconv} function converts the text in the input buffer -according to the rules associated with the descriptor @var{cd} and -stores the result in the output buffer. It is possible to call the -function for the same text several times in a row since for stateful -character sets the necessary state information is kept in the data -structures associated with the descriptor. - -The input buffer is specified by @code{*@var{inbuf}} and it contains -@code{*@var{inbytesleft}} bytes. The extra indirection is necessary for -communicating the used input back to the caller (see below). It is -important to note that the buffer pointer is of type @code{char} and the -length is measured in bytes even if the input text is encoded in wide -characters. - -The output buffer is specified in a similar way. @code{*@var{outbuf}} -points to the beginning of the buffer with at least -@code{*@var{outbytesleft}} bytes room for the result. The buffer -pointer again is of type @code{char} and the length is measured in -bytes. If @var{outbuf} or @code{*@var{outbuf}} is a null pointer the -conversion is performed but no output is available. - -If @var{inbuf} is a null pointer the @code{iconv} function performs the -necessary action to put the state of the conversion into the initial -state. This is obviously a no-op for non-stateful encodings, but if the -encoding has a state such a function call might put some byte sequences -in the output buffer which perform the necessary state changes. The -next call with @var{inbuf} not being a null pointer then simply goes on -from the initial state. It is important that the programmer never makes -any assumption on whether the conversion has to deal with states or not. -Even if the input and output character sets are not stateful the -implementation might still have to keep states. This is due to the -implementation chosen for the GNU C library as it is described below. -Therefore an @code{iconv} call to reset the state should always be -performed if some protocol requires this for the output text. - -The conversion stops for three reasons. The first is that all -characters from the input buffer are converted. This actually can mean -two things: really all bytes from the input buffer are consumed or -there are some bytes at the end of the buffer which possibly can form a -complete character but the input is incomplete. The second reason for a -stop is when the output buffer is full. And the third reason is that -the input contains invalid characters. - -In all these cases the buffer pointers after the last successful -conversion, for input and output buffer, are stored in @var{inbuf} and -@var{outbuf} and the available room in each buffer is stored in -@var{inbytesleft} and @var{outbytesleft}. - -Since the character sets selected in the @code{iconv_open} call can be -almost arbitrary there can be situations where the input buffer contains -valid characters which have no identical representation in the output -character set. The behavior in this situation is undefined. The -@emph{current} behavior of the GNU C library in this situation is to -return with an error immediately. This certainly is not the most -desirable solution. Therefore future versions will provide better ones -but they are not yet finished. - -If all input from the input buffer is successfully converted and stored -in the output buffer the function returns the number of non-reversible -conversions performed. In all other cases the return value is -@code{(size_t) -1} and @code{errno} is set appropriately. In this case -the value pointed to by @var{inbytesleft} is nonzero. - -@table @code -@item EILSEQ -The conversion stopped because of an invalid byte sequence in the input. -After the call @code{*@var{inbuf}} points at the first byte of the -invalid byte sequence. - -@item E2BIG -The conversion stopped because it ran out of space in the output buffer. - -@item EINVAL -The conversion stopped because of an incomplete byte sequence at the end -of the input buffer. - -@item EBADF -The @var{cd} argument is invalid. -@end table - -@pindex iconv.h -This function was introduced in the XPG2 standard and is declared in the -@file{iconv.h} header. -@end deftypefun - -The definition of the @code{iconv} function is quite good overall. It -provides quite flexible functionality. The only problems lie in the -boundary cases which are incomplete byte sequences at the end of the -input buffer and invalid input. A third problem, which is not really -a design problem, is the way conversions are selected. The standard -does not say anything about the legitimate names, a minimal set of -available conversions. We will see how this negatively impacts other -implementations, as is demonstrated below. - - -@node iconv Examples -@subsection A complete @code{iconv} example - -The example below features a solution for a common problem. Given that -one knows the internal encoding used by the system for @code{wchar_t} -strings one often is in the position to read text from a file and store -it in wide character buffers. One can do this using @code{mbsrtowcs} -but then we run into the problems discussed above. - -@smallexample -int -file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail) -@{ - char inbuf[BUFSIZ]; - size_t insize = 0; - char *wrptr = (char *) outbuf; - int result = 0; - iconv_t cd; - - cd = iconv_open ("WCHAR_T", charset); - if (cd == (iconv_t) -1) - @{ - /* @r{Something went wrong.} */ - if (errno == EINVAL) - error (0, 0, "conversion from '%s' to wchar_t not available", - charset); - else - perror ("iconv_open"); - - /* @r{Terminate the output string.} */ - *outbuf = L'\0'; - - return -1; - @} - - while (avail > 0) - @{ - size_t nread; - size_t nconv; - char *inptr = inbuf; - - /* @r{Read more input.} */ - nread = read (fd, inbuf + insize, sizeof (inbuf) - insize); - if (nread == 0) - @{ - /* @r{When we come here the file is completely read.} - @r{This still could mean there are some unused} - @r{characters in the @code{inbuf}. Put them back.} */ - if (lseek (fd, -insize, SEEK_CUR) == -1) - result = -1; - - /* @r{Now write out the byte sequence to get into the} - @r{initial state if this is necessary.} */ - iconv (cd, NULL, NULL, &wrptr, &avail); - - break; - @} - insize += nread; - - /* @r{Do the conversion.} */ - nconv = iconv (cd, &inptr, &insize, &wrptr, &avail); - if (nconv == (size_t) -1) - @{ - /* @r{Not everything went right. It might only be} - @r{an unfinished byte sequence at the end of the} - @r{buffer. Or it is a real problem.} */ - if (errno == EINVAL) - /* @r{This is harmless. Simply move the unused} - @r{bytes to the beginning of the buffer so that} - @r{they can be used in the next round.} */ - memmove (inbuf, inptr, insize); - else - @{ - /* @r{It is a real problem. Maybe we ran out of} - @r{space in the output buffer or we have invalid} - @r{input. In any case back the file pointer to} - @r{the position of the last processed byte.} */ - lseek (fd, -insize, SEEK_CUR); - result = -1; - break; - @} - @} - @} - - /* @r{Terminate the output string.} */ - if (avail >= sizeof (wchar_t)) - *((wchar_t *) wrptr) = L'\0'; - - if (iconv_close (cd) != 0) - perror ("iconv_close"); - - return (wchar_t *) wrptr - outbuf; -@} -@end smallexample - -@cindex stateful -This example shows the most important aspects of using the @code{iconv} -functions. It shows how successive calls to @code{iconv} can be used to -convert large amounts of text. The user does not have to care about -stateful encodings as the functions take care of everything. - -An interesting point is the case where @code{iconv} return an error and -@code{errno} is set to @code{EINVAL}. This is not really an error in -the transformation. It can happen whenever the input character set -contains byte sequences of more than one byte for some character and -texts are not processed in one piece. In this case there is a chance -that a multibyte sequence is cut. The caller than can simply read the -remainder of the takes and feed the offending bytes together with new -character from the input to @code{iconv} and continue the work. The -internal state kept in the descriptor is @emph{not} unspecified after -such an event as it is the case with the conversion functions from the -@w{ISO C} standard. - -The example also shows the problem of using wide character strings with -@code{iconv}. As explained in the description of the @code{iconv} -function above the function always takes a pointer to a @code{char} -array and the available space is measured in bytes. In the example the -output buffer is a wide character buffer. Therefore we use a local -variable @var{wrptr} of type @code{char *} which is used in the -@code{iconv} calls. - -This looks rather innocent but can lead to problems on platforms which -have tight restriction on alignment. Therefore the caller of -@code{iconv} has to make sure that the pointers passed are suitable for -access of characters from the appropriate character set. Since in the -above case the input parameter to the function is a @code{wchar_t} -pointer this is the case (unless the user violates alignment when -computing the parameter). But in other situations, especially when -writing generic functions where one does not know what type of character -set one uses and therefore treats text as a sequence of bytes, it might -become tricky. - - -@node Other iconv Implementations -@subsection Some Details about other @code{iconv} Implementations - -This is not really the place to discuss the @code{iconv} implementation -of other systems but it is necessary to know a bit about them to write -portable programs. The above mentioned problems with the specification -of the @code{iconv} functions can lead to portability issues. - -The first thing to notice is that due to the large number of character -sets in use it is certainly not practical to encode the conversions -directly in the C library. Therefore the conversion information must -come from files outside the C library. This is usually done in one or -both of the following ways: - -@itemize @bullet -@item -The C library contains a set of generic conversion functions which can -read the needed conversion tables and other information from data files. -These files get loaded when necessary. - -This solution is problematic as it requires a great deal of effort to -apply to all character sets (potentially an infinite set). The -differences in the structure of the different character sets is so large -that many different variants of the table processing functions must be -developed. On top of this the generic nature of these functions make -them slower than specifically implemented functions. - -@item -The C library only contains a framework which can dynamically load -object files and execute the therein contained conversion functions. - -This solution provides much more flexibility. The C library itself -contains only very little code and therefore reduces the general memory -footprint. Also, with a documented interface between the C library and -the loadable modules it is possible for third parties to extend the set -of available conversion modules. A drawback of this solution is that -dynamic loading must be available. -@end itemize - -Some implementations in commercial Unices implement a mixture of these -these possibilities, the majority only the second solution. Using -loadable modules moves the code out of the library itself and keeps the -door open for extensions and improvements. But this design is also -limiting on some platforms since not many platforms support dynamic -loading in statically linked programs. On platforms without his -capability it is therefore not possible to use this interface in -statically linked programs. The GNU C library has on ELF platforms no -problems with dynamic loading in in these situations and therefore this -point is moot. The danger is that one gets acquainted with this and -forgets about the restrictions on other systems. - -A second thing to know about other @code{iconv} implementations is that -the number of available conversions is often very limited. Some -implementations provide in the standard release (not special -international or developer releases) at most 100 to 200 conversion -possibilities. This does not mean 200 different character sets are -supported. E.g., conversions from one character set to a set of, say, -10 others counts as 10 conversion. Together with the other direction -this makes already 20. One can imagine the thin coverage these platform -provide. Some Unix vendors even provide only a handful of conversions -which renders them useless for almost all uses. - -This directly leads to a third and probably the most problematic point. -The way the @code{iconv} conversion functions are implemented on all -known Unix system and the availability of the conversion functions from -character set @math{@cal{A}} to @math{@cal{B}} and the conversion from -@math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the -conversion from @math{@cal{A}} to @math{@cal{C}} is available. - -This might not seem unreasonable and problematic at first but it is a -quite big problem as one will notice shortly after hitting it. To show -the problem we assume to write a program which has to convert from -@math{@cal{A}} to @math{@cal{C}}. A call like - -@smallexample -cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}"); -@end smallexample - -@noindent -does fail according to the assumption above. But what does the program -do now? The conversion is really necessary and therefore simply giving -up is no possibility. - -This is a nuisance. The @code{iconv} function should take care of this. -But how should the program proceed from here on? If it would try to -convert to character set @math{@cal{B}} first the two @code{iconv_open} -calls - -@smallexample -cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}"); -@end smallexample - -@noindent -and - -@smallexample -cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}"); -@end smallexample - -@noindent -will succeed but how to find @math{@cal{B}}? - -Unfortunately, the answer is: there is no general solution. On some -systems guessing might help. On those systems most character sets can -convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. -Beside this only some very system-specific methods can help. Since the -conversion functions come from loadable modules and these modules must -be stored somewhere in the filesystem, one @emph{could} try to find them -and determine from the available file which conversions are available -and whether there is an indirect route from @math{@cal{A}} to -@math{@cal{C}}. - -This shows one of the design errors of @code{iconv} mentioned above. It -should at least be possible to determine the list of available -conversion programmatically so that if @code{iconv_open} says there is -no such conversion, one could make sure this also is true for indirect -routes. - - -@node glibc iconv Implementation -@subsection The @code{iconv} Implementation in the GNU C library - -After reading about the problems of @code{iconv} implementations in the -last section it is certainly good to note that the implementation in -the GNU C library has none of the problems mentioned above. What -follows is a step-by-step analysis of the points raised above. The -evaluation is based on the current state of the development (as of -January 1999). The development of the @code{iconv} functions is not -complete, but basic functionality has solidified. - -The GNU C library's @code{iconv} implementation uses shared loadable -modules to implement the conversions. A very small number of -conversions are built into the library itself but these are only rather -trivial conversions. - -All the benefits of loadable modules are available in the GNU C library -implementation. This is especially appealing since the interface is -well documented (see below) and it therefore is easy to write new -conversion modules. The drawback of using loadable objects is not a -problem in the GNU C library, at least on ELF systems. Since the -library is able to load shared objects even in statically linked -binaries this means that static linking needs not to be forbidden in -case one wants to use @code{iconv}. - -The second mentioned problem is the number of supported conversions. -Currently, the GNU C library supports more than 150 character sets. The -way the implementation is designed the number of supported conversions -is greater than 22350 (@math{150} times @math{149}). If any conversion -from or to a character set is missing it can easily be added. - -Particularly impressive as it may be, this high number is due to the -fact that the GNU C library implementation of @code{iconv} does not have -the third problem mentioned above. I.e., whenever there is a conversion -from a character set @math{@cal{A}} to @math{@cal{B}} and from -@math{@cal{B}} to @math{@cal{C}} it is always possible to convert from -@math{@cal{A}} to @math{@cal{C}} directly. If the @code{iconv_open} -returns an error and sets @code{errno} to @code{EINVAL} this really -means there is no known way, directly or indirectly, to perform the -wanted conversion. - -@cindex triangulation -This is achieved by providing for each character set a conversion from -and to UCS-4 encoded @w{ISO 10646}. Using @w{ISO 10646} as an -intermediate representation it is possible to @dfn{triangulate}, i.e., -converting with an intermediate representation. - -There is no inherent requirement to provide a conversion to @w{ISO -10646} for a new character set and it is also possible to provide other -conversions where neither source nor destination character set is @w{ISO -10646}. The currently existing set of conversions is simply meant to -cover all conversions which might be of interest. - -@cindex ISO-2022-JP -@cindex EUC-JP -All currently available conversions use the triangulation method above, -making conversion run unnecessarily slow. If, e.g., somebody often -needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution -would involve direct conversion between the two character sets, skipping -the input to @w{ISO 10646} first. The two character sets of interest -are much more similar to each other than to @w{ISO 10646}. - -In such a situation one can easy write a new conversion and provide it -as a better alternative. The GNU C library @code{iconv} implementation -would automatically use the module implementing the conversion if it is -specified to be more efficient. - -@subsubsection Format of @file{gconv-modules} files - -All information about the available conversions comes from a file named -@file{gconv-modules} which can be found in any of the directories along -the @code{GCONV_PATH}. The @file{gconv-modules} files are line-oriented -text files, where each of the lines has one of the following formats: - -@itemize @bullet -@item -If the first non-whitespace character is a @kbd{#} the line contains -only comments and is ignored. - -@item -Lines starting with @code{alias} define an alias name for a character -set. There are two more words expected on the line. The first one -defines the alias name and the second defines the original name of the -character set. The effect is that it is possible to use the alias name -in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and -achieve the same result as when using the real character set name. - -This is quite important as a character set has often many different -names. There is normally always an official name but this need not -correspond to the most popular name. Beside this many character sets -have special names which are somehow constructed. E.g., all character -sets specified by the ISO have an alias of the form -@code{ISO-IR-@var{nnn}} where @var{nnn} is the registration number. -This allows programs which know about the registration number to -construct character set names and use them in @code{iconv_open} calls. -More on the available names and aliases follows below. - -@item -Lines starting with @code{module} introduce an available conversion -module. These lines must contain three or four more words. - -The first word specifies the source character set, the second word the -destination character set of conversion implemented in this module. The -third word is the name of the loadable module. The filename is -constructed by appending the usual shared object suffix (normally -@file{.so}) and this file is then supposed to be found in the same -directory the @file{gconv-modules} file is in. The last word on the -line, which is optional, is a numeric value representing the cost of the -conversion. If this word is missing a cost of @math{1} is assumed. The -numeric value itself does not matter that much; what counts are the -relative values of the sums of costs for all possible conversion paths. -Below is a more precise description of the use of the cost value. -@end itemize - -Returning to the example above where one has written a module to directly -convert from ISO-2022-JP to EUC-JP and back. All what has to be done is -to put the new module, be its name ISO2022JP-EUCJP.so, in a directory -and add a file @file{gconv-modules} with the following content in the -same directory: - -@smallexample -module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1 -module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1 -@end smallexample - -To see why this is sufficient, it is necessary to understand how the -conversion used by @code{iconv} (and described in the descriptor) is -selected. The approach to this problem is quite simple. - -At the first call of the @code{iconv_open} function the program reads -all available @file{gconv-modules} files and builds up two tables: one -containing all the known aliases and another which contains the -information about the conversions and which shared object implements -them. - -@subsubsection Finding the conversion path in @code{iconv} - -The set of available conversions form a directed graph with weighted -edges. The weights on the edges are the costs specified in the -@file{gconv-modules} files. The @code{iconv_open} function uses an -algorithm suitable for search for the best path in such a graph and so -constructs a list of conversions which must be performed in succession -to get the transformation from the source to the destination character -set. - -Explaining why the above @file{gconv-modules} files allows the -@code{iconv} implementation to resolve the specific ISO-2022-JP to -EUC-JP conversion module instead of the conversion coming with the -library itself is straightforward. Since the latter conversion takes two -steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to -EUC-JP) the cost is @math{1+1 = 2}. But the above @file{gconv-modules} -file specifies that the new conversion modules can perform this -conversion with only the cost of @math{1}. - -A mysterious piece about the @file{gconv-modules} file above (and also -the file coming with the GNU C library) are the names of the character -sets specified in the @code{module} lines. Why do almost all the names -end in @code{//}? And this is not all: the names can actually be -regular expressions. At this point of time this mystery should not be -revealed, unless you have the relevant spell-casting materials: ashes -from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix -blessed by St.@: Emacs, assorted herbal roots from Central America, sand -from Cebu, etc. Sorry! @strong{The part of the implementation where -this is used is not yet finished. For now please simply follow the -existing examples. It'll become clearer once it is. --drepper} - -A last remark about the @file{gconv-modules} is about the names not -ending with @code{//}. There often is a character set named -@code{INTERNAL} mentioned. From the discussion above and the chosen -name it should have become clear that this is the name for the -representation used in the intermediate step of the triangulation. We -have said that this is UCS-4 but actually it is not quite right. The -UCS-4 specification also includes the specification of the byte ordering -used. Since a UCS-4 value consists of four bytes a stored value is -effected by byte ordering. The internal representation is @emph{not} -the same as UCS-4 in case the byte ordering of the processor (or at least -the running process) is not the same as the one required for UCS-4. This -is done for performance reasons as one does not want to perform -unnecessary byte-swapping operations if one is not interested in actually -seeing the result in UCS-4. To avoid trouble with endianess the internal -representation consistently is named @code{INTERNAL} even on big-endian -systems where the representations are identical. - -@subsubsection @code{iconv} module data structures - -So far this section described how modules are located and considered to -be used. What remains to be described is the interface of the modules -so that one can write new ones. This section describes the interface as -it is in use in January 1999. The interface will change in future a bit -but hopefully only in an upward compatible way. - -The definitions necessary to write new modules are publicly available -in the non-standard header @file{gconv.h}. The following text will -therefore describe the definitions from this header file. But first it -is necessary to get an overview. - -From the perspective of the user of @code{iconv} the interface is quite -simple: the @code{iconv_open} function returns a handle which can be -used in calls to @code{iconv} and finally the handle is freed with a call -to @code{iconv_close}. The problem is: the handle has to be able to -represent the possibly long sequences of conversion steps and also the -state of each conversion since the handle is all which is passed to the -@code{iconv} function. Therefore the data structures are really the -elements to understanding the implementation. - -We need two different kinds of data structures. The first describes the -conversion and the second describes the state etc. There are really two -type definitions like this in @file{gconv.h}. -@pindex gconv.h - -@comment gconv.h -@comment GNU -@deftp {Data type} {struct __gconv_step} -This data structure describes one conversion a module can perform. For -each function in a loaded module with conversion functions there is -exactly one object of this type. This object is shared by all users of -the conversion. I.e., this object does not contain any information -corresponding to an actual conversion. It only describes the conversion -itself. - -@table @code -@item struct __gconv_loaded_object *__shlib_handle -@itemx const char *__modname -@itemx int __counter -All these elements of the structure are used internally in the C library -to coordinate loading and unloading the shared. One must not expect any -of the other elements be available or initialized. - -@item const char *__from_name -@itemx const char *__to_name -@code{__from_name} and @code{__to_name} contain the names of the source and -destination character sets. They can be used to identify the actual -conversion to be carried out since one module might implement -conversions for more than one character set and/or direction. - -@item gconv_fct __fct -@itemx gconv_init_fct __init_fct -@itemx gconv_end_fct __end_fct -These elements contain pointers to the functions in the loadable module. -The interface will be explained below. - -@item int __min_needed_from -@itemx int __max_needed_from -@itemx int __min_needed_to -@itemx int __max_needed_to; -These values have to be filled in the init function of the module. The -@code{__min_needed_from} value specifies how many bytes a character of -the source character set at least needs. The @code{__max_needed_from} -specifies the maximum value which also includes possible shift -sequences. - -The @code{__min_needed_to} and @code{__max_needed_to} values serve the -same purpose but this time for the destination character set. - -It is crucial that these values are accurate since otherwise the -conversion functions will have problems or not work at all. - -@item int __stateful -This element must also be initialized by the init function. It is -nonzero if the source character set is stateful. Otherwise it is zero. - -@item void *__data -This element can be used freely by the conversion functions in the -module. It can be used to communicate extra information from one call -to another. It need not be initialized if not needed at all. If this -element gets assigned a pointer to dynamically allocated memory -(presumably in the init function) it has to be made sure that the end -function deallocates the memory. Otherwise the application will leak -memory. - -It is important to be aware that this data structure is shared by all -users of this specification conversion and therefore the @code{__data} -element must not contain data specific to one specific use of the -conversion function. -@end table -@end deftp - -@comment gconv.h -@comment GNU -@deftp {Data type} {struct __gconv_step_data} -This is the data structure which contains the information specific to -each use of the conversion functions. - -@table @code -@item char *__outbuf -@itemx char *__outbufend -These elements specify the output buffer for the conversion step. The -@code{__outbuf} element points to the beginning of the buffer and -@code{__outbufend} points to the byte following the last byte in the -buffer. The conversion function must not assume anything about the size -of the buffer but it can be safely assumed the there is room for at -least one complete character in the output buffer. - -Once the conversion is finished and the conversion is the last step the -@code{__outbuf} element must be modified to point after last last byte -written into the buffer to signal how much output is available. If this -conversion step is not the last one the element must not be modified. -The @code{__outbufend} element must not be modified. - -@item int __is_last -This element is nonzero if this conversion step is the last one. This -information is necessary for the recursion. See the description of the -conversion function internals below. This element must never be -modified. - -@item int __invocation_counter -The conversion function can use this element to see how many calls of -the conversion function already happened. Some character sets require -when generating output a certain prolog and by comparing this value with -zero one can find out whether it is the first call and therefore the -prolog should be emitted or not. This element must never be modified. - -@item int __internal_use -This element is another one rarely used but needed in certain -situations. It got assigned a nonzero value in case the conversion -functions are used to implement @code{mbsrtowcs} et.al. I.e., the -function is not used directly through the @code{iconv} interface. - -This sometimes makes a difference as it is expected that the -@code{iconv} functions are used to translate entire texts while the -@code{mbsrtowcs} functions are normally only used to convert single -strings and might be used multiple times to convert entire texts. - -But in this situation we would have problem complying with some rules of -the character set specification. Some character sets require a prolog -which must appear exactly once for an entire text. If a number of -@code{mbsrtowcs} calls are used to convert the text only the first call -must add the prolog. But since there is no communication between the -different calls of @code{mbsrtowcs} the conversion functions have no -possibility to find this out. The situation is different for sequences -of @code{iconv} calls since the handle allows access to the needed -information. - -This element is mostly used together with @code{__invocation_counter} in -a way like this: - -@smallexample -if (!data->__internal_use - && data->__invocation_counter == 0) - /* @r{Emit prolog.} */ - ... -@end smallexample - -This element must never be modified. - -@item mbstate_t *__statep -The @code{__statep} element points to an object of type @code{mbstate_t} -(@pxref{Keeping the state}). The conversion of an stateful character -set must use the object pointed to by this element to store information -about the conversion state. The @code{__statep} element itself must -never be modified. - -@item mbstate_t __state -This element @emph{never} must be used directly. It is only part of -this structure to have the needed space allocated. -@end table -@end deftp - -@subsubsection @code{iconv} module interfaces - -With the knowledge about the data structures we now can describe the -conversion functions itself. To understand the interface a bit of -knowledge about the functionality in the C library which loads the -objects with the conversions is necessary. - -It is often the case that one conversion is used more than once. I.e., -there are several @code{iconv_open} calls for the same set of character -sets during one program run. The @code{mbsrtowcs} et.al.@: functions in -the GNU C library also use the @code{iconv} functionality which -increases the number of uses of the same functions even more. - -For this reason the modules do not get loaded exclusively for one -conversion. Instead a module once loaded can be used by arbitrarily many -@code{iconv} or @code{mbsrtowcs} calls at the same time. The splitting -of the information between conversion function specific information and -conversion data makes this possible. The last section showed the two -data structures used to do this. - -This is of course also reflected in the interface and semantics of the -functions the modules must provide. There are three functions which -must have the following names: - -@table @code -@item gconv_init -The @code{gconv_init} function initializes the conversion function -specific data structure. This very same object is shared by all -conversion which use this conversion and therefore no state information -about the conversion itself must be stored in here. If a module -implements more than one conversion the @code{gconv_init} function will be -called multiple times. - -@item gconv_end -The @code{gconv_end} function is responsible to free all resources -allocated by the @code{gconv_init} function. If there is nothing to do -this function can be missing. Special care must be taken if the module -implements more than one conversion and the @code{gconv_init} function -does not allocate the same resources for all conversions. - -@item gconv -This is the actual conversion function. It is called to convert one -block of text. It gets passed the conversion step information -initialized by @code{gconv_init} and the conversion data, specific to -this use of the conversion functions. -@end table - -There are three data types defined for the three module interface -function and these define the interface. - -@comment gconv.h -@comment GNU -@deftypevr {Data type} int {(*__gconv_init_fct)} (struct __gconv_step *) -This specifies the interface of the initialization function of the -module. It is called exactly once for each conversion the module -implements. - -As explained int the description of the @code{struct __gconv_step} data -structure above the initialization function has to initialize parts of -it. - -@table @code -@item __min_needed_from -@itemx __max_needed_from -@itemx __min_needed_to -@itemx __max_needed_to -These elements must be initialized to the exact numbers of the minimum -and maximum number of bytes used by one character in the source and -destination character set respectively. If the characters all have the -same size the minimum and maximum values are the same. - -@item __stateful -This element must be initialized to an nonzero value if the source -character set is stateful. Otherwise it must be zero. -@end table - -If the initialization function needs to communication some information -to the conversion function this can happen using the @code{__data} -element of the @code{__gconv_step} structure. But since this data is -shared by all the conversion is must not be modified by the conversion -function. How this can be used is shown in the example below. - -@smallexample -#define MIN_NEEDED_FROM 1 -#define MAX_NEEDED_FROM 4 -#define MIN_NEEDED_TO 4 -#define MAX_NEEDED_TO 4 - -int -gconv_init (struct __gconv_step *step) -@{ - /* @r{Determine which direction.} */ - struct iso2022jp_data *new_data; - enum direction dir = illegal_dir; - enum variant var = illegal_var; - int result; - - if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0) - @{ - dir = from_iso2022jp; - var = iso2022jp; - @} - else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0) - @{ - dir = to_iso2022jp; - var = iso2022jp; - @} - else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0) - @{ - dir = from_iso2022jp; - var = iso2022jp2; - @} - else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0) - @{ - dir = to_iso2022jp; - var = iso2022jp2; - @} - - result = __GCONV_NOCONV; - if (dir != illegal_dir) - @{ - new_data = (struct iso2022jp_data *) - malloc (sizeof (struct iso2022jp_data)); - - result = __GCONV_NOMEM; - if (new_data != NULL) - @{ - new_data->dir = dir; - new_data->var = var; - step->__data = new_data; - - if (dir == from_iso2022jp) - @{ - step->__min_needed_from = MIN_NEEDED_FROM; - step->__max_needed_from = MAX_NEEDED_FROM; - step->__min_needed_to = MIN_NEEDED_TO; - step->__max_needed_to = MAX_NEEDED_TO; - @} - else - @{ - step->__min_needed_from = MIN_NEEDED_TO; - step->__max_needed_from = MAX_NEEDED_TO; - step->__min_needed_to = MIN_NEEDED_FROM; - step->__max_needed_to = MAX_NEEDED_FROM + 2; - @} - - /* @r{Yes, this is a stateful encoding.} */ - step->__stateful = 1; - - result = __GCONV_OK; - @} - @} - - return result; -@} -@end smallexample - -The function first checks which conversion is wanted. The module from -which this function is taken implements four different conversion and -which one is selected can be determined by comparing the names. The -comparison should always be done without paying attention to the case. - -Then a data structure is allocated which contains the necessary -information about which conversion is selected. The data structure -@code{struct iso2022jp_data} is locally defined since outside the module -this data is not used at all. Please note that if all four conversions -this modules supports are requested there are four data blocks. - -One interesting thing is the initialization of the @code{__min_} and -@code{__max_} elements of the step data object. A single ISO-2022-JP -character can consist of one to four bytes. Therefore the -@code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined -this way. The output is always the @code{INTERNAL} character set (aka -UCS-4) and therefore each character consists of exactly four bytes. For -the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into -account that escape sequences might be necessary to switch the character -sets. Therefore the @code{__max_needed_to} element for this direction -gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the -two bytes needed for the escape sequences to single the switching. The -asymmetry in the maximum values for the two directions can be explained -easily: when reading ISO-2022-JP text escape sequences can be handled -alone. I.e., it is not necessary to process a real character since the -effect of the escape sequence can be recorded in the state information. -The situation is different for the other direction. Since it is in -general not known which character comes next one cannot emit escape -sequences to change the state in advance. This means the escape -sequences which have to be emitted together with the next character. -Therefore one needs more room then only for the character itself. - -The possible return values of the initialization function are: - -@table @code -@item __GCONV_OK -The initialization succeeded -@item __GCONV_NOCONV -The requested conversion is not supported in the module. This can -happen if the @file{gconv-modules} file has errors. -@item __GCONV_NOMEM -Memory required to store additional information could not be allocated. -@end table -@end deftypevr - -The functions called before the module is unloaded is significantly -easier. It often has nothing at all to do in which case it can be left -out completely. - -@comment gconv.h -@comment GNU -@deftypevr {Data type} void {(*__gconv_end_fct)} (struct gconv_step *) -The task of this function is it to free all resources allocated in the -initialization function. Therefore only the @code{__data} element of -the object pointed to by the argument is of interest. Continuing the -example from the initialization function, the finalization function -looks like this: - -@smallexample -void -gconv_end (struct __gconv_step *data) -@{ - free (data->__data); -@} -@end smallexample -@end deftypevr - -The most important function is the conversion function itself. It can -get quite complicated for complex character sets. But since this is not -of interest here we will only describe a possible skeleton for the -conversion function. - -@comment gconv.h -@comment GNU -@deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int) -The conversion function can be called for two basic reason: to convert -text or to reset the state. From the description of the @code{iconv} -function it can be seen why the flushing mode is necessary. What mode -is selected is determined by the sixth argument, an integer. If it is -nonzero it means that flushing is selected. - -Common to both mode is where the output buffer can be found. The -information about this buffer is stored in the conversion step data. A -pointer to this is passed as the second argument to this function. The -description of the @code{struct __gconv_step_data} structure has more -information on this. - -@cindex stateful -What has to be done for flushing depends on the source character set. -If it is not stateful nothing has to be done. Otherwise the function -has to emit a byte sequence to bring the state object in the initial -state. Once this all happened the other conversion modules in the chain -of conversions have to get the same chance. Whether another step -follows can be determined from the @code{__is_last} element of the step -data structure to which the first parameter points. - -The more interesting mode is when actually text has to be converted. -The first step in this case is to convert as much text as possible from -the input buffer and store the result in the output buffer. The start -of the input buffer is determined by the third argument which is a -pointer to a pointer variable referencing the beginning of the buffer. -The fourth argument is a pointer to the byte right after the last byte -in the buffer. - -The conversion has to be performed according to the current state if the -character set is stateful. The state is stored in an object pointed to -by the @code{__statep} element of the step data (second argument). Once -either the input buffer is empty or the output buffer is full the -conversion stops. At this point the pointer variable referenced by the -third parameter must point to the byte following the last processed -byte. I.e., if all of the input is consumed this pointer and the fourth -parameter have the same value. - -What now happens depends on whether this step is the last one or not. -If it is the last step the only thing which has to be done is to update -the @code{__outbuf} element of the step data structure to point after the -last written byte. This gives the caller the information on how much -text is available in the output buffer. Beside this the variable -pointed to by the fifth parameter, which is of type @code{size_t}, must -be incremented by the number of characters (@emph{not bytes}) which were -converted in a non-reversible way. Then the function can return. - -In case the step is not the last one the later conversion functions have -to get a chance to do their work. Therefore the appropriate conversion -function has to be called. The information about the functions is -stored in the conversion data structures, passed as the first parameter. -This information and the step data are stored in arrays so the next -element in both cases can be found by simple pointer arithmetic: - -@smallexample -int -gconv (struct __gconv_step *step, struct __gconv_step_data *data, - const char **inbuf, const char *inbufend, size_t *written, - int do_flush) -@{ - struct __gconv_step *next_step = step + 1; - struct __gconv_step_data *next_data = data + 1; - ... -@end smallexample - -The @code{next_step} pointer references the next step information and -@code{next_data} the next data record. The call of the next function -therefore will look similar to this: - -@smallexample - next_step->__fct (next_step, next_data, &outerr, outbuf, - written, 0) -@end smallexample - -But this is not yet all. Once the function call returns the conversion -function might have some more to do. If the return value of the -function is @code{__GCONV_EMPTY_INPUT} this means there is more room in -the output buffer. Unless the input buffer is empty the conversion -functions start all over again and processes the rest of the input -buffer. If the return value is not @code{__GCONV_EMPTY_INPUT} something -went wrong and we have to recover from this. - -A requirement for the conversion function is that the input buffer -pointer (the third argument) always points to the last character which -was put in the converted form in the output buffer. This is trivially -true after the conversion performed in the current step. But if the -conversion functions deeper down the stream stop prematurely not all -characters from the output buffer are consumed and therefore the input -buffer pointers must be backed of to the right position. - -This is easy to do if the input and output character sets have a fixed -width for all characters. In this situation we can compute how many -characters are left in the output buffer and therefore can correct the -input buffer pointer appropriate with a similar computation. Things are -getting tricky if either character set has character represented with -variable length byte sequences and it gets even more complicated if the -conversion has to take care of the state. In these cases the conversion -has to be performed once again, from the known state before the initial -conversion. I.e., if necessary the state of the conversion has to be -reset and the conversion loop has to be executed again. The difference -now is that it is known how much input must be created and the -conversion can stop before converting the first unused character. Once -this is done the input buffer pointers must be updated again and the -function can return. - -One final thing should be mentioned. If it is necessary for the -conversion to know whether it is the first invocation (in case a prolog -has to be emitted) the conversion function should just before returning -to the caller increment the @code{__invocation_counter} element of the -step data structure. See the description of the @code{struct -__gconv_step_data} structure above for more information on how this can -be used. - -The return value must be one of the following values: - -@table @code -@item __GCONV_EMPTY_INPUT -All input was consumed and there is room left in the output buffer. -@item __GCONV_FULL_OUTPUT -No more room in the output buffer. In case this is not the last step -this value is propagated down from the call of the next conversion -function in the chain. -@item __GCONV_INCOMPLETE_INPUT -The input buffer is not entirely empty since it contains an incomplete -character sequence. -@end table - -The following example provides a framework for a conversion function. -In case a new conversion has to be written the holes in this -implementation have to be filled and that is it. - -@smallexample -int -gconv (struct __gconv_step *step, struct __gconv_step_data *data, - const char **inbuf, const char *inbufend, size_t *written, - int do_flush) -@{ - struct __gconv_step *next_step = step + 1; - struct __gconv_step_data *next_data = data + 1; - gconv_fct fct = next_step->__fct; - int status; - - /* @r{If the function is called with no input this means we have} - @r{to reset to the initial state. The possibly partly} - @r{converted input is dropped.} */ - if (do_flush) - @{ - status = __GCONV_OK; - - /* @r{Possible emit a byte sequence which put the state object} - @r{into the initial state.} */ - - /* @r{Call the steps down the chain if there are any but only} - @r{if we successfully emitted the escape sequence.} */ - if (status == __GCONV_OK && ! data->__is_last) - status = fct (next_step, next_data, NULL, NULL, - written, 1); - @} - else - @{ - /* @r{We preserve the initial values of the pointer variables.} */ - const char *inptr = *inbuf; - char *outbuf = data->__outbuf; - char *outend = data->__outbufend; - char *outptr; - - do - @{ - /* @r{Remember the start value for this round.} */ - inptr = *inbuf; - /* @r{The outbuf buffer is empty.} */ - outptr = outbuf; - - /* @r{For stateful encodings the state must be safe here.} */ - - /* @r{Run the conversion loop. @code{status} is set} - @r{appropriately afterwards.} */ - - /* @r{If this is the last step leave the loop, there is} - @r{nothing we can do.} */ - if (data->__is_last) - @{ - /* @r{Store information about how many bytes are} - @r{available.} */ - data->__outbuf = outbuf; - - /* @r{If any non-reversible conversions were performed,} - @r{add the number to @code{*written}.} */ - - break; - @} - - /* @r{Write out all output which was produced.} */ - if (outbuf > outptr) - @{ - const char *outerr = data->__outbuf; - int result; - - result = fct (next_step, next_data, &outerr, - outbuf, written, 0); - - if (result != __GCONV_EMPTY_INPUT) - @{ - if (outerr != outbuf) - @{ - /* @r{Reset the input buffer pointer. We} - @r{document here the complex case.} */ - size_t nstatus; - - /* @r{Reload the pointers.} */ - *inbuf = inptr; - outbuf = outptr; - - /* @r{Possibly reset the state.} */ - - /* @r{Redo the conversion, but this time} - @r{the end of the output buffer is at} - @r{@code{outerr}.} */ - @} - - /* @r{Change the status.} */ - status = result; - @} - else - /* @r{All the output is consumed, we can make} - @r{ another run if everything was ok.} */ - if (status == __GCONV_FULL_OUTPUT) - status = __GCONV_OK; - @} - @} - while (status == __GCONV_OK); - - /* @r{We finished one use of this step.} */ - ++data->__invocation_counter; - @} - - return status; -@} -@end smallexample -@end deftypevr - -This information should be sufficient to write new modules. Anybody -doing so should also take a look at the available source code in the GNU -C library sources. It contains many examples of working and optimized -modules. +@node Character Set Handling, Locales, String and Array Utilities, Top
+@c %MENU% Support for extended character sets
+@chapter Character Set Handling
+
+@ifnottex
+@macro cal{text}
+\text\
+@end macro
+@end ifnottex
+
+Character sets used in the early days of computing had only six, seven,
+or eight bits for each character: there was never a case where more than
+eight bits (one byte) were used to represent a single character. The
+limitations of this approach became more apparent as more people
+grappled with non-Roman character sets, where not all the characters
+that make up a language's character set can be represented by @math{2^8}
+choices. This chapter shows the functionality that was added to the C
+library to support multiple character sets.
+
+@menu
+* Extended Char Intro:: Introduction to Extended Characters.
+* Charset Function Overview:: Overview about Character Handling
+ Functions.
+* Restartable multibyte conversion:: Restartable multibyte conversion
+ Functions.
+* Non-reentrant Conversion:: Non-reentrant Conversion Function.
+* Generic Charset Conversion:: Generic Charset Conversion.
+@end menu
+
+
+@node Extended Char Intro
+@section Introduction to Extended Characters
+
+A variety of solutions is available to overcome the differences between
+character sets with a 1:1 relation between bytes and characters and
+character sets with ratios of 2:1 or 4:1. The remainder of this
+section gives a few examples to help understand the design decisions
+made while developing the functionality of the @w{C library}.
+
+@cindex internal representation
+A distinction we have to make right away is between internal and
+external representation. @dfn{Internal representation} means the
+representation used by a program while keeping the text in memory.
+External representations are used when text is stored or transmitted
+through some communication channel. Examples of external
+representations include files waiting in a directory to be
+read and parsed.
+
+Traditionally there has been no difference between the two representations.
+It was equally comfortable and useful to use the same single-byte
+representation internally and externally. This comfort level decreases
+with more and larger character sets.
+
+One of the problems to overcome with the internal representation is
+handling text that is externally encoded using different character
+sets. Assume a program that reads two texts and compares them using
+some metric. The comparison can be usefully done only if the texts are
+internally kept in a common format.
+
+@cindex wide character
+For such a common format (@math{=} character set) eight bits are certainly
+no longer enough. So the smallest entity will have to grow: @dfn{wide
+characters} will now be used. Instead of one byte per character, two or
+four will be used instead. (Three are not good to address in memory and
+more than four bytes seem not to be necessary).
+
+@cindex Unicode
+@cindex ISO 10646
+As shown in some other part of this manual,
+@c !!! Ahem, wide char string functions are not yet covered -- drepper
+a completely new family has been created of functions that can handle wide
+character texts in memory. The most commonly used character sets for such
+internal wide character representations are Unicode and @w{ISO 10646}
+(also known as UCS for Universal Character Set). Unicode was originally
+planned as a 16-bit character set; whereas, @w{ISO 10646} was designed to
+be a 31-bit large code space. The two standards are practically identical.
+They have the same character repertoire and code table, but Unicode specifies
+added semantics. At the moment, only characters in the first @code{0x10000}
+code positions (the so-called Basic Multilingual Plane, BMP) have been
+assigned, but the assignment of more specialized characters outside this
+16-bit space is already in progress. A number of encodings have been
+defined for Unicode and @w{ISO 10646} characters:
+@cindex UCS-2
+@cindex UCS-4
+@cindex UTF-8
+@cindex UTF-16
+UCS-2 is a 16-bit word that can only represent characters
+from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
+and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where
+ASCII characters are represented by ASCII bytes and non-ASCII characters
+by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
+of UCS-2 in which pairs of certain UCS-2 words can be used to encode
+non-BMP characters up to @code{0x10ffff}.
+
+To represent wide characters the @code{char} type is not suitable. For
+this reason the @w{ISO C} standard introduces a new type that is
+designed to keep one character of a wide character string. To maintain
+the similarity there is also a type corresponding to @code{int} for
+those functions that take a single wide character.
+
+@comment stddef.h
+@comment ISO
+@deftp {Data type} wchar_t
+This data type is used as the base type for wide character strings.
+I.e., arrays of objects of this type are the equivalent of @code{char[]}
+for multibyte character strings. The type is defined in @file{stddef.h}.
+
+The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not
+say anything specific about the representation. It only requires that
+this type is capable of storing all elements of the basic character set.
+Therefore it would be legitimate to define @code{wchar_t} as @code{char},
+which might make sense for embedded systems.
+
+But for GNU systems @code{wchar_t} is always 32 bits wide and, therefore,
+capable of representing all UCS-4 values and, therefore, covering all of
+@w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type
+and thereby follow Unicode very strictly. This definition is perfectly
+fine with the standard, but it also means that to represent all
+characters from Unicode and @w{ISO 10646} one has to use UTF-16 surrogate
+characters, which is in fact a multi-wide-character encoding. But
+resorting to multi-wide-character encoding contradicts the purpose of the
+@code{wchar_t} type.
+@end deftp
+
+@comment wchar.h
+@comment ISO
+@deftp {Data type} wint_t
+@code{wint_t} is a data type used for parameters and variables that
+contain a single wide character. As the name suggests this type is the
+equivalent of @code{int} when using the normal @code{char} strings. The
+types @code{wchar_t} and @code{wint_t} often have the same
+representation if their size is 32 bits wide but if @code{wchar_t} is
+defined as @code{char} the type @code{wint_t} must be defined as
+@code{int} due to the parameter promotion.
+
+@pindex wchar.h
+This type is defined in @file{wchar.h} and was introduced in
+@w{Amendment 1} to @w{ISO C90}.
+@end deftp
+
+As there are for the @code{char} data type macros are available for
+specifying the minimum and maximum value representable in an object of
+type @code{wchar_t}.
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WCHAR_MIN
+The macro @code{WCHAR_MIN} evaluates to the minimum value representable
+by an object of type @code{wint_t}.
+
+This macro was introduced in @w{Amendment 1} to @w{ISO C90}.
+@end deftypevr
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WCHAR_MAX
+The macro @code{WCHAR_MAX} evaluates to the maximum value representable
+by an object of type @code{wint_t}.
+
+This macro was introduced in @w{Amendment 1} to @w{ISO C90}.
+@end deftypevr
+
+Another special wide character value is the equivalent to @code{EOF}.
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WEOF
+The macro @code{WEOF} evaluates to a constant expression of type
+@code{wint_t} whose value is different from any member of the extended
+character set.
+
+@code{WEOF} need not be the same value as @code{EOF} and unlike
+@code{EOF} it also need @emph{not} be negative. I.e., sloppy code like
+
+@smallexample
+@{
+ int c;
+ ...
+ while ((c = getc (fp)) < 0)
+ ...
+@}
+@end smallexample
+
+@noindent
+has to be rewritten to use @code{WEOF} explicitly when wide characters
+are used:
+
+@smallexample
+@{
+ wint_t c;
+ ...
+ while ((c = wgetc (fp)) != WEOF)
+ ...
+@}
+@end smallexample
+
+@pindex wchar.h
+This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is
+defined in @file{wchar.h}.
+@end deftypevr
+
+
+These internal representations present problems when it comes to storing
+and transmittal. Because each single wide character consists of more
+than one byte, they are effected by byte-ordering. Thus, machines with
+different endianesses would see different values when accessing the same
+data. This byte ordering concern also applies for communication protocols
+that are all byte-based and, thereforet require that the sender has to
+decide about splitting the wide character in bytes. A last (but not least
+important) point is that wide characters often require more storage space
+than a customized byte-oriented character set.
+
+@cindex multibyte character
+@cindex EBCDIC
+ For all the above reasons, an external encoding that is different
+from the internal encoding is often used if the latter is UCS-2 or UCS-4.
+The external encoding is byte-based and can be chosen appropriately for
+the environment and for the texts to be handled. A variety of different
+character sets can be used for this external encoding (information that
+will not be exhaustively presented here--instead, a description of the
+major groups will suffice). All of the ASCII-based character sets
+[_bkoz_: do you mean Roman character sets? If not, what do you mean
+here?] fulfill one requirement: they are "filesystem safe." This means
+that the character @code{'/'} is used in the encoding @emph{only} to
+represent itself. Things are a bit different for character sets like
+EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set
+family used by IBM), but if the operation system does not understand
+EBCDIC directly the parameters-to-system calls have to be converted first
+anyhow.
+
+@itemize @bullet
+@item
+The simplest character sets are single-byte character sets. There can
+be only up to 256 characters (for @w{8 bit} character sets), which is
+not sufficient to cover all languages but might be sufficient to handle
+a specific text. Handling of a @w{8 bit} character sets is simple. This
+is not true for other kinds presented later, and therefore, the
+application one uses might require the use of @w{8 bit} character sets.
+
+@cindex ISO 2022
+@item
+The @w{ISO 2022} standard defines a mechanism for extended character
+sets where one character @emph{can} be represented by more than one
+byte. This is achieved by associating a state with the text.
+Characters that can be used to change the state can be embedded in the
+text. Each byte in the text might have a different interpretation in each
+state. The state might even influence whether a given byte stands for a
+character on its own or whether it has to be combined with some more
+bytes.
+
+@cindex EUC
+@cindex Shift_JIS
+@cindex SJIS
+In most uses of @w{ISO 2022} the defined character sets do not allow
+state changes which cover more than the next character. This has the
+big advantage that whenever one can identify the beginning of the byte
+sequence of a character one can interpret a text correctly. Examples of
+character sets using this policy are the various EUC character sets
+(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
+or Shift_JIS (SJIS, a Japanese encoding).
+
+But there are also character sets using a state which is valid for more
+than one character and has to be changed by another byte sequence.
+Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
+
+@item
+@cindex ISO 6937
+Early attempts to fix 8 bit character sets for other languages using the
+Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes
+representing characters like the acute accent do not produce output
+themselves: one has to combine them with other characters to get the
+desired result. For example, the byte sequence @code{0xc2 0x61}
+(non-spacing acute accent, followed by lower-case `a') to get the ``small
+a with acute'' character. To get the acute accent character on its own,
+one has to write @code{0xc2 0x20} (the non-spacing acute followed by a
+space).
+
+Character sets like @w[ISO 6937] are used in some embedded systems such
+as teletex.
+
+@item
+@cindex UTF-8
+Instead of converting the Unicode or @w{ISO 10646} text used internally,
+it is often also sufficient to simply use an encoding different than
+UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an
+encoding: UTF-8. This encoding is able to represent all of @w{ISO
+10646} 31 bits in a byte string of length one to six.
+
+@cindex UTF-7
+There were a few other attempts to encode @w{ISO 10646} such as UTF-7,
+but UTF-8 is today the only encoding which should be used. In fact, with
+any luck UTF-8 will soon be the only external encoding that has to be
+supported. It proves to be universally usable and its only disadvantage
+is that it favors Roman languages by making the byte string
+representation of other scripts (Cyrillic, Greek, Asian scripts) longer
+than necessary if using a specific character set for these scripts.
+Methods like the Unicode compression scheme can alleviate these
+problems.
+@end itemize
+
+The question remaining is: how to select the character set or encoding
+to use. The answer: you cannot decide about it yourself, it is decided
+by the developers of the system or the majority of the users. Since the
+goal is interoperability one has to use whatever the other people one
+works with use. If there are no constraints, the selection is based on
+the requirements the expected circle of users will have. In other words,
+if a project is expected to be used in only, say, Russia it is fine to use
+KOI8-R or a similar character set. But if at the same time people from,
+say, Greece are participating one should use a character set which allows
+all people to collaborate.
+
+The most widely useful solution seems to be: go with the most general
+character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding
+and problems about users not being able to use their own language
+adequately are a thing of the past.
+
+One final comment about the choice of the wide character representation
+is necessary at this point. We have said above that the natural choice
+is using Unicode or @w{ISO 10646}. This is not required, but at least
+encouraged, by the @w{ISO C} standard. The standard defines at least a
+macro @code{__STDC_ISO_10646__} that is only defined on systems where
+the @code{wchar_t} type encodes @w{ISO 10646} characters. If this
+symbol is not defined one should avoid making assumptions about the wide
+character representation. If the programmer uses only the functions
+provided by the C library to handle wide character strings there should
+be no compatibility problems with other systems.
+
+@node Charset Function Overview
+@section Overview about Character Handling Functions
+
+A Unix @w{C library} contains three different sets of functions in two
+families to handle character set conversion. One of the function families
+(the most commonly used) is specified in the @w{ISO C90} standard and,
+therefore, is portable even beyond the Unix world. Unfortunately this
+family is the least useful one. These functions should be avoided
+whenever possible, especially when developing libraries (as opposed to
+applications).
+
+The second family of functions got introduced in the early Unix standards
+(XPG2) and is still part of the latest and greatest Unix standard:
+@w{Unix 98}. It is also the most powerful and useful set of functions.
+But we will start with the functions defined in @w{Amendment 1} to
+@w{ISO C90}.
+
+@node Restartable multibyte conversion
+@section Restartable Multibyte Conversion Functions
+
+The @w{ISO C} standard defines functions to convert strings from a
+multibyte representation to wide character strings. There are a number
+of peculiarities:
+
+@itemize @bullet
+@item
+The character set assumed for the multibyte encoding is not specified
+as an argument to the functions. Instead the character set specified by
+the @code{LC_CTYPE} category of the current locale is used; see
+@ref{Locale Categories}.
+
+@item
+The functions handling more than one character at a time require NUL
+terminated strings as the argument. I.e., converting blocks of text
+does not work unless one can add a NUL byte at an appropriate place.
+The GNU C library contains some extensions to the standard that allow
+specifying a size, but basically they also expect terminated strings.
+@end itemize
+
+Despite these limitations the @w{ISO C} functions can be used in many
+contexts. In graphical user interfaces, for instance, it is not
+uncommon to have functions that require text to be displayed in a wide
+character string if the text is not simple ASCII. The text itself might come
+from a file with translations and the user should decide about the
+current locale which determines the translation and therefore also the
+external encoding used. In such a situation (and many others) the
+functions described here are perfect. If more freedom while performing
+the conversion is necessary take a look at the @code{iconv} functions
+(@pxref{Generic Charset Conversion}).
+
+@menu
+* Selecting the Conversion:: Selecting the conversion and its properties.
+* Keeping the state:: Representing the state of the conversion.
+* Converting a Character:: Converting Single Characters.
+* Converting Strings:: Converting Multibyte and Wide Character
+ Strings.
+* Multibyte Conversion Example:: A Complete Multibyte Conversion Example.
+@end menu
+
+@node Selecting the Conversion
+@subsection Selecting the conversion and its properties
+
+We already said above that the currently selected locale for the
+@code{LC_CTYPE} category decides about the conversion which is performed
+by the functions we are about to describe. Each locale uses its own
+character set (given as an argument to @code{localedef}) and this is the
+one assumed as the external multibyte encoding. The wide character
+character set always is UCS-4, at least on GNU systems.
+
+A characteristic of each multibyte character set is the maximum number
+of bytes that can be necessary to represent one character. This
+information is quite important when writing code that uses the
+conversion functions (as shown in the examples below).
+The @w{ISO C} standard defines two macros which provide this information.
+
+
+@comment limits.h
+@comment ISO
+@deftypevr Macro int MB_LEN_MAX
+@code{MB_LEN_MAX} specifies the maximum number of bytes in the multibyte
+sequence for a single character in any of the supported locales. It is
+a compile-time constant and is defined in @file{limits.h}.
+@pindex limits.h
+@end deftypevr
+
+@comment stdlib.h
+@comment ISO
+@deftypevr Macro int MB_CUR_MAX
+@code{MB_CUR_MAX} expands into a positive integer expression that is the
+maximum number of bytes in a multibyte character in the current locale.
+The value is never greater than @code{MB_LEN_MAX}. Unlike
+@code{MB_LEN_MAX} this macro need not be a compile-time constant, and in
+the GNU C library it is not.
+
+@pindex stdlib.h
+@code{MB_CUR_MAX} is defined in @file{stdlib.h}.
+@end deftypevr
+
+Two different macros are necessary since strictly @w{ISO C90} compilers
+do not allow variable length array definitions, but still it is desirable
+to avoid dynamic allocation. This incomplete piece of code shows the
+problem:
+
+@smallexample
+@{
+ char buf[MB_LEN_MAX];
+ ssize_t len = 0;
+
+ while (! feof (fp))
+ @{
+ fread (&buf[len], 1, MB_CUR_MAX - len, fp);
+ /* @r{... process} buf */
+ len -= used;
+ @}
+@}
+@end smallexample
+
+The code in the inner loop is expected to have always enough bytes in
+the array @var{buf} to convert one multibyte character. The array
+@var{buf} has to be sized statically since many compilers do not allow a
+variable size. The @code{fread} call makes sure that @code{MB_CUR_MAX}
+bytes are always available in @var{buf}. Note that it isn't
+a problem if @code{MB_CUR_MAX} is not a compile-time constant.
+
+
+@node Keeping the state
+@subsection Representing the state of the conversion
+
+@cindex stateful
+In the introduction of this chapter it was said that certain character
+sets use a @dfn{stateful} encoding. That is, the encoded values depend
+in some way on the previous bytes in the text.
+
+Since the conversion functions allow converting a text in more than one
+step we must have a way to pass this information from one call of the
+functions to another.
+
+@comment wchar.h
+@comment ISO
+@deftp {Data type} mbstate_t
+@cindex shift state
+A variable of type @code{mbstate_t} can contain all the information
+about the @dfn{shift state} needed from one call to a conversion
+function to another.
+
+@pindex wchar.h
+@code{mbstate_t} is defined in @file{wchar.h}. It was introduced in
+@w{Amendment 1} to @w{ISO C90}.
+@end deftp
+
+To use objects of type @code{mbstate_t} the programmer has to define such
+objects (normally as local variables on the stack) and pass a pointer to
+the object to the conversion functions. This way the conversion function
+can update the object if the current multibyte character set is stateful.
+
+There is no specific function or initializer to put the state object in
+any specific state. The rules are that the object should always
+represent the initial state before the first use, and this is achieved by
+clearing the whole variable with code such as follows:
+
+@smallexample
+@{
+ mbstate_t state;
+ memset (&state, '\0', sizeof (state));
+ /* @r{from now on @var{state} can be used.} */
+ ...
+@}
+@end smallexample
+
+When using the conversion functions to generate output it is often
+necessary to test whether the current state corresponds to the initial
+state. This is necessary, for example, to decide whether to emit
+escape sequences to set the state to the initial state at certain
+sequence points. Communication protocols often require this.
+
+@comment wchar.h
+@comment ISO
+@deftypefun int mbsinit (const mbstate_t *@var{ps})
+The @code {mbsinit} function determines whether the state object pointed
+to by @var{ps} is in the initial state. If @var{ps} is a null pointer or
+the object is in the initial state the return value is nonzero. Otherwise
+it is zero.
+
+@pindex wchar.h
+@code {mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is
+declared in @file{wchar.h}.
+@end deftypefun
+
+Code using @code {mbsinit} often looks similar to this:
+
+@c Fix the example to explicitly say how to generate the escape sequence
+@c to restore the initial state.
+@smallexample
+@{
+ mbstate_t state;
+ memset (&state, '\0', sizeof (state));
+ /* @r{Use @var{state}.} */
+ ...
+ if (! mbsinit (&state))
+ @{
+ /* @r{Emit code to return to initial state.} */
+ const wchar_t empty[] = L"";
+ const wchar_t *srcp = empty;
+ wcsrtombs (outbuf, &srcp, outbuflen, &state);
+ @}
+ ...
+@}
+@end smallexample
+
+The code to emit the escape sequence to get back to the initial state is
+interesting. The @code{wcsrtombs} function can be used to determine the
+necessary output code (@pxref{Converting Strings}). Please note that on
+GNU systems it is not necessary to perform this extra action for the
+conversion from multibyte text to wide character text since the wide
+character encoding is not stateful. But there is nothing mentioned in
+any standard which prohibits making @code{wchar_t} using a stateful
+encoding.
+
+@node Converting a Character
+@subsection Converting Single Characters
+
+The most fundamental of the conversion functions are those dealing with
+single characters. Please note that this does not always mean single
+bytes. But since there is very often a subset of the multibyte
+character set which consists of single byte sequences there are
+functions to help with converting bytes. Frequently, ASCII is a subpart
+of the multibyte character set. In such a scenario, each ASCII character
+stands for itself, and all other characters have at least a first byte
+that is beyond the range @math{0} to @math{127}.
+
+@comment wchar.h
+@comment ISO
+@deftypefun wint_t btowc (int @var{c})
+The @code{btowc} function (``byte to wide character'') converts a valid
+single byte character @var{c} in the initial shift state into the wide
+character equivalent using the conversion rules from the currently
+selected locale of the @code{LC_CTYPE} category.
+
+If @code{(unsigned char) @var{c}} is no valid single byte multibyte
+character or if @var{c} is @code{EOF}, the function returns @code{WEOF}.
+
+Please note the restriction of @var{c} being tested for validity only in
+the initial shift state. No @code{mbstate_t} object is used from
+which the state information is taken, and the function also does not use
+any static state.
+
+@pindex wchar.h
+The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90}
+and is declared in @file{wchar.h}.
+@end deftypefun
+
+Despite the limitation that the single byte value always is interpreted
+in the initial state this function is actually useful most of the time.
+Most characters are either entirely single-byte character sets or they
+are extension to ASCII. But then it is possible to write code like this
+(not that this specific example is very useful):
+
+@smallexample
+wchar_t *
+itow (unsigned long int val)
+@{
+ static wchar_t buf[30];
+ wchar_t *wcp = &buf[29];
+ *wcp = L'\0';
+ while (val != 0)
+ @{
+ *--wcp = btowc ('0' + val % 10);
+ val /= 10;
+ @}
+ if (wcp == &buf[29])
+ *--wcp = L'0';
+ return wcp;
+@}
+@end smallexample
+
+Why is it necessary to use such a complicated implementation and not
+simply cast @code{'0' + val % 10} to a wide character? The answer is
+that there is no guarantee that one can perform this kind of arithmetic
+on the character of the character set used for @code{wchar_t}
+representation. In other situations the bytes are not constant at
+compile time and so the compiler cannot do the work. In situations like
+this it is necessary @code{btowc}.
+
+@noindent
+There also is a function for the conversion in the other direction.
+
+@comment wchar.h
+@comment ISO
+@deftypefun int wctob (wint_t @var{c})
+The @code{wctob} function (``wide character to byte'') takes as the
+parameter a valid wide character. If the multibyte representation for
+this character in the initial state is exactly one byte long the return
+value of this function is this character. Otherwise the return value is
+@code{EOF}.
+
+@pindex wchar.h
+@code{wctob} was introduced in @w{Amendment 1} to @w{ISO C90} and
+is declared in @file{wchar.h}.
+@end deftypefun
+
+There are more general functions to convert single character from
+multibyte representation to wide characters and vice versa. These
+functions pose no limit on the length of the multibyte representation
+and they also do not require it to be in the initial state.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps})
+@cindex stateful
+The @code{mbrtowc} function (``multibyte restartable to wide
+character'') converts the next multibyte character in the string pointed
+to by @var{s} into a wide character and stores it in the wide character
+string pointed to by @var{pwc}. The conversion is performed according
+to the locale currently selected for the @code{LC_CTYPE} category. If
+the conversion for the character set used in the locale requires a state,
+the multibyte string is interpreted in the state represented by the
+object pointed to by @var{ps}. If @var{ps} is a null pointer, a static,
+internal state variable used only by the @code{mbrtowc} function is
+used.
+
+If the next multibyte character corresponds to the NUL wide character,
+the return value of the function is @math{0} and the state object is
+afterwards in the initial state. If the next @var{n} or fewer bytes
+form a correct multibyte character, the return value is the number of
+bytes starting from @var{s} that form the multibyte character. The
+conversion state is updated according to the bytes consumed in the
+conversion. In both cases the wide character (either the @code{L'\0'}
+or the one found in the conversion) is stored in the string pointed to
+by @var{pwc} if @var{pwc} is not null.
+
+If the first @var{n} bytes of the multibyte string possibly form a valid
+multibyte character but there are more than @var{n} bytes needed to
+complete it, the return value of the function is @code{(size_t) -2} and
+no value is stored. Please note that this can happen even if @var{n}
+has a value greater than or equal to @code{MB_CUR_MAX} since the input
+might contain redundant shift sequences.
+
+If the first @code{n} bytes of the multibyte string cannot possibly form
+a valid multibyte character, no value is stored, the global variable
+@code{errno} is set to the value @code{EILSEQ}, and the function returns
+@code{(size_t) -1}. The conversion state is afterwards undefined.
+
+@pindex wchar.h
+@code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and
+is declared in @file{wchar.h}.
+@end deftypefun
+
+Use of @code{mbrtowc} is straightforward. A function which copies a
+multibyte string into a wide character string while at the same time
+converting all lowercase characters into uppercase could look like this
+(this is not the final version, just an example; it has no error
+checking, and sometimes leaks memory):
+
+@smallexample
+wchar_t *
+mbstouwcs (const char *s)
+@{
+ size_t len = strlen (s);
+ wchar_t *result = malloc ((len + 1) * sizeof (wchar_t));
+ wchar_t *wcp = result;
+ wchar_t tmp[1];
+ mbstate_t state;
+ size_t nbytes;
+
+ memset (&state, '\0', sizeof (state));
+ while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0)
+ @{
+ if (nbytes >= (size_t) -2)
+ /* Invalid input string. */
+ return NULL;
+ *result++ = towupper (tmp[0]);
+ len -= nbytes;
+ s += nbytes;
+ @}
+ return result;
+@}
+@end smallexample
+
+The use of @code{mbrtowc} should be clear. A single wide character is
+stored in @code{@var{tmp}[0]}, and the number of consumed bytes is stored
+in the variable @var{nbytes}. If the conversion is successful, the
+uppercase variant of the wide character is stored in the @var{result}
+array and the pointer to the input string and the number of available
+bytes is adjusted.
+
+The only non-obvious thing about @code{mbrtowc} might be the way memory
+is allocated for the result. The above code uses the fact that there
+can never be more wide characters in the converted results than there are
+bytes in the multibyte input string. This method yields a pessimistic
+guess about the size of the result, and if many wide character strings
+have to be constructed this way or if the strings are long, the extra
+memory required to be allocated because the input string contains
+multibyte characters might be significant. The allocated memory block can
+be resized to the correct size before returning it, but a better solution
+might be to allocate just the right amount of space for the result right
+away. Unfortunately there is no function to compute the length of the wide
+character string directly from the multibyte string. There is, however, a
+function which does part of the work.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps})
+The @code{mbrlen} function (``multibyte restartable length'') computes
+the number of at most @var{n} bytes starting at @var{s} which form the
+next valid and complete multibyte character.
+
+If the next multibyte character corresponds to the NUL wide character,
+the return value is @math{0}. If the next @var{n} bytes form a valid
+multibyte character, the number of bytes belonging to this multibyte
+character byte sequence is returned.
+
+If the the first @var{n} bytes possibly form a valid multibyte
+character but the character is incomplete, the return value is
+@code{(size_t) -2}. Otherwise the multibyte character sequence is invalid
+and the return value is @code{(size_t) -1}.
+
+The multibyte sequence is interpreted in the state represented by the
+object pointed to by @var{ps}. If @var{ps} is a null pointer, a state
+object local to @code{mbrlen} is used.
+
+@pindex wchar.h
+@code{mbrlen} was introduced in @w{Amendment 1} to @w{ISO C90} and
+is declared in @file{wchar.h}.
+@end deftypefun
+
+The attentive reader now will note that @code{mbrlen} can be implemented
+as
+
+@smallexample
+mbrtowc (NULL, s, n, ps != NULL ? ps : &internal)
+@end smallexample
+
+This is true and in fact is mentioned in the official specification.
+How can this function be used to determine the length of the wide
+character string created from a multibyte character string? It is not
+directly usable, but we can define a function @code{mbslen} using it:
+
+@smallexample
+size_t
+mbslen (const char *s)
+@{
+ mbstate_t state;
+ size_t result = 0;
+ size_t nbytes;
+ memset (&state, '\0', sizeof (state));
+ while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0)
+ @{
+ if (nbytes >= (size_t) -2)
+ /* @r{Something is wrong.} */
+ return (size_t) -1;
+ s += nbytes;
+ ++result;
+ @}
+ return result;
+@}
+@end smallexample
+
+This function simply calls @code{mbrlen} for each multibyte character
+in the string and counts the number of function calls. Please note that
+we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen}
+call. This is acceptable since a) this value is larger then the length of
+the longest multibyte character sequence and b) we know that the string
+@var{s} ends with a NUL byte, which cannot be part of any other multibyte
+character sequence but the one representing the NUL wide character.
+Therefore, the @code{mbrlen} function will never read invalid memory.
+
+Now that this function is available (just to make this clear, this
+function is @emph{not} part of the GNU C library) we can compute the
+number of wide character required to store the converted multibyte
+character string @var{s} using
+
+@smallexample
+wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t);
+@end smallexample
+
+Please note that the @code{mbslen} function is quite inefficient. The
+implementation of @code{mbstouwcs} with @code{mbslen} would have to
+perform the conversion of the multibyte character input string twice, and
+this conversion might be quite expensive. So it is necessary to think
+about the consequences of using the easier but imprecise method before
+doing the work twice.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps})
+The @code{wcrtomb} function (``wide character restartable to
+multibyte'') converts a single wide character into a multibyte string
+corresponding to that wide character.
+
+If @var{s} is a null pointer, the function resets the state stored in
+the objects pointed to by @var{ps} (or the internal @code{mbstate_t}
+object) to the initial state. This can also be achieved by a call like
+this:
+
+@smallexample
+wcrtombs (temp_buf, L'\0', ps)
+@end smallexample
+
+@noindent
+since, if @var{s} is a null pointer, @code{wcrtomb} performs as if it
+writes into an internal buffer, which is guaranteed to be large enough.
+
+If @var{wc} is the NUL wide character, @code{wcrtomb} emits, if
+necessary, a shift sequence to get the state @var{ps} into the initial
+state followed by a single NUL byte, which is stored in the string
+@var{s}.
+
+Otherwise a byte sequence (possibly including shift sequences) is written
+into the string @var{s}. This only happens if @var{wc} is a valid wide
+character (i.e., it has a multibyte representation in the character set
+selected by locale of the @code{LC_CTYPE} category). If @var{wc} is no
+valid wide character, nothing is stored in the strings @var{s},
+@code{errno} is set to @code{EILSEQ}, the conversion state in @var{ps}
+is undefined and the return value is @code{(size_t) -1}.
+
+If no error occurred the function returns the number of bytes stored in
+the string @var{s}. This includes all bytes representing shift
+sequences.
+
+One word about the interface of the function: there is no parameter
+specifying the length of the array @var{s}. Instead the function
+assumes that there are at least @code{MB_CUR_MAX} bytes available since
+this is the maximum length of any byte sequence representing a single
+character. So the caller has to make sure that there is enough space
+available, otherwise buffer overruns can occur.
+
+@pindex wchar.h
+@code{wcrtomb} was introduced in @w{Amendment 1} to @w{ISO C90} and is
+declared in @file{wchar.h}.
+@end deftypefun
+
+Using @code{wcrtomb} is as easy as using @code{mbrtowc}. The following
+example appends a wide character string to a multibyte character string.
+Again, the code is not really useful (or correct), it is simply here to
+demonstrate the use and some problems.
+
+@smallexample
+char *
+mbscatwcs (char *s, size_t len, const wchar_t *ws)
+@{
+ mbstate_t state;
+ /* @r{Find the end of the existing string.} */
+ char *wp = strchr (s, '\0');
+ len -= wp - s;
+ memset (&state, '\0', sizeof (state));
+ do
+ @{
+ size_t nbytes;
+ if (len < MB_CUR_LEN)
+ @{
+ /* @r{We cannot guarantee that the next}
+ @r{character fits into the buffer, so}
+ @r{return an error.} */
+ errno = E2BIG;
+ return NULL;
+ @}
+ nbytes = wcrtomb (wp, *ws, &state);
+ if (nbytes == (size_t) -1)
+ /* @r{Error in the conversion.} */
+ return NULL;
+ len -= nbytes;
+ wp += nbytes;
+ @}
+ while (*ws++ != L'\0');
+ return s;
+@}
+@end smallexample
+
+First the function has to find the end of the string currently in the
+array @var{s}. The @code{strchr} call does this very efficiently since a
+requirement for multibyte character representations is that the NUL byte
+is never used except to represent itself (and in this context, the end
+of the string).
+
+After initializing the state object the loop is entered where the first
+task is to make sure there is enough room in the array @var{s}. We
+abort if there are not at least @code{MB_CUR_LEN} bytes available. This
+is not always optimal but we have no other choice. We might have less
+than @code{MB_CUR_LEN} bytes available but the next multibyte character
+might also be only one byte long. At the time the @code{wcrtomb} call
+returns it is too late to decide whether the buffer was large enough. If
+this solution is unsuitable, there is a very slow but more accurate
+solution.
+
+@smallexample
+ ...
+ if (len < MB_CUR_LEN)
+ @{
+ mbstate_t temp_state;
+ memcpy (&temp_state, &state, sizeof (state));
+ if (wcrtomb (NULL, *ws, &temp_state) > len)
+ @{
+ /* @r{We cannot guarantee that the next}
+ @r{character fits into the buffer, so}
+ @r{return an error.} */
+ errno = E2BIG;
+ return NULL;
+ @}
+ @}
+ ...
+@end smallexample
+
+Here we perform the conversion that might overflow the buffer so that
+we are afterwards in the position to make an exact decision about the
+buffer size. Please note the @code{NULL} argument for the destination
+buffer in the new @code{wcrtomb} call; since we are not interested in the
+converted text at this point, this is a nice way to express this. The
+most unusual thing about this piece of code certainly is the duplication
+of the conversion state object, but if a change of the state is necessary
+to emit the next multibyte character, we want to have the same shift state
+change performed in the real conversion. Therefore, we have to preserve
+the initial shift state information.
+
+There are certainly many more and even better solutions to this problem.
+This example is only provided for educational purposes.
+
+@node Converting Strings
+@subsection Converting Multibyte and Wide Character Strings
+
+The functions described in the previous section only convert a single
+character at a time. Most operations to be performed in real-world
+programs include strings and therefore the @w{ISO C} standard also
+defines conversions on entire strings. However, the defined set of
+functions is quite limited; therefore, the GNU C library contains a few
+extensions which can help in some important situations.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
+The @code{mbsrtowcs} function (``multibyte string restartable to wide
+character string'') converts an NUL-terminated multibyte character
+string at @code{*@var{src}} into an equivalent wide character string,
+including the NUL wide character at the end. The conversion is started
+using the state information from the object pointed to by @var{ps} or
+from an internal object of @code{mbsrtowcs} if @var{ps} is a null
+pointer. Before returning, the state object is updated to match the state
+after the last converted character. The state is the initial state if the
+terminating NUL byte is reached and converted.
+
+If @var{dst} is not a null pointer, the result is stored in the array
+pointed to by @var{dst}; otherwise, the conversion result is not
+available since it is stored in an internal buffer.
+
+If @var{len} wide characters are stored in the array @var{dst} before
+reaching the end of the input string, the conversion stops and @var{len}
+is returned. If @var{dst} is a null pointer, @var{len} is never checked.
+
+Another reason for a premature return from the function call is if the
+input string contains an invalid multibyte sequence. In this case the
+global variable @code{errno} is set to @code{EILSEQ} and the function
+returns @code{(size_t) -1}.
+
+@c XXX The ISO C9x draft seems to have a problem here. It says that PS
+@c is not updated if DST is NULL. This is not said straightforward and
+@c none of the other functions is described like this. It would make sense
+@c to define the function this way but I don't think it is meant like this.
+
+In all other cases the function returns the number of wide characters
+converted during this call. If @var{dst} is not null, @code{mbsrtowcs}
+stores in the pointer pointed to by @var{src} either a null pointer (if
+the NUL byte in the input string was reached) or the address of the byte
+following the last converted multibyte character.
+
+@pindex wchar.h
+@code{mbsrtowcs} was introduced in @w{Amendment 1} to @w{ISO C90} and is
+declared in @file{wchar.h}.
+@end deftypefun
+
+The definition of the @code{mbsrtowcs} function has one important
+limitation. The requirement that @var{dst} has to be a NUL-terminated
+string provides problems if one wants to convert buffers with text. A
+buffer is normally no collection of NUL-terminated strings but instead a
+continuous collection of lines, separated by newline characters. Now
+assume that a function to convert one line from a buffer is needed. Since
+the line is not NUL-terminated the source pointer cannot directly point
+into the unmodified text buffer. This means, either one inserts the NUL
+byte at the appropriate place for the time of the @code{mbsrtowcs}
+function call (which is not doable for a read-only buffer or in a
+multi-threaded application) or one copies the line in an extra buffer
+where it can be terminated by a NUL byte. Note that it is not in general
+possible to limit the number of characters to convert by setting the
+parameter @var{len} to any specific value. Since it is not known how
+many bytes each multibyte character sequence is in length, one can only
+guess.
+
+@cindex stateful
+There is still a problem with the method of NUL-terminating a line right
+after the newline character which could lead to very strange results.
+As said in the description of the @code{mbsrtowcs} function above the
+conversion state is guaranteed to be in the initial shift state after
+processing the NUL byte at the end of the input string. But this NUL
+byte is not really part of the text. I.e., the conversion state after
+the newline in the original text could be something different than the
+initial shift state and therefore the first character of the next line
+is encoded using this state. But the state in question is never
+accessible to the user since the conversion stops after the NUL byte
+(which resets the state). Most stateful character sets in use today
+require that the shift state after a newline be the initial state--but
+this is not a strict guarantee. Therefore, simply NUL-terminating a
+piece of a running text is not always an adequate solution and,
+therefore, should never be used in generally used code.
+
+The generic conversion interface (@pxref{Generic Charset Conversion})
+does not have this limitation (it simply works on buffers, not
+strings), and the GNU C library contains a set of functions which take
+additional parameters specifying the maximal number of bytes which are
+consumed from the input string. This way the problem of
+@code{mbsrtowcs}'s example above could be solved by determining the line
+length and passing this length to the function.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
+The @code{wcsrtombs} function (``wide character string restartable to
+multibyte string'') converts the NUL-terminated wide character string at
+@code{*@var{src}} into an equivalent multibyte character string and
+stores the result in the array pointed to by @var{dst}. The NUL wide
+character is also converted. The conversion starts in the state
+described in the object pointed to by @var{ps} or by a state object
+locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If
+@var{dst} is a null pointer, the conversion is performed as usual but the
+result is not available. If all characters of the input string were
+successfully converted and if @var{dst} is not a null pointer, the
+pointer pointed to by @var{src} gets assigned a null pointer.
+
+If one of the wide characters in the input string has no valid multibyte
+character equivalent, the conversion stops early, sets the global
+variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}.
+
+Another reason for a premature stop is if @var{dst} is not a null
+pointer and the next converted character would require more than
+@var{len} bytes in total to the array @var{dst}. In this case (and if
+@var{dest} is not a null pointer) the pointer pointed to by @var{src} is
+assigned a value pointing to the wide character right after the last one
+successfully converted.
+
+Except in the case of an encoding error the return value of the
+@code{wcsrtombs} function is the number of bytes in all the multibyte
+character sequences stored in @var{dst}. Before returning the state in
+the object pointed to by @var{ps} (or the internal object in case
+@var{ps} is a null pointer) is updated to reflect the state after the
+last conversion. The state is the initial shift state in case the
+terminating NUL wide character was converted.
+
+@pindex wchar.h
+The @code{wcsrtombs} function was introduced in @w{Amendment 1} to
+@w{ISO C90} and is declared in @file{wchar.h}.
+@end deftypefun
+
+The restriction mentioned above for the @code{mbsrtowcs} function applies
+here also. There is no possibility of directly controlling the number of
+input characters. One has to place the NUL wide character at the correct
+place or control the consumed input indirectly via the available output
+array size (the @var{len} parameter).
+
+@comment wchar.h
+@comment GNU
+@deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps})
+The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs}
+function. All the parameters are the same except for @var{nmc} which is
+new. The return value is the same as for @code{mbsrtowcs}.
+
+This new parameter specifies how many bytes at most can be used from the
+multibyte character string. In other words, the multibyte character
+string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte is
+found within the @var{nmc} first bytes of the string, the conversion
+stops here.
+
+This function is a GNU extension. It is meant to work around the
+problems mentioned above. Now it is possible to convert a buffer with
+multibyte character text piece for piece without having to care about
+inserting NUL bytes and the effect of NUL bytes on the conversion state.
+@end deftypefun
+
+A function to convert a multibyte string into a wide character string
+and display it could be written like this (this is not a really useful
+example):
+
+@smallexample
+void
+showmbs (const char *src, FILE *fp)
+@{
+ mbstate_t state;
+ int cnt = 0;
+ memset (&state, '\0', sizeof (state));
+ while (1)
+ @{
+ wchar_t linebuf[100];
+ const char *endp = strchr (src, '\n');
+ size_t n;
+
+ /* @r{Exit if there is no more line.} */
+ if (endp == NULL)
+ break;
+
+ n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state);
+ linebuf[n] = L'\0';
+ fprintf (fp, "line %d: \"%S\"\n", linebuf);
+ @}
+@}
+@end smallexample
+
+There is no problem with the state after a call to @code{mbsnrtowcs}.
+Since we don't insert characters in the strings which were not in there
+right from the beginning and we use @var{state} only for the conversion
+of the given buffer, there is no problem with altering the state.
+
+@comment wchar.h
+@comment GNU
+@deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps})
+The @code{wcsnrtombs} function implements the conversion from wide
+character strings to multibyte character strings. It is similar to
+@code{wcsrtombs} but, just like @code{mbsnrtowcs}, it takes an extra
+parameter, which specifies the length of the input string.
+
+No more than @var{nwc} wide characters from the input string
+@code{*@var{src}} are converted. If the input string contains a NUL
+wide character in the first @var{nwc} characters, the conversion stops at
+this place.
+
+The @code{wcsnrtombs} function is a GNU extension and just like
+@code{mbsnrtowcs} helps in situations where no NUL-terminated input
+strings are available.
+@end deftypefun
+
+
+@node Multibyte Conversion Example
+@subsection A Complete Multibyte Conversion Example
+
+The example programs given in the last sections are only brief and do
+not contain all the error checking etc. Presented here is a complete
+and documented example. It features the @code{mbrtowc} function but it
+should be easy to derive versions using the other functions.
+
+@smallexample
+int
+file_mbsrtowcs (int input, int output)
+@{
+ /* @r{Note the use of @code{MB_LEN_MAX}.}
+ @r{@code{MB_CUR_MAX} cannot portably be used here.} */
+ char buffer[BUFSIZ + MB_LEN_MAX];
+ mbstate_t state;
+ int filled = 0;
+ int eof = 0;
+
+ /* @r{Initialize the state.} */
+ memset (&state, '\0', sizeof (state));
+
+ while (!eof)
+ @{
+ ssize_t nread;
+ ssize_t nwrite;
+ char *inp = buffer;
+ wchar_t outbuf[BUFSIZ];
+ wchar_t *outp = outbuf;
+
+ /* @r{Fill up the buffer from the input file.} */
+ nread = read (input, buffer + filled, BUFSIZ);
+ if (nread < 0)
+ @{
+ perror ("read");
+ return 0;
+ @}
+ /* @r{If we reach end of file, make a note to read no more.} */
+ if (nread == 0)
+ eof = 1;
+
+ /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */
+ filled += nread;
+
+ /* @r{Convert those bytes to wide characters--as many as we can.} */
+ while (1)
+ @{
+ size_t thislen = mbrtowc (outp, inp, filled, &state);
+ /* @r{Stop converting at invalid character;}
+ @r{this can mean we have read just the first part}
+ @r{of a valid character.} */
+ if (thislen == (size_t) -1)
+ break;
+ /* @r{We want to handle embedded NUL bytes}
+ @r{but the return value is 0. Correct this.} */
+ if (thislen == 0)
+ thislen = 1;
+ /* @r{Advance past this character.} */
+ inp += thislen;
+ filled -= thislen;
+ ++outp;
+ @}
+
+ /* @r{Write the wide characters we just made.} */
+ nwrite = write (output, outbuf,
+ (outp - outbuf) * sizeof (wchar_t));
+ if (nwrite < 0)
+ @{
+ perror ("write");
+ return 0;
+ @}
+
+ /* @r{See if we have a @emph{real} invalid character.} */
+ if ((eof && filled > 0) || filled >= MB_CUR_MAX)
+ @{
+ error (0, 0, "invalid multibyte character");
+ return 0;
+ @}
+
+ /* @r{If any characters must be carried forward,}
+ @r{put them at the beginning of @code{buffer}.} */
+ if (filled > 0)
+ memmove (inp, buffer, filled);
+ @}
+
+ return 1;
+@}
+@end smallexample
+
+
+@node Non-reentrant Conversion
+@section Non-reentrant Conversion Function
+
+The functions described in the previous chapter are defined in
+@w{Amendment 1} to @w{ISO C90}, but the original @w{ISO C90} standard
+also contained functions for character set conversion. The reason that
+these original functions are not described first is that they are almost
+entirely useless.
+
+The problem is that all the conversion functions described in the
+original @w{ISO C90} use a local state. Using a local state implies that
+multiple conversions at the same time (not only when using threads)
+cannot be done, and that you cannot first convert single characters and
+then strings since you cannot tell the conversion functions which state
+to use.
+
+These original functions are therefore usable only in a very limited set
+of situations. One must complete converting the entire string before
+starting a new one, and each string/text must be converted with the same
+function (there is no problem with the library itself; it is guaranteed
+that no library function changes the state of any of these functions).
+@strong{For the above reasons it is highly requested that the functions
+described in the previous section be used in place of non-reentrant
+conversion functions.}
+
+@menu
+* Non-reentrant Character Conversion:: Non-reentrant Conversion of Single
+ Characters.
+* Non-reentrant String Conversion:: Non-reentrant Conversion of Strings.
+* Shift State:: States in Non-reentrant Functions.
+@end menu
+
+@node Non-reentrant Character Conversion
+@subsection Non-reentrant Conversion of Single Characters
+
+@comment stdlib.h
+@comment ISO
+@deftypefun int mbtowc (wchar_t *restrict @var{result}, const char *restrict @var{string}, size_t @var{size})
+The @code{mbtowc} (``multibyte to wide character'') function when called
+with non-null @var{string} converts the first multibyte character
+beginning at @var{string} to its corresponding wide character code. It
+stores the result in @code{*@var{result}}.
+
+@code{mbtowc} never examines more than @var{size} bytes. (The idea is
+to supply for @var{size} the number of bytes of data you have in hand.)
+
+@code{mbtowc} with non-null @var{string} distinguishes three
+possibilities: the first @var{size} bytes at @var{string} start with
+valid multibyte characters, they start with an invalid byte sequence or
+just part of a character, or @var{string} points to an empty string (a
+null character).
+
+For a valid multibyte character, @code{mbtowc} converts it to a wide
+character and stores that in @code{*@var{result}}, and returns the
+number of bytes in that character (always at least @math{1} and never
+more than @var{size}).
+
+For an invalid byte sequence, @code{mbtowc} returns @math{-1}. For an
+empty string, it returns @math{0}, also storing @code{'\0'} in
+@code{*@var{result}}.
+
+If the multibyte character code uses shift characters, then
+@code{mbtowc} maintains and updates a shift state as it scans. If you
+call @code{mbtowc} with a null pointer for @var{string}, that
+initializes the shift state to its standard initial value. It also
+returns nonzero if the multibyte character code in use actually has a
+shift state. @xref{Shift State}.
+@end deftypefun
+
+@comment stdlib.h
+@comment ISO
+@deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar})
+The @code{wctomb} (``wide character to multibyte'') function converts
+the wide character code @var{wchar} to its corresponding multibyte
+character sequence, and stores the result in bytes starting at
+@var{string}. At most @code{MB_CUR_MAX} characters are stored.
+
+@code{wctomb} with non-null @var{string} distinguishes three
+possibilities for @var{wchar}: a valid wide character code (one that can
+be translated to a multibyte character), an invalid code, and @code{L'\0'}.
+
+Given a valid code, @code{wctomb} converts it to a multibyte character,
+storing the bytes starting at @var{string}. Then it returns the number
+of bytes in that character (always at least @math{1} and never more
+than @code{MB_CUR_MAX}).
+
+If @var{wchar} is an invalid wide character code, @code{wctomb} returns
+@math{-1}. If @var{wchar} is @code{L'\0'}, it returns @code{0}, also
+storing @code{'\0'} in @code{*@var{string}}.
+
+If the multibyte character code uses shift characters, then
+@code{wctomb} maintains and updates a shift state as it scans. If you
+call @code{wctomb} with a null pointer for @var{string}, that
+initializes the shift state to its standard initial value. It also
+returns nonzero if the multibyte character code in use actually has a
+shift state. @xref{Shift State}.
+
+Calling this function with a @var{wchar} argument of zero when
+@var{string} is not null has the side-effect of reinitializing the
+stored shift state @emph{as well as} storing the multibyte character
+@code{'\0'} and returning @math{0}.
+@end deftypefun
+
+Similar to @code{mbrlen} there is also a non-reentrant function which
+computes the length of a multibyte character. It can be defined in
+terms of @code{mbtowc}.
+
+@comment stdlib.h
+@comment ISO
+@deftypefun int mblen (const char *@var{string}, size_t @var{size})
+The @code{mblen} function with a non-null @var{string} argument returns
+the number of bytes that make up the multibyte character beginning at
+@var{string}, never examining more than @var{size} bytes. (The idea is
+to supply for @var{size} the number of bytes of data you have in hand.)
+
+The return value of @code{mblen} distinguishes three possibilities: the
+first @var{size} bytes at @var{string} start with valid multibyte
+characters, they start with an invalid byte sequence or just part of a
+character, or @var{string} points to an empty string (a null character).
+
+For a valid multibyte character, @code{mblen} returns the number of
+bytes in that character (always at least @code{1} and never more than
+@var{size}). For an invalid byte sequence, @code{mblen} returns
+@math{-1}. For an empty string, it returns @math{0}.
+
+If the multibyte character code uses shift characters, then @code{mblen}
+maintains and updates a shift state as it scans. If you call
+@code{mblen} with a null pointer for @var{string}, that initializes the
+shift state to its standard initial value. It also returns a nonzero
+value if the multibyte character code in use actually has a shift state.
+@xref{Shift State}.
+
+@pindex stdlib.h
+The function @code{mblen} is declared in @file{stdlib.h}.
+@end deftypefun
+
+
+@node Non-reentrant String Conversion
+@subsection Non-reentrant Conversion of Strings
+
+For convenience the @w{ISO C90} standard also defines functions to
+convert entire strings instead of single characters. These functions
+suffer from the same problems as their reentrant counterparts from
+@w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}.
+
+@comment stdlib.h
+@comment ISO
+@deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size})
+The @code{mbstowcs} (``multibyte string to wide character string'')
+function converts the null-terminated string of multibyte characters
+@var{string} to an array of wide character codes, storing not more than
+@var{size} wide characters into the array beginning at @var{wstring}.
+The terminating null character counts towards the size, so if @var{size}
+is less than the actual number of wide characters resulting from
+@var{string}, no terminating null character is stored.
+
+The conversion of characters from @var{string} begins in the initial
+shift state.
+
+If an invalid multibyte character sequence is found, the @code{mbstowcs}
+function returns a value of @math{-1}. Otherwise, it returns the number
+of wide characters stored in the array @var{wstring}. This number does
+not include the terminating null character, which is present if the
+number is less than @var{size}.
+
+Here is an example showing how to convert a string of multibyte
+characters, allocating enough space for the result.
+
+@smallexample
+wchar_t *
+mbstowcs_alloc (const char *string)
+@{
+ size_t size = strlen (string) + 1;
+ wchar_t *buf = xmalloc (size * sizeof (wchar_t));
+
+ size = mbstowcs (buf, string, size);
+ if (size == (size_t) -1)
+ return NULL;
+ buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));
+ return buf;
+@}
+@end smallexample
+
+@end deftypefun
+
+@comment stdlib.h
+@comment ISO
+@deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size})
+The @code{wcstombs} (``wide character string to multibyte string'')
+function converts the null-terminated wide character array @var{wstring}
+into a string containing multibyte characters, storing not more than
+@var{size} bytes starting at @var{string}, followed by a terminating
+null character if there is room. The conversion of characters begins in
+the initial shift state.
+
+The terminating null character counts towards the size, so if @var{size}
+is less than or equal to the number of bytes needed in @var{wstring}, no
+terminating null character is stored.
+
+If a code that does not correspond to a valid multibyte character is
+found, the @code{wcstombs} function returns a value of @math{-1}.
+Otherwise, the return value is the number of bytes stored in the array
+@var{string}. This number does not include the terminating null character,
+which is present if the number is less than @var{size}.
+@end deftypefun
+
+@node Shift State
+@subsection States in Non-reentrant Functions
+
+In some multibyte character codes, the @emph{meaning} of any particular
+byte sequence is not fixed; it depends on what other sequences have come
+earlier in the same string. Typically there are just a few sequences that
+can change the meaning of other sequences; these few are called
+@dfn{shift sequences} and we say that they set the @dfn{shift state} for
+other sequences that follow.
+
+To illustrate shift state and shift sequences, suppose we decide that
+the sequence @code{0200} (just one byte) enters Japanese mode, in which
+pairs of bytes in the range from @code{0240} to @code{0377} are single
+characters, while @code{0201} enters Latin-1 mode, in which single bytes
+in the range from @code{0240} to @code{0377} are characters, and
+interpreted according to the ISO Latin-1 character set. This is a
+multibyte code which has two alternative shift states (``Japanese mode''
+and ``Latin-1 mode''), and two shift sequences that specify particular
+shift states.
+
+When the multibyte character code in use has shift states, then
+@code{mblen}, @code{mbtowc}, and @code{wctomb} must maintain and update
+the current shift state as they scan the string. To make this work
+properly, you must follow these rules:
+
+@itemize @bullet
+@item
+Before starting to scan a string, call the function with a null pointer
+for the multibyte character address---for example, @code{mblen (NULL,
+0)}. This initializes the shift state to its standard initial value.
+
+@item
+Scan the string one character at a time, in order. Do not ``back up''
+and rescan characters already scanned, and do not intersperse the
+processing of different strings.
+@end itemize
+
+Here is an example of using @code{mblen} following these rules:
+
+@smallexample
+void
+scan_string (char *s)
+@{
+ int length = strlen (s);
+
+ /* @r{Initialize shift state.} */
+ mblen (NULL, 0);
+
+ while (1)
+ @{
+ int thischar = mblen (s, length);
+ /* @r{Deal with end of string and invalid characters.} */
+ if (thischar == 0)
+ break;
+ if (thischar == -1)
+ @{
+ error ("invalid multibyte character");
+ break;
+ @}
+ /* @r{Advance past this character.} */
+ s += thischar;
+ length -= thischar;
+ @}
+@}
+@end smallexample
+
+The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not
+reentrant when using a multibyte code that uses a shift state. However,
+no other library functions call these functions, so you don't have to
+worry that the shift state will be changed mysteriously.
+
+
+@node Generic Charset Conversion
+@section Generic Charset Conversion
+
+The conversion functions mentioned so far in this chapter all had in
+common that they operate on character sets that are not directly
+specified by the functions. The multibyte encoding used is specified by
+the currently selected locale for the @code{LC_CTYPE} category. The
+wide character set is fixed by the implementation (in the case of GNU C
+library it is always UCS-4 encoded @w{ISO 10646}.
+
+This has of course several problems when it comes to general character
+conversion:
+
+@itemize @bullet
+@item
+For every conversion where neither the source nor the destination
+character set is the character set of the locale for the @code{LC_CTYPE}
+category, one has to change the @code{LC_CTYPE} locale using
+@code{setlocale}.
+
+Changing the @code{LC_TYPE} locale introduces major problems for the rest
+of the programs since several more functions (e.g., the character
+classification functions, @pxref{Classification of Characters}) use the
+@code{LC_CTYPE} category.
+
+@item
+Parallel conversions to and from different character sets are not
+possible since the @code{LC_CTYPE} selection is global and shared by all
+threads.
+
+@item
+If neither the source nor the destination character set is the character
+set used for @code{wchar_t} representation, there is at least a two-step
+process necessary to convert a text using the functions above. One would
+have to select the source character set as the multibyte encoding,
+convert the text into a @code{wchar_t} text, select the destination
+character set as the multibyte encoding, and convert the wide character
+text to the multibyte (@math{=} destination) character set.
+
+Even if this is possible (which is not guaranteed) it is a very tiring
+work. Plus it suffers from the other two raised points even more due to
+the steady changing of the locale.
+@end itemize
+
+The XPG2 standard defines a completely new set of functions which has
+none of these limitations. They are not at all coupled to the selected
+locales, and they have no constraints on the character sets selected for
+source and destination. Only the set of available conversions limits
+them. The standard does not specify that any conversion at all must be
+available. Such availability is a measure of the quality of the
+implementation.
+
+In the following text first the interface to @code{iconv} and then the
+conversion function, will be described. Comparisons with other
+implementations will show what obstacles stand in the way of portable
+applications. Finally, the implementation is described in so far as might
+interest the advanced user who wants to extend conversion capabilities.
+
+@menu
+* Generic Conversion Interface:: Generic Character Set Conversion Interface.
+* iconv Examples:: A complete @code{iconv} example.
+* Other iconv Implementations:: Some Details about other @code{iconv}
+ Implementations.
+* glibc iconv Implementation:: The @code{iconv} Implementation in the GNU C
+ library.
+@end menu
+
+@node Generic Conversion Interface
+@subsection Generic Character Set Conversion Interface
+
+This set of functions follows the traditional cycle of using a resource:
+open--use--close. The interface consists of three functions, each of
+which implements one step.
+
+Before the interfaces are described it is necessary to introduce a
+data type. Just like other open--use--close interfaces the functions
+introduced here work using handles and the @file{iconv.h} header
+defines a special type for the handles used.
+
+@comment iconv.h
+@comment XPG2
+@deftp {Data Type} iconv_t
+This data type is an abstract type defined in @file{iconv.h}. The user
+must not assume anything about the definition of this type; it must be
+completely opaque.
+
+Objects of this type can get assigned handles for the conversions using
+the @code{iconv} functions. The objects themselves need not be freed, but
+the conversions for which the handles stand for have to.
+@end deftp
+
+@noindent
+The first step is the function to create a handle.
+
+@comment iconv.h
+@comment XPG2
+@deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode})
+The @code{iconv_open} function has to be used before starting a
+conversion. The two parameters this function takes determine the
+source and destination character set for the conversion, and if the
+implementation has the possibility to perform such a conversion, the
+function returns a handle.
+
+If the wanted conversion is not available, the @code{iconv_open} function
+returns @code{(iconv_t) -1}. In this case the global variable
+@code{errno} can have the following values:
+
+@table @code
+@item EMFILE
+The process already has @code{OPEN_MAX} file descriptors open.
+@item ENFILE
+The system limit of open file is reached.
+@item ENOMEM
+Not enough memory to carry out the operation.
+@item EINVAL
+The conversion from @var{fromcode} to @var{tocode} is not supported.
+@end table
+
+It is not possible to use the same descriptor in different threads to
+perform independent conversions. The data structures associated
+with the descriptor include information about the conversion state.
+This must not be messed up by using it in different conversions.
+
+An @code{iconv} descriptor is like a file descriptor as for every use a
+new descriptor must be created. The descriptor does not stand for all
+of the conversions from @var{fromset} to @var{toset}.
+
+The GNU C library implementation of @code{iconv_open} has one
+significant extension to other implementations. To ease the extension
+of the set of available conversions, the implementation allows storing
+the necessary files with data and code in an arbitrary number of
+directories. How this extension must be written will be explained below
+(@pxref{glibc iconv Implementation}). Here it is only important to say
+that all directories mentioned in the @code{GCONV_PATH} environment
+variable are considered only if they contain a file @file{gconv-modules}.
+These directories need not necessarily be created by the system
+administrator. In fact, this extension is introduced to help users
+writing and using their own, new conversions. Of course, this does not
+work for security reasons in SUID binaries; in this case only the system
+directory is considered and this normally is
+@file{@var{prefix}/lib/gconv}. The @code{GCONV_PATH} environment variable
+is examined exactly once at the first call of the @code{iconv_open}
+function. Later modifications of the variable have no effect.
+
+@pindex iconv.h
+The @code{iconv_open} function was introduced early in the X/Open
+Portability Guide, @w{version 2}. It is supported by all commercial
+Unices as it is required for the Unix branding. However, the quality and
+completeness of the implementation varies widely. The @code{iconv_open}
+function is declared in @file{iconv.h}.
+@end deftypefun
+
+The @code{iconv} implementation can associate large data structure with
+the handle returned by @code{iconv_open}. Therefore, it is crucial to
+free all the resources once all conversions are carried out and the
+conversion is not needed anymore.
+
+@comment iconv.h
+@comment XPG2
+@deftypefun int iconv_close (iconv_t @var{cd})
+The @code{iconv_close} function frees all resources associated with the
+handle @var{cd}, which must have been returned by a successful call to
+the @code{iconv_open} function.
+
+If the function call was successful the return value is @math{0}.
+Otherwise it is @math{-1} and @code{errno} is set appropriately.
+Defined error are:
+
+@table @code
+@item EBADF
+The conversion descriptor is invalid.
+@end table
+
+@pindex iconv.h
+The @code{iconv_close} function was introduced together with the rest
+of the @code{iconv} functions in XPG2 and is declared in @file{iconv.h}.
+@end deftypefun
+
+The standard defines only one actual conversion function. This has,
+therefore, the most general interface: it allows conversion from one
+buffer to another. Conversion from a file to a buffer, vice versa, or
+even file to file can be implemented on top of it.
+
+@comment iconv.h
+@comment XPG2
+@deftypefun size_t iconv (iconv_t @var{cd}, char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
+@cindex stateful
+The @code{iconv} function converts the text in the input buffer
+according to the rules associated with the descriptor @var{cd} and
+stores the result in the output buffer. It is possible to call the
+function for the same text several times in a row since for stateful
+character sets the necessary state information is kept in the data
+structures associated with the descriptor.
+
+The input buffer is specified by @code{*@var{inbuf}} and it contains
+@code{*@var{inbytesleft}} bytes. The extra indirection is necessary for
+communicating the used input back to the caller (see below). It is
+important to note that the buffer pointer is of type @code{char} and the
+length is measured in bytes even if the input text is encoded in wide
+characters.
+
+The output buffer is specified in a similar way. @code{*@var{outbuf}}
+points to the beginning of the buffer with at least
+@code{*@var{outbytesleft}} bytes room for the result. The buffer
+pointer again is of type @code{char} and the length is measured in
+bytes. If @var{outbuf} or @code{*@var{outbuf}} is a null pointer, the
+conversion is performed but no output is available.
+
+If @var{inbuf} is a null pointer, the @code{iconv} function performs the
+necessary action to put the state of the conversion into the initial
+state. This is obviously a no-op for non-stateful encodings, but if the
+encoding has a state, such a function call might put some byte sequences
+in the output buffer, which perform the necessary state changes. The
+next call with @var{inbuf} not being a null pointer then simply goes on
+from the initial state. It is important that the programmer never makes
+any assumption as to whether the conversion has to deal with states. Even
+if the input and output character sets are not stateful, the
+implementation might still have to keep states. This is due to the
+implementation chosen for the GNU C library as it is described below.
+Therefore an @code{iconv} call to reset the state should always be
+performed if some protocol requires this for the output text.
+
+The conversion stops for one of three reasons. The first is that all
+characters from the input buffer are converted. This actually can mean
+two things: either all bytes from the input buffer are consumed or
+there are some bytes at the end of the buffer that possibly can form a
+complete character but the input is incomplete. The second reason for a
+stop is that the output buffer is full. And the third reason is that
+the input contains invalid characters.
+
+In all of these cases the buffer pointers after the last successful
+conversion, for input and output buffer, are stored in @var{inbuf} and
+@var{outbuf}, and the available room in each buffer is stored in
+@var{inbytesleft} and @var{outbytesleft}.
+
+Since the character sets selected in the @code{iconv_open} call can be
+almost arbitrary, there can be situations where the input buffer contains
+valid characters, which have no identical representation in the output
+character set. The behavior in this situation is undefined. The
+@emph{current} behavior of the GNU C library in this situation is to
+return with an error immediately. This certainly is not the most
+desirable solution; therefore, future versions will provide better ones,
+but they are not yet finished.
+
+If all input from the input buffer is successfully converted and stored
+in the output buffer, the function returns the number of non-reversible
+conversions performed. In all other cases the return value is
+@code{(size_t) -1} and @code{errno} is set appropriately. In such cases
+the value pointed to by @var{inbytesleft} is nonzero.
+
+@table @code
+@item EILSEQ
+The conversion stopped because of an invalid byte sequence in the input.
+After the call, @code{*@var{inbuf}} points at the first byte of the
+invalid byte sequence.
+
+@item E2BIG
+The conversion stopped because it ran out of space in the output buffer.
+
+@item EINVAL
+The conversion stopped because of an incomplete byte sequence at the end
+of the input buffer.
+
+@item EBADF
+The @var{cd} argument is invalid.
+@end table
+
+@pindex iconv.h
+The @code{iconv} function was introduced in the XPG2 standard and is
+declared in the @file{iconv.h} header.
+@end deftypefun
+
+The definition of the @code{iconv} function is quite good overall. It
+provides quite flexible functionality. The only problems lie in the
+boundary cases, which are incomplete byte sequences at the end of the
+input buffer and invalid input. A third problem, which is not really
+a design problem, is the way conversions are selected. The standard
+does not say anything about the legitimate names, a minimal set of
+available conversions. We will see how this negatively impacts other
+implementations, as demonstrated below.
+
+@node iconv Examples
+@subsection A complete @code{iconv} example
+
+The example below features a solution for a common problem. Given that
+one knows the internal encoding used by the system for @code{wchar_t}
+strings, one often is in the position to read text from a file and store
+it in wide character buffers. One can do this using @code{mbsrtowcs},
+but then we run into the problems discussed above.
+
+@smallexample
+int
+file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail)
+@{
+ char inbuf[BUFSIZ];
+ size_t insize = 0;
+ char *wrptr = (char *) outbuf;
+ int result = 0;
+ iconv_t cd;
+
+ cd = iconv_open ("WCHAR_T", charset);
+ if (cd == (iconv_t) -1)
+ @{
+ /* @r{Something went wrong.} */
+ if (errno == EINVAL)
+ error (0, 0, "conversion from '%s' to wchar_t not available",
+ charset);
+ else
+ perror ("iconv_open");
+
+ /* @r{Terminate the output string.} */
+ *outbuf = L'\0';
+
+ return -1;
+ @}
+
+ while (avail > 0)
+ @{
+ size_t nread;
+ size_t nconv;
+ char *inptr = inbuf;
+
+ /* @r{Read more input.} */
+ nread = read (fd, inbuf + insize, sizeof (inbuf) - insize);
+ if (nread == 0)
+ @{
+ /* @r{When we come here the file is completely read.}
+ @r{This still could mean there are some unused}
+ @r{characters in the @code{inbuf}. Put them back.} */
+ if (lseek (fd, -insize, SEEK_CUR) == -1)
+ result = -1;
+
+ /* @r{Now write out the byte sequence to get into the}
+ @r{initial state if this is necessary.} */
+ iconv (cd, NULL, NULL, &wrptr, &avail);
+
+ break;
+ @}
+ insize += nread;
+
+ /* @r{Do the conversion.} */
+ nconv = iconv (cd, &inptr, &insize, &wrptr, &avail);
+ if (nconv == (size_t) -1)
+ @{
+ /* @r{Not everything went right. It might only be}
+ @r{an unfinished byte sequence at the end of the}
+ @r{buffer. Or it is a real problem.} */
+ if (errno == EINVAL)
+ /* @r{This is harmless. Simply move the unused}
+ @r{bytes to the beginning of the buffer so that}
+ @r{they can be used in the next round.} */
+ memmove (inbuf, inptr, insize);
+ else
+ @{
+ /* @r{It is a real problem. Maybe we ran out of}
+ @r{space in the output buffer or we have invalid}
+ @r{input. In any case back the file pointer to}
+ @r{the position of the last processed byte.} */
+ lseek (fd, -insize, SEEK_CUR);
+ result = -1;
+ break;
+ @}
+ @}
+ @}
+
+ /* @r{Terminate the output string.} */
+ if (avail >= sizeof (wchar_t))
+ *((wchar_t *) wrptr) = L'\0';
+
+ if (iconv_close (cd) != 0)
+ perror ("iconv_close");
+
+ return (wchar_t *) wrptr - outbuf;
+@}
+@end smallexample
+
+@cindex stateful
+This example shows the most important aspects of using the @code{iconv}
+functions. It shows how successive calls to @code{iconv} can be used to
+convert large amounts of text. The user does not have to care about
+stateful encodings as the functions take care of everything.
+
+An interesting point is the case where @code{iconv} returns an error and
+@code{errno} is set to @code{EINVAL}. This is not really an error in the
+transformation. It can happen whenever the input character set contains
+byte sequences of more than one byte for some character and texts are not
+processed in one piece. In this case there is a chance that a multibyte
+sequence is cut. The caller can then simply read the remainder of the
+takes and feed the offending bytes together with new character from the
+input to @code{iconv} and continue the work. The internal state kept in
+the descriptor is @emph{not} unspecified after such an event as is the
+case with the conversion functions from the @w{ISO C} standard.
+
+The example also shows the problem of using wide character strings with
+@code{iconv}. As explained in the description of the @code{iconv}
+function above, the function always takes a pointer to a @code{char}
+array and the available space is measured in bytes. In the example, the
+output buffer is a wide character buffer; therefore, we use a local
+variable @var{wrptr} of type @code{char *}, which is used in the
+@code{iconv} calls.
+
+This looks rather innocent but can lead to problems on platforms that
+have tight restriction on alignment. Therefore the caller of @code{iconv}
+has to make sure that the pointers passed are suitable for access of
+characters from the appropriate character set. Since, in the
+above case, the input parameter to the function is a @code{wchar_t}
+pointer, this is the case (unless the user violates alignment when
+computing the parameter). But in other situations, especially when
+writing generic functions where one does not know what type of character
+set one uses and, therefore, treats text as a sequence of bytes, it might
+become tricky.
+
+@node Other iconv Implementations
+@subsection Some Details about other @code{iconv} Implementations
+
+This is not really the place to discuss the @code{iconv} implementation
+of other systems but it is necessary to know a bit about them to write
+portable programs. The above mentioned problems with the specification
+of the @code{iconv} functions can lead to portability issues.
+
+The first thing to notice is that, due to the large number of character
+sets in use, it is certainly not practical to encode the conversions
+directly in the C library. Therefore, the conversion information must
+come from files outside the C library. This is usually done in one or
+both of the following ways:
+
+@itemize @bullet
+@item
+The C library contains a set of generic conversion functions which can
+read the needed conversion tables and other information from data files.
+These files get loaded when necessary.
+
+This solution is problematic as it requires a great deal of effort to
+apply to all character sets (potentially an infinite set). The
+differences in the structure of the different character sets is so large
+that many different variants of the table-processing functions must be
+developed. In addition, the generic nature of these functions make them
+slower than specifically implemented functions.
+
+@item
+The C library only contains a framework which can dynamically load
+object files and execute the conversion functions contained therein.
+
+This solution provides much more flexibility. The C library itself
+contains only very little code and therefore reduces the general memory
+footprint. Also, with a documented interface between the C library and
+the loadable modules it is possible for third parties to extend the set
+of available conversion modules. A drawback of this solution is that
+dynamic loading must be available.
+@end itemize
+
+Some implementations in commercial Unices implement a mixture of these
+possibilities; the majority implement only the second solution. Using
+loadable modules moves the code out of the library itself and keeps
+the door open for extensions and improvements, but this design is also
+limiting on some platforms since not many platforms support dynamic
+loading in statically linked programs. On platforms without this
+capability it is therefore not possible to use this interface in
+statically linked programs. The GNU C library has, on ELF platforms, no
+problems with dynamic loading in these situations; therefore, this
+point is moot. The danger is that one gets acquainted with this situation
+and forgets about the restrictions on other systems.
+
+A second thing to know about other @code{iconv} implementations is that
+the number of available conversions is often very limited. Some
+implementations provide, in the standard release (not special
+international or developer releases), at most 100 to 200 conversion
+possibilities. This does not mean 200 different character sets are
+supported; for example, conversions from one character set to a set of 10
+others might count as 10 conversions. Together with the other direction
+this makes 20 conversion possibilities used up by one character set. One
+can imagine the thin coverage these platform provide. Some Unix vendors
+even provide only a handful of conversions which renders them useless for
+almost all uses.
+
+This directly leads to a third and probably the most problematic point.
+The way the @code{iconv} conversion functions are implemented on all
+known Unix systems and the availability of the conversion functions from
+character set @math{@cal{A}} to @math{@cal{B}} and the conversion from
+@math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the
+conversion from @math{@cal{A}} to @math{@cal{C}} is available.
+
+This might not seem unreasonable and problematic at first, but it is a
+quite big problem as one will notice shortly after hitting it. To show
+the problem we assume to write a program which has to convert from
+@math{@cal{A}} to @math{@cal{C}}. A call like
+
+@smallexample
+cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}");
+@end smallexample
+
+@noindent
+fails according to the assumption above. But what does the program
+do now? The conversion is necessary; therefore, simply giving up is not
+an option.
+
+This is a nuisance. The @code{iconv} function should take care of this.
+But how should the program proceed from here on? If it tries to convert
+to character set @math{@cal{B}}, first the two @code{iconv_open}
+calls
+
+@smallexample
+cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}");
+@end smallexample
+
+@noindent
+and
+
+@smallexample
+cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}");
+@end smallexample
+
+@noindent
+will succeed, but how to find @math{@cal{B}}?
+
+Unfortunately, the answer is: there is no general solution. On some
+systems guessing might help. On those systems most character sets can
+convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Beside
+this only some very system-specific methods can help. Since the
+conversion functions come from loadable modules and these modules must
+be stored somewhere in the filesystem, one @emph{could} try to find them
+and determine from the available file which conversions are available
+and whether there is an indirect route from @math{@cal{A}} to
+@math{@cal{C}}.
+
+This example shows one of the design errors of @code{iconv} mentioned
+above. It should at least be possible to determine the list of available
+conversion programmatically so that if @code{iconv_open} says there is no
+such conversion, one could make sure this also is true for indirect
+routes.
+
+@node glibc iconv Implementation
+@subsection The @code{iconv} Implementation in the GNU C library
+
+After reading about the problems of @code{iconv} implementations in the
+last section it is certainly good to note that the implementation in
+the GNU C library has none of the problems mentioned above. What
+follows is a step-by-step analysis of the points raised above. The
+evaluation is based on the current state of the development (as of
+January 1999). The development of the @code{iconv} functions is not
+complete, but basic functionality has solidified.
+
+The GNU C library's @code{iconv} implementation uses shared loadable
+modules to implement the conversions. A very small number of
+conversions are built into the library itself but these are only rather
+trivial conversions.
+
+All the benefits of loadable modules are available in the GNU C library
+implementation. This is especially appealing since the interface is
+well documented (see below), and it, therefore, is easy to write new
+conversion modules. The drawback of using loadable objects is not a
+problem in the GNU C library, at least on ELF systems. Since the
+library is able to load shared objects even in statically linked
+binaries, static linking need not be forbidden in case one wants to use
+@code{iconv}.
+
+The second mentioned problem is the number of supported conversions.
+Currently, the GNU C library supports more than 150 character sets. The
+way the implementation is designed the number of supported conversions
+is greater than 22350 (@math{150} times @math{149}). If any conversion
+from or to a character set is missing, it can be added easily.
+
+Particularly impressive as it may be, this high number is due to the
+fact that the GNU C library implementation of @code{iconv} does not have
+the third problem mentioned above (i.e., whenever there is a conversion
+from a character set @math{@cal{A}} to @math{@cal{B}} and from
+@math{@cal{B}} to @math{@cal{C}} it is always possible to convert from
+@math{@cal{A}} to @math{@cal{C}} directly). If the @code{iconv_open}
+returns an error and sets @code{errno} to @code{EINVAL}, there is no
+known way, directly or indirectly, to perform the wanted conversion.
+
+@cindex triangulation
+Triangulation is achieved by providing for each character set a
+conversion from and to UCS-4 encoded @w{ISO 10646}. Using @w{ISO 10646}
+as an intermediate representation it is possible to @dfn{triangulate}
+(i.e., convert with an intermediate representation).
+
+There is no inherent requirement to provide a conversion to @w{ISO
+10646} for a new character set, and it is also possible to provide other
+conversions where neither source nor destination character set is @w{ISO
+10646}. The existing set of conversions is simply meant to cover all
+conversions that might be of interest.
+
+@cindex ISO-2022-JP
+@cindex EUC-JP
+All currently available conversions use the triangulation method above,
+making conversion run unnecessarily slow. If, for example, somebody
+often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution
+would involve direct conversion between the two character sets, skipping
+the input to @w{ISO 10646} first. The two character sets of interest
+are much more similar to each other than to @w{ISO 10646}.
+
+In such a situation one easily can write a new conversion and provide it
+as a better alternative. The GNU C library @code{iconv} implementation
+would automatically use the module implementing the conversion if it is
+specified to be more efficient.
+
+@subsubsection Format of @file{gconv-modules} files
+
+All information about the available conversions comes from a file named
+@file{gconv-modules} which can be found in any of the directories along
+the @code{GCONV_PATH}. The @file{gconv-modules} files are line-oriented
+text files, where each of the lines has one of the following formats:
+
+@itemize @bullet
+@item
+If the first non-whitespace character is a @kbd{#} the line contains only
+comments and is ignored.
+
+@item
+Lines starting with @code{alias} define an alias name for a character
+set. Two more words are expected on the line. The first word
+defines the alias name, and the second defines the original name of the
+character set. The effect is that it is possible to use the alias name
+in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and
+achieve the same result as when using the real character set name.
+
+This is quite important as a character set has often many different
+names. There is normally an official name but this need not correspond to
+the most popular name. Beside this many character sets have special
+names that are somehow constructed. For example, all character sets
+specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}}
+where @var{nnn} is the registration number. This allows programs which
+know about the registration number to construct character set names and
+use them in @code{iconv_open} calls. More on the available names and
+aliases follows below.
+
+@item
+Lines starting with @code{module} introduce an available conversion
+module. These lines must contain three or four more words.
+
+The first word specifies the source character set, the second word the
+destination character set of conversion implemented in this module, and
+the third word is the name of the loadable module. The filename is
+constructed by appending the usual shared object suffix (normally
+@file{.so}) and this file is then supposed to be found in the same
+directory the @file{gconv-modules} file is in. The last word on the line,
+which is optional, is a numeric value representing the cost of the
+conversion. If this word is missing, a cost of @math{1} is assumed. The
+numeric value itself does not matter that much; what counts are the
+relative values of the sums of costs for all possible conversion paths.
+Below is a more precise description of the use of the cost value.
+@end itemize
+
+Returning to the example above where one has written a module to directly
+convert from ISO-2022-JP to EUC-JP and back. All that has to be done is
+to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory
+and add a file @file{gconv-modules} with the following content in the
+same directory:
+
+@smallexample
+module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1
+module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1
+@end smallexample
+
+To see why this is sufficient, it is necessary to understand how the
+conversion used by @code{iconv} (and described in the descriptor) is
+selected. The approach to this problem is quite simple.
+
+At the first call of the @code{iconv_open} function the program reads
+all available @file{gconv-modules} files and builds up two tables: one
+containing all the known aliases and another that contains the
+information about the conversions and which shared object implements
+them.
+
+@subsubsection Finding the conversion path in @code{iconv}
+
+The set of available conversions form a directed graph with weighted
+edges. The weights on the edges are the costs specified in the
+@file{gconv-modules} files. The @code{iconv_open} function uses an
+algorithm suitable for search for the best path in such a graph and so
+constructs a list of conversions which must be performed in succession
+to get the transformation from the source to the destination character
+set.
+
+Explaining why the above @file{gconv-modules} files allows the
+@code{iconv} implementation to resolve the specific ISO-2022-JP to
+EUC-JP conversion module instead of the conversion coming with the
+library itself is straightforward. Since the latter conversion takes two
+steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to
+EUC-JP), the cost is @math{1+1 = 2}. The above @file{gconv-modules}
+file, however, specifies that the new conversion modules can perform this
+conversion with only the cost of @math{1}.
+
+A mysterious item about the @file{gconv-modules} file above (and also
+the file coming with the GNU C library) are the names of the character
+sets specified in the @code{module} lines. Why do almost all the names
+end in @code{//}? And this is not all: the names can actually be
+regular expressions. At this point in time this mystery should not be
+revealed, unless you have the relevant spell-casting materials: ashes
+from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix
+blessed by St.@: Emacs, assorted herbal roots from Central America, sand
+from Cebu, etc. Sorry! @strong{The part of the implementation where
+this is used is not yet finished. For now please simply follow the
+existing examples. It'll become clearer once it is. --drepper}
+
+A last remark about the @file{gconv-modules} is about the names not
+ending with @code{//}. Aa character set named @code{INTERNAL} is often
+mentioned. From the discussion above and the chosen name it should have
+become clear that this is the name for the representation used in the
+intermediate step of the triangulation. We have said that this is UCS-4
+but actually that is not quite right. The UCS-4 specification also
+includes the specification of the byte ordering used. Since a UCS-4 value
+consists of four bytes, a stored value is effected by byte ordering. The
+internal representation is @emph{not} the same as UCS-4 in case the byte
+ordering of the processor (or at least the running process) is not the
+same as the one required for UCS-4. This is done for performance reasons
+as one does not want to perform unnecessary byte-swapping operations if
+one is not interested in actually seeing the result in UCS-4. To avoid
+trouble with endianess, the internal representation consistently is named
+@code{INTERNAL} even on big-endian systems where the representations are
+identical.
+
+@subsubsection @code{iconv} module data structures
+
+So far this section has described how modules are located and considered
+to be used. What remains to be described is the interface of the modules
+so that one can write new ones. This section describes the interface as
+it is in use in January 1999. The interface will change a bit in the
+future but, with luck, only in an upwardly compatible way.
+
+The definitions necessary to write new modules are publicly available
+in the non-standard header @file{gconv.h}. The following text,
+therefore, describes the definitions from this header file. First,
+however, it is necessary to get an overview.
+
+From the perspective of the user of @code{iconv} the interface is quite
+simple: the @code{iconv_open} function returns a handle that can be used
+in calls to @code{iconv}, and finally the handle is freed with a call to
+@code{iconv_close}. The problem is that the handle has to be able to
+represent the possibly long sequences of conversion steps and also the
+state of each conversion since the handle is all that is passed to the
+@code{iconv} function. Therefore, the data structures are really the
+elements necessary to understanding the implementation.
+
+We need two different kinds of data structures. The first describes the
+conversion and the second describes the state etc. There are really two
+type definitions like this in @file{gconv.h}.
+@pindex gconv.h
+
+@comment gconv.h
+@comment GNU
+@deftp {Data type} {struct __gconv_step}
+This data structure describes one conversion a module can perform. For
+each function in a loaded module with conversion functions there is
+exactly one object of this type. This object is shared by all users of
+the conversion (i.e., this object does not contain any information
+corresponding to an actual conversion; it only describes the conversion
+itself).
+
+@table @code
+@item struct __gconv_loaded_object *__shlib_handle
+@itemx const char *__modname
+@itemx int __counter
+All these elements of the structure are used internally in the C library
+to coordinate loading and unloading the shared. One must not expect any
+of the other elements to be available or initialized.
+
+@item const char *__from_name
+@itemx const char *__to_name
+@code{__from_name} and @code{__to_name} contain the names of the source and
+destination character sets. They can be used to identify the actual
+conversion to be carried out since one module might implement conversions
+for more than one character set and/or direction.
+
+@item gconv_fct __fct
+@itemx gconv_init_fct __init_fct
+@itemx gconv_end_fct __end_fct
+These elements contain pointers to the functions in the loadable module.
+The interface will be explained below.
+
+@item int __min_needed_from
+@itemx int __max_needed_from
+@itemx int __min_needed_to
+@itemx int __max_needed_to;
+These values have to be supplied in the init function of the module. The
+@code{__min_needed_from} value specifies how many bytes a character of
+the source character set at least needs. The @code{__max_needed_from}
+specifies the maximum value that also includes possible shift sequences.
+
+The @code{__min_needed_to} and @code{__max_needed_to} values serve the
+same purpose as @code{__min_needed_from} and @code{__max_needed_from} but
+this time for the destination character set.
+
+It is crucial that these values be accurate since otherwise the
+conversion functions will have problems or not work at all.
+
+@item int __stateful
+This element must also be initialized by the init function.
+@code{int __stateful} is nonzero if the source character set is stateful.
+Otherwise it is zero.
+
+@item void *__data
+This element can be used freely by the conversion functions in the
+module. @code{void *__data} can be used to communicate extra information
+from one call to another. @code{void *__data} need not be initialized if
+not needed at all. If @code{void *__data} element is assigned a pointer
+to dynamically allocated memory (presumably in the init function) it has
+to be made sure that the end function deallocates the memory. Otherwise
+the application will leak memory.
+
+It is important to be aware that this data structure is shared by all
+users of this specification conversion and therefore the @code{__data}
+element must not contain data specific to one specific use of the
+conversion function.
+@end table
+@end deftp
+
+@comment gconv.h
+@comment GNU
+@deftp {Data type} {struct __gconv_step_data}
+This is the data structure that contains the information specific to
+each use of the conversion functions.
+
+
+@table @code
+@item char *__outbuf
+@itemx char *__outbufend
+These elements specify the output buffer for the conversion step. The
+@code{__outbuf} element points to the beginning of the buffer, and
+@code{__outbufend} points to the byte following the last byte in the
+buffer. The conversion function must not assume anything about the size
+of the buffer but it can be safely assumed the there is room for at
+least one complete character in the output buffer.
+
+Once the conversion is finished, if the conversion is the last step, the
+@code{__outbuf} element must be modified to point after the last byte
+written into the buffer to signal how much output is available. If this
+conversion step is not the last one, the element must not be modified.
+The @code{__outbufend} element must not be modified.
+
+@item int __is_last
+This element is nonzero if this conversion step is the last one. This
+information is necessary for the recursion. See the description of the
+conversion function internals below. This element must never be
+modified.
+
+@item int __invocation_counter
+The conversion function can use this element to see how many calls of
+the conversion function already happened. Some character sets require a
+certain prolog when generating output, and by comparing this value with
+zero, one can find out whether it is the first call and whether,
+therefore, the prolog should be emitted. This element must never be
+modified.
+
+@item int __internal_use
+This element is another one rarely used but needed in certain
+situations. It is assigned a nonzero value in case the conversion
+functions are used to implement @code{mbsrtowcs} et.al.@: (i.e., the
+function is not used directly through the @code{iconv} interface).
+
+This sometimes makes a difference as it is expected that the
+@code{iconv} functions are used to translate entire texts while the
+@code{mbsrtowcs} functions are normally used only to convert single
+strings and might be used multiple times to convert entire texts.
+
+But in this situation we would have problem complying with some rules of
+the character set specification. Some character sets require a prolog
+which must appear exactly once for an entire text. If a number of
+@code{mbsrtowcs} calls are used to convert the text, only the first call
+must add the prolog. However, because there is no communication between the
+different calls of @code{mbsrtowcs}, the conversion functions have no
+possibility to find this out. The situation is different for sequences
+of @code{iconv} calls since the handle allows access to the needed
+information.
+
+The @code{int __internal_use} element is mostly used together with
+@code{__invocation_counter} as follows:
+
+@smallexample
+if (!data->__internal_use
+ && data->__invocation_counter == 0)
+ /* @r{Emit prolog.} */
+ ...
+@end smallexample
+
+This element must never be modified.
+
+@item mbstate_t *__statep
+The @code{__statep} element points to an object of type @code{mbstate_t}
+(@pxref{Keeping the state}). The conversion of a stateful character
+set must use the object pointed to by @code{__statep} to store
+information about the conversion state. The @code{__statep} element
+itself must never be modified.
+
+@item mbstate_t __state
+This element must @emph{never} be used directly. It is only part of
+this structure to have the needed space allocated.
+@end table
+@end deftp
+
+@subsubsection @code{iconv} module interfaces
+
+With the knowledge about the data structures we now can describe the
+conversion function itself. To understand the interface a bit of
+knowledge is necessary about the functionality in the C library that
+loads the objects with the conversions.
+
+It is often the case that one conversion is used more than once (i.e.,
+there are several @code{iconv_open} calls for the same set of character
+sets during one program run). The @code{mbsrtowcs} et.al.@: functions in
+the GNU C library also use the @code{iconv} functionality, which
+increases the number of uses of the same functions even more.
+
+Because of this multiple use of conversions, the modules do not get
+loaded exclusively for one conversion. Instead a module once loaded can
+be used by an arbitrary number of @code{iconv} or @code{mbsrtowcs} calls
+at the same time. The splitting of the information between conversion-
+function-specific information and conversion data makes this possible.
+The last section showed the two data structures used to do this.
+
+This is of course also reflected in the interface and semantics of the
+functions that the modules must provide. There are three functions that
+must have the following names:
+
+@table @code
+@item gconv_init
+The @code{gconv_init} function initializes the conversion function
+specific data structure. This very same object is shared by all
+conversions that use this conversion and, therefore, no state information
+about the conversion itself must be stored in here. If a module
+implements more than one conversion, the @code{gconv_init} function will
+be called multiple times.
+
+@item gconv_end
+The @code{gconv_end} function is responsible for freeing all resources
+allocated by the @code{gconv_init} function. If there is nothing to do,
+this function can be missing. Special care must be taken if the module
+implements more than one conversion and the @code{gconv_init} function
+does not allocate the same resources for all conversions.
+
+@item gconv
+This is the actual conversion function. It is called to convert one
+block of text. It gets passed the conversion step information
+initialized by @code{gconv_init} and the conversion data, specific to
+this use of the conversion functions.
+@end table
+
+There are three data types defined for the three module interface
+functions and these define the interface.
+
+@comment gconv.h
+@comment GNU
+@deftypevr {Data type} int {(*__gconv_init_fct)} (struct __gconv_step *)
+This specifies the interface of the initialization function of the
+module. It is called exactly once for each conversion the module
+implements.
+
+As explained in the description of the @code{struct __gconv_step} data
+structure above the initialization function has to initialize parts of
+it.
+
+@table @code
+@item __min_needed_from
+@itemx __max_needed_from
+@itemx __min_needed_to
+@itemx __max_needed_to
+These elements must be initialized to the exact numbers of the minimum
+and maximum number of bytes used by one character in the source and
+destination character sets, respectively. If the characters all have the
+same size, the minimum and maximum values are the same.
+
+@item __stateful
+This element must be initialized to an nonzero value if the source
+character set is stateful. Otherwise it must be zero.
+@end table
+
+If the initialization function needs to communicate some information
+to the conversion function, this communication can happen using the
+@code{__data} element of the @code{__gconv_step} structure. But since
+this data is shared by all the conversions, it must not be modified by
+the conversion function. The example below shows how this can be used.
+
+@smallexample
+#define MIN_NEEDED_FROM 1
+#define MAX_NEEDED_FROM 4
+#define MIN_NEEDED_TO 4
+#define MAX_NEEDED_TO 4
+
+int
+gconv_init (struct __gconv_step *step)
+@{
+ /* @r{Determine which direction.} */
+ struct iso2022jp_data *new_data;
+ enum direction dir = illegal_dir;
+ enum variant var = illegal_var;
+ int result;
+
+ if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0)
+ @{
+ dir = from_iso2022jp;
+ var = iso2022jp;
+ @}
+ else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0)
+ @{
+ dir = to_iso2022jp;
+ var = iso2022jp;
+ @}
+ else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0)
+ @{
+ dir = from_iso2022jp;
+ var = iso2022jp2;
+ @}
+ else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0)
+ @{
+ dir = to_iso2022jp;
+ var = iso2022jp2;
+ @}
+
+ result = __GCONV_NOCONV;
+ if (dir != illegal_dir)
+ @{
+ new_data = (struct iso2022jp_data *)
+ malloc (sizeof (struct iso2022jp_data));
+
+ result = __GCONV_NOMEM;
+ if (new_data != NULL)
+ @{
+ new_data->dir = dir;
+ new_data->var = var;
+ step->__data = new_data;
+
+ if (dir == from_iso2022jp)
+ @{
+ step->__min_needed_from = MIN_NEEDED_FROM;
+ step->__max_needed_from = MAX_NEEDED_FROM;
+ step->__min_needed_to = MIN_NEEDED_TO;
+ step->__max_needed_to = MAX_NEEDED_TO;
+ @}
+ else
+ @{
+ step->__min_needed_from = MIN_NEEDED_TO;
+ step->__max_needed_from = MAX_NEEDED_TO;
+ step->__min_needed_to = MIN_NEEDED_FROM;
+ step->__max_needed_to = MAX_NEEDED_FROM + 2;
+ @}
+
+ /* @r{Yes, this is a stateful encoding.} */
+ step->__stateful = 1;
+
+ result = __GCONV_OK;
+ @}
+ @}
+
+ return result;
+@}
+@end smallexample
+
+The function first checks which conversion is wanted. The module from
+which this function is taken implements four different conversions;
+which one is selected can be determined by comparing the names. The
+comparison should always be done without paying attention to the case.
+
+Next, a data structure, which contains the necessary information about
+which conversion is selected, is allocated. The data structure
+@code{struct iso2022jp_data} is locally defined since, outside the
+module, this data is not used at all. Please note that if all four
+conversions this modules supports are requested there are four data
+blocks.
+
+One interesting thing is the initialization of the @code{__min_} and
+@code{__max_} elements of the step data object. A single ISO-2022-JP
+character can consist of one to four bytes. Therefore the
+@code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined
+this way. The output is always the @code{INTERNAL} character set (aka
+UCS-4) and therefore each character consists of exactly four bytes. For
+the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
+account that escape sequences might be necessary to switch the character
+sets. Therefore the @code{__max_needed_to} element for this direction
+gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the
+two bytes needed for the escape sequences to single the switching. The
+asymmetry in the maximum values for the two directions can be explained
+easily: when reading ISO-2022-JP text, escape sequences can be handled
+alone (i.e., it is not necessary to process a real character since the
+effect of the escape sequence can be recorded in the state information).
+The situation is different for the other direction. Since it is in
+general not known which character comes next, one cannot emit escape
+sequences to change the state in advance. This means the escape
+sequences that have to be emitted together with the next character.
+Therefore one needs more room than only for the character itself.
+
+The possible return values of the initialization function are:
+
+@table @code
+@item __GCONV_OK
+The initialization succeeded
+@item __GCONV_NOCONV
+The requested conversion is not supported in the module. This can
+happen if the @file{gconv-modules} file has errors.
+@item __GCONV_NOMEM
+Memory required to store additional information could not be allocated.
+@end table
+@end deftypevr
+
+The function called before the module is unloaded is significantly
+easier. It often has nothing at all to do; in which case it can be left
+out completely.
+
+@comment gconv.h
+@comment GNU
+@deftypevr {Data type} void {(*__gconv_end_fct)} (struct gconv_step *)
+The task of this function is to free all resources allocated in the
+initialization function. Therefore only the @code{__data} element of
+the object pointed to by the argument is of interest. Continuing the
+example from the initialization function, the finalization function
+looks like this:
+
+@smallexample
+void
+gconv_end (struct __gconv_step *data)
+@{
+ free (data->__data);
+@}
+@end smallexample
+@end deftypevr
+
+The most important function is the conversion function itself, which can
+get quite complicated for complex character sets. But since this is not
+of interest here, we will only describe a possible skeleton for the
+conversion function.
+
+@comment gconv.h
+@comment GNU
+@deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int)
+The conversion function can be called for two basic reason: to convert
+text or to reset the state. From the description of the @code{iconv}
+function it can be seen why the flushing mode is necessary. What mode
+is selected is determined by the sixth argument, an integer. This
+argument being nonzero means that flushing is selected.
+
+Common to both modes is where the output buffer can be found. The
+information about this buffer is stored in the conversion step data. A
+pointer to this information is passed as the second argument to this
+function. The description of the @code{struct __gconv_step_data}
+structure has more information on the conversion step data.
+
+@cindex stateful
+What has to be done for flushing depends on the source character set.
+If the source character set is not stateful, nothing has to be done.
+Otherwise the function has to emit a byte sequence to bring the state
+object into the initial state. Once this all happened the other
+conversion modules in the chain of conversions have to get the same
+chance. Whether another step follows can be determined from the
+@code{__is_last} element of the step data structure to which the first
+parameter points.
+
+The more interesting mode is when actual text has to be converted. The
+first step in this case is to convert as much text as possible from the
+input buffer and store the result in the output buffer. The start of the
+input buffer is determined by the third argument which is a pointer to a
+pointer variable referencing the beginning of the buffer. The fourth
+argument is a pointer to the byte right after the last byte in the buffer.
+
+The conversion has to be performed according to the current state if the
+character set is stateful. The state is stored in an object pointed to
+by the @code{__statep} element of the step data (second argument). Once
+either the input buffer is empty or the output buffer is full the
+conversion stops. At this point, the pointer variable referenced by the
+third parameter must point to the byte following the last processed
+byte (i.e., if all of the input is consumed, this pointer and the fourth
+parameter have the same value).
+
+What now happens depends on whether this step is the last one. If it is
+the last step, the only thing that has to be done is to update the
+@code{__outbuf} element of the step data structure to point after the
+last written byte. This update gives the caller the information on how
+much text is available in the output buffer. In addition, the variable
+pointed to by the fifth parameter, which is of type @code{size_t}, must
+be incremented by the number of characters (@emph{not bytes}) that were
+converted in a non-reversible way. Then, the function can return.
+
+In case the step is not the last one, the later conversion functions have
+to get a chance to do their work. Therefore, the appropriate conversion
+function has to be called. The information about the functions is
+stored in the conversion data structures, passed as the first parameter.
+This information and the step data are stored in arrays, so the next
+element in both cases can be found by simple pointer arithmetic:
+
+@smallexample
+int
+gconv (struct __gconv_step *step, struct __gconv_step_data *data,
+ const char **inbuf, const char *inbufend, size_t *written,
+ int do_flush)
+@{
+ struct __gconv_step *next_step = step + 1;
+ struct __gconv_step_data *next_data = data + 1;
+ ...
+@end smallexample
+
+The @code{next_step} pointer references the next step information and
+@code{next_data} the next data record. The call of the next function
+therefore will look similar to this:
+
+@smallexample
+ next_step->__fct (next_step, next_data, &outerr, outbuf,
+ written, 0)
+@end smallexample
+
+But this is not yet all. Once the function call returns the conversion
+function might have some more to do. If the return value of the function
+is @code{__GCONV_EMPTY_INPUT}, more room is available in the output
+buffer. Unless the input buffer is empty the conversion, functions start
+all over again and process the rest of the input buffer. If the return
+value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have
+to recover from this.
+
+A requirement for the conversion function is that the input buffer
+pointer (the third argument) always point to the last character that
+was put in converted form into the output buffer. This is trivially
+true after the conversion performed in the current step, but if the
+conversion functions deeper downstream stop prematurely, not all
+characters from the output buffer are consumed and, therefore, the input
+buffer pointers must be backed off to the right position.
+
+Correcting the input buffers is easy to do if the input and output
+character sets have a fixed width for all characters. In this situation
+we can compute how many characters are left in the output buffer and,
+therefore, can correct the input buffer pointer appropriately with a
+similar computation. Things are getting tricky if either character set
+has characters represented with variable length byte sequences, and it
+gets even more complicated if the conversion has to take care of the
+state. In these cases the conversion has to be performed once again, from
+the known state before the initial conversion (i.e., if necessary the
+state of the conversion has to be reset and the conversion loop has to be
+executed again). The difference now is that it is known how much input
+must be created, and the conversion can stop before converting the first
+unused character. Once this is done the input buffer pointers must be
+updated again and the function can return.
+
+One final thing should be mentioned. If it is necessary for the
+conversion to know whether it is the first invocation (in case a prolog
+has to be emitted), the conversion function should increment the
+@code{__invocation_counter} element of the step data structure just
+before returning to the caller. See the description of the @code{struct
+__gconv_step_data} structure above for more information on how this can
+be used.
+
+The return value must be one of the following values:
+
+@table @code
+@item __GCONV_EMPTY_INPUT
+All input was consumed and there is room left in the output buffer.
+@item __GCONV_FULL_OUTPUT
+No more room in the output buffer. In case this is not the last step
+this value is propagated down from the call of the next conversion
+function in the chain.
+@item __GCONV_INCOMPLETE_INPUT
+The input buffer is not entirely empty since it contains an incomplete
+character sequence.
+@end table
+
+The following example provides a framework for a conversion function.
+In case a new conversion has to be written the holes in this
+implementation have to be filled and that is it.
+
+@smallexample
+int
+gconv (struct __gconv_step *step, struct __gconv_step_data *data,
+ const char **inbuf, const char *inbufend, size_t *written,
+ int do_flush)
+@{
+ struct __gconv_step *next_step = step + 1;
+ struct __gconv_step_data *next_data = data + 1;
+ gconv_fct fct = next_step->__fct;
+ int status;
+
+ /* @r{If the function is called with no input this means we have}
+ @r{to reset to the initial state. The possibly partly}
+ @r{converted input is dropped.} */
+ if (do_flush)
+ @{
+ status = __GCONV_OK;
+
+ /* @r{Possible emit a byte sequence which put the state object}
+ @r{into the initial state.} */
+
+ /* @r{Call the steps down the chain if there are any but only}
+ @r{if we successfully emitted the escape sequence.} */
+ if (status == __GCONV_OK && ! data->__is_last)
+ status = fct (next_step, next_data, NULL, NULL,
+ written, 1);
+ @}
+ else
+ @{
+ /* @r{We preserve the initial values of the pointer variables.} */
+ const char *inptr = *inbuf;
+ char *outbuf = data->__outbuf;
+ char *outend = data->__outbufend;
+ char *outptr;
+
+ do
+ @{
+ /* @r{Remember the start value for this round.} */
+ inptr = *inbuf;
+ /* @r{The outbuf buffer is empty.} */
+ outptr = outbuf;
+
+ /* @r{For stateful encodings the state must be safe here.} */
+
+ /* @r{Run the conversion loop. @code{status} is set}
+ @r{appropriately afterwards.} */
+
+ /* @r{If this is the last step, leave the loop. There is}
+ @r{nothing we can do.} */
+ if (data->__is_last)
+ @{
+ /* @r{Store information about how many bytes are}
+ @r{available.} */
+ data->__outbuf = outbuf;
+
+ /* @r{If any non-reversible conversions were performed,}
+ @r{add the number to @code{*written}.} */
+
+ break;
+ @}
+
+ /* @r{Write out all output which was produced.} */
+ if (outbuf > outptr)
+ @{
+ const char *outerr = data->__outbuf;
+ int result;
+
+ result = fct (next_step, next_data, &outerr,
+ outbuf, written, 0);
+
+ if (result != __GCONV_EMPTY_INPUT)
+ @{
+ if (outerr != outbuf)
+ @{
+ /* @r{Reset the input buffer pointer. We}
+ @r{document here the complex case.} */
+ size_t nstatus;
+
+ /* @r{Reload the pointers.} */
+ *inbuf = inptr;
+ outbuf = outptr;
+
+ /* @r{Possibly reset the state.} */
+
+ /* @r{Redo the conversion, but this time}
+ @r{the end of the output buffer is at}
+ @r{@code{outerr}.} */
+ @}
+
+ /* @r{Change the status.} */
+ status = result;
+ @}
+ else
+ /* @r{All the output is consumed, we can make}
+ @r{ another run if everything was ok.} */
+ if (status == __GCONV_FULL_OUTPUT)
+ status = __GCONV_OK;
+ @}
+ @}
+ while (status == __GCONV_OK);
+
+ /* @r{We finished one use of this step.} */
+ ++data->__invocation_counter;
+ @}
+
+ return status;
+@}
+@end smallexample
+@end deftypevr
+
+This information should be sufficient to write new modules. Anybody
+doing so should also take a look at the available source code in the GNU
+C library sources. It contains many examples of working and optimized
+modules.
+
+@c File charset.texi edited October 2001 by Dennis Grace, IBM Corporation
\ No newline at end of file |