aboutsummaryrefslogtreecommitdiff
path: root/manual/charset.texi
diff options
context:
space:
mode:
authorUlrich Drepper <drepper@redhat.com>1999-01-11 20:13:43 +0000
committerUlrich Drepper <drepper@redhat.com>1999-01-11 20:13:43 +0000
commit390955cbdeb674bead490fc3f74a8a0893ea83cf (patch)
tree2900fdc697f52133f633c09edbbe712882736bf0 /manual/charset.texi
parent68ef28edc2f1bafa417da1ac8d35a3bf2a1b565b (diff)
downloadglibc-390955cbdeb674bead490fc3f74a8a0893ea83cf.tar
glibc-390955cbdeb674bead490fc3f74a8a0893ea83cf.tar.gz
glibc-390955cbdeb674bead490fc3f74a8a0893ea83cf.tar.bz2
glibc-390955cbdeb674bead490fc3f74a8a0893ea83cf.zip
Update.
1999-01-11 Ulrich Drepper <drepper@cygnus.com> * ctype/Versions [GLIBC_2.0]: Export __ctype32_b. * include/wctype.h: Declare __iswctype. * stdio-common/vfscanf.c (__vfscanf): Use __iswspace instead of iswspace. * wctype/Makefile (routines): Add wcextra_l. * wctype/wcextra.c (iswblank): Implement function here and don't use __iswctype. (__iswblank_l): Move definition to... * wctype/wcextra_l.c: ...here. New file. * wctype/wcfuncs.c: Really implement functions and don't call __iswctype or __towctrans. * wctype/wctype.h: Change isw* and tow* macros. Don't call __iswctype or __towctrans. Instead optimize constant argument case. * iconv/gconv.h: Fix typos. * iconv/skeleton.c: Fix typos. Optimize init function a bit. Correctly emit escape sequence to return to initial state in conversion function. * iconvdata/iso-2022-jp.c (gconv_init): Correctly initialize max_needed_to element. * manual/mbyte.texi: Removed. This is now described in charset.texi. * manual/charset.texi: New file. * manual/Makefile (chapters): Replace mbyte by charset. * manual/ctype.texi: Document wide character functions. * manual/intro.texi: Fix reference to mbyte chapter. * manual/lang.texi: Likewise. * manual/locale.texi: Likewise. * manual/stdio.texi: Likewise. * manual/string.texi: Fix @node line for new charset chapter. * manual/libc.texinfo (UPDATED): Updated. Also update copyright years. * manual/memory.texi (savestring): Optimize code to give a good example. * manual/filesys.texi: Fix wording. Patches by Jim Meyering. * nscd/nscd_getgr_r.c: Include stdint.h to get uintptr_t definition. * nscd/nscd_getpw_r.c: Likewise. * nscd/nscd_gethst_r.c: Likewise. * stdlib/stdtold_l.c: Always include xlocale.h. 1999-01-11 Geoffrey Keating <geoffk@ozemail.com.au> * stdlib/fpioconst.h (LDBL_MAX_10_EXP_LOG): Define to be same as DBL_MAX_10_EXP_LOG if there is no long double. (_fpioconst_pow10): Always use size as LDBL_MAX_10_EXP_LOG to match printf_fp.c. 1999-01-10 Andreas Jaeger <aj@arthur.rhein-neckar.de> * timezone/Makefile ($(testdata)/GB): Changed to ... ($(testdata)/Europe/London): ... for tst-timezone test. ($(objpfx)tst-timezone.out): Change GB to Europe/London. * timezone/tst-timezone.c (main): Enable DST switching test, change GB to Europe/London. 1999-01-10 Philip Blundell <philb@gnu.org> * socket/Makefile (headers): Remove bits/sockunion.h. 1999-01-09 Philip Blundell <philb@gnu.org> * socket/sys/socket.h: Don't include <bits/sockunion.h>. * sysdeps/generic/bits/sockunion.h: Deleted. * sysdeps/unix/sysv/linux/bits/sockunion.h: Likewise. 1999-01-08 H.J. Lu <hjl@gnu.org> * io/fts.c (fts_close): Don't access memory after having it freed.
Diffstat (limited to 'manual/charset.texi')
-rw-r--r--manual/charset.texi2846
1 files changed, 2846 insertions, 0 deletions
diff --git a/manual/charset.texi b/manual/charset.texi
new file mode 100644
index 0000000000..6179128e3c
--- /dev/null
+++ b/manual/charset.texi
@@ -0,0 +1,2846 @@
+@node Character Set Handling, Locales, String and Array Utilities, Top
+@c %MENU% Support for extended character sets
+@chapter Character Set Handling
+
+@ifnottex
+@macro cal{text}
+\text\
+@end macro
+@end ifnottex
+
+Character sets used in the early days of computers had only six, seven,
+or eight bits for each character. In no case more bits than would fit
+into one byte which nowadays is almost exclusively @w{8 bits} wide.
+This of course leads to several problems once not all characters needed
+at one time can be represented by the up to 256 available characters.
+This chapter shows the functionality which was added to the C library to
+overcome this problem.
+
+@menu
+* Extended Char Intro:: Introduction to Extended Characters.
+* Charset Function Overview:: Overview about Character Handling
+ Functions.
+* Restartable multibyte conversion:: Restartable multibyte conversion
+ Functions.
+* Non-reentrant Conversion:: Non-reentrant Conversion Function.
+* Generic Charset Conversion:: Generic Charset Conversion.
+@end menu
+
+
+@node Extended Char Intro
+@section Introduction to Extended Characters
+
+To overcome the limitations of character sets with a 1:1 relation
+between bytes and characters people came up with a variety of solutions.
+The remainder of this section gives a few examples to help understanding
+the design decision made while developing the functionality of the @w{C
+library} to support them.
+
+@cindex internal representation
+A distinction we have to make right away is between internal and
+external representation. @dfn{Internal representation} means the
+representation used by a program while keeping the text in memory.
+External representations are used when text is stored or transmitted
+through whatever communication channel.
+
+Traditionally there was no difference between the two representations.
+It was equally comfortable and useful to use the same one-byte
+representation internally and externally. This changes with more and
+larger character sets.
+
+One of the problems to overcome with the internal representation is
+handling text which were externally encoded using different character
+sets. Assume a program which reads two texts and compares them using
+some metric. The comparison can be usefully done only if the texts are
+internally kept in a common format.
+
+@cindex wide character
+For such a common format (@math{=} character set) eight bits are certainly
+not enough anymore. So the smallest entity will have to grow: @dfn{wide
+characters} will be used. Here instead of one byte one uses two or four
+(three are not good to address in memory and more than four bytes seem
+not to be necessary).
+
+@cindex Unicode
+@cindex ISO 10646
+As shown in some other part of this manual
+@c !!! Ahem, wide char string functions are not yet covered -- drepper
+there exists a completely new family of functions which can handle texts
+of this kinds in memory. The most commonly used character set for such
+internal wide character representations are Unicode and @w{ISO 10646}.
+The former is a subset of the later and used when wide characters are
+chosen to by 2 bytes (@math{= 16} bits) wide. The standard names of the
+@cindex UCS2
+@cindex UCS4
+encodings used in these cases are UCS2 (@math{= 16} bits) and UCS4
+(@math{= 32} bits).
+
+To represent wide characters the @code{char} type is certainly not
+suitable. For this reason the @w{ISO C} standard introduces a new type
+which is designed to keep one character of a wide character string. To
+maintain the similarity there is also a type corresponding to @code{int}
+for those functions which take a single wide character.
+
+@comment stddef.h
+@comment ISO
+@deftp {Data type} wchar_t
+This data type is used as the base type for wide character strings.
+I.e., arrays of objects of this type are the equivalent of @code{char[]}
+for multibyte character strings. The type is defined in @file{stddef.h}.
+
+The @w{ISO C89} standard, where this type was introduced, does not say
+anything specific about the representation. It only requires that this
+type is capable to store all elements of the basic character set.
+Therefore it would be legitimate to define @code{wchar_t} and
+@code{char}. This might make sense for embedded systems.
+
+But for GNU systems this type is always 32 bits wide. It is therefore
+capable to represent all UCS4 value therefore covering all of @w{ISO
+10646}. Some Unix systems define @code{wchar_t} as a 16 bit type and
+thereby follow Unicode very strictly. This is perfectly fine with the
+standard but it also means that to represent all characters fro Unicode
+and @w{ISO 10646} one has to use surrogate character which is in fact a
+multi-wide-character encoding. But this contradicts the purpose of the
+@code{wchar_t} type.
+@end deftp
+
+@comment wchar.h
+@comment ISO
+@deftp {Data type} wint_t
+@code{wint_t} is a data type used for parameters and variables which
+contain a single wide character. As the name already suggests it is the
+equivalent to @code{int} when using the normal @code{char} strings. The
+types @code{wchar_t} and @code{wint_t} have often the same
+representation if their size if 32 bits wide but if @code{wchar_t} is
+defined as @code{char} the type @code{wint_t} must be defined as
+@code{int} due to the parameter promotion.
+
+@pindex wchar.h
+This type is defined in @file{wchar.h} and got introduced in the second
+amendment to @w{ISO C 89}.
+@end deftp
+
+As there are for the @code{char} data type there also exist macros
+specifying the minimum and maximum value representable in an object of
+type @code{wchar_t}.
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WCHAR_MIN
+The macro @code{WCHAR_MIN} evaluates to the minimum value representable
+by an object of type @code{wint_t}.
+
+This macro got introduced in the second amendment to @w{ISO C89}.
+@end deftypevr
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WCHAR_MAX
+The macro @code{WCHAR_MIN} evaluates to the maximum value representable
+by an object of type @code{wint_t}.
+
+This macro got introduced in the second amendment to @w{ISO C89}.
+@end deftypevr
+
+Another special wide character value is the equivalent to @code{EOF}.
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WEOF
+The macro @code{WEOF} evaluates to a constant expression of type
+@code{wint_t} whose value is different from any member of the extended
+character set.
+
+@code{WEOF} need not be the same value as @code{EOF} and unlike
+@code{EOF} it also need @emph{not} be negative. I.e., sloppy code like
+
+@smallexample
+@{
+ int c;
+ ...
+ while ((c = getc (fp)) < 0)
+ ...
+@}
+@end smallexample
+
+@noindent
+has to be rewritten to explicitly use @code{WEOF} when wide characters
+are used.
+
+@smallexample
+@{
+ wint_t c;
+ ...
+ while ((c = wgetc (fp)) != WEOF)
+ ...
+@}
+@end smallexample
+
+@pindex wchar.h
+This macro was introduced in the second amendment to @w{ISO C89} and is
+defined in @file{wchar.h}.
+@end deftypevr
+
+
+These internal representations present problems when it comes to storing
+and transmitting them. Since a single wide character consists of more
+than one byte they are effected by byte-ordering. I.e., machines with
+different endianesses would see different value accessing the same data.
+This also applies for communication protocols which are all byte-based
+and therefore the sender has to decide about splitting the wide
+character in bytes. A last but not least important point is that wide
+characters often require more storage space than an customized byte
+oriented character set.
+
+@cindex multibyte character
+This is why most of the time an external encoding which is different
+from the internal encoding is used if the later is UCS2 or UCS4. The
+external encoding is byte-based and can be chosen appropriately for the
+environment and for the texts to be handled. There exists a variety of
+different character sets which can be used which is too much to be
+handled completely here. We restrict ourself here to a description of
+the major groups. All of the ASCII-based character sets fulfill one
+requirement: they are ``filesystem safe''. This means that the
+character @code{'/'} is used in the encoding @emph{only} to represent
+itself. Things are a bit different for character like EBCDIC but if the
+operation system does not understand EBCDIC directly the parameters to
+system calls have to be converted first anyhow.
+
+@itemize @bullet
+@item
+The simplest character sets are one-byte character sets. There can be
+only up to 256 characters (for @w{8 bit} character sets) which is not
+sufficient to cover all languages but might be sufficient to handle a
+specific text. Another reason to choose this is because of constraints
+from interaction with other programs.
+
+@cindex ISO 2022
+@item
+The @w{ISO 2022} standard defines a mechanism for extended character
+sets where one character @emph{can} be represented by more than one
+byte. This is achieved by associating a state with the text. Embedded
+in the text can be characters which can be used to change the state.
+Each byte in the text might have a different interpretation in each
+state. The state might even influence whether a given byte stands for a
+character on its own or whether it has to be combined with some more
+bytes.
+
+@cindex EUC
+@cindex SJIS
+In most uses of @w{ISO 2022} the defined character sets do not allow
+state changes which cover more than the next character. This has the
+big advantage that whenever one can identify the beginning of the byte
+sequence of a character one can interpret a text correctly. Examples of
+character sets using this policy are the various EUC character sets
+(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
+or SJIS (Shift JIS, a Japanese encoding).
+
+But there are also character sets using a state which is valid for more
+than one character and has to be changed by another byte sequence.
+Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
+
+@item
+@cindex ISO 6937
+Early attempts to fix 8 bit character sets for other languages using the
+Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes
+representing characters like the acute accent do not produce output on
+there on. One has to combine them with other characters. E.g., the
+byte sequence @code{0xc2 0x61} (non-spacing acute accent, following by
+lower-case `a') to get the ``small a with acute'' character. To get the
+acute accent character on its on one has to write @code{0xc2 0x20} (the
+non-spacing acute followed by a space).
+
+This type of characters sets is quite frequently used in embedded
+systems such as video text.
+
+@item
+@cindex UTF-8
+Instead of converting the Unicode or @w{ISO 10646} text used internally
+it is often also sufficient to simply use an encoding different then
+UCS2/UCS4. The Unicode and @w{ISO 10646} standards even specify such an
+encoding: UTF-8. This encoding is able to represent all of @w{ISO
+10464} 31 bits in a byte string of length one to seven.
+
+@cindex UTF-7
+There were a few other attempts to encode @w{ISO 10646} such as UTF-7
+but UTF-8 is today the only encoding which should be used. In fact,
+UTF-8 will hopefully soon be the only external which has to be
+supported. It proofs to be universally usable and the only disadvantage
+is that it favor Latin languages very much by making the byte string
+representation of other scripts (Cyrillic, Greek, Asian scripts) longer
+than necessary if using a specific character set for these scripts. But
+with methods like the Unicode compression scheme one can overcome these
+problems and the ever growing memory and storage capacities do the rest.
+@end itemize
+
+The question remaining now is: how to select the character set or
+encoding to use. The answer is mostly: you cannot decide about it
+yourself, it is decided by the developers of the system or the majority
+of the users. Since the goal is interoperability one has to use
+whatever the other people one works with use. If there are no
+constraints the selection is based on the requirements the expected
+circle of users will have. I.e., if a project is expected to only be
+used in, say, Russia it is fine to use KOI8-R or a similar character
+set. But if at the same time people from, say, Greek are participating
+one should use a character set which allows all people to collaborate.
+
+A general advice here could be: go with the most general character set,
+namely @w{ISO 10646}. Use UTF-8 as the external encoding and problems
+about users not being able to use their own language adequately are a
+thing of the past.
+
+One final comment about the choice of the wide character representation
+is necessary at this point. We have said above that the natural choice
+is using Unicode or @w{ISO 10646}. This is not specified in any
+standard, though. The @w{ISO C} standard does not specify anything
+specific about the @code{wchar_t} type. There might be systems where
+the developers decided differently. Therefore one should as much as
+possible avoid making assumption about the wide character representation
+although GNU systems will always work as described above. If the
+programmer uses only the functions provided by the C library to handle
+wide character strings there should not be any compatibility problems
+with other systems.
+
+@node Charset Function Overview
+@section Overview about Character Handling Functions
+
+A Unix @w{C library} contains three different sets of functions in two
+families to handling character set conversion. The one function family
+is specified in the @w{ISO C} standard and therefore is portable even
+beyond the Unix world.
+
+The most commonly known set of functions, coming from the @w{ISO C89}
+standard, is unfortunately the least useful one. In fact, these
+functions should be avoided whenever possible, especially when
+developing libraries (as opposed to applications).
+
+The second family o functions got introduced in the early Unix standards
+(XPG2) and is still part of the latest and greatest Unix standard:
+@w{Unix 98}. It is also the most powerful and useful set of functions.
+But we will start with the functions defined in the second amendment to
+@w{ISO C89}.
+
+@node Restartable multibyte conversion
+@section Restartable Multibyte Conversion Functions
+
+The @w{ISO C} standard defines functions to convert strings from a
+multibyte representation to wide character strings. There are a number
+of peculiarities:
+
+@itemize @bullet
+@item
+The character set assumed for the multibyte encoding is not specified
+as an argument to the functions. Instead the character set specified by
+the @code{LC_CTYPE} category of the current locale is used; see
+@ref{Locale Categories}.
+
+@item
+The functions handling more than one character at a time require NUL
+terminated strings as the argument. I.e., converting blocks of text
+does not work unless one can add a NUL byte at an appropriate place.
+The GNU C library contains some extensions the standard which allow
+specifying a size but basically they also expect terminated strings.
+@end itemize
+
+Despite these limitations the @w{ISO C} functions can very well be used
+in many contexts. In graphical user interfaces, for instance, it is not
+uncommon to have functions which require text to be displayed in a wide
+character string if it is not simple ASCII. The text itself might come
+from a file with translations and of course to user should decide about
+the current locale which determines the translation and therefore also
+the external encoding used. In such a situation (and many others) the
+functions described here are perfect. If more freedom while performing
+the conversion is necessary take a look at the @code{iconv} functions
+(@pxref{Generic Charset Conversion})
+
+@menu
+* Selecting the Conversion:: Selecting the conversion and its properties.
+* Keeping the state:: Representing the state of the conversion.
+* Converting a Character:: Converting Single Characters.
+* Converting Strings:: Converting Multibyte and Wide Character
+ Strings.
+* Multibyte Conversion Example:: A Complete Multibyte Conversion Example.
+@end menu
+
+@node Selecting the Conversion
+@subsection Selecting the conversion and its properties
+
+We already said above that the currently selected locale for the
+@code{LC_CTYPE} category decides about the conversion which is performed
+by the functions we are about to describe. Each locale uses its own
+character set (given as an argument to @code{localedef}) and this is the
+one assumed as the external multibyte encoding. The wide character
+character set always is UCS4. So we can see here already where the
+limitations of these conversion functions are.
+
+A characteristic of each multibyte character set is the maximum number
+of bytes which can be necessary to represent one character. This
+information is quite important when writing code which uses the
+conversion functions. In the examples below we will see some examples.
+The @w{ISO C} standard defines two macros which provide this information.
+
+
+@comment limits.h
+@comment ISO
+@deftypevr Macro int MB_LEN_MAX
+This macro specifies the maximum number of bytes in the multibyte
+sequence for a single character in any of the supported locales. It is
+a compile-time constant and it is defined in @file{limits.h}.
+@pindex limits.h
+@end deftypevr
+
+@comment stdlib.h
+@comment ISO
+@deftypevr Macro int MB_CUR_MAX
+@code{MB_CUR_MAX} expands into a positive integer expression that is the
+maximum number of bytes in a multibyte character in the current locale.
+The value is never greater than @code{MB_LEN_MAX}. Unlike
+@code{MB_LEN_MAX} this macro need not be a compile-time constant and in
+fact, in the GNU C library it is not.
+
+@pindex stdlib.h
+@code{MB_CUR_MAX} is defined in @file{stdlib.h}.
+@end deftypevr
+
+Two different macros are necessary since strictly @w{ISO C89} compiles
+do not allow variable length array definitions but still it is desirable
+to avoid dynamic allocation. This incomplete piece of code shows the
+problem:
+
+@smallexample
+@{
+ char buf[MB_LEN_MAX];
+ ssize_t len = 0;
+
+ while (! feof (fp))
+ @{
+ fread (&buf[len], 1, MB_CUR_MAX - len, fp);
+ /* @r{... process} buf */
+ len -= used;
+ @}
+@}
+@end smallexample
+
+The code in the inner loop is expected to have always enough bytes in
+the array @var{buf} to convert one multibyte character. The array
+@var{buf} has to be sized statically since many compilers do not allow a
+variable size. The @code{fread} call makes sure that always
+@code{MB_CUR_MAX} bytes are available in @var{buf}. Note that it is no
+problem if @code{MB_CUR_MAX} is not a compile-time constant.
+
+
+@node Keeping the state
+@subsection Representing the state of the conversion
+
+@cindex stateful
+In the introduction of this chapter it was said that certain character
+sets use a @dfn{stateful} encoding. I.e., the encoded values depend in
+some way on the previous byte in the text.
+
+Since the conversion functions allow converting a text in more than one
+step we must have a way to pass this information from one call of the
+functions to another.
+
+@comment wchar.h
+@comment ISO
+@deftp {Data type} mbstate_t
+@cindex shift state
+A variable of type @code{mbstate_t} can contain all the information
+about the @dfn{shift state} needed from one call to a conversion
+function to another.
+
+@pindex wchar.h
+This type is defined in @file{wchar.h}. It got introduced in the second
+amendment to @w{ISO C89}.
+@end deftp
+
+To use objects of this type the programmer has to define such objects
+(normally as local variables on the stack) and pass a pointer to the
+object to the conversion functions. This way the conversion function
+can update the object if the current multibyte character set is
+stateful.
+
+There is no specific function or initializer to put the state object in
+any specific state. The rules are that the object should always
+represent the initial state before the first use and this is achieved by
+clearing the whole variable with code such as follows:
+
+@smallexample
+@{
+ mbstate_t state;
+ memset (&state, '\0', sizeof (state));
+ /* @r{from now on @var{state} can be used.} */
+ ...
+@}
+@end smallexample
+
+When using the conversion functions to generate output it is often
+necessary to test whether current state corresponds to the initial
+state. This is necessary, for example, to decide whether or not to emit
+escape sequences to set the state to the initial state at certain
+sequence points. Communication protocols often require this.
+
+@comment wchar.h
+@comment ISO
+@deftypefun int mbsinit (const mbstate_t *@var{ps})
+This function determines whether the state object pointed to by @var{ps}
+is in the initial state or not. If @var{ps} is no null pointer or the
+object is in the initial state the return value is nonzero. Otherwise
+it is zero.
+
+@pindex wchar.h
+This function was introduced in the second amendment to @w{ISO C89} and
+is declared in @file{wchar.h}.
+@end deftypefun
+
+Code using this function often looks similar to this:
+
+@smallexample
+@{
+ mbstate_t state;
+ memset (&state, '\0', sizeof (state));
+ /* @r{Use @var{state}.} */
+ ...
+ if (! mbsinit (&state))
+ @{
+ /* @r{Emit code to return to initial state.} */
+ fputs ("@r{whatever needed}", fp);
+ @}
+ ...
+@}
+@end smallexample
+
+@node Converting a Character
+@subsection Converting Single Characters
+
+The most fundamental of the conversion functions are those dealing with
+single characters. Please note that this does not always mean single
+bytes. But since there is very often a subset of the multibyte
+character set which consists of single byte sequences there are
+functions to help with converting bytes. One very important and often
+applicable scenario is where ASCII is a subpart of the multibyte
+character set. I.e., all ASCII characters stand for itself and all
+other characters have at least a first byte which is beyond the range
+@math{0} to @math{127}.
+
+@comment wchar.h
+@comment ISO
+@deftypefun wint_t btowc (int @var{c})
+The @code{btowc} function (``byte to wide character'') converts a valid
+single byte character in the initial shift state into the wide character
+equivalent using the conversion rules from the currently selected locale
+of the @code{LC_CTYPE} category.
+
+If @code{(unsigned char) @var{c}} is no valid single byte multibyte
+character or if @var{c} is @code{EOF} the function returns @code{WEOF}.
+
+Please note the restriction of @var{c} being tested for validity only in
+the initial shift state. There is no @code{mbstate_t} object used from
+which the state information is taken and the function also does not use
+any static state.
+
+@pindex wchar.h
+This function was introduced in the second amendment of @w{ISO C89} and
+is declared in @file{wchar.h}.
+@end deftypefun
+
+Despite the limitation that the single byte value always is interpreted
+in the initial state this function is actually useful most of the time.
+Most character are either entirely single-byte character sets or they
+are extension to ASCII. But then it is possible to write code like this
+(not that this specific example is useful):
+
+@smallexample
+wchar_t *
+itow (unsigned long int val)
+@{
+ static wchar_t buf[30];
+ wchar_t *wcp = &buf[29];
+ *wcp = L'\0';
+ while (val != 0)
+ @{
+ *--wcp = btowc ('0' + val % 10);
+ val /= 10;
+ @}
+ if (wcp == &buf[29])
+ *--wcp = btowc ('0');
+ return wcp;
+@}
+@end smallexample
+
+The question is why is it necessary to use such a complicated
+implementation and not simply cast L'0' to a wide character. The answer
+is that there is no guarantee that the compiler knows about the wide
+character set used at runtime. Even if the wide character equivalent of
+a given single-byte character is simply the equivalent to casting a
+single-byte character to @code{wchar_t} this is no guarantee that this
+is the case everywhere.
+
+There also is a function for the conversion in the other direction.
+
+@comment wchar.h
+@comment ISO
+@deftypefun int wctob (wint_t @var{c})
+The @code{wctob} function (``wide character to byte'') takes as the
+paremeter a valid wide character. If the multibyte representation for
+this character in the initial state is exactly one byte long the return
+value of this function is this character. Otherwise the return value is
+@code{EOF}.
+
+@pindex wchar.h
+This function was introduced in the second amendment of @w{ISO C89} and
+is declared in @file{wchar.h}.
+@end deftypefun
+
+There are more general functions to convert single character from
+multibyte representation to wide characters and vice versa. These
+functions pose no limit on the length of the multibyte representation
+and they also do not require it to be in the initial state.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps})
+@cindex stateful
+The @code{mbrtowc} function (``multibyte restartable to wide
+character'') converts the next multibyte character in the string pointed
+to by @var{s} into a wide character and stores it in the wide character
+string pointed to by @var{pwc}. The conversion is performed according
+to the locale currently selected for the @code{LC_CTYPE} category. If
+the character set for the locale is stateful the multibyte string is
+interpreted in the state represented by the object pointed to by
+@var{ps}. If @var{ps} is a null pointer an static, internal state
+variable used only by the @code{mbrtowc} variable is used.
+
+If the next multibyte character corresponds to the NUL wide character
+the return value of the function is @math{0} and the state object is
+afterwards in the initial state. If the next @var{n} or fewer bytes
+form a correct multibyte character the return value is the number of
+bytes starting from @var{s} which form the multibyte character. The
+conversion state is updated according to the bytes consumed in the
+conversion. In both cases the wide character (either the @code{L'\0'}
+or the one found in the conversion) is stored in the string pointer to
+by @var{pwc} iff @var{pwc} is not null.
+
+If the first @var{n} bytes of the multibyte string possibly form a valid
+multibyte character but there are more than @var{n} bytes needed to
+complete it the return value of the function is @code{(size_t) -2} and
+no value is stored. Please note that this can happen even if @var{n}
+has a value greater or equal to @code{MB_CUR_MAX} since the input might
+contain redundant shift sequences.
+
+If the first @code{n} bytes of the multibyte string cannot possibly
+form a valid multibyte character also no value is stored, the global
+variable i set to the value @code{EILSEQ} and the function return
+@code{(size_t) -1}. The conversion state is afterwards undefined.
+
+@pindex wchar.h
+This function was introduced in the second amendment to @w{ISO C89} and
+is declared in @file{wchar.h}.
+@end deftypefun
+
+Using this function is straight forward. A function which copies a
+multibyte string into a wide character string while at the same time
+converting all lowercase character into uppercase could look like this
+(this is not the final version, just an example; it has no error
+checking and leaks sometimes memory):
+
+@smallexample
+wchar_t *
+mbstouwcs (const char *s)
+@{
+ size_t len = strlen (s);
+ wchar_t *result = malloc ((len + 1) * sizeof (wchar_t));
+ wchar_t *wcp = result;
+ wchar_t tmp[1];
+ mbstate_t state;
+ memset (&state, '\0', sizeof (state));
+ size_t nbytes;
+ while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0)
+ @{
+ if (nbytes >= (size_t) -2)
+ /* Invalid input string. */
+ return NULL;
+ *result++ = towupper (tmp[0]);
+ len -= nbytes;
+ s += nbytes;
+ @}
+ return result;
+@}
+@end smallexample
+
+The use of @code{mbrtowc} should be clear. A single wide character is
+stored in @code{@var{tmp}[0]} and the number of consumed bytes is stored
+in the variable @var{nbytes}. In case the the conversion was successful
+the uppercase variant of the wide character is stored in the
+@var{result} array and the pointer to the input string and the number of
+available bytes is adjusted.
+
+The only non-obvious thing about the function might be the way memory is
+allocated for the result. The above code uses the fact that there can
+never be more wide characters in the converted results than there are
+bytes in the multibyte input string. This method yields to a
+pessimistic guess about the size of the result and if many wide
+character strings have to be constructed this way or the strings are
+long, the extra memory required to store the wide character strings
+might be significant. It would of course be possible to resize the
+allocated memory block to the correct size before returning it. A
+better solution might be to allocate just the right amount of space for
+the result right away. Unfortunately there is no function to compute
+the length of the wide character string directly from the multibyte
+string. But there is a function which does part of the work.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps})
+The @code{mbrlen} function (``multibyte restartable length'') computes
+the number of at most @var{n} bytes starting at @var{s} which form the
+next valid and complete multibyte character.
+
+If the next multibyte character corresponds to the NUL wide character
+the return value is @math{0}. If the next @var{n} bytes form a valid
+multibyte character the number of bytes belonging to this multibyte
+character byte sequence is returned.
+
+If the the first @var{n} bytes possibly form a valid multibyte
+character but it is incomplete the return value is @code{(size_t) -2}.
+Otherwise the multibyte character sequence is invalid and the return
+value is @code{(size_t) -1}.
+
+The multibyte sequence is interpreted in the state represented by the
+object pointer to by @var{ps}. If @var{ps} is a null pointer an state
+object local to @code{mbrlen} is used.
+
+@pindex wchar.h
+This function was introduced in the second amendment to @w{ISO C89} and
+is declared in @file{wchar.h}.
+@end deftypefun
+
+The tentative reader now will of course note that @code{mbrlen} can be
+implemented as
+
+@smallexample
+mbrtowc (NULL, s, n, ps != NULL ? ps : &internal)
+@end smallexample
+
+This is true and in fact is mentioned in the official specification.
+Now, how can this function be used to determine the length of the wide
+character string created from a multibyte character string? It is not
+directly usable but we can define a function @code{mbslen} using it:
+
+@smallexample
+size_t
+mbslen (const char *s)
+@{
+ mbstate_t state;
+ size_t result = 0;
+ size_t nbytes;
+ memset (&state, '\0', sizeof (state));
+ while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0)
+ @{
+ if (nbytes >= (size_t) -2)
+ /* @r{Something is wrong.} */
+ return (size_t) -1;
+ s += nbytes;
+ ++result;
+ @}
+ return result;
+@}
+@end smallexample
+
+This function simply calls @code{mbrlen} for each multibyte character
+in the string and counts the number of function calls. Please note that
+we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen}
+call. This is OK since a) this value is larger then the length of the
+longest multibyte character sequence and b) because we know that the
+string @var{s} ends with a NIL byte which cannot be part of any other
+multibyte character sequence but the one representing the NIL wide
+character. Therefore the @code{mbrlen} function will never read invalid
+memory.
+
+Now that this function is available (just to make this clear, this
+function is @emph{not} part of the GNU C library) we can compute the
+number of wide character required to store the converted multibyte
+character string @var{s} using
+
+@smallexample
+wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t);
+@end smallexample
+
+Please note that the @code{mbslen} function is quite inefficient. The
+implementation of @code{mbstouwcs} implemented using @code{mbslen} would
+have to perform the conversion of the multibyte character input string
+twice and this conversion might be quite expensive. So it is necessary
+to think about the consequences of using the easier but inprecise method
+before doing the work twice.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps})
+The @code{wcrtomb} function (``wide character restartable to
+multibyte'') converts a single wide character into a multibyte string
+corresponding to that wide character.
+
+If @var{s} is a null pointer the resets the the state stored in the
+objects pointer to by @var{ps} to the initial state. This can also be
+achieved by a call like this:
+
+@smallexample
+wcrtombs (temp_buf, L'\0', ps)
+@end smallexample
+
+@noindent
+since when @var{s} is a null pointer @code{wcrtomb} performs as if it
+writes into an internal buffer which is guaranteed to be large enough.
+
+If @var{wc} is the NUL wide character @code{wcrtomb} emits, if
+necessary, a shift sequence to get the state @var{ps} into the initial
+state followed by a single NUL byte is stored in the string @var{s}.
+
+Otherwise a byte sequence (possibly including shift sequences) is
+written into the string @var{s}. This of course only happens if
+@var{wc} is a valid wide character, i.e., it has a multibyte
+representation in the character set selected by locale of the
+@code{LC_CTYPE} category. If @var{wc} is no valid wide character
+nothing is stored in the strings @var{s}, @code{errno} is set to
+@code{EILSEQ}, the conversion state in @var{ps} is undefined and the
+return value is @code{(size_t) -1}.
+
+If no error occurred the function returns the number of bytes stored in
+the string @var{s}. This includes all byte representing shift
+sequences.
+
+One word about the interface of the function: there is no parameter
+specifying the length of the array @var{s}. Instead the function
+assumes that there are at least @code{MB_CUR_MAX} bytes available since
+this is the maximum length of any byte sequence representing a single
+character. So the caller has to make sure that there is enough space
+available, otherwise buffer overruns can occur.
+
+@pindex wchar.h
+This function was introduced in the second amendment to @w{ISO C} and is
+declared in @file{wchar.h}.
+@end deftypefun
+
+Using this function is as easy as using @code{mbrtowc}. The following
+example appends a wide character string to a multibyte character string.
+Again, the code is not really useful, it is simply here to demonstrate
+the use and some problems.
+
+@smallexample
+char *
+mbscatwc (char *s, size_t len, const wchar_t *ws)
+@{
+ mbstate_t state;
+ char *wp = strchr (s, '\0');
+ len -= wp - s;
+ memset (&state, '\0', sizeof (state));
+ do
+ @{
+ size_t nbytes;
+ if (len < MB_CUR_LEN)
+ @{
+ /* @r{We cannot guarantee that the next}
+ @r{character fits into the buffer, so}
+ @r{return an error.} */
+ errno = E2BIG;
+ return NULL;
+ @}
+ nbytes = wcrtomb (wp, *ws, &state);
+ if (nbytes == (size_t) -1)
+ /* @r{Error in the conversion.} */
+ return NULL;
+ len -= nbytes;
+ wp += nbytes;
+ @}
+ while (*ws++ != L'\0');
+ return s;
+@}
+@end smallexample
+
+First the function has to find the end of the string currently in the
+array @var{s}. The @code{strchr} call does this very efficiently since a
+requirement for multibyte character representations is that the NUL byte
+never is used except to represent itself (and in this context, the end
+of the string).
+
+After initializing the state object the loop is entered where the first
+task is to make sure there is enough room in the array @var{s}. We
+abort if there are not at least @code{MB_CUR_LEN} bytes available. This
+is not always optimal but we have no other choice. We might have less
+than @code{MB_CUR_LEN} bytes available but the next multibyte character
+might also be only one byte long. At the time the @code{wcrtomb} call
+returns it is too late to decide whether the buffer was large enough or
+not. If this solution is really unsuitable there is a very slow but
+more accurate solution.
+
+@smallexample
+ ...
+ if (len < MB_CUR_LEN)
+ @{
+ mbstate_t temp_state;
+ memcpy (&temp_state, &state, sizeof (state));
+ if (wcrtomb (NULL, *ws, &temp_state) > len)
+ @{
+ /* @r{We cannot guarantee that the next}
+ @r{character fits into the buffer, so}
+ @r{return an error.} */
+ errno = E2BIG;
+ return NULL;
+ @}
+ @}
+ ...
+@end smallexample
+
+Here we do perform the conversion which might overflow the buffer so
+that we are afterwards in the position to make an exact decision about
+the buffer size. Please note the @code{NULL} argument for the
+destination buffer in the new @code{wcrtomb} call; since we are not
+interested in the result at this point this is a nice way to express
+this. The most unusual thing about this piece of code certainly is the
+duplication of the conversion state object. But think about it: if a
+change of the state is necessary to emit the next multibyte character we
+want to have the same shift state change performed in the real
+conversion. Therefore we have to preserve the initial shift state
+information.
+
+There are certainly many more and even better solutions to this problem.
+This example is only meant for educational purposes.
+
+@node Converting Strings
+@subsection Converting Multibyte and Wide Character Strings
+
+The functions described in the previous section only convert a single
+character at a time. Most operations to be performed in real-world
+programs include strings and therefore the @w{ISO C} standard also
+defines conversions on entire strings. The defined set of functions is
+quite limited, though. Therefore contains the GNU C library a few
+extensions which are necessary in some important situations.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
+The @code{mbsrtowcs} function (``multibyte string restartable to wide
+character string'') converts an NUL terminated multibyte character
+string at @code{*@var{src}} into an equivalent wide character string,
+including the NUL wide character at the end. The conversion is started
+using the state information from the object pointed to by @var{ps} or
+from an internal object of @code{mbsrtowcs} if @var{ps} is a null
+pointer. Before returning the state object to match the state after the
+last converted character. The state is the initial state if the
+terminating NUL byte is reached and converted.
+
+If @var{dst} is not a null pointer the result is stored in the array
+pointed to by @var{dst}, otherwise the conversion result is not
+available since it is stored in an internal buffer.
+
+If @var{len} wide characters are stored in the array @var{dst} before
+reaching the end of the input string the conversion stops and @var{len}
+is returned. If @var{dst} is a null pointer @var{len} is never checked.
+
+Another reason for a premature return from the function call is if the
+input string contains an invalid multibyte sequence. In this case the
+global variable @code{errno} is set to @code{EILSEQ} and the function
+returns @code{(size_t) -1}.
+
+@c XXX The ISO C9x draft seems to have a problem here. It says that PS
+@c is not updated if DST is NULL. This is not said straight forward and
+@c none of the other functions is described like this. It would make sense
+@c to define the function this way but I don't think it is meant like this.
+
+In all other cases the function returns the number of wide characters
+converted during this call. If @var{dst} is not null @code{mbsrtowcs}
+stores in the pointer pointed to by @var{src} a null pointer (if the NUL
+byte in the input string was reached) or the address of the byte
+following the last converted multibyte character.
+
+@pindex wchar.h
+This function was introduced in the second amendment to @w{ISO C} and is
+declared in @file{wchar.h}.
+@end deftypefun
+
+The definition of this function has one limitation which has to be
+understood. The requirement that @var{dst} has to be a NUL terminated
+string provides problems if one wants to convert buffers with text. A
+buffer is normally no collection of NUL terminated strings but instead a
+continuous collection of lines, separated by newline characters. Now
+assume a function to convert one line from a buffer is needed. Since
+the line is not NUL terminated the source pointer cannot directly point
+into the unmodified text buffer. This means, either one inserts the NUL
+byte at the appropriate place for the time of the @code{mbsrtowcs}
+function call (which is not doable for a read-only buffer or in a
+multi-threaded application) or one copies the line in an extra buffer
+where it can be terminated by a NUL byte. Note that it is not in
+general possible to limit the number of characters to convert by setting
+the parameter @var{len} to any specific value. Since it is not known
+how many bytes each multibyte character sequence is in length one always
+could do only a guess.
+
+@cindex stateful
+There is still a problem with the method of NUL-terminating a line right
+after the newline character which could lead to very strange results.
+As said in the description of the @var{mbsrtowcs} function above the
+conversion state is guaranteed to be in the initial shift state after
+processing the NUL byte at the end of the input string. But this NUL
+byte is not really part of the text. I.e., the conversion state after
+the newline in the original text could be something different than the
+initial shift state and therefore the first character of the next line
+is encoded using this state. But the state in question is never
+accessible to the user since the conversion stops after the NUL byte.
+Fortunately most stateful character sets in use today require that the
+shift state after a newline is the initial state but this is no
+guarantee. Therefore simply NUL terminating a piece of a running text
+is not always the adequate solution.
+
+The generic conversion
+@comment XXX reference to iconv
+interface does not have this limitation (it simply works on buffers, not
+strings) but there is another way. The GNU C library contains a set of
+functions why take additional parameters specifying maximal number of
+bytes which are consumed from the input string. This way the problem of
+above's example could be solved by determining the line length and
+passing this length to the function.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
+The @code{wcsrtombs} function (``wide character string restartable to
+multibyte string'') converts the NUL terminated wide character string at
+@code{*@var{src}} into an equivalent multibyte character string and
+stores the result in the array pointed to by @var{dst}. The NUL wide
+character is also converted. The conversion starts in the state
+described in the object pointed to by @var{ps} or by a state object
+locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If
+@var{dst} is a null pointer the conversion is performed as usual but the
+result is not available. If all characters of the input string were
+successfully converted and if @var{dst} is not a null pointer the
+pointer pointed to by @var{src} gets assigned a null pointer.
+
+If one of the wide characters in the input string has no valid multibyte
+character equivalent the conversion stops early, sets the global
+variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}.
+
+Another reason for a premature stop is if @var{dst} is not a null
+pointer and the next converted character would require more than
+@var{len} bytes in total to the array @var{dst}. In this case (and if
+@var{dest} is not a null pointer) the pointer pointed to by @var{src} is
+assigned a value pointing to the wide character right after the last one
+successfully converted.
+
+Except in the case of an encoding error the return value of the function
+is the number of bytes in all the multibyte character sequences stored
+in @var{dst}. Before returning the state in the object pointed to by
+@var{ps} (or the internal object in case @var{ps} is a null pointer) is
+updated to reflect the state after the last conversion. The state is
+the initial shift state in case the terminating NUL wide character was
+converted.
+
+@pindex wchar.h
+This function was introduced in the second amendment to @w{ISO C} and is
+declared in @file{wchar.h}.
+@end deftypefun
+
+The restriction mentions above for the @code{mbsrtowcs} function applies
+also here. There is no possibility to directly control the number of
+input characters. One has to place the NUL wide character at the
+correct place or control the consumed input indirectly via the available
+output array size (the @var{len} parameter).
+
+@comment wchar.h
+@comment GNU
+@deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps})
+The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs}
+function. All the parameters are the same except for @var{nmc} which is
+new. The return value is the same as for @code{mbsrtowcs}.
+
+This new parameter specifies how many bytes at most can be used from the
+multibyte character string. I.e., the multibyte character string
+@code{*@var{src}} need not be NUL terminated. But if a NUL byte is
+found within the @var{nmc} first bytes of the string the conversion
+stops here.
+
+This function is a GNU extensions. It is meant to work around the
+problems mentioned above. Now it is possible to convert buffer with
+multibyte character text piece for piece without having to care about
+inserting NUL bytes and the effect of NUL bytes on the conversion state.
+@end deftypefun
+
+A function to convert a multibyte string into a wide character string
+and display it could be written like this (this is no really useful
+example):
+
+@smallexample
+void
+showmbs (const char *src, FILE *fp)
+@{
+ mbstate_t state;
+ int cnt = 0;
+ memset (&state, '\0', sizeof (state));
+ while (1)
+ @{
+ wchar_t linebuf[100];
+ const char *endp = strchr (src, '\n');
+ size_t n;
+
+ /* @r{Exit if there is no more line.} */
+ if (endp == NULL)
+ break;
+
+ n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state);
+ linebuf[n] = L'\0';
+ fprintf (fp, "line %d: \"%S\"\n", linebuf);
+ @}
+@}
+@end smallexample
+
+There is no more problem with the state after a call to
+@code{mbsnrtowcs}. Since we don't insert characters in the strings
+which were not in there right from the beginning and we use @var{state}
+only for the conversion of the given buffer there is no problem with
+mixing the state up.
+
+@comment wchar.h
+@comment GNU
+@deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps})
+The @code{wcsnrtombs} function implements the conversion from wide
+character strings to multibyte character strings. It is similar to
+@code{wcsrtombs} but it takes, just like @code{mbsnrtowcs}, an extra
+parameter which specifies the length of the input string.
+
+No more than @var{nwc} wide characters from the input string
+@code{*@var{src}} are converted. If the input string contains a NUL
+wide character in the first @var{nwc} character to conversion stops at
+this place.
+
+This function is a GNU extension and just like @code{mbsnrtowcs} is
+helps in situations where no NUL terminated input strings are available.
+@end deftypefun
+
+
+@node Multibyte Conversion Example
+@subsection A Complete Multibyte Conversion Example
+
+The example programs given in the last sections are only brief and do
+not contain all the error checking etc. Therefore here comes a complete
+and documented example. It features the @code{mbrtowc} function but it
+should be easy to derive versions using the other functions.
+
+@smallexample
+int
+file_mbsrtowcs (int input, int output)
+@{
+ /* @r{Note the use of @code{MB_LEN_MAX}.}
+ @r{@code{MB_CUR_MAX} cannot portably be used here.} */
+ char buffer[BUFSIZ + MB_LEN_MAX];
+ mbstate_t state;
+ int filled = 0;
+ int eof = 0;
+
+ /* @r{Initialize the state.} */
+ memset (&state, '\0', sizeof (state));
+
+ while (!eof)
+ @{
+ ssize_t nread;
+ ssize_t nwrite;
+ char *inp = buffer;
+ wchar_t outbuf[BUFSIZ];
+ wchar_t *outp = outbuf;
+
+ /* @r{Fill up the buffer from the input file.} */
+ nread = read (input, buffer + filled, BUFSIZ);
+ if (nread < 0)
+ @{
+ perror ("read");
+ return 0;
+ @}
+ /* @r{If we reach end of file, make a note to read no more.} */
+ if (nread == 0)
+ eof = 1;
+
+ /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */
+ filled += nread;
+
+ /* @r{Convert those bytes to wide characters--as many as we can.} */
+ while (1)
+ @{
+ size_t thislen = mbrtowc (outp, inp, filled, &state);
+ /* @r{Stop converting at invalid character;}
+ @r{this can mean we have read just the first part}
+ @r{of a valid character.} */
+ if (thislen == (size_t) -1)
+ break;
+ /* @r{We want to handle embedded NUL bytes}
+ @r{but the return value is 0. Correct this.} */
+ if (thislen == 0)
+ thislen = 1;
+ /* @r{Advance past this character.} */
+ inp += thislen;
+ filled -= thislen;
+ ++outp;
+ @}
+
+ /* @r{Write the wide characters we just made.} */
+ nwrite = write (output, outbuf,
+ (outp - outbuf) * sizeof (wchar_t));
+ if (nwrite < 0)
+ @{
+ perror ("write");
+ return 0;
+ @}
+
+ /* @r{See if we have a @emph{real} invalid character.} */
+ if ((eof && filled > 0) || filled >= MB_CUR_MAX)
+ @{
+ error (0, 0, "invalid multibyte character");
+ return 0;
+ @}
+
+ /* @r{If any characters must be carried forward,}
+ @r{put them at the beginning of @code{buffer}.} */
+ if (filled > 0)
+ memmove (inp, buffer, filled);
+ @}
+
+ return 1;
+@}
+@end smallexample
+
+
+@node Non-reentrant Conversion
+@section Non-reentrant Conversion Function
+
+The functions described in the last chapter are defined in the second
+amendment to @w{ISO C89}. But the original @w{ISO C89} standard also
+contained functions for character set conversion. The reason that they
+are not described in the first place is that they are almost entirely
+useless.
+
+The problem is that all the functions for conversion defined in @w{ISO
+C89} use a local state. This does not only mean that multiple
+conversions at the same time (not only when using threads) cannot be
+done. It also means that you cannot first convert single characters and
+the strings since you cannot say the conversion functions which state to
+use.
+
+These functions are therefore usable only in a very limited set of
+situation. One most complete converting the entire string before
+starting a new one and each string/text must be converted with the same
+function (there is no problem with the library itself; it is guaranteed
+that no library function changes the state of any of these functions).
+For these reasons it is @emph{highly} requested to use the functions
+from the last section.
+
+@menu
+* Non-reentrant Character Conversion:: Non-reentrant Conversion of Single
+ Characters.
+* Non-reentrant String Conversion:: Non-reentrant Conversion of Strings.
+* Shift State:: States in Non-reentrant Functions.
+@end menu
+
+@node Non-reentrant Character Conversion
+@subsection Non-reentrant Conversion of Single Characters
+
+@comment stdlib.h
+@comment ISO
+@deftypefun int mbtowc (wchar_t *@var{result}, const char *@var{string}, size_t @var{size})
+The @code{mbtowc} (``multibyte to wide character'') function when called
+with non-null @var{string} converts the first multibyte character
+beginning at @var{string} to its corresponding wide character code. It
+stores the result in @code{*@var{result}}.
+
+@code{mbtowc} never examines more than @var{size} bytes. (The idea is
+to supply for @var{size} the number of bytes of data you have in hand.)
+
+@code{mbtowc} with non-null @var{string} distinguishes three
+possibilities: the first @var{size} bytes at @var{string} start with
+valid multibyte character, they start with an invalid byte sequence or
+just part of a character, or @var{string} points to an empty string (a
+null character).
+
+For a valid multibyte character, @code{mbtowc} converts it to a wide
+character and stores that in @code{*@var{result}}, and returns the
+number of bytes in that character (always at least @code{1}, and never
+more than @var{size}).
+
+For an invalid byte sequence, @code{mbtowc} returns @code{-1}. For an
+empty string, it returns @code{0}, also storing @code{0} in
+@code{*@var{result}}.
+
+If the multibyte character code uses shift characters, then
+@code{mbtowc} maintains and updates a shift state as it scans. If you
+call @code{mbtowc} with a null pointer for @var{string}, that
+initializes the shift state to its standard initial value. It also
+returns nonzero if the multibyte character code in use actually has a
+shift state. @xref{Shift State}.
+@end deftypefun
+
+@comment stdlib.h
+@comment ISO
+@deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar})
+The @code{wctomb} (``wide character to multibyte'') function converts
+the wide character code @var{wchar} to its corresponding multibyte
+character sequence, and stores the result in bytes starting at
+@var{string}. At most @code{MB_CUR_MAX} characters are stored.
+
+@code{wctomb} with non-null @var{string} distinguishes three
+possibilities for @var{wchar}: a valid wide character code (one that can
+be translated to a multibyte character), an invalid code, and @code{0}.
+
+Given a valid code, @code{wctomb} converts it to a multibyte character,
+storing the bytes starting at @var{string}. Then it returns the number
+of bytes in that character (always at least @code{1}, and never more
+than @code{MB_CUR_MAX}).
+
+If @var{wchar} is an invalid wide character code, @code{wctomb} returns
+@code{-1}. If @var{wchar} is @code{0}, it returns @code{0}, also
+storing @code{0} in @code{*@var{string}}.
+
+If the multibyte character code uses shift characters, then
+@code{wctomb} maintains and updates a shift state as it scans. If you
+call @code{wctomb} with a null pointer for @var{string}, that
+initializes the shift state to its standard initial value. It also
+returns nonzero if the multibyte character code in use actually has a
+shift state. @xref{Shift State}.
+
+Calling this function with a @var{wchar} argument of zero when
+@var{string} is not null has the side-effect of reinitializing the
+stored shift state @emph{as well as} storing the multibyte character
+@code{0} and returning @code{0}.
+@end deftypefun
+
+Similar to @code{mbrlen} there is also a non-reentrant function which
+computes the length of a multibyte character. It can be defined in
+terms of @code{mbtowc}.
+
+@comment stdlib.h
+@comment ISO
+@deftypefun int mblen (const char *@var{string}, size_t @var{size})
+The @code{mblen} function with a non-null @var{string} argument returns
+the number of bytes that make up the multibyte character beginning at
+@var{string}, never examining more than @var{size} bytes. (The idea is
+to supply for @var{size} the number of bytes of data you have in hand.)
+
+The return value of @code{mblen} distinguishes three possibilities: the
+first @var{size} bytes at @var{string} start with valid multibyte
+character, they start with an invalid byte sequence or just part of a
+character, or @var{string} points to an empty string (a null character).
+
+For a valid multibyte character, @code{mblen} returns the number of
+bytes in that character (always at least @code{1}, and never more than
+@var{size}). For an invalid byte sequence, @code{mblen} returns
+@code{-1}. For an empty string, it returns @code{0}.
+
+If the multibyte character code uses shift characters, then @code{mblen}
+maintains and updates a shift state as it scans. If you call
+@code{mblen} with a null pointer for @var{string}, that initializes the
+shift state to its standard initial value. It also returns nonzero if
+the multibyte character code in use actually has a shift state.
+@xref{Shift State}.
+
+@pindex stdlib.h
+The function @code{mblen} is declared in @file{stdlib.h}.
+@end deftypefun
+
+
+@node Non-reentrant String Conversion
+@subsection Non-reentrant Conversion of Strings
+
+For convenience reasons the @w{ISO C89} standard defines also functions
+to convert entire strings instead of single characters. These functions
+suffer from the same problems as their reentrant counterparts from the
+second amendment to @w{ISO C89}; see @xref{Converting Strings}.
+
+@comment stdlib.h
+@comment ISO
+@deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size})
+The @code{mbstowcs} (``multibyte string to wide character string'')
+function converts the null-terminated string of multibyte characters
+@var{string} to an array of wide character codes, storing not more than
+@var{size} wide characters into the array beginning at @var{wstring}.
+The terminating null character counts towards the size, so if @var{size}
+is less than the actual number of wide characters resulting from
+@var{string}, no terminating null character is stored.
+
+The conversion of characters from @var{string} begins in the initial
+shift state.
+
+If an invalid multibyte character sequence is found, this function
+returns a value of @code{-1}. Otherwise, it returns the number of wide
+characters stored in the array @var{wstring}. This number does not
+include the terminating null character, which is present if the number
+is less than @var{size}.
+
+Here is an example showing how to convert a string of multibyte
+characters, allocating enough space for the result.
+
+@smallexample
+wchar_t *
+mbstowcs_alloc (const char *string)
+@{
+ size_t size = strlen (string) + 1;
+ wchar_t *buf = xmalloc (size * sizeof (wchar_t));
+
+ size = mbstowcs (buf, string, size);
+ if (size == (size_t) -1)
+ return NULL;
+ buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));
+ return buf;
+@}
+@end smallexample
+
+@end deftypefun
+
+@comment stdlib.h
+@comment ISO
+@deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size})
+The @code{wcstombs} (``wide character string to multibyte string'')
+function converts the null-terminated wide character array @var{wstring}
+into a string containing multibyte characters, storing not more than
+@var{size} bytes starting at @var{string}, followed by a terminating
+null character if there is room. The conversion of characters begins in
+the initial shift state.
+
+The terminating null character counts towards the size, so if @var{size}
+is less than or equal to the number of bytes needed in @var{wstring}, no
+terminating null character is stored.
+
+If a code that does not correspond to a valid multibyte character is
+found, this function returns a value of @code{-1}. Otherwise, the
+return value is the number of bytes stored in the array @var{string}.
+This number does not include the terminating null character, which is
+present if the number is less than @var{size}.
+@end deftypefun
+
+@node Shift State
+@subsection States in Non-reentrant Functions
+
+In some multibyte character codes, the @emph{meaning} of any particular
+byte sequence is not fixed; it depends on what other sequences have come
+earlier in the same string. Typically there are just a few sequences
+that can change the meaning of other sequences; these few are called
+@dfn{shift sequences} and we say that they set the @dfn{shift state} for
+other sequences that follow.
+
+To illustrate shift state and shift sequences, suppose we decide that
+the sequence @code{0200} (just one byte) enters Japanese mode, in which
+pairs of bytes in the range from @code{0240} to @code{0377} are single
+characters, while @code{0201} enters Latin-1 mode, in which single bytes
+in the range from @code{0240} to @code{0377} are characters, and
+interpreted according to the ISO Latin-1 character set. This is a
+multibyte code which has two alternative shift states (``Japanese mode''
+and ``Latin-1 mode''), and two shift sequences that specify particular
+shift states.
+
+When the multibyte character code in use has shift states, then
+@code{mblen}, @code{mbtowc} and @code{wctomb} must maintain and update
+the current shift state as they scan the string. To make this work
+properly, you must follow these rules:
+
+@itemize @bullet
+@item
+Before starting to scan a string, call the function with a null pointer
+for the multibyte character address---for example, @code{mblen (NULL,
+0)}. This initializes the shift state to its standard initial value.
+
+@item
+Scan the string one character at a time, in order. Do not ``back up''
+and rescan characters already scanned, and do not intersperse the
+processing of different strings.
+@end itemize
+
+Here is an example of using @code{mblen} following these rules:
+
+@smallexample
+void
+scan_string (char *s)
+@{
+ int length = strlen (s);
+
+ /* @r{Initialize shift state.} */
+ mblen (NULL, 0);
+
+ while (1)
+ @{
+ int thischar = mblen (s, length);
+ /* @r{Deal with end of string and invalid characters.} */
+ if (thischar == 0)
+ break;
+ if (thischar == -1)
+ @{
+ error ("invalid multibyte character");
+ break;
+ @}
+ /* @r{Advance past this character.} */
+ s += thischar;
+ length -= thischar;
+ @}
+@}
+@end smallexample
+
+The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not
+reentrant when using a multibyte code that uses a shift state. However,
+no other library functions call these functions, so you don't have to
+worry that the shift state will be changed mysteriously.
+
+
+@node Generic Charset Conversion
+@section Generic Charset Conversion
+
+The conversion functions mentioned so far in this chapter all had in
+common that they operate on character sets which are not directly
+specified by the functions. The multibyte encoding used is specified by
+the currently selected locale for the @code{LC_CTYPE} category. The
+wide character set is fixed by the implementation (in the case of GNU C
+library it always is @w{ISO 10646}.
+
+This has of course several problems when it comes to general character
+conversion:
+
+@itemize @bullet
+@item
+For every conversion where neither the source or destination character
+set is the character set of the locale for the @code{LC_CTYPE} category,
+one has to change the @code{LC_CTYPE} locale using @code{setlocale}.
+
+This introduces major problems for the rest of the programs since
+several more functions (e.g., the character classification functions,
+@xref{Classification of Characters}) use the @code{LC_CTYPE} category.
+
+@item
+Parallel conversions to and from different character sets are not
+possible since the @code{LC_CTYPE} selection is global and shared by all
+threads.
+
+@item
+If neither the source nor the destination character set is the character
+set used for @code{wchar_t} representation there is at least a two-step
+process necessary to convert a text using the functions above. One
+would have to select the source character set as the multibyte encoding,
+convert the text into a @code{wchar_t} text, select the destination
+character set as the multibyte encoding and convert the wide character
+text to the multibyte (=destination) character set.
+
+Even if this is possible (which is not guaranteed) it is a very tiring
+work. Plus it suffers from the other two raised points even more due to
+the steady changing of the locale.
+@end itemize
+
+
+The XPG2 standard defines a completely new set of functions which has
+none of these limitations. They are not at all coupled to the selected
+locales and they but no constraints on the character sets selected for
+source and destination. Only the set of available conversions is
+limiting them. The standard does not specify that any conversion at all
+must be available. It is a measure of the quality of the implementation.
+
+In the following text first the interface will be described. It is here
+shortly named @code{iconv}-interface after the name of the conversion
+function. Then the implementation is described as far as interesting to
+the advanced user who wants to extend the conversion capabilities.
+Comparisons with other implementations will show what trapfalls lie on
+the way of portable applications.
+
+@menu
+* Generic Conversion Interface:: Generic Character Set Conversion Interface.
+* iconv Examples:: A complete @code{iconv} example.
+* Other iconv Implementations:: Some Details about other @code{iconv}
+ Implementations.
+* glibc iconv Implementation:: The @code{iconv} Implementation in the GNU C
+ library.
+@end menu
+
+@node Generic Conversion Interface
+@subsection Generic Character Set Conversion Interface
+
+This set of functions follows the traditional cycle of using a resource:
+open--use--close. The interface consists of three functions, each of
+which implement one step.
+
+Before the interfaces are described it is necessary to introduce a
+datatype. Just like other open--use--close interface the functions
+introduced here work using a handles and the @file{iconv.h} header
+defines a special type for the handles used.
+
+@comment iconv.h
+@comment XPG2
+@deftp {Data Type} iconv_t
+This data type is an abstract type defined in @file{iconv.h}. The user
+must not assume anything about the definition of this type, it must be
+completely opaque.
+
+Objects of this type can get assigned handles for the conversions using
+the @code{iconv} functions. The objects themselves need not be freed but
+the conversions for which the handles stand for have to.
+@end deftp
+
+@noindent
+The first step is the function to create a handle.
+
+@comment iconv.h
+@comment XPG2
+@deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode})
+The @code{iconv_open} function has to be used before starting a
+conversion. The two parameters this function takes determine the
+sources and destination character set for the conversion and if the
+implementation has the possibility to perform such a conversion the
+function returns a handle.
+
+If the wanted conversion is not available the function returns
+@code{(iconv_t) -1}. In this case the global variable @code{errno} can
+have the following values:
+
+@table @code
+@item EMFILE
+The process already has @code{OPEN_MAX} file descriptors open.
+@item ENFILE
+The system limit of open file is reached.
+@item ENOMEM
+Not enough memory to carry out the operation.
+@item EINVAL
+The conversion from @var{fromcode} to @var{tocode} is not supported.
+@end table
+
+It is not possible to use the same descriptor in different threads to
+perform independent conversions. Within the data structures associated
+with the descriptor there is information about the conversion state.
+This must of course not be messed up by using it in different
+conversions.
+
+An @code{iconv} descriptor is just a file descriptor as for every use a
+new descriptor must be created. The descriptor does not stand for all
+of the conversions from @var{fromset} to @var{toset}.
+
+The GNU C library implementation of @code{iconv_open} has one
+significant extension to other implementations. To ease the extension
+of the set of available conversions the implementation allows to store
+the necessary files with data and code in arbitrary many directories.
+How this extensions have to be written will be explained below
+(@pxref{glibc iconv Implementation}). Here it is only important to say
+that all directories mentioned in the @code{GCONV_PATH} environment
+variable are considered if they contain a file @file{gconv-modules}.
+These directories need not necessarily be created by the system
+administrator. In fact, this extension is introduced to help users
+writing and using own, new conversions. Of course this does not work
+for security reasons in SUID binaries; in this case only the system
+directory is considered and this normally is
+@file{@var{prefix}/lib/gconv}. The @code{GCONV_PATH} environment
+variable is examined exactly once at the first call of the
+@code{iconv_open} function. Later modifications of the variable have no
+effect.
+
+@pindex iconv.h
+This function got introduced early in the X/Open Portability Guide,
+@w{version 2}. It is supported by all commercial Unices as it is
+required for the Unix branding. The quality and completeness of the
+implementation varies widely, though. The function is declared in
+@file{iconv.h}.
+@end deftypefun
+
+The @code{iconv} implementation can associate large data structure with
+the handle returned by @code{iconv_open}. Therefore it is crucial to
+free all the resources once all conversions are carried out and the
+conversion is not needed anymore.
+
+@comment iconv.h
+@comment XPG2
+@deftypefun int iconv_close (iconv_t @var{cd})
+The @code{iconv_close} function frees all resources associated with the
+handle @var{cd} which must have been returned by a successful call to
+the @code{iconv_open} function.
+
+If the function call was successful the return value is @math{0}.
+Otherwise it is @math{-1} and @code{errno} is set appropriately.
+Defined error are:
+
+@table @code
+@item EBADF
+The conversion descriptor is invalid.
+@end table
+
+@pindex iconv.h
+This function was introduced together with the rest of the @code{iconv}
+functions in XPG2 and it is declared in @file{iconv.h}.
+@end deftypefun
+
+The standard defines only one actual conversion function. This has
+therefore the most general interface: it allows conversion from one
+buffer to another. Conversion from a file to a buffer, vice versa, or
+even file to file can be implemented on top of it.
+
+@comment iconv.h
+@comment XPG2
+@deftypefun size_t iconv (iconv_t @var{cd}, const char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
+@cindex stateful
+The @code{iconv} function converts the text in the input buffer
+according to the rules associated with the descriptor @var{cd} and
+stores the result in the output buffer. It is possible to call the
+function for the same text several times in a row since for stateful
+character sets the necessary state information is kept in the data
+structures associated with the descriptor.
+
+The input buffer is specified by @code{*@var{inbuf}} and it contains
+@code{*@var{inbytesleft}} bytes. The extra indirection is necessary for
+communicating the used input back to the caller (see below). It is
+important to note that the buffer pointer is of type @code{char} and the
+length is measured in bytes even if the input text is encoded in wide
+characters.
+
+The output buffer is specified in a similar way. @code{*@var{outbuf}}
+points to the beginning of the buffer with at least
+@code{*@var{outbytesleft}} bytes room for the result. The buffer
+pointer again is of type @code{char} and the length is measured in
+bytes. If @var{outbuf} or @code{*@var{outbuf}} is a null pointer the
+conversion is performed but no output is available.
+
+If @var{inbuf} is a null pointer the @code{iconv} function performs the
+necessary action to put the state of the conversion into the initial
+state. This is obviously a no-op for non-stateful encodings, but if the
+encoding has a state such a function call might put some byte sequences
+in the output buffer which perform the necessary state changes. The
+next call with @var{inbuf} not being a null pointer then simply goes on
+from the initial state. It is important that the programmer never makes
+any assumption on whether the conversion has to deal with states or not.
+Even if the input and output character sets are not stateful the
+implementation might still have to keep states. This is due to the
+implementation chosen for the GNU C library as it is described below.
+Therefore an @code{iconv} call to reset the state should always be
+performed if some protocol requires this for the output text.
+
+The conversion stops for three reasons. The first is that all
+characters from the input buffer are converted. This actually can mean
+two things: really all bytes from the input buffer are consumed or the
+there are some bytes at the end of the buffer which possibly can form a
+complete character but the input is incomplete. The second reason for a
+stop is when the output buffer is full. And the third reason is that
+the input contains invalid characters.
+
+In all these cases the buffer pointers after the last successful
+conversion, for input and output buffer, are stored in @var{inbuf} and
+@var{outbuf} and the available room in each buffer is stored in
+@var{inbytesleft} and @var{outbytesleft}.
+
+Since the character sets selected in the @code{iconv_open} call can be
+almost arbitrary there can be situations where the input buffer contains
+valid characters which have no identical representation in the output
+character set. The behavior in this situation is undefined. The
+@emph{current} behavior of the GNU C library in this situation is to
+return with an error immediately. This certainly is not the most
+desirable solution. Therefore future versions will provide better ones
+but they are not yet finished.
+
+If all input from the input buffer is successfully converted and stored
+in the output buffer the function returns the number of conversion
+performed. In all other cases the return value is @code{(size_t) -1}
+and @code{errno} is set appropriately. In this case the value pointed
+to by @var{inbytesleft} is nonzero.
+
+@table @code
+@item EILSEQ
+The conversion stopped because of an invalid byte sequence in the input.
+After the call @code{*@var{inbuf}} points at the first byte of the
+invalid byte sequence.
+
+@item E2BIG
+The conversion stopped because it ran out of space in the output buffer.
+
+@item EINVAL
+The conversion stopped because of an incomplete byte sequence at the end
+of the input buffer.
+
+@item EBADF
+The @var{cd} argument is invalid.
+@end table
+
+@pindex iconv.h
+This function was introduced in the XPG2 standard and is declared in the
+@file{iconv.h} header.
+@end deftypefun
+
+The definition of the @code{iconv} function is quite good overall. It
+provides quite flexible functionality. The only problems lie in the
+boundary cases which are incomplete byte sequences at the end of the
+input buffer and invalid input. A third problem, which is not really a
+design problem, is the way conversions are selected. The standard does
+not say anything about the legitimate names, a minimal set of available
+conversions. We will see how this has negative impacts in the
+discussion of other implementations further down.
+
+
+@node iconv Examples
+@subsection A complete @code{iconv} example
+
+The example below features a solution for a common problem. Given that
+one knows the internal encoding used by the system for @code{wchar_t}
+strings one often is in the position to read text from a file and store
+it in wide character buffers. One can do this using @code{mbsrtowcs}
+but then we run into the problems discussed above.
+
+@smallexample
+int
+file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail)
+@{
+ char inbuf[BUFSIZ];
+ size_t insize = 0;
+ char *wrptr = (char *) outbuf;
+ int result = 0;
+ iconv_t cd;
+
+ cd = iconv_open ("UCS4", charset);
+ if (cd == (iconv_t) -1)
+ @{
+ /* @r{Something went wrong.} */
+ if (errno == EINVAL)
+ error (0, 0, "conversion from `%s' to `UCS4' no available",
+ charset);
+ else
+ perror ("iconv_open");
+
+ /* @r{Terminate the output string.} */
+ *outbuf = L'\0';
+
+ return -1;
+ @}
+
+ while (avail > 0)
+ @{
+ size_t nread;
+ size_t nconv;
+ char *inptr = inbuf;
+
+ /* @r{Read more input.} */
+ nread = read (fd, inbuf + insize, sizeof (inbuf) - insize);
+ if (nread == 0)
+ @{
+ /* @r{When we come here the file is completely read.}
+ @r{This still could mean there are some unused}
+ @r{characters in the @code{inbuf}. Put them back.} */
+ if (lseek (fd, -insize, SEEK_CUR) == -1)
+ result = -1;
+ break;
+ @}
+ insize += nread;
+
+ /* @r{Do the conversion.} */
+ nconv = iconv (cd, &inptr, &insize, &wrptr, &avail);
+ if (nconv == (size_t) -1)
+ @{
+ /* @r{Not everything went right. It might only be}
+ @r{an unfinished byte sequence at the end of the}
+ @r{buffer. Or it is a real problem.} */
+ if (errno == EINVAL)
+ /* @r{This is harmless. Simply move the unused}
+ @r{bytes to the beginning of the buffer so that}
+ @r{they can be used in the next round.} */
+ memmove (inbuf, inptr, insize);
+ else
+ @{
+ /* @r{It is a real problem. Maybe we ran out of}
+ @r{space in the output buffer or we have invalid}
+ @r{input. In any case back the file pointer to}
+ @r{the position of the last processed byte.} */
+ lseek (fd, -insize, SEEK_CUR);
+ result = -1;
+ break;
+ @}
+ @}
+ @}
+
+ /* @r{Terminate the output string.} */
+ *((wchar_t *) wrptr) = L'\0';
+
+ if (iconv_close (cd) != 0)
+ perror ("iconv_close");
+
+ return (wchar_t *) wrptr - outbuf;
+@}
+@end smallexample
+
+@cindex stateful
+This example shows the most important aspects of using the @code{iconv}
+functions. It shows how successive calls to @code{iconv} can be used to
+convert large amounts of text. The user does not have to care about
+stateful encodings as the functions take care of everything.
+
+An interesting point is the case where @code{iconv} return an error and
+@code{errno} is set to @code{EINVAL}. This is not really an error in
+the transformation. It can happen whenever the input character set
+contains byte sequences of more than one byte for some character and
+texts are not processed in one piece. In this case there is a chance
+that a multibyte sequence is cut. The caller than can simply read the
+remainder of the takes and feed the offending bytes together with new
+character from the input to @code{iconv} and continue the work. The
+internal state kept in the descriptor is @emph{not} unspecified after
+such an event as it is the case with the conversion functions from the
+@w{ISO C} standard.
+
+The example also shows the problem of using wide character strings with
+@code{iconv}. As explained in the description of the @code{iconv}
+function above the function always takes a pointer to a @code{char}
+array and the available space is measured in bytes. In the example the
+output buffer is a wide character buffer. Therefore we use a local
+variable @var{wrptr} of type @code{char *} which is used in the
+@code{iconv} calls.
+
+This looks rather innocent but can lead to problems on platforms which
+have tight restriction on alignment. Therefore the caller of
+@code{iconv} has to make sure that the pointers passed are suitable for
+access of characters from the appropriate character set. Since in the
+above case the input parameter to the function is a @code{wchar_t}
+pointer this is the case (unless the user violates alignment when
+computing the parameter). But in other situations, especially when
+writing generic functions where one does not know what type of character
+set on uses and therefore treats text as a sequence of bytes, it might
+become tricky.
+
+
+@node Other iconv Implementations
+@subsection Some Details about other @code{iconv} Implementations
+
+This is not really the place to discuss the @code{iconv} implementation
+of other systems but it is necessary to know a bit about them to write
+portable programs. The above mentioned problems with the specification
+of the @code{iconv} functions can lead to portability issues.
+
+The first thing to notice is that due to the large number of character
+sets in use it is certainly not practical to encode the conversions
+directly in the C library. Therefore the conversion information must
+come from files outside the C library. This is usually in one or both
+of the following ways:
+
+@itemize @bullet
+@item
+The C library contains a set of generic conversion functions which can
+read the needed conversion tables and other information from data files.
+These files get loaded when necessary.
+
+This solution is problematic as it is only with very much effort
+applicable to all character set (maybe it is even impossible). The
+differences in structure of the different character sets is so large
+that many different variants of the table processing functions must be
+developed. On top of this the generic nature of these functions make
+them slower than specifically implemented functions.
+
+@item
+The C library only contains a framework which can dynamically load
+object files and execute the therein contained conversion functions.
+
+This solution provides much more flexibility. The C library itself
+contains only very little code and therefore reduces the general memory
+footprint. Also, with a documented interface between the C library and
+the loadable modules it is possible for third parties to extend the set
+of available conversion modules. A drawback of this solution is that
+dynamic loading must be available.
+@end itemize
+
+Some implementations in commercial Unices implement a mixture of these
+possibilities, the majority only the second solution. This often leads
+to problems, though. Since the modules with the conversion modules must
+be dynamically loaded the system must have this possibility for all
+programs. But this is not the case. At least some platforms (if no
+all) are not able to dynamically load objects if the program is linked
+statically. This is often solved by outlawing static linking entirely
+but sure it is a weak solution. The GNU C library does not have this
+restriction though it also uses dynamic loading. The danger is that one
+get acquainted with this and forgets about the restriction on other
+systems.
+
+A second thing to know about other @code{iconv} implementations is that
+the number of available conversion is often very limited. Some
+implementations provide in the standard release (not the special
+international release, if something exists) at most 100 to 200
+conversion possibilities. This does not mean 200 different character
+sets are supported. E.g., conversions from one character set to a set
+of, say, 10 others counts as 10 conversion. Together with the other
+direction this makes already 20. One can imagine the thin coverage
+these platform provide. Some Unix vendors even provide only a handful
+of conversions which renders them useless for almost all uses.
+
+This directly leads to a third and probably the most problematic point.
+The way the @code{iconv} conversion functions are implemented on all
+known Unix system the availability of the conversion functions from
+character set @math{@cal{A}} to @math{@cal{B}} and the conversion from
+@math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the
+conversion from @math{@cal{A}} to @math{@cal{C}} is available.
+
+This might not seem unreasonable and problematic at first but it is a
+quite big problem as one will notice shortly after hitting it. To show
+the problem we assume to write a program which has to convert from
+@math{@cal{A}} to @math{@cal{C}}. A call like
+
+@smallexample
+cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}");
+@end smallexample
+
+@noindent
+does fail according to the assumption above. But what does the program
+do now? The conversion is really necessary and therefore simply giving
+up is no possibility.
+
+First this is of course a nuisance. The @code{iconv} function should
+take care of this. But second, how should the program proceed from here
+on? If it would try to convert to character set @math{@cal{B}} first
+the two @code{iconv_open} calls
+
+@smallexample
+cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}");
+@end smallexample
+
+@noindent
+and
+
+@smallexample
+cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}");
+@end smallexample
+
+@noindent
+will succeed but how to find @math{@cal{B}}?
+
+The answer is unfortunately: there is no general solution. On some
+systems guessing might help. On those systems most character sets can
+convert to and from UTF8 encoded @w{ISO 10646} or Unicode text. Beside
+this only some very system-specific methods can help. Since the
+conversion functions come from loadable modules and these modules must
+be stored somewhere in the filesystem, one @emph{could} try to find them
+and determine from the available file which conversions are available
+and whether there is an indirect route from @math{@cal{A}} to
+@math{@cal{C}}.
+
+This shows one of the design errors of @code{iconv} mentioned above. It
+should at least be possible to determine the list of available
+conversion programmatically so that if @code{iconv_open} says there is
+no such conversion, one could make sure this also is true for indirect
+routes.
+
+
+@node glibc iconv Implementation
+@subsection The @code{iconv} Implementation in the GNU C library
+
+After reading about the problems of @code{iconv} implementations in the
+last section it is certainly good to read here that the implementation
+in the GNU C library has none of the problems mentioned above. But step
+by step now. We will now address the points raised above. The
+evaluation is based on the current state of the development (as of
+January 1999). The development of the @code{iconv} functions is not
+entirely finished by now but things can only get better.
+
+The GNU C library's @code{iconv} implementation uses shared loadable
+modules to implement the conversions. A very small number of
+conversions are built into the library itself but these are only rather
+trivial conversions.
+
+All the benefits of loadable modules are available in the GNU C library
+implementation. This is especially interesting since the interface is
+well documented (see below) and it therefore is easy to write new
+conversion modules. The drawback of using loadable object is not a
+problem in the GNU C library, at least on ELF systems. Since the
+library is able to load shared objects even in statically linked
+binaries this means that static linking must not be forbidden in case
+one wants to use @code{iconv}.
+
+The second mentioned problems is the number of supported conversions.
+First, the GNU C library supports more then 150 character. And the was
+the implementation is designed the number of supported conversions is
+greater than 22350 (@math{150} times @math{149}). If any conversion
+from or to a character set is missing it can easily be added.
+
+This high number is due to the fact that the GNU C library
+implementation of @code{iconv} does not have the third problem mentioned
+above. I.e., whenever there is a conversion from a character set
+@math{@cal{A}} to @math{@cal{B}} and from @math{@cal{B}} to
+@math{@cal{C}} it always is possible to convert from @math{@cal{A}} to
+@math{@cal{C}} directly. If the @code{iconv_open} returns an error and
+sets @code{errno} to @code{EINVAL} this really means there is no known
+way, directly or indirectly, to perform the wanted conversion.
+
+@cindex triangulation
+This is achieved by providing for each character set a conversion from
+and to UCS4 encoded @w{ISO 10646}. Using @w{ISO 10646} as an
+intermediate representation it is possible to ``triangulate''.
+
+There is no inherent requirement to provide a conversion to @w{ISO
+10646} for a new character set and it is also possible to provide other
+conversion where neither source not destination character set is @w{ISO
+10646}. The currently existing set of conversion is simply meant to
+convert all conversions which might be of interest. What could be done
+in future is improving the speed of certain conversions.
+
+@cindex ISO-2022-JP
+@cindex EUC-JP
+Since all currently available conversions use the triangulation methods
+often used conversion run unnecessarily slow. If, e.g., somebody often
+needs the conversion from ISO-2022-JP to EUC-JP it is not the best way
+to convert the input to @w{ISO 10646} first. The two character sets of
+interest are much more similar to each other than to @w{ISO 10646}.
+
+In such a situation one can easy write a new conversion and provide it
+as a better alternative. The GNU C library @code{iconv} implementation
+would automatically use the module implementing the conversion if it is
+specified to be more efficient.
+
+@subsubsection Format of @file{gconv-modules} files
+
+All information about the available conversions comes from a file named
+@file{gconv-modules} which can be found in any of the directories along
+the @code{GCONV_PATH}. The @file{gconv-modules} files are line-oriented
+text files, where each of the lines has one of the following formats:
+
+@itemize @bullet
+@item
+If the first non-whitespace character is a @kbd{#} the line contains
+only comments is is ignored.
+
+@item
+Lines starting with @code{alias} define an alias name for a character
+set. There are two more words expected on the line. The first one
+defines the alias name and the second defines the original name of the
+character set. The effect is that it is possible to use the alias name
+in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and
+achieve the same result as when using the real character set name.
+
+This is quite important as a character set has often many different
+names. There is normally always an official name but this need not
+correspond to the most popular name. Beside this many character sets
+have special names which are somehow constructed. E.g., all character
+sets specified by the ISO have an alias of the form
+@code{ISO-IR-@var{nnn}} where @var{nnn} is the registration number.
+This allows programs which know about the registration number to
+construct character set names and use them in @code{iconv_open} calls.
+More on the available names and alias follows below.
+
+@item
+Lines starting with @code{module} introduce an available conversion
+module. These lines must contain three or four more words.
+
+The first word specifies the source character set, the second word the
+destination character set of conversion implemented in this module. The
+third word is the name of the loadable module. The filename is
+constructed by appending the usual shared object prefix (normally
+@file{.so}) and this file is then supposed to be found in the same
+directory the @file{gconv-modules} file is in. The last word on the
+line, which is optional, is a numeric value representing the cost of the
+conversion. If this word is missing a cost of @math{1} is assumed. The
+numeric value itself does not matter that much; what counts are the
+relative values of the sums of costs for all possible conversion paths.
+Below is a more precise description of the use of the cost value.
+@end itemize
+
+Coming back to the example where one has written a module to directly
+convert from ISO-2022-JP to EUC-JP and back. All what has to be done is
+to put the new module, be its name ISO2022JP-EUCJP.so, in a directory
+and add a file @file{gconv-modules} with the following content in the
+same directory:
+
+@smallexample
+module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1
+module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1
+@end smallexample
+
+To see why this is enough it is necessary to understand how the
+conversion used by @code{iconv} and described in the descriptor is
+selected. The approach to this problem is quite simple.
+
+At the first call of the @code{iconv_open} function the program reads
+all available @file{gconv-modules} files and builds up two tables: one
+containing all the known aliases and another which contains the
+information about the conversions and which shared object implements
+them.
+
+@subsubsection Finding the conversion path in @code{iconv}
+
+The set of available conversions form a directed graph with weighted
+edges. The weights on the edges are of course the costs specified in
+the @file{gconv-modules} files. The @code{iconv_open} function
+therefore uses an algorithm suitable to search for the best path in such
+a graph and so constructs a list of conversions which must be performed
+in succession to get the transformation from the source to the
+destination character set.
+
+Now it can be easily seen why the above @file{gconv-modules} files
+allows the @code{iconv} implementation to pick up the specific
+ISO-2022-JP to EUC-JP conversion module instead of the conversion coming
+with the library itself. Since the later conversion takes two steps
+(from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to
+EUC-JP) the cost is @math{1+1 = 2}. But the above @file{gconv-modules}
+file specifies that the new conversion modules can perform this
+conversion with only the cost of @math{1}.
+
+A bit mysterious about the @file{gconv-modules} file above (and also the
+file coming with the GNU C library) are the names of the character sets
+specified in the @code{module} lines. Why do almost all the names end
+in @code{//}? And this is not all: the names can actually be regular
+expressions. At this point of time this mystery should not be revealed.
+Sorry! @strong{The part of the implementation where this is used is not
+yet finished. For now please simply follow the existing examples.
+It'll become clearer once it is. --drepper}
+
+A last remark about the @file{gconv-modules} is about the names not
+ending with @code{//}. There often is a character set named
+@code{INTERNAL} mentioned. From the discussion above and the chosen
+name it should have become clear that this is the names for the
+representation used in the intermediate step of the triangulation. We
+have said that this is UCS4 but actually it is not quite right. The
+UCS4 specification also includes the specification of the byte ordering
+used. Since an UCS4 value consists of four bytes a stored value is
+effected by byte ordering. The internal representation is @emph{not}
+the same as UCS4 in case the byte ordering of the processor (or at least
+the running process) is not the same as the one required for UCS4. This
+is done for performance reasons as one does not want to perform
+unnecessary byte-swapping operations if one is not interested in actually
+seeing the result in UCS4. To avoid trouble with endianess the internal
+representation consistently is named @code{INTERNAL} even on big-endian
+systems where the representations are identical.
+
+@subsubsection @code{iconv} module data structures
+
+So far this section described how modules are located and considered to
+be used. What remains to be described is the interface of the modules
+so that one can write new ones. This section describes the interface as
+it is in use in January 1999. The interface will change in future a bit
+but hopefully only in an upward compatible way.
+
+The definitions necessary to write new modules are publically available
+in the non-standard header @file{gconv.h}. The following text will
+therefore describe the definitions from this header file. But first it
+is necessary to get an overview.
+
+From the perspective of the user of @code{iconv} the interface is quite
+simple: the @code{iconv_open} function returns a handle which can be
+used in calls @code{iconv} and finally the handle is freed with a call
+to @code{iconv_close}. The problem is: the handle has to be able to
+represent the possibly long sequences of conversion steps and also the
+state of each conversion since the handle is all which is passed to the
+@code{iconv} function. Therefore the data structures are really the
+elements to understanding the implementation.
+
+We need two different kinds of data structures. The first describes the
+conversion and the second describes the state etc. There are really two
+type definitions like this in @file{gconv.h}.
+@pindex gconv.h
+
+@comment gconv.h
+@comment GNU
+@deftp {Data type} {struct gconv_step}
+This data structure describes one conversion a module can perform. For
+each function in a loaded module with conversion functions there is
+exactly one object of this type. This object is shared by all users of
+the conversion. I.e., this object does not contain any information
+corresponding to an actual conversion. It only describes the conversion
+itself.
+
+@table @code
+@item struct gconv_loaded_object *shlib_handle
+@itemx const char *modname
+@itemx int counter
+All these elements of the structure are used internally in the C library
+to coordinate loading and unloading the shared. One must not expect any
+of the other elements be available or initialized.
+
+@item const char *from_name
+@itemx const char *to_name
+@code{from_name} and @code{to_name} contain the names of the source and
+destination character sets. They can be used to identify the actual
+conversion to be carried out since one module might implement
+conversions for more than one character set and/or direction.
+
+@item gconv_fct fct
+@itemx gconv_init_fct init_fct
+@itemx gconv_end_fct end_fct
+These elements contain pointers to the functions in the loadable module.
+The interface will be explained below.
+
+@item int min_needed_from
+@itemx int max_needed_from
+@itemx int min_needed_to
+@itemx int max_needed_to;
+These values have to be filled in the the init function of the module.
+The @code{min_needed_from} value specifies how many bytes a character of
+the source character set at least needs. The @code{max_needed_from}
+specifies the maximum value which also includes possible shift
+sequences.
+
+The @code{min_needed_to} and @code{max_needed_to} values serve the same
+purpose but this time for the destination character set.
+
+It is crucial that these values are accurate since otherwise the
+conversion functions will have problems or not work at all.
+
+@item int stateful
+This element must also be initialized by the init function. It is
+nonzero if the source character set is stateful. Otherwise it is zero.
+
+@item void *data
+This element can be used freely by the conversion functions in the
+module. It can be used to communicate extra information from one call
+to another. It need not be initialized if not needed at all. If this
+element gets assigned a pointer to dynamically allocated memory
+(presumably in the init function) it has to be made sure that the end
+function deallocates the memory. Otherwise the application will leak
+memory.
+
+It is important to be aware that this data structure is shared by all
+users of this specification conversion and therefore the @code{data}
+element must not contain data specific to one specific use of the
+conversion function.
+@end table
+@end deftp
+
+@comment gconv.h
+@comment GNU
+@deftp {Data type} {struct gconv_step_data}
+This is the data structure which contains the information specific to
+each use of the conversion functions.
+
+@table @code
+@item char *outbuf
+@itemx char *outbufend
+These elements specify the output buffer for the conversion step. The
+@code{outbuf} element points to the beginning of the buffer and
+@code{outbufend} points to the byte following the last byte in the
+buffer. The conversion function must not assume anything about the size
+of the buffer but it can be safely assumed the there is room for at
+least one complete character in the output buffer.
+
+Once the conversion is finished and the conversion is the last step the
+@code{outbuf} element must be modified to point after last last byte
+written into the buffer to signal how much output is available. If this
+conversion step is not the last one the element must not be modified.
+The @code{outbufend} element must not be modified.
+
+@item int is_last
+This element is nonzero if this conversion step is the last one. This
+information is necessary for the recursion. See the description of the
+conversion function internals below. This element must never be
+modified.
+
+@item int invocation_counter
+The conversion function can use this element to see how many calls of
+the conversion function already happened. Some character sets require
+when generating output a certain prolog and by comparing this value with
+zero one can find out whether it is the first call and therefore the
+prolog should be emitted or not. This element must never be modified.
+
+@item int internal_use
+This element is another one rarely used but needed in certain
+situations. It got assigned a nonzero value in case the conversion
+functions are used to implement @code{mbsrtowcs} et.al. I.e., the
+function is not used directly through the @code{iconv} interface.
+
+This sometimes makes a difference as it is expected that the
+@code{iconv} functions are used to translate entire texts while the
+@code{mbsrtowcs} functions are normally only used to convert single
+strings and might be used multiple times to convert entire texts.
+
+But in this situation we would have problem complying with some rules of
+the character set specification. Some character sets require a prolog
+which must appear exactly once for an entire text. If a number of
+@code{mbsrtowcs} calls are used to convert the text only the first call
+must add the prolog. But since there is no communication between the
+different calls of @code{mbsrtowcs} the conversion functions have no
+possibility to find this out. The situation is different for sequences
+of @code{iconv} calls since the handle allows to access the needed
+information.
+
+This element is mostly used together with @code{invocation_counter} in a
+way like this:
+
+@smallexample
+if (!data->internal_use && data->invocation_counter == 0)
+ /* @r{Emit prolog.} */
+ ...
+@end smallexample
+
+This element must never be modified.
+
+@item mbstate_t *statep
+The @code{statep} element points to an object of type @code{mbstate_t}
+(@pxref{Keeping the state}). The conversion of an stateful charater
+set must use the object pointed to by this element to store information
+about the conversion state. The @code{statep} element itself must never
+be modified.
+
+@item mbstate_t __state
+This element @emph{never} must be used directly. It is only part of
+this structure to have the needed space allocated.
+@end table
+@end deftp
+
+@subsubsection @code{iconv} module interfaces
+
+With the knowledge about the data structures we now can describe the
+conversion functions itself. To understand the interface a bit of
+knowledge about the functionality in the C library which loads the
+objects with the conversions is necessary.
+
+It is often the case that one conversion is used more than once. I.e.,
+there are several @code{iconv_open} calls for the same set of character
+sets during one program run. The @code{mbsrtowcs} et.al.@: functions in
+the GNU C library also use the @code{iconv} functionality which
+increases the number of uses of the same functions even more.
+
+For this reason the modules do not get loaded exclusively for one
+conversion. Instead a module once loaded can be used by arbitrary many
+@code{iconv} or @code{mbsrtowcs} calls at the same time. The splitting
+of the information between conversion function specific information and
+conversion data makes this possible. The last section showed the two
+data structure used to do this.
+
+This is of course also reflected in the interface and semantic of the
+functions the modules must provide. There are three functions which
+must have the following names:
+
+@table @code
+@item gconv_init
+The @code{gconv_init} function initializes the conversion function
+specific data structure. This very same object is shared by all
+conversion which use this conversion and therefore no state information
+about the conversion itself must be stored in here. If a module
+implements more than one conversion the @code{gconv_init} function will be
+called multiple times.
+
+@item gconv_end
+The @code{gconv_end} function is responsible to free all resources
+allocated by the @code{gconv_init} function. If there is nothing to do
+this function can be missing. Special care must be taken if the module
+implements more than one conversion and the @code{gconv_init} function
+does not allocate the same resources for all conversions.
+
+@item gconv
+This is the actual conversion function. It is called to convert one
+block of text. It gets passed the conversion step information
+initialized by @code{gconv_init} and the conversion data, specific to
+this use of the conversion functions.
+@end table
+
+There are three data types defined for the three module interface
+function and these define the interface.
+
+@comment gconv.h
+@comment GNU
+@deftypevr {Data type} int (*gconv_init_fct) (struct gconv_step *)
+This specifies the interface of the initialization function of the
+module. It is called exactly once for each conversion the module
+implements.
+
+As explained int the description of the @code{struct gconv_step} data
+structure above the initialization function has to initialize parts of
+it.
+
+@table @code
+@item min_needed_from
+@itemx max_needed_from
+@itemx min_needed_to
+@itemx max_needed_to
+These elements must be initialized to the exact numbers of the minimum
+and maximum number of bytes used by one character in the source and
+destination character set respectively. If the characters all have the
+same size the minimum and maximum values are the same.
+
+@item stateful
+This element must be initialized to an nonzero value if the source
+character set is stateful. Otherwise it must be zero.
+@end table
+
+If the initialization function needs to communication some information
+to the conversion function this can happen using the @code{data} element
+of the @code{gconv_step} structure. But since this data is shared by
+all the conversion is must not be modified by the conversion function.
+How this can be used is shown in the example below.
+
+@smallexample
+#define MIN_NEEDED_FROM 1
+#define MAX_NEEDED_FROM 4
+#define MIN_NEEDED_TO 4
+#define MAX_NEEDED_TO 4
+
+int
+gconv_init (struct gconv_step *step)
+@{
+ /* @r{Determine which direction.} */
+ struct iso2022jp_data *new_data;
+ enum direction dir = illegal_dir;
+ enum variant var = illegal_var;
+ int result;
+
+ if (__strcasecmp (step->from_name, "ISO-2022-JP//") == 0)
+ @{
+ dir = from_iso2022jp;
+ var = iso2022jp;
+ @}
+ else if (__strcasecmp (step->to_name, "ISO-2022-JP//") == 0)
+ @{
+ dir = to_iso2022jp;
+ var = iso2022jp;
+ @}
+ else if (__strcasecmp (step->from_name, "ISO-2022-JP-2//") == 0)
+ @{
+ dir = from_iso2022jp;
+ var = iso2022jp2;
+ @}
+ else if (__strcasecmp (step->to_name, "ISO-2022-JP-2//") == 0)
+ @{
+ dir = to_iso2022jp;
+ var = iso2022jp2;
+ @}
+
+ result = GCONV_NOCONV;
+ if (dir != illegal_dir)
+ @{
+ new_data = (struct iso2022jp_data *)
+ malloc (sizeof (struct iso2022jp_data));
+
+ result = GCONV_NOMEM;
+ if (new_data != NULL)
+ @{
+ new_data->dir = dir;
+ new_data->var = var;
+ step->data = new_data;
+
+ if (dir == from_iso2022jp)
+ @{
+ step->min_needed_from = MIN_NEEDED_FROM;
+ step->max_needed_from = MAX_NEEDED_FROM;
+ step->min_needed_to = MIN_NEEDED_TO;
+ step->max_needed_to = MAX_NEEDED_TO;
+ @}
+ else
+ @{
+ step->min_needed_from = MIN_NEEDED_TO;
+ step->max_needed_from = MAX_NEEDED_TO;
+ step->min_needed_to = MIN_NEEDED_FROM;
+ step->max_needed_to = MAX_NEEDED_FROM + 2;
+ @}
+
+ /* @r{Yes, this is a stateful encoding.} */
+ step->stateful = 1;
+
+ result = GCONV_OK;
+ @}
+ @}
+
+ return result;
+@}
+@end smallexample
+
+The function first checks which conversion is wanted. The module from
+which this function is taken implements four different conversion and
+which one is selected can be determined by comparing the names. The
+comparison should always be done without paying attention to the case.
+
+Then a data structure is allocated which contains the necessary
+information about which conversion is selected. The data structure
+@code{struct iso2022jp_data} is locally defined since outside the module
+this data is not used at all. Please note that if all four conversions
+this modules supports are requested there are four data blocks.
+
+One interesting thing is the initialization of the @code{min_} and
+@code{max_} elements of the step data object. A single ISO-2022-JP
+character can consist of one to four bytes. Therefore the
+@code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined
+this way. The output is always the @code{INTERNAL} character set (aka
+UCS4) and therefore each character consists of exactly four bytes. For
+the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
+account that escape sequences might be necessary to switch the character
+sets. Therefore the @code{max_needed_to} element for this direction
+gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the
+two bytes needed for the escape sequences to single the switching. The
+asymmetry in the maximum values for the two directions can be explained
+easily: when reading ISO-2022-JP text escape sequences can be handled
+alone. I.e., it is not necessary to process a real character since the
+effect of the escape sequence can be recorded in the state information.
+The situation is different for the other direction. Since it is in
+general not known which character comes next one cannot emit escape
+sequences to change the state in advance. This means the escape
+sequences which have to be emitted together with the next character.
+Therefore one needs more room then only for the character itself.
+
+The possible return values of the initialization function are:
+
+@table @code
+@item GCONV_OK
+The initialization succeeded
+@item GCONV_NOCONV
+The requested conversion is not supported in the module. This can
+happen if the @file{gconv-modules} file has errors.
+@item GCONV_NOMEM
+Memory required to store additional information could not be allocated.
+@end table
+@end deftypevr
+
+The functions called before the module is unloaded is significantly
+easier. It often has nothing at all to do in which case it can be left
+out completely.
+
+@comment gconv.h
+@comment GNU
+@deftypevr {Data type} void (*gconv_end_fct) (struct gconv_step *)
+The task of this function is it to free all resources allocated in the
+initialization function. Therefore only the @code{data} element of the
+object pointed to by the argument is of interest. Continuing the
+example from the initialization function, the finalization function
+looks like this:
+
+@smallexample
+void
+gconv_end (struct gconv_step *data)
+@{
+ free (data->data);
+@}
+@end smallexample
+@end deftypevr
+
+The most important function of course is the conversion function itself.
+It can get quite complicated for complex character sets. But since this
+is not of interest here we will only describe a possible skeleton for
+the conversion function.
+
+@comment gconv.h
+@comment GNU
+@deftypevr {Data type} int (*gconv_fct) (struct gconv_step *, struct gconv_step_data *, const char **, const char *, size_t *, int)
+The conversion function can be called for two basic reason: to convert
+text or to reset the state. From the description of the @code{iconv}
+function it can be seen why the flushing mode is necessary. What mode
+is selected is determined by the sixth argument, an integer. If it is
+nonzero it means that flushing is selected.
+
+Common to both mode is where the output buffer can be found. The
+information about this buffer is stored in the conversion step data. A
+pointer to this is passed as the second argument to this function. The
+description of the @code{struct gconv_step_data} structure has more
+information on this.
+
+@cindex stateful
+What has to be done for flushing depends on the source character set.
+If it is not stateful nothing has to be done. Otherwise the function
+has to emit a byte sequence to bring the state object in the initial
+state. Once this all happened the other conversion modules in the chain
+of conversions have to get the same chance. Whether another step
+follows can be determined from the @code{is_last} element of the step
+data structure to which the first parameter points.
+
+The more interesting mode is when actually text has to be converted.
+The first step in this case is to convert as much text as possible from
+the input buffer and store the result in the output buffer. The start
+of the input buffer is determined by the third argument which is a
+pointer to a pointer variable referencing the beginning of the buffer.
+The fourth argument is a pointer to the byte right after the last byte
+in the buffer.
+
+The conversion has to be performed according to the current state if the
+character set is stateful. The state is stored in an object pointed to
+by the @code{statep} element of the step data (second argument). Once
+either the input buffer is empty or the output buffer is full the
+conversion stops. At this point the pointer variable referenced by the
+third parameter must point to the byte following the last processed
+byte. I.e., if all of the input is consumed this pointer and the fourth
+parameter have the same value.
+
+What now happens depends on whether this step is the last one or not.
+If it is the last step the only thing which has to be done is to update
+the @code{outbuf} element of the step data structure to point after the
+last written byte. This gives the caller the information on how much
+text is available in the output buffer. Beside this the variable
+pointed to by the fifth parameter, which is of type @code{size_t}, must
+be incremented by the number of characters (@emph{not bytes}) which were
+written in the output buffer. Then the function can return.
+
+In case the step is not the last one the later conversion functions have
+to get a chance to do their work. Therefore the appropriate conversion
+function has to be called. The information about the functions is
+stored in the conversion data structures, passed as the first parameter.
+This information and the step data are stored in arrays so the next
+element in both cases can be found by simple pointer arithmetic:
+
+@smallexample
+int
+gconv (struct gconv_step *step, struct gconv_step_data *data,
+ const char **inbuf, const char *inbufend, size_t *written,
+ int do_flush)
+@{
+ struct gconv_step *next_step = step + 1;
+ struct gconv_step_data *next_data = data + 1;
+ ...
+@end smallexample
+
+The @code{next_step} pointer references the next step information and
+@code{next_data} the next data record. The call of the next function
+therefore will look similar to this:
+
+@smallexample
+ next_step->fct (next_step, next_data, &outerr, outbuf, written, 0)
+@end smallexample
+
+But this is not yet all. Once the function call returns the conversion
+function might have some more to do. If the return value of the
+function is @code{GCONV_EMPTY_INPUT} this means there is more room in
+the output buffer. Unless the input buffer is empty the conversion
+functions start all over again and processes the rest of the input
+buffer. If the return value is not @code{GCONV_EMPTY_INPUT} something
+went wrong and we have to recover from this.
+
+A requirement for the conversion function is that the input buffer
+pointer (the third argument) always points to the last character which
+was put in the converted form in the output buffer. This is trivial
+true after the conversion performed in the current step. But if the
+conversion functions deeper down the stream stop prematurely not all
+characters from the output buffer are consumed and therefore the input
+buffer pointers must be backed of to the right position.
+
+This is easy to do if the input and output character sets have a fixed
+width for all characters. In this situation we can compute how many
+characters are left in the output buffer and therefore can correct the
+input buffer pointer appropriate with a similar computation. Things are
+getting tricky if either character set has character represented with
+variable length byte sequences and it gets even more complicated if the
+conversion has to take care of the state. In these cases the conversion
+has to be performed once again, from the known state before the initial
+conversion. I.e., if necessary the state of the conversion has to be
+reset and the conversion loop has to be executed again. The difference
+now is that it is known how much input must be created and the
+conversion can stop before converting the first unused character. Once
+this is done the input buffer pointers must be updated again and the
+function can return.
+
+One final thing should be mentioned. If it is necessary for the
+conversion to know whether it is the first invocation (in case a prolog
+has to be emitted) the conversion function should just before returning
+to the caller increment the @code{invocation_counter} element of the
+step data structure. See the description of the @code{struct
+gconv_step_data} structure above for more information on how this can be
+used.
+
+The return value must be one of the following values:
+
+@table @code
+@item GCONV_EMPTY_INPUT
+All input was consumed and there is room left in the output buffer.
+@item GCONV_OUTPUT_FULL
+No more room in the output buffer. In case this is not the last step
+this value is propagated down from the call of the next conversion
+function in the chain.
+@item GCONV_INCOMPLETE_INPUT
+The input buffer is not entirely empty since it contains an incomplete
+character sequence.
+@end table
+
+The following example provides a framework for a conversion function.
+In case a new conversion has to be written the holes in this
+implementation have to be filled and that is it.
+
+@smallexample
+int
+gconv (struct gconv_step *step, struct gconv_step_data *data,
+ const char **inbuf, const char *inbufend, size_t *written,
+ int do_flush)
+@{
+ struct gconv_step *next_step = step + 1;
+ struct gconv_step_data *next_data = data + 1;
+ gconv_fct fct = next_step->fct;
+ int status;
+
+ /* @r{If the function is called with no input this means we have}
+ @r{to reset to the initial state. The possibly partly}
+ @r{converted input is dropped.} */
+ if (do_flush)
+ @{
+ status = GCONV_OK;
+
+ /* @r{Possible emit a byte sequence which put the state object}
+ @r{into the initial state.} */
+
+ /* @r{Call the steps down the chain if there are any but only}
+ @r{if we successfully emitted the escape sequence.} */
+ if (status == GCONV_OK && ! data->is_last)
+ status = fct (next_step, next_data, NULL, NULL,
+ written, 1);
+ @}
+ else
+ @{
+ /* @r{We preserve the initial values of the pointer variables.} */
+ const char *inptr = *inbuf;
+ char *outbuf = data->outbuf;
+ char *outend = data->outbufend;
+ char *outptr;
+
+ /* @r{This variable is used to count the number of characters}
+ @r{we actually converted.} */
+ size_t converted = 0;
+
+ do
+ @{
+ /* @r{Remember the start value for this round.} */
+ inptr = *inbuf;
+ /* @r{The outbuf buffer is empty.} */
+ outptr = outbuf;
+
+ /* @r{For stateful encodings the state must be safe here.} */
+
+ /* @r{Run the conversion loop. @code{status} is set}
+ @r{appropriately afterwards.} */
+
+ /* @r{If this is the last step leave the loop, there is}
+ @r{nothing we can do.} */
+ if (data->is_last)
+ @{
+ /* @r{Store information about how many bytes are}
+ @r{available.} */
+ data->outbuf = outbuf;
+
+ /* @r{Remember how many characters we converted.} */
+ *written += converted;
+
+ break;
+ @}
+
+ /* @r{Write out all output which was produced.} */
+ if (outbuf > outptr)
+ @{
+ const char *outerr = data->outbuf;
+ int result;
+
+ result = fct (next_step, next_data, &outerr,
+ outbuf, written, 0);
+
+ if (result != GCONV_EMPTY_INPUT)
+ @{
+ if (outerr != outbuf)
+ @{
+ /* @r{Reset the input buffer pointer. We}
+ @r{document here the complex case.} */
+ size_t nstatus;
+
+ /* @r{Reload the pointers.} */
+ *inbuf = inptr;
+ outbuf = outptr;
+
+ /* @r{Possibly reset the state.} */
+
+ /* @r{Redo the conversion, but this time}
+ @r{the end of the output buffer is at}
+ @r{@code{outerr}.} */
+ @}
+
+ /* @r{Change the status.} */
+ status = result;
+ @}
+ else
+ /* @r{All the output is consumed, we can make}
+ @r{ another run if everything was ok.} */
+ if (status == GCONV_FULL_OUTPUT)
+ status = GCONV_OK;
+ @}
+ @}
+ while (status == GCONV_OK);
+
+ /* @r{We finished one use of this step.} */
+ ++data->invocation_counter;
+ @}
+
+ return status;
+@}
+@end smallexample
+@end deftypevr
+
+This information should be sufficient to write new modules. Anybody
+doing so should also take a look at the available source code in the GNU
+C library sources. It contains many examples of working and optimized
+modules.