aboutsummaryrefslogtreecommitdiff
path: root/manual/mbyte.texi
diff options
context:
space:
mode:
authorUlrich Drepper <drepper@redhat.com>1999-01-14 12:58:05 +0000
committerUlrich Drepper <drepper@redhat.com>1999-01-14 12:58:05 +0000
commit0d0323043255c75bd21fa9a5a8701b3c2728b809 (patch)
tree5cc8cd2f70c5e2dfae74aa5e40bebb4d3a867ca7 /manual/mbyte.texi
parent1b1594b3ad57e96b060d1089360053b52fc4a5b1 (diff)
downloadglibc-0d0323043255c75bd21fa9a5a8701b3c2728b809.tar
glibc-0d0323043255c75bd21fa9a5a8701b3c2728b809.tar.gz
glibc-0d0323043255c75bd21fa9a5a8701b3c2728b809.tar.bz2
glibc-0d0323043255c75bd21fa9a5a8701b3c2728b809.zip
Update.
* sysdeps/unix/sysv/linux/arm/Dist: Add sys/user.h.
Diffstat (limited to 'manual/mbyte.texi')
-rw-r--r--manual/mbyte.texi696
1 files changed, 0 insertions, 696 deletions
diff --git a/manual/mbyte.texi b/manual/mbyte.texi
deleted file mode 100644
index 8f3c1924fa..0000000000
--- a/manual/mbyte.texi
+++ /dev/null
@@ -1,696 +0,0 @@
-@node Extended Characters, Locales, String and Array Utilities, Top
-@c %MENU% Support for extended character sets
-@chapter Extended Characters
-
-A number of languages use character sets that are larger than the range
-of values of type @code{char}. Japanese and Chinese are probably the
-most familiar examples.
-
-The GNU C library includes support for two mechanisms for dealing with
-extended character sets: multibyte characters and wide characters. This
-chapter describes how to use these mechanisms, and the functions for
-converting between them.
-@cindex extended character sets
-
-The behavior of the functions in this chapter is affected by the current
-locale for character classification---the @code{LC_CTYPE} category; see
-@ref{Locale Categories}. This choice of locale selects which multibyte
-code is used, and also controls the meanings and characteristics of wide
-character codes.
-
-@menu
-* Extended Char Intro:: Multibyte codes versus wide characters.
-* Locales and Extended Chars:: The locale selects the character codes.
-* Multibyte Char Intro:: How multibyte codes are represented.
-* Wide Char Intro:: How wide characters are represented.
-* Wide String Conversion:: Converting wide strings to multibyte code
- and vice versa.
-* Length of Char:: how many bytes make up one multibyte char.
-* Converting One Char:: Converting a string character by character.
-* Example of Conversion:: Example showing why converting
- one character at a time may be useful.
-* Shift State:: Multibyte codes with "shift characters".
-@end menu
-
-@node Extended Char Intro, Locales and Extended Chars, , Extended Characters
-@section Introduction to Extended Characters
-
-You can represent extended characters in either of two ways:
-
-@itemize @bullet
-@item
-As @dfn{multibyte characters} which can be embedded in an ordinary
-string, an array of @code{char} objects. Their advantage is that many
-programs and operating systems can handle occasional multibyte
-characters scattered among ordinary ASCII characters, without any
-change.
-
-@item
-@cindex wide characters
-As @dfn{wide characters}, which are like ordinary characters except that
-they occupy more bits. The wide character data type, @code{wchar_t},
-has a range large enough to hold extended character codes as well as
-old-fashioned ASCII codes.
-
-An advantage of wide characters is that each character is a single data
-object, just like ordinary ASCII characters. There are a few
-disadvantages:
-
-@itemize @bullet
-@item
-Each existing program must be modified and recompiled to make it use
-wide characters.
-
-@item
-Files of wide characters cannot be read by programs that expect ordinary
-characters.
-@end itemize
-@end itemize
-
-Typically, you use the multibyte character representation as part of the
-external program interface, such as reading or writing text to files.
-However, it's usually easier to perform internal manipulations on
-strings containing extended characters on arrays of @code{wchar_t}
-objects, since the uniform representation makes most editing operations
-easier. If you do use multibyte characters for files and wide
-characters for internal operations, you need to convert between them
-when you read and write data.
-
-If your system supports extended characters, then it supports them both
-as multibyte characters and as wide characters. The library includes
-functions you can use to convert between the two representations.
-These functions are described in this chapter.
-
-@node Locales and Extended Chars, Multibyte Char Intro, Extended Char Intro, Extended Characters
-@section Locales and Extended Characters
-
-A computer system can support more than one multibyte character code,
-and more than one wide character code. The user controls the choice of
-codes through the current locale for character classification
-(@pxref{Locales}). Each locale specifies a particular multibyte
-character code and a particular wide character code. The choice of locale
-influences the behavior of the conversion functions in the library.
-
-Some locales support neither wide characters nor nontrivial multibyte
-characters. In these locales, the library conversion functions still
-work, even though what they do is basically trivial.
-
-If you select a new locale for character classification, the internal
-shift state maintained by these functions can become confused, so it's
-not a good idea to change the locale while you are in the middle of
-processing a string.
-
-@node Multibyte Char Intro, Wide Char Intro, Locales and Extended Chars, Extended Characters
-@section Multibyte Characters
-@cindex multibyte characters
-
-In the ordinary ASCII code, a sequence of characters is a sequence of
-bytes, and each character is one byte. This is very simple, but
-allows for only 256 distinct characters.
-
-In a @dfn{multibyte character code}, a sequence of characters is a
-sequence of bytes, but each character may occupy one or more consecutive
-bytes of the sequence.
-
-@cindex basic byte sequence
-There are many different ways of designing a multibyte character code;
-different systems use different codes. To specify a particular code
-means designating the @dfn{basic} byte sequences---those which represent
-a single character---and what characters they stand for. A code that a
-computer can actually use must have a finite number of these basic
-sequences, and typically none of them is more than a few characters
-long.
-
-These sequences need not all have the same length. In fact, many of
-them are just one byte long. Because the basic ASCII characters in the
-range from @code{0} to @code{0177} are so important, they stand for
-themselves in all multibyte character codes. That is to say, a byte
-whose value is @code{0} through @code{0177} is always a character in
-itself. The characters which are more than one byte must always start
-with a byte in the range from @code{0200} through @code{0377}.
-
-The byte value @code{0} can be used to terminate a string, just as it is
-often used in a string of ASCII characters.
-
-Specifying the basic byte sequences that represent single characters
-automatically gives meanings to many longer byte sequences, as more than
-one character. For example, if the two byte sequence @code{0205 049}
-stands for the Greek letter alpha, then @code{0205 049 065} must stand
-for an alpha followed by an @samp{A} (ASCII code 065), and @code{0205 049
-0205 049} must stand for two alphas in a row.
-
-If any byte sequence can have more than one meaning as a sequence of
-characters, then the multibyte code is ambiguous---and no good. The
-codes that systems actually use are all unambiguous.
-
-In most codes, there are certain sequences of bytes that have no meaning
-as a character or characters. These are called @dfn{invalid}.
-
-The simplest possible multibyte code is a trivial one:
-
-@quotation
-The basic sequences consist of single bytes.
-@end quotation
-
-This particular code is equivalent to not using multibyte characters at
-all. It has no invalid sequences. But it can handle only 256 different
-characters.
-
-Here is another possible code which can handle 9376 different
-characters:
-
-@quotation
-The basic sequences consist of
-
-@itemize @bullet
-@item
-single bytes with values in the range @code{0} through @code{0237}.
-
-@item
-two-byte sequences, in which both of the bytes have values in the range
-from @code{0240} through @code{0377}.
-@end itemize
-@end quotation
-
-@noindent
-This code or a similar one is used on some systems to represent Japanese
-characters. The invalid sequences are those which consist of an odd
-number of consecutive bytes in the range from @code{0240} through
-@code{0377}.
-
-Here is another multibyte code which can handle more distinct extended
-characters---in fact, almost thirty million:
-
-@quotation
-The basic sequences consist of
-
-@itemize @bullet
-@item
-single bytes with values in the range @code{0} through @code{0177}.
-
-@item
-sequences of up to four bytes in which the first byte is in the range
-from @code{0200} through @code{0237}, and the remaining bytes are in the
-range from @code{0240} through @code{0377}.
-@end itemize
-@end quotation
-
-@noindent
-In this code, any sequence that starts with a byte in the range
-from @code{0240} through @code{0377} is invalid.
-
-And here is another variant which has the advantage that removing the
-last byte or bytes from a valid character can never produce another
-valid character. (This property is convenient when you want to search
-strings for particular characters.)
-
-@quotation
-The basic sequences consist of
-
-@itemize @bullet
-@item
-single bytes with values in the range @code{0} through @code{0177}.
-
-@item
-two-byte sequences in which the first byte is in the range from
-@code{0200} through @code{0207}, and the second byte is in the range
-from @code{0240} through @code{0377}.
-
-@item
-three-byte sequences in which the first byte is in the range from
-@code{0210} through @code{0217}, and the other bytes are in the range
-from @code{0240} through @code{0377}.
-
-@item
-four-byte sequences in which the first byte is in the range from
-@code{0220} through @code{0227}, and the other bytes are in the range
-from @code{0240} through @code{0377}.
-@end itemize
-@end quotation
-
-@noindent
-The list of invalid sequences for this code is long and not worth
-stating in full; examples of invalid sequences include @code{0240} and
-@code{0220 0300 065}.
-
-The number of @emph{possible} multibyte codes is astronomical. But a
-given computer system will support at most a few different codes. (One
-of these codes may allow for thousands of different characters.)
-Another computer system may support a completely different code. The
-library facilities described in this chapter are helpful because they
-package up the knowledge of the details of a particular computer
-system's multibyte code, so your programs need not know them.
-
-You can use special standard macros to find out the maximum possible
-number of bytes in a character in the currently selected multibyte
-code with @code{MB_CUR_MAX}, and the maximum for @emph{any} multibyte
-code supported on your computer with @code{MB_LEN_MAX}.
-
-@comment limits.h
-@comment ISO
-@deftypevr Macro int MB_LEN_MAX
-This is the maximum length of a multibyte character for any supported
-locale. It is defined in @file{limits.h}.
-@pindex limits.h
-@end deftypevr
-
-@comment stdlib.h
-@comment ISO
-@deftypevr Macro int MB_CUR_MAX
-This macro expands into a (possibly non-constant) positive integer
-expression that is the maximum number of bytes in a multibyte character
-in the current locale. The value is never greater than @code{MB_LEN_MAX}.
-
-@pindex stdlib.h
-@code{MB_CUR_MAX} is defined in @file{stdlib.h}.
-@end deftypevr
-
-Normally, each basic sequence in a particular character code stands for
-one character, the same character regardless of context. Some multibyte
-character codes have a concept of @dfn{shift state}; certain codes,
-called @dfn{shift sequences}, change to a different shift state, and the
-meaning of some or all basic sequences varies according to the current
-shift state. In fact, the set of basic sequences might even be
-different depending on the current shift state. @xref{Shift State}, for
-more information on handling this sort of code.
-
-What happens if you try to pass a string containing multibyte characters
-to a function that doesn't know about them? Normally, such a function
-treats a string as a sequence of bytes, and interprets certain byte
-values specially; all other byte values are ``ordinary''. As long as a
-multibyte character doesn't contain any of the special byte values, the
-function should pass it through as if it were several ordinary
-characters.
-
-For example, let's figure out what happens if you use multibyte
-characters in a file name. The functions such as @code{open} and
-@code{unlink} that operate on file names treat the name as a sequence of
-byte values, with @samp{/} as the only special value. Any other byte
-values are copied, or compared, in sequence, and all byte values are
-treated alike. Thus, you may think of the file name as a sequence of
-bytes or as a string containing multibyte characters; the same behavior
-makes sense equally either way, provided no multibyte character contains
-a @samp{/}.
-
-@node Wide Char Intro, Wide String Conversion, Multibyte Char Intro, Extended Characters
-@section Wide Character Introduction
-
-@dfn{Wide characters} are much simpler than multibyte characters. They
-are simply characters with more than eight bits, so that they have room
-for more than 256 distinct codes. The wide character data type,
-@code{wchar_t}, has a range large enough to hold extended character
-codes as well as old-fashioned ASCII codes.
-
-An advantage of wide characters is that each character is a single data
-object, just like ordinary ASCII characters. Wide characters also have
-some disadvantages:
-
-@itemize @bullet
-@item
-A program must be modified and recompiled in order to use wide
-characters at all.
-
-@item
-Files of wide characters cannot be read by programs that expect ordinary
-characters.
-@end itemize
-
-Wide character values @code{0} through @code{0177} are always identical
-in meaning to the ASCII character codes. The wide character value zero
-is often used to terminate a string of wide characters, just as a single
-byte with value zero often terminates a string of ordinary characters.
-
-@comment stddef.h
-@comment ISO
-@deftp {Data Type} wchar_t
-This is the ``wide character'' type, an integer type whose range is
-large enough to represent all distinct values in any extended character
-set in the supported locales. @xref{Locales}, for more information
-about locales. This type is defined in the header file @file{stddef.h}.
-@pindex stddef.h
-@end deftp
-
-If your system supports extended characters, then each extended
-character has both a wide character code and a corresponding multibyte
-basic sequence.
-
-@cindex code, character
-@cindex character code
-In this chapter, the term @dfn{code} is used to refer to a single
-extended character object to emphasize the distinction from the
-@code{char} data type.
-
-@node Wide String Conversion, Length of Char, Wide Char Intro, Extended Characters
-@section Conversion of Extended Strings
-@cindex extended strings, converting representations
-@cindex converting extended strings
-
-@pindex stdlib.h
-The @code{mbstowcs} function converts a string of multibyte characters
-to a wide character array. The @code{wcstombs} function does the
-reverse. These functions are declared in the header file
-@file{stdlib.h}.
-
-In most programs, these functions are the only ones you need for
-conversion between wide strings and multibyte character strings. But
-they have limitations. If your data is not null-terminated or is not
-all in core at once, you probably need to use the low-level conversion
-functions to convert one character at a time. @xref{Converting One
-Char}.
-
-@comment stdlib.h
-@comment ISO
-@deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size})
-The @code{mbstowcs} (``multibyte string to wide character string'')
-function converts the null-terminated string of multibyte characters
-@var{string} to an array of wide character codes, storing not more than
-@var{size} wide characters into the array beginning at @var{wstring}.
-The terminating null character counts towards the size, so if @var{size}
-is less than the actual number of wide characters resulting from
-@var{string}, no terminating null character is stored.
-
-The conversion of characters from @var{string} begins in the initial
-shift state.
-
-If an invalid multibyte character sequence is found, this function
-returns a value of @code{-1}. Otherwise, it returns the number of wide
-characters stored in the array @var{wstring}. This number does not
-include the terminating null character, which is present if the number
-is less than @var{size}.
-
-Here is an example showing how to convert a string of multibyte
-characters, allocating enough space for the result.
-
-@smallexample
-wchar_t *
-mbstowcs_alloc (const char *string)
-@{
- size_t size = strlen (string) + 1;
- wchar_t *buf = xmalloc (size * sizeof (wchar_t));
-
- size = mbstowcs (buf, string, size);
- if (size == (size_t) -1)
- return NULL;
- buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));
- return buf;
-@}
-@end smallexample
-
-@end deftypefun
-
-@comment stdlib.h
-@comment ISO
-@deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size})
-The @code{wcstombs} (``wide character string to multibyte string'')
-function converts the null-terminated wide character array @var{wstring}
-into a string containing multibyte characters, storing not more than
-@var{size} bytes starting at @var{string}, followed by a terminating
-null character if there is room. The conversion of characters begins in
-the initial shift state.
-
-The terminating null character counts towards the size, so if @var{size}
-is less than or equal to the number of bytes needed in @var{wstring}, no
-terminating null character is stored.
-
-If a code that does not correspond to a valid multibyte character is
-found, this function returns a value of @code{-1}. Otherwise, the
-return value is the number of bytes stored in the array @var{string}.
-This number does not include the terminating null character, which is
-present if the number is less than @var{size}.
-@end deftypefun
-
-@node Length of Char, Converting One Char, Wide String Conversion, Extended Characters
-@section Multibyte Character Length
-@cindex multibyte character, length of
-@cindex length of multibyte character
-
-This section describes how to scan a string containing multibyte
-characters, one character at a time. The difficulty in doing this
-is to know how many bytes each character contains. Your program
-can use @code{mblen} to find this out.
-
-@comment stdlib.h
-@comment ISO
-@deftypefun int mblen (const char *@var{string}, size_t @var{size})
-The @code{mblen} function with a non-null @var{string} argument returns
-the number of bytes that make up the multibyte character beginning at
-@var{string}, never examining more than @var{size} bytes. (The idea is
-to supply for @var{size} the number of bytes of data you have in hand.)
-
-The return value of @code{mblen} distinguishes three possibilities: the
-first @var{size} bytes at @var{string} start with valid multibyte
-character, they start with an invalid byte sequence or just part of a
-character, or @var{string} points to an empty string (a null character).
-
-For a valid multibyte character, @code{mblen} returns the number of
-bytes in that character (always at least @code{1}, and never more than
-@var{size}). For an invalid byte sequence, @code{mblen} returns
-@code{-1}. For an empty string, it returns @code{0}.
-
-If the multibyte character code uses shift characters, then @code{mblen}
-maintains and updates a shift state as it scans. If you call
-@code{mblen} with a null pointer for @var{string}, that initializes the
-shift state to its standard initial value. It also returns nonzero if
-the multibyte character code in use actually has a shift state.
-@xref{Shift State}.
-
-@pindex stdlib.h
-The function @code{mblen} is declared in @file{stdlib.h}.
-@end deftypefun
-
-@node Converting One Char, Example of Conversion, Length of Char, Extended Characters
-@section Conversion of Extended Characters One by One
-@cindex extended characters, converting
-@cindex converting extended characters
-
-@pindex stdlib.h
-You can convert multibyte characters one at a time to wide characters
-with the @code{mbtowc} function. The @code{wctomb} function does the
-reverse. These functions are declared in @file{stdlib.h}.
-
-@comment stdlib.h
-@comment ISO
-@deftypefun int mbtowc (wchar_t *@var{result}, const char *@var{string}, size_t @var{size})
-The @code{mbtowc} (``multibyte to wide character'') function when called
-with non-null @var{string} converts the first multibyte character
-beginning at @var{string} to its corresponding wide character code. It
-stores the result in @code{*@var{result}}.
-
-@code{mbtowc} never examines more than @var{size} bytes. (The idea is
-to supply for @var{size} the number of bytes of data you have in hand.)
-
-@code{mbtowc} with non-null @var{string} distinguishes three
-possibilities: the first @var{size} bytes at @var{string} start with
-valid multibyte character, they start with an invalid byte sequence or
-just part of a character, or @var{string} points to an empty string (a
-null character).
-
-For a valid multibyte character, @code{mbtowc} converts it to a wide
-character and stores that in @code{*@var{result}}, and returns the
-number of bytes in that character (always at least @code{1}, and never
-more than @var{size}).
-
-For an invalid byte sequence, @code{mbtowc} returns @code{-1}. For an
-empty string, it returns @code{0}, also storing @code{0} in
-@code{*@var{result}}.
-
-If the multibyte character code uses shift characters, then
-@code{mbtowc} maintains and updates a shift state as it scans. If you
-call @code{mbtowc} with a null pointer for @var{string}, that
-initializes the shift state to its standard initial value. It also
-returns nonzero if the multibyte character code in use actually has a
-shift state. @xref{Shift State}.
-@end deftypefun
-
-@comment stdlib.h
-@comment ISO
-@deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar})
-The @code{wctomb} (``wide character to multibyte'') function converts
-the wide character code @var{wchar} to its corresponding multibyte
-character sequence, and stores the result in bytes starting at
-@var{string}. At most @code{MB_CUR_MAX} characters are stored.
-
-@code{wctomb} with non-null @var{string} distinguishes three
-possibilities for @var{wchar}: a valid wide character code (one that can
-be translated to a multibyte character), an invalid code, and @code{0}.
-
-Given a valid code, @code{wctomb} converts it to a multibyte character,
-storing the bytes starting at @var{string}. Then it returns the number
-of bytes in that character (always at least @code{1}, and never more
-than @code{MB_CUR_MAX}).
-
-If @var{wchar} is an invalid wide character code, @code{wctomb} returns
-@code{-1}. If @var{wchar} is @code{0}, it returns @code{0}, also
-storing @code{0} in @code{*@var{string}}.
-
-If the multibyte character code uses shift characters, then
-@code{wctomb} maintains and updates a shift state as it scans. If you
-call @code{wctomb} with a null pointer for @var{string}, that
-initializes the shift state to its standard initial value. It also
-returns nonzero if the multibyte character code in use actually has a
-shift state. @xref{Shift State}.
-
-Calling this function with a @var{wchar} argument of zero when
-@var{string} is not null has the side-effect of reinitializing the
-stored shift state @emph{as well as} storing the multibyte character
-@code{0} and returning @code{0}.
-@end deftypefun
-
-@node Example of Conversion, Shift State, Converting One Char, Extended Characters
-@section Character-by-Character Conversion Example
-
-Here is an example that reads multibyte character text from descriptor
-@code{input} and writes the corresponding wide characters to descriptor
-@code{output}. We need to convert characters one by one for this
-example because @code{mbstowcs} is unable to continue past a null
-character, and cannot cope with an apparently invalid partial character
-by reading more input.
-
-@smallexample
-int
-file_mbstowcs (int input, int output)
-@{
- char buffer[BUFSIZ + MB_LEN_MAX];
- int filled = 0;
- int eof = 0;
-
- while (!eof)
- @{
- int nread;
- int nwrite;
- char *inp = buffer;
- wchar_t outbuf[BUFSIZ];
- wchar_t *outp = outbuf;
-
- /* @r{Fill up the buffer from the input file.} */
- nread = read (input, buffer + filled, BUFSIZ);
- if (nread < 0)
- @{
- perror ("read");
- return 0;
- @}
- /* @r{If we reach end of file, make a note to read no more.} */
- if (nread == 0)
- eof = 1;
-
- /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */
- filled += nread;
-
- /* @r{Convert those bytes to wide characters--as many as we can.} */
- while (1)
- @{
- int thislen = mbtowc (outp, inp, filled);
- /* Stop converting at invalid character;
- this can mean we have read just the first part
- of a valid character. */
- if (thislen == -1)
- break;
- /* @r{Treat null character like any other,}
- @r{but also reset shift state.} */
- if (thislen == 0) @{
- thislen = 1;
- mbtowc (NULL, NULL, 0);
- @}
- /* @r{Advance past this character.} */
- inp += thislen;
- filled -= thislen;
- outp++;
- @}
-
- /* @r{Write the wide characters we just made.} */
- nwrite = write (output, outbuf,
- (outp - outbuf) * sizeof (wchar_t));
- if (nwrite < 0)
- @{
- perror ("write");
- return 0;
- @}
-
- /* @r{See if we have a @emph{real} invalid character.} */
- if ((eof && filled > 0) || filled >= MB_CUR_MAX)
- @{
- error ("invalid multibyte character");
- return 0;
- @}
-
- /* @r{If any characters must be carried forward,}
- @r{put them at the beginning of @code{buffer}.} */
- if (filled > 0)
- memcpy (inp, buffer, filled);
- @}
- @}
-
- return 1;
-@}
-@end smallexample
-
-@node Shift State, , Example of Conversion, Extended Characters
-@section Multibyte Codes Using Shift Sequences
-
-In some multibyte character codes, the @emph{meaning} of any particular
-byte sequence is not fixed; it depends on what other sequences have come
-earlier in the same string. Typically there are just a few sequences
-that can change the meaning of other sequences; these few are called
-@dfn{shift sequences} and we say that they set the @dfn{shift state} for
-other sequences that follow.
-
-To illustrate shift state and shift sequences, suppose we decide that
-the sequence @code{0200} (just one byte) enters Japanese mode, in which
-pairs of bytes in the range from @code{0240} to @code{0377} are single
-characters, while @code{0201} enters Latin-1 mode, in which single bytes
-in the range from @code{0240} to @code{0377} are characters, and
-interpreted according to the ISO Latin-1 character set. This is a
-multibyte code which has two alternative shift states (``Japanese mode''
-and ``Latin-1 mode''), and two shift sequences that specify particular
-shift states.
-
-When the multibyte character code in use has shift states, then
-@code{mblen}, @code{mbtowc} and @code{wctomb} must maintain and update
-the current shift state as they scan the string. To make this work
-properly, you must follow these rules:
-
-@itemize @bullet
-@item
-Before starting to scan a string, call the function with a null pointer
-for the multibyte character address---for example, @code{mblen (NULL,
-0)}. This initializes the shift state to its standard initial value.
-
-@item
-Scan the string one character at a time, in order. Do not ``back up''
-and rescan characters already scanned, and do not intersperse the
-processing of different strings.
-@end itemize
-
-Here is an example of using @code{mblen} following these rules:
-
-@smallexample
-void
-scan_string (char *s)
-@{
- int length = strlen (s);
-
- /* @r{Initialize shift state.} */
- mblen (NULL, 0);
-
- while (1)
- @{
- int thischar = mblen (s, length);
- /* @r{Deal with end of string and invalid characters.} */
- if (thischar == 0)
- break;
- if (thischar == -1)
- @{
- error ("invalid multibyte character");
- break;
- @}
- /* @r{Advance past this character.} */
- s += thischar;
- length -= thischar;
- @}
-@}
-@end smallexample
-
-The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not
-reentrant when using a multibyte code that uses a shift state. However,
-no other library functions call these functions, so you don't have to
-worry that the shift state will be changed mysteriously.