aboutsummaryrefslogtreecommitdiff
path: root/localedata/unicode-gen
AgeCommit message (Collapse)Author
2024-01-14localedata/unicode-gen/utf8_gen.py: fix Hangul syllable nameMike FABIAN
Resolves: BZ # 29506
2024-01-08localedata: unicode-gen: Remove redundant \s* from regexp, fix commentsMike FABIAN
2024-01-01Update copyright dates with scripts/update-copyrightsPaul Eggert
2023-09-16Update to Unicode 15.1.0 [BZ #30854]Mike FABIAN
Unicode 15.1.0 Support: Character encoding, character type info, and transliteration tables are all updated to Unicode 15.1.0, using the generator scripts contributed by Mike FABIAN (Red Hat). Total removed characters in newly generated CHARMAP: 0 Total changed characters in newly generated CHARMAP: 0 Total added characters in newly generated CHARMAP: 627 Total removed characters in newly generated WIDTH: 0 Total changed characters in newly generated WIDTH: 0 Total added characters in newly generated WIDTH: 627 alpha: Added 622 characters in new ctype which were not in old ctype graph: Added 627 characters in new ctype which were not in old ctype print: Added 627 characters in new ctype which were not in old ctype punct: Added 5 characters in new ctype which were not in old ctype The five characters added to punct are: 2FFC;IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM RIGHT;So;0;ON;;;;;N;;;;; 2FFD;IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER RIGHT;So;0;ON;;;;;N;;;;; 2FFE;IDEOGRAPHIC DESCRIPTION CHARACTER HORIZONTAL REFLECTION;So;0;ON;;;;;N;;;;; 2FFF;IDEOGRAPHIC DESCRIPTION CHARACTER ROTATION;So;0;ON;;;;;N;;;;; 31EF;IDEOGRAPHIC DESCRIPTION CHARACTER SUBTRACTION;So;0;ON;;;;;N;;;;; The Unicode announcement blog entry says "[...] adds 627 characters, [...] additions include 622 CJK unified ideographs in a new block, [...]", so that looks OK. The Unicode blog mentions "six completely new emoji" but they don't appear here as they are all sequences and not single code points. Resolves: BZ #30854 Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2023-09-16localedata/unicode-gen/utf8_gen.py: adapt regexp to get relevant lines from ↵Mike FABIAN
EastAsianWidth.txt Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2023-09-16Fix regexp syntax warnings in localedata/unicode-gen/ctype_compatibility.pyMike FABIAN
Fix these: $ python -m py_compile ./ctype_compatibility.py ./ctype_compatibility.py:146: SyntaxWarning: invalid escape sequence '\)' Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2023-01-06Update copyright dates with scripts/update-copyrightsJoseph Myers
2022-10-06Update to Unicode 15.0.0 [BZ #29604]Mike FABIAN
Unicode 15.0.0 Support: Character encoding, character type info, and transliteration tables are all updated to Unicode 15.0.0, using the generator scripts contributed by Mike FABIAN (Red Hat). Total added characters in newly generated CHARMAP: 4489 Total removed characters in newly generated WIDTH: 0 Total changed characters in newly generated WIDTH: 0 Total added characters in newly generated WIDTH: 4257 alpha: Added 4389 characters in new ctype which were not in old ctype combining: Added 42 characters in new ctype which were not in old ctype combining_level3: Added 34 characters in new ctype which were not in old ctype graph: Added 4489 characters in new ctype which were not in old ctype lower: Added 73 characters in new ctype which were not in old ctype print: Added 4489 characters in new ctype which were not in old ctype punct: Missing 5 characters of old ctype in new ctype punct: Missing: ఄ 0xc04 TELUGU SIGN COMBINING ANUSVARA ABOVE punct: Missing: ྂ 0xf82 TIBETAN SIGN NYI ZLA NAA DA punct: Missing: ྃ 0xf83 TIBETAN SIGN SNA LDAN punct: Missing: 𑂀 0x11080 KAITHI SIGN CANDRABINDU punct: Missing: 𑂁 0x11081 KAITHI SIGN ANUSVARA That’s OK, because these are now Alphabetic in DerivedCoreProperties.txt punct: Added 105 characters in new ctype which were not in old ctype Resolves: BZ #29604 Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2022-01-01Update copyright dates with scripts/update-copyrightsPaul Eggert
I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 7061 files FOO. I then removed trailing white space from math/tgmath.h, support/tst-support-open-dev-null-range.c, and sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following obscure pre-commit check failure diagnostics from Savannah. I don't know why I run into these diagnostics whereas others evidently do not. remote: *** 912-#endif remote: *** 913: remote: *** 914- remote: *** error: lines with trailing whitespace found ... remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
2021-10-04Update to Unicode 14.0.0 [BZ #28390]Mike FABIAN
Unicode 14.0.0 Support: Character encoding, character type info, and transliteration tables are all updated to Unicode 14.0.0, using the generator scripts contributed by Mike FABIAN (Red Hat). Total added characters in newly generated CHARMAP: 838 Total removed characters in newly generated WIDTH: 1 (Characters not in WIDTH get width 1 by default, i.e. these have width 1 now.) removed: <U1734> 0 : eaw=N category=Mc bidi=L name=HANUNOO SIGN PAMUDPOD That seems intentional, the character had category Mn (Mark, nonspacing) before and now has Mc (Mark, spacing combining) Total changed characters in newly generated WIDTH: 0 Total added characters in newly generated WIDTH: 175
2021-09-03Remove "Contributed by" linesSiddhesh Poyarekar
We stopped adding "Contributed by" or similar lines in sources in 2012 in favour of git logs and keeping the Contributors section of the glibc manual up to date. Removing these lines makes the license header a bit more consistent across files and also removes the possibility of error in attribution when license blocks or files are copied across since the contributed-by lines don't actually reflect reality in those cases. Move all "Contributed by" and similar lines (Written by, Test by, etc.) into a new file CONTRIBUTED-BY to retain record of these contributions. These contributors are also mentioned in manual/contrib.texi, so we just maintain this additional record as a courtesy to the earlier developers. The following scripts were used to filter a list of files to edit in place and to clean up the CONTRIBUTED-BY file respectively. These were not added to the glibc sources because they're not expected to be of any use in future given that this is a one time task: https://gist.github.com/siddhesh/b5ecac94eabfd72ed2916d6d8157e7dc https://gist.github.com/siddhesh/15ea1f5e435ace9774f485030695ee02 Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2021-01-02Update copyright dates with scripts/update-copyrightsPaul Eggert
I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 6694 files FOO. I then removed trailing white space from benchtests/bench-pthread-locks.c and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this diagnostic from Savannah: remote: *** pre-commit check failed ... remote: *** error: lines with trailing whitespace found remote: error: hook declined to update refs/heads/master
2020-06-26Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to 0 [BZ #26120]Mike FABIAN
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2020-04-21Bug 25819: Update to Unicode 13.0.0Mike FABIAN
Unicode 13.0.0 Support: Character encoding, character type info, and transliteration tables are all updated to Unicode 13.0.0, using the generator scripts contributed by Mike FABIAN (Red Hat). Total added characters in newly generated CHARMAP: 5930 Total added characters in newly generated WIDTH: 5536
2020-01-01Update copyright dates with scripts/update-copyrights.Joseph Myers
2019-09-07Prefer https to http for gnu.org and fsf.org URLsPaul Eggert
Also, change sources.redhat.com to sourceware.org. This patch was automatically generated by running the following shell script, which uses GNU sed, and which avoids modifying files imported from upstream: sed -ri ' s,(http|ftp)(://(.*\.)?(gnu|fsf|sourceware)\.org($|[^.]|\.[^a-z])),https\2,g s,(http|ftp)(://(.*\.)?)sources\.redhat\.com($|[^.]|\.[^a-z]),https\2sourceware.org\4,g ' \ $(find $(git ls-files) -prune -type f \ ! -name '*.po' \ ! -name 'ChangeLog*' \ ! -path COPYING ! -path COPYING.LIB \ ! -path manual/fdl-1.3.texi ! -path manual/lgpl-2.1.texi \ ! -path manual/texinfo.tex ! -path scripts/config.guess \ ! -path scripts/config.sub ! -path scripts/install-sh \ ! -path scripts/mkinstalldirs ! -path scripts/move-if-change \ ! -path INSTALL ! -path locale/programs/charmap-kw.h \ ! -path po/libc.pot ! -path sysdeps/gnu/errlist.c \ ! '(' -name configure \ -execdir test -f configure.ac -o -f configure.in ';' ')' \ ! '(' -name preconfigure \ -execdir test -f preconfigure.ac ';' ')' \ -print) and then by running 'make dist-prepare' to regenerate files built from the altered files, and then executing the following to cleanup: chmod a+x sysdeps/unix/sysv/linux/riscv/configure # Omit irrelevant whitespace and comment-only changes, # perhaps from a slightly-different Autoconf version. git checkout -f \ sysdeps/csky/configure \ sysdeps/hppa/configure \ sysdeps/riscv/configure \ sysdeps/unix/sysv/linux/csky/configure # Omit changes that caused a pre-commit check to fail like this: # remote: *** error: sysdeps/powerpc/powerpc64/ppc-mcount.S: trailing lines git checkout -f \ sysdeps/powerpc/powerpc64/ppc-mcount.S \ sysdeps/unix/sysv/linux/s390/s390-64/syscall.S # Omit change that caused a pre-commit check to fail like this: # remote: *** error: sysdeps/sparc/sparc64/multiarch/memcpy-ultra3.S: last line does not end in newline git checkout -f sysdeps/sparc/sparc64/multiarch/memcpy-ultra3.S
2019-05-13Bug 24535: Update to Unicode 12.1.0Mike FABIAN
Unicode 12.1.0 Support: Character encoding, character type info, and transliteration tables are all updated to Unicode 12.1.0, using the generator scripts contributed by Mike FABIAN (Red Hat). Some info about the number of characters added or changed: Total added characters in newly generated CHARMAP: 1 added: <U32FF> /xe3/x8b/xbf SQUARE ERA NAME REIWA Total added characters in newly generated WIDTH: 1 added: <U32FF> 2 : eaw=W category=So bidi=L name=SQUARE ERA NAME REIWA graph: Added 1 characters in new ctype which were not in old ctype graph: Added: ㋿ U+32FF SQUARE ERA NAME REIWA print: Added 1 characters in new ctype which were not in old ctype print: Added: ㋿ U+32FF SQUARE ERA NAME REIWA punct: Added 1 characters in new ctype which were not in old ctype punct: Added: ㋿ U+32FF SQUARE ERA NAME REIWA
2019-03-08Bug 24307: Update to Unicode 12.0.0Mike FABIAN
Unicode 12.0.0 Support: Character encoding, character type info, and transliteration tables are all updated to Unicode 12.0.0, using the generator scripts contributed by Mike FABIAN (Red Hat). Some info about the number of characters added or changed: Total added characters in newly generated CHARMAP: 554 Total added characters in newly generated WIDTH: 106 alpha: Missing 8 characters of old ctype in new ctype (These are combining marks, apparently they were removed from alpha on purpose) alpha: Added 295 characters in new ctype which were not in old ctype combining: Missing 2 characters of old ctype in new ctype (U+1CF2 VEDIC SIGN ARDHAVISARGA and U+1CF3 VEDIC SIGN ROTATED ARDHAVISARGA, these are now "Alphabetic" in Unicode 12.0.0) combining: Added 37 characters in new ctype which were not in old ctype combining_level3: Missing 2 characters of old ctype in new ctype (U+1CF2 VEDIC SIGN ARDHAVISARGA and U+1CF3 VEDIC SIGN ROTATED ARDHAVISARGA, these are now "Alphabetic" in Unicode 12.0.0) combining_level3: Added 26 characters in new ctype which were not in old ctype graph: Added 554 characters in new ctype which were not in old ctype lower: Added 6 characters in new ctype which were not in old ctype print: Added 554 characters in new ctype which were not in old ctype punct: Missing 29 characters of old ctype in new ctype (These characters have all become "Alphabetic" in Unicode 12.0.0. Therefore, they are not in "punct" anymore (see: is_punct() in unicode_utils.py)) punct: Added 296 characters in new ctype which were not in old ctype tolower: Added 7 characters in new ctype which were not in old ctype totitle: Added 7 characters in new ctype which were not in old ctype toupper: Added 7 characters in new ctype which were not in old ctype upper: Added 7 characters in new ctype which were not in old ctype [BZ #24307] * localedata/unicode-gen/Makefile (UNICODE_VERSION): Set to 12.0.0. * localedata/unicode-gen/DerivedCoreProperties.txt: Update to Unicode 12.0.0. * localedata/unicode-gen/EastAsianWidth.txt: Likewise. * localedata/unicode-gen/PropList.txt: Likewise. * localedata/unicode-gen/UnicodeData.txt: Likewise. * localedata/unicode-gen/ctype_compatibility_test_cases.py: U+108D became "Alphabetic" in Unicode 12.0.0. Adapt test case. * localedata/charmaps/UTF-8: Regenerate. * localedata/locales/i18n_ctype: Likewise. * localedata/locales/tr_TR: Likewise. * localedata/locales/translit_circle: Likewise. * localedata/locales/translit_cjk_compat: Likewise. * localedata/locales/translit_combining: Likewise. * localedata/locales/translit_compat: Likewise. * localedata/locales/translit_font: Likewise. * localedata/locales/translit_fraction: Likewise.
2019-01-01Update copyright dates with scripts/update-copyrights.Joseph Myers
* All files with FSF copyright notices: Update copyright dates using scripts/update-copyrights. * locale/programs/charmap-kw.h: Regenerated. * locale/programs/locfile-kw.h: Likewise.
2018-07-10Put the correct Unicode version number 11.0.0 into the generated filesMike FABIAN
In some places there was still the old Unicode version 10.0.0 in the files. * localedata/charmaps/UTF-8: Use correct Unicode version 11.0.0 in comment. * localedata/locales/i18n_ctype: Use correct Unicode version in comments and headers. * localedata/unicode-gen/utf8_gen.py: Add option to specify Unicode version * localedata/unicode-gen/Makefile: Use option to specify Unicode version for utf8_gen.py
2018-07-04Bug 23308: Update to Unicode 11.0.0Mike FABIAN
Unicode 11.0.0 Support: Character encoding, character type info, and transliteration tables are all updated to Unicode 11.0.0, using the generator scripts contributed by Mike FABIAN (Red Hat). Some info about the number of characters added: Total added characters in newly generated CHARMAP: 684 Total added characters in newly generated WIDTH: 119 alpha: Added 380 characters in new ctype which were not in old ctype combining: Added 56 characters in new ctype which were not in old ctype combining_level3: Added 37 characters in new ctype which were not in old ctype graph: Added 684 characters in new ctype which were not in old ctype lower: Added 82 characters in new ctype which were not in old ctype print: Added 684 characters in new ctype which were not in old ctype punct: Added 304 characters in new ctype which were not in old ctype tolower: Added 79 characters in new ctype which were not in old ctype totitle: Added 33 characters in new ctype which were not in old ctype toupper: Added 79 characters in new ctype which were not in old ctype upper: Added 79 characters in new ctype which were not in old ctype No characters were removed. [BZ #23308] * unicode-gen/Makefile (UNICODE_VERSION): Set to 11.0.0. * localedata/unicode-gen/DerivedCoreProperties.txt: Update to Unicode 11.0.0. * localedata/unicode-gen/EastAsianWidth.txt: likewise. * localedata/unicode-gen/PropList.txt: likewise. * localedata/unicode-gen/UnicodeData.txt: likewise. * localedata/charmaps/UTF-8: Regenerate. * localedata/locales/i18n_ctype: likewise. * localedata/locales/tr_TR: likewise. * localedata/locales/translit_circle: likewise. * localedata/locales/translit_cjk_compat: likewise. * localedata/locales/translit_combining: likewise. * localedata/locales/translit_compat: likewise. * localedata/locales/translit_font: likewise. * localedata/locales/translit_fraction: likewise.
2018-01-01Update copyright dates with scripts/update-copyrights.Joseph Myers
* All files with FSF copyright notices: Update copyright dates using scripts/update-copyrights. * locale/programs/charmap-kw.h: Regenerated. * locale/programs/locfile-kw.h: Likewise.
2017-10-31localedata: Once again correct and regenerate i18n_ctype.Rafal Luzynski
Following the previous work by Carlos O'Donell the category of LC_CTYPE is correctly set to "i18n:2012" rather than "unicode:2014" and the i18n_ctype file is once again regenerated from scratch to make sure it does not contain any manual additions except the copyright message. Reviewed-by: Carlos O'Donell <carlos@redhat.com> * localedata/unicode-gen/gen_unicode_ctype.py (output_head): category of LC_CTYPE set to "i18n:2012". * localedata/locales/i18n_ctype: Regenerate.
2017-10-25localedata: Fix unicode-gen check target.Carlos O'Donell
After the transition to generating a distinct file for Unicode ctype information e.g. i18n_ctype, the check target was left with the wrong target name. This patch fixes the check target and regenerates the files with more information than previously used, filling in the the LC_IDENTIFICATION data. Tested on x86_64 by regenerating from Unicode source files, and running checks. Tested by subsequently rebuilding all locales. No regressions in testsuite. Signed-off-by: Carlos O'Donell <carlos@redhat.com> Reported-by: Rafal Luzynski <digitalfreak@lingonborough.com>
2017-10-13localedata: Reorganize Unicode LC_CTYPE inclusion.Carlos O'Donell
The commit does the following things: * Move non-transliteration Unicode generated data to i18n_ctype. * Copy the i18n_ctype data into i18n and add transliteration. In the future, any locale which needs Unicode LC_CTYPE data can also just use `copy i18n_ctype` and get the base character classes and maps without transliteration. Tested by compiling all the locales and my prototype C.UTF-8 which uses it. Signed-off-by: Carlos O'Donell <carlos@redhat.com>
2017-09-06Improve utf8_gen.py to set the width for characters with ↵Mike FABIAN
Prepended_Concatenation_Mark property to 1 [BZ #22070] * localedata/unicode-gen/utf8_gen.py: Set the width for characters with Prepended_Concatenation_Mark property to 1 * localedata/charmaps/UTF-8: Updated using the improved script.
2017-09-06Write all ranges of neighbouring characters with the same width using the ↵Mike FABIAN
range notation in charmaps/UTF-8 Writing ranges of neighbouring characters with the same with like this <U000E0100>...<U000E01EF> 0 in charmaps/UTF-8 is more efficient than writing many single character lines like: <U000E0100> 0 <U000E0101> 0 ... [BZ #21750] * unicode-gen/utf8_gen.py: Write all ranges of neighbouring characters with the same width using the range notation in charmaps/UTF-8.
2017-08-17Resolve some historically special cases of ambiguous widthThorsten Glaser
[BZ #21750] * unicode-gen/utf8_gen.py (U+00AD): Set width to 1. * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0. * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2. * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise.
2017-08-17Handle more cases of combining charactersThorsten Glaser
[BZ #21750] * unicode-gen/utf8_gen.py: Treat category Me and Mn as combining.
2017-08-17UnicodeData has precedence over EastAsianWidthThorsten Glaser
[BZ #19852] [BZ #21750] * unicode-gen/utf8_gen.py: Process EastAsianWidth lines before UnicodeData lines so the latter have precedence; remove hack to group output by EastAsianWidth ranges.
2017-06-22Bug 21533: Update to Unicode 10.0.0Mike FABIAN
* Unicode 10.0.0 Support: Character encoding, character type info, and transliteration tables are all updated to Unicode 10.0.0, using generator scripts contributed by Mike FABIAN (Red Hat).
2017-02-21Bug 20313: Update to Unicode 9.0.0Mike FABIAN
* Unicode 9.0.0 Support: Character encoding, character type info, and transliteration tables are all updated to Unicode 9.0.0, using generator scripts contributed by Mike FABIAN (Red Hat).
2017-01-01Update copyright dates with scripts/update-copyrights.Joseph Myers
2016-06-11unicode-gen: include standard comment file headerMike Frysinger
We deployed this header to all the locale files, so make sure we include it in the generated ones too so we don't lose it.
2016-01-04Update copyright dates with scripts/update-copyrights.Joseph Myers
2015-12-11Automate LC_CTYPE generation for tr_TR, update to Unicode 8.0.0 (bug 18491).Joseph Myers
This patch makes the automation of Unicode LC_CTYPE generation also support generating the modified LC_CTYPE used for Turkish (where case conversions of 'i' and 'I' differ from ASCII conventions), so allowing that to be more readily kept in sync for future Unicode updates. The patch includes the locale update generated by the scripts. Tested for x86_64. [BZ #18491] * unicode-gen/unicode_utils.py (to_upper_turkish): New function. (to_lower_turkish): Likewise. * unicode-gen/gen_unicode_ctype.py (output_tables): Support producing output with Turkish case conversions. (--turkish): New command-line option. * unicode-gen/Makefile (GENERATED): Add tr_TR. (tr_TR): New rule. * locales/tr_TR: Regenerate LC_CTYPE.
2015-12-10Update to Unicode 8.0.0.Mike FABIAN
Update __STDC_ISO_10646__ to 201505L for Unicode 8.0.0. Update character encoding, ctype, and transliteration tables. New scripts autogenerate transliteration tables.
2015-12-09Update transliteration support to Unicode 7.0.0.Carlos O'Donell
The transliteration files are now autogenerated from upstream Unicode data.
2015-02-23Amendments to Unicode 7 update.Alexandre Oliva
for ChangeLog * include/stdc-predef.h (__STDC_ISO_10646__): Update to 201304L, for Unicode 7. for localedata/ChangeLog * unicode-gen/ctype_compatibility.py: Use date ranges in copyright notice. * unicode-gen/ctype_compatibility_test_cases.py: Likewise. * unicode-gen/gen_unicode_ctype.py: Likewise. * unicode-gen/utf8_compatibility.py: Likewise. * unicode-gen/utf8_gen.py: Likewise. Use upper case for global variables, use tuples for global constant arrays. From Mike FABIAN. Suggested by Mike Frysinger <vapier@gentoo.org>.
2015-02-20Unicode 7.0.0 update; added generator scripts.Alexandre Oliva
for localedata/ChangeLog [BZ #17588] [BZ #13064] [BZ #14094] [BZ #17998] * unicode-gen/Makefile: New. * unicode-gen/unicode-license.txt: New, from Unicode. * unicode-gen/UnicodeData.txt: New, from Unicode. * unicode-gen/DerivedCoreProperties.txt: New, from Unicode. * unicode-gen/EastAsianWidth.txt: New, from Unicode. * unicode-gen/gen_unicode_ctype.py: New generator, from Mike FABIAN <mfabian@redhat.com>. * unicode-gen/ctype_compatibility.py: New verifier, from Pravin Satpute <psatpute@redhat.com> and Mike FABIAN. * unicode-gen/ctype_compatibility_test_cases.py: New verifier module, from Mike FABIAN. * unicode-gen/utf8_gen.py: New generator, from Pravin Satpute and Mike FABIAN. * unicode-gen/utf8_compatibility.py: New verifier, from Pravin Satpute and Mike FABIAN. * charmaps/UTF-8: Update. * locales/i18n: Update. * gen-unicode-ctype.c: Remove. * tst-ctype-de_DE.ISO-8859-1.in: Adjust, islower now returns true for ordinal indicators.