]> code.delx.au - gnu-emacs/blob - admin/notes/unicode
Update Unicode notes for importing a new Unicode version
[gnu-emacs] / admin / notes / unicode
1 -*-mode: text; coding: utf-8;-*-
2
3 Copyright (C) 2002-2016 Free Software Foundation, Inc.
4 See the end of the file for license conditions.
5
6 Importing a new Unicode Standard version into Emacs
7 -------------------------------------------------------------
8
9 Emacs uses the following files from the Unicode Character Database
10 (a.k.a. "UCD):
11
12 . UnicodeData.txt
13 . Blocks.txt
14 . BidiMirroring.txt
15 . BidiBrackets.txt
16 . IVD_Sequences.txt
17
18 First, these files need to be copied into admin/unidata/, and then
19 Emacs should be rebuilt for them to take effect. Rebuilding Emacs
20 updates several derived files elsewhere in the Emacs source tree,
21 mainly in lisp/international/.
22
23 When Emacs is rebuilt for the first time after importing the new
24 files, pay attention to any warning or error messages. In particular,
25 admin/unidata/unidata-gen.el will complain if UnicodeData.txt defines
26 new bidirectional attributes of characters, because unidata-gen.el,
27 bidi.c and dispextern.h need to be updated in that case; failure to do
28 so will cause aborts in redisplay.
29
30 Next, review the changes in UnicodeData.txt vs the previous version
31 used by Emacs. Any changes, be it introduction of new scripts or
32 addition of codepoints to existing scripts, might need corresponding
33 changes in the data used for filling the category-table, case-table,
34 and char-width-table. The additional scripts should cause automatic
35 updates in charscript.el, but it is a good idea to look at the results
36 and see if any changes in admin/unidata/blocks.awk are required.
37
38 Any new scripts added by UnicodeData.txt will also need updates to
39 script-representative-chars defined in fontset.el, and also the list
40 of OTF script tags in otf-script-alist, whose source is on this page:
41
42 https://www.microsoft.com/typography/otspec/scripttags.htm
43
44 Other databases in fontset.el might also need to be updated as needed.
45
46 The function 'ucs-names', defined in lisp/international/mule-cmds.el,
47 might need to be updated because it knows about used and unused ranges
48 of Unicode codepoints, which a new release of the Unicode Standard
49 could change.
50
51 Problems, fixmes and other unicode-related issues
52 -------------------------------------------------------------
53
54 Notes by fx to record various things of variable importance. Handa
55 needs to check them -- don't take too seriously, especially with
56 regard to completeness.
57
58 * SINGLE_BYTE_CHAR_P returns true for Latin-1 characters, which has
59 undesirable effects. E.g.:
60 (multibyte-string-p (let ((s "x")) (aset s 0 ?£) s)) => nil
61 (multibyte-string-p (concat [?£])) => nil
62 (text-char-description ?£) => "M-#"
63
64 These examples are all fixed by the change of 2002-10-14, but
65 there still exist questionable SINGLE_BYTE_CHAR_P in the
66 code (keymap.c and print.c).
67
68 * Rationalize character syntax and its relationship to the Unicode
69 database. (Applies mainly to symbol an punctuation syntax.)
70
71 * Fontset handling and customization needs work. We want to relate
72 fonts to scripts, probably based on the Unicode blocks. The
73 presence of small-repertoire 10646-encoded fonts in XFree 4 is a
74 pain, not currently worked round.
75
76 With the change on 2002-07-26, multiple fonts can be
77 specified in a fontset for a specific range of characters.
78 Each range can also be specified by script. Before using
79 ISO10646 fonts, Emacs checks their repertories to avoid such
80 fonts that don't have a glyph for a specific character.
81
82 fx has worked on fontset customization, but was stymied by
83 basic problems with the way the default face is dealt with
84 (and something else, I think). This needs revisiting.
85
86 * Work is also needed on charset and coding system priorities.
87
88 * The relevant bits of latin1-disp.el need porting (and probably
89 re-naming/updating). See also cyril-util.el.
90
91 * Quail files need more work now the encoding is largely irrelevant.
92
93 * What to do with the old coding categories stuff?
94
95 * The preferred-coding-system property of charsets should probably be
96 junked unless it can be made more useful now.
97
98 * find-multibyte-characters needs looking at.
99
100 * Implement Korean cp949/UHC, BIG5-HKSCS and any other important missing
101 charsets.
102
103 * Lazy-load tables for unify-charset somehow?
104
105 Actually, Emacs clears out all charset maps and unify-map just
106 before dumping, and they are loaded again on demand by the
107 dumped emacs. But, those maps (char tables) generated while
108 temacs is running can't be removed from the dumped emacs.
109
110 * iso-2022 charsets get unified on i/o.
111
112 With the change on 2003-01-06, decoding routines put the 'charset'
113 property onto decoded text, and iso-2022 encoder pay attention
114 to it. Thus, for instance, reading and writing by
115 iso-2022-7bit preserve the original designation sequences.
116 The property name 'preferred-charset' may be better?
117
118 We may have to utilize this property to decide a font.
119
120 * Revisit locale processing: look at treating the language and
121 charset parts separately. (Language should affect things like
122 spelling and calendar, but that's not a Unicode issue.)
123
124 * Handle Unicode combining characters usefully, e.g. diacritics, and
125 handle more scripts specifically (à la Devanagari). There are
126 issues with canonicalization.
127
128 * We need tabular input methods, e.g. for maths symbols. (Not
129 specific to Unicode.)
130
131 * Need multibyte text in menus, e.g. for the above. (Not specific to
132 Unicode -- see Emacs etc/TODO, but now mostly works with gtk.)
133
134 * There's currently no support for Unicode normalization.
135
136 * Populate char-width-table correctly for Unicode characters and
137 worry about what happens when double-width charsets covering
138 non-CJK characters are unified.
139
140 * There are type errors lurking, e.g. in
141 Fcheck_coding_systems_region. Define ENABLE_CHECKING to find them.
142
143 * Old auto-save files, and similar files, such as Gnus drafts,
144 containing non-ASCII characters probably won't be re-read correctly.
145
146
147 Source file encoding
148 --------------------
149
150 Most Emacs source files are encoded in UTF-8 (or in ASCII, which is a
151 subset), but there are a few exceptions, listed below. Perhaps
152 someday many of these files will be converted to UTF-8, for
153 convenience when using tools like 'grep -r', but this might need
154 nontrivial changes to the build process.
155
156 * chinese-big5
157
158 These are verbatim copies of files taken from external sources.
159 They haven't been converted to UTF-8.
160
161 leim/CXTERM-DIC/4Corner.tit
162 leim/CXTERM-DIC/ARRAY30.tit
163 leim/CXTERM-DIC/ECDICT.tit
164 leim/CXTERM-DIC/ETZY.tit
165 leim/CXTERM-DIC/PY-b5.tit
166 leim/CXTERM-DIC/Punct-b5.tit
167 leim/CXTERM-DIC/QJ-b5.tit
168 leim/CXTERM-DIC/ZOZY.tit
169 leim/MISC-DIC/CTLau-b5.html
170 leim/MISC-DIC/cangjie-table.b5
171
172 * chinese-iso-8bit
173
174 These are verbatim copies of files taken from external sources.
175 They haven't been converted to UTF-8.
176
177 leim/CXTERM-DIC/CCDOSPY.tit
178 leim/CXTERM-DIC/Punct.tit
179 leim/CXTERM-DIC/QJ.tit
180 leim/CXTERM-DIC/SW.tit
181 leim/CXTERM-DIC/TONEPY.tit
182 leim/MISC-DIC/CTLau.html
183 leim/MISC-DIC/pinyin.map
184 leim/MISC-DIC/ziranma.cin
185
186 * cp850
187
188 This file contains non-ASCII characters in unibyte strings. When
189 editing a keyboard layout it's more convenient to see 'é' than
190 '\202', and the MS-DOS compiler requires the single byte if a
191 backslash escape is not being used.
192
193 src/msdos.c
194
195 * iso-2022-cn-ext
196
197 This file is externally generated from leim/MISC-DIC/cangjie-table.b5
198 by Big5->CNS converter. It hasn't been converted to UTF-8.
199
200 leim/MISC-DIC/cangjie-table.cns
201
202 * japanese-iso-8bit
203
204 SKK-JISYO.L is a verbatim copy of a file taken from an external source.
205 It hasn't been converted to UTF-8.
206
207 leim/SKK-DIC/SKK-JISYO.L
208
209 * japanese-shift-jis
210
211 This is a verbatim copy of a file taken from an external source.
212 It hasn't been converted to UTF-8.
213
214 admin/charsets/mapfiles/cns2ucsdkw.txt
215
216 * iso-2022-7bit
217
218 This file switches between CJK charsets, which is not encoded in UTF-8.
219
220 etc/HELLO
221
222 Each of these files contains just one CJK charset, but Emacs
223 currently has no easy way to specify set-charset-priority on a
224 per-file basis, so converting any of these files to UTF-8 might
225 change the file's appearance when viewed by an Emacs that is
226 operating in some other language environment.
227
228 etc/tutorials/TUTORIAL.ja
229 lisp/international/ja-dic-cnv.el
230 lisp/international/ja-dic-utl.el
231 lisp/international/kinsoku.el
232 lisp/international/kkc.el
233 lisp/international/titdic-cnv.el
234 lisp/language/japan-util.el
235 lisp/language/japanese.el
236 lisp/leim/quail/cyril-jis.el
237 lisp/leim/quail/hanja-jis.el
238 lisp/leim/quail/japanese.el
239 lisp/leim/quail/py-punct.el
240 lisp/leim/quail/pypunct-b5.el
241
242 This file contains just Chinese characters, and has same problem.
243 Also, it contains characters that cannot be encoded in UTF-8.
244
245 lisp/international/titdic-cnv.el
246
247 * utf-8-emacs
248
249 These files contain characters that cannot be encoded in UTF-8.
250
251 lisp/language/ethio-util.el
252 lisp/language/ethiopic.el
253 lisp/language/ind-util.el
254 lisp/language/tibet-util.el
255 lisp/language/tibetan.el
256 lisp/leim/quail/ethiopic.el
257 lisp/leim/quail/tibetan.el
258
259 * binary files
260
261 These files contain binary data, and are not text files.
262 Some of the entries in this list are patterns, and stand for any
263 files with the listed extension.
264
265 *.gz
266 *.icns
267 *.ico
268 *.pbm
269 *.pdf
270 *.png
271 *.sig
272 etc/e/eterm-color
273 etc/package-keyring.gpg
274 msdos/emacs.pif
275 nextstep/GNUstep/Emacs.base/Resources/emacs.tiff
276 nt/icons/hand.cur
277
278 \f
279 This file is part of GNU Emacs.
280
281 GNU Emacs is free software: you can redistribute it and/or modify
282 it under the terms of the GNU General Public License as published by
283 the Free Software Foundation, either version 3 of the License, or
284 (at your option) any later version.
285
286 GNU Emacs is distributed in the hope that it will be useful,
287 but WITHOUT ANY WARRANTY; without even the implied warranty of
288 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
289 GNU General Public License for more details.
290
291 You should have received a copy of the GNU General Public License
292 along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>.