]> code.delx.au - gnu-emacs/blob - doc/lispref/nonascii.texi
(Text Representations): Rewrite to make consistent with Emacs 23
[gnu-emacs] / doc / lispref / nonascii.texi
1 @c -*-texinfo-*-
2 @c This is part of the GNU Emacs Lisp Reference Manual.
3 @c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004,
4 @c 2005, 2006, 2007, 2008 Free Software Foundation, Inc.
5 @c See the file elisp.texi for copying conditions.
6 @setfilename ../../info/characters
7 @node Non-ASCII Characters, Searching and Matching, Text, Top
8 @chapter Non-@acronym{ASCII} Characters
9 @cindex multibyte characters
10 @cindex characters, multi-byte
11 @cindex non-@acronym{ASCII} characters
12
13 This chapter covers the special issues relating to characters and
14 how they are stored in strings and buffers.
15
16 @menu
17 * Text Representations:: How Emacs represents text.
18 * Converting Representations:: Converting unibyte to multibyte and vice versa.
19 * Selecting a Representation:: Treating a byte sequence as unibyte or multi.
20 * Character Codes:: How unibyte and multibyte relate to
21 codes of individual characters.
22 * Character Sets:: The space of possible character codes
23 is divided into various character sets.
24 * Chars and Bytes:: More information about multibyte encodings.
25 * Splitting Characters:: Converting a character to its byte sequence.
26 * Scanning Charsets:: Which character sets are used in a buffer?
27 * Translation of Characters:: Translation tables are used for conversion.
28 * Coding Systems:: Coding systems are conversions for saving files.
29 * Input Methods:: Input methods allow users to enter various
30 non-ASCII characters without special keyboards.
31 * Locales:: Interacting with the POSIX locale.
32 @end menu
33
34 @node Text Representations
35 @section Text Representations
36 @cindex text representation
37
38 Emacs buffers and strings support a large repertoire of characters
39 from many different scripts. This is so users could type and display
40 text in most any known written language.
41
42 @cindex character codepoint
43 @cindex codespace
44 @cindex Unicode
45 To support this multitude of characters and scripts, Emacs closely
46 follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
47 unique number, called a @dfn{codepoint}, to each and every character.
48 The range of codepoints defined by Unicode, or the Unicode
49 @dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive. Emacs
50 extends this range with codepoints in the range @code{3FFF80..3FFFFF},
51 which it uses for representing raw 8-bit bytes that cannot be
52 interpreted as characters. Thus, a character codepoint in Emacs is a
53 22-bit integer number.
54
55 @cindex internal representation of characters
56 @cindex characters, representation in buffers and strings
57 @cindex multibyte text
58 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
59 that are codepoints of text characters within buffers and strings.
60 Rather, Emacs uses a variable-length internal representation of
61 characters, that stores each character as a sequence of 1 to 5 8-bit
62 bytes, depending on the magnitude of its codepoint@footnote{
63 This internal representation is based on one of the encodings defined
64 by the Unicode Standard, called @dfn{UTF-8}, for representing any
65 Unicode codepoint, but Emacs extends UTF-8 to represent the additional
66 codepoints it uses for raw 8-bit bytes.}.
67 For example, any @acronym{ASCII} character takes up only 1 byte, a
68 Latin-1 character takes up 2 bytes, etc. We call this representation
69 of text @dfn{multibyte}, because it uses several bytes for each
70 character.
71
72 Outside Emacs, characters can be represented in many different
73 encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts
74 between these external encodings and the internal representation, as
75 appropriate, when it reads text into a buffer or a string, or when it
76 writes text to a disk file or passes it to some other process.
77
78 Occasionally, Emacs needs to hold and manipulate encoded text or
79 binary non-text data in its buffer or string. For example, when Emacs
80 visits a file, it first reads the file's text verbatim into a buffer,
81 and only then converts it to the internal representation. Before the
82 conversion, the buffer holds encoded text.
83
84 @cindex unibyte text
85 Encoded text is not really text, as far as Emacs is concerned, but
86 rather a sequence of raw 8-bit bytes. We call buffers and strings
87 that hold encoded text @dfn{unibyte} buffers and strings, because
88 Emacs treats them as a sequence of individual bytes. In particular,
89 Emacs usually displays unibyte buffers and strings as octal codes such
90 as @code{\237}. We recommend that you never use unibyte buffers and
91 strings except for manipulating encoded text or binary non-text data.
92
93 In a buffer, the buffer-local value of the variable
94 @code{enable-multibyte-characters} specifies the representation used.
95 The representation for a string is determined and recorded in the string
96 when the string is constructed.
97
98 @defvar enable-multibyte-characters
99 This variable specifies the current buffer's text representation.
100 If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
101 it contains unibyte encoded text or binary non-text data.
102
103 You cannot set this variable directly; instead, use the function
104 @code{set-buffer-multibyte} to change a buffer's representation.
105 @end defvar
106
107 @defvar default-enable-multibyte-characters
108 This variable's value is entirely equivalent to @code{(default-value
109 'enable-multibyte-characters)}, and setting this variable changes that
110 default value. Setting the local binding of
111 @code{enable-multibyte-characters} in a specific buffer is not allowed,
112 but changing the default value is supported, and it is a reasonable
113 thing to do, because it has no effect on existing buffers.
114
115 The @samp{--unibyte} command line option does its job by setting the
116 default value to @code{nil} early in startup.
117 @end defvar
118
119 @defun position-bytes position
120 Buffer positions are measured in character units. This function
121 returns the byte-position corresponding to buffer position
122 @var{position} in the current buffer. This is 1 at the start of the
123 buffer, and counts upward in bytes. If @var{position} is out of
124 range, the value is @code{nil}.
125 @end defun
126
127 @defun byte-to-position byte-position
128 Return the buffer position, in character units, corresponding to
129 byte-position @var{byte-position} in the current buffer. If
130 @var{byte-position} is out of range, the value is @code{nil}.
131 @end defun
132
133 @defun multibyte-string-p string
134 Return @code{t} if @var{string} is a multibyte string, @code{nil}
135 otherwise.
136 @end defun
137
138 @defun string-bytes string
139 @cindex string, number of bytes
140 This function returns the number of bytes in @var{string}.
141 If @var{string} is a multibyte string, this can be greater than
142 @code{(length @var{string})}.
143 @end defun
144
145 @defun unibyte-string &rest bytes
146 This function concatenates all its argument @var{bytes} and makes the
147 result a unibyte string.
148 @end defun
149
150 @node Converting Representations
151 @section Converting Text Representations
152
153 Emacs can convert unibyte text to multibyte; it can also convert
154 multibyte text to unibyte, though this conversion loses information. In
155 general these conversions happen when inserting text into a buffer, or
156 when putting text from several strings together in one string. You can
157 also explicitly convert a string's contents to either representation.
158
159 Emacs chooses the representation for a string based on the text that
160 it is constructed from. The general rule is to convert unibyte text to
161 multibyte text when combining it with other multibyte text, because the
162 multibyte representation is more general and can hold whatever
163 characters the unibyte text has.
164
165 When inserting text into a buffer, Emacs converts the text to the
166 buffer's representation, as specified by
167 @code{enable-multibyte-characters} in that buffer. In particular, when
168 you insert multibyte text into a unibyte buffer, Emacs converts the text
169 to unibyte, even though this conversion cannot in general preserve all
170 the characters that might be in the multibyte text. The other natural
171 alternative, to convert the buffer contents to multibyte, is not
172 acceptable because the buffer's representation is a choice made by the
173 user that cannot be overridden automatically.
174
175 Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
176 unchanged, and likewise character codes 128 through 159. It converts
177 the non-@acronym{ASCII} codes 160 through 255 by adding the value
178 @code{nonascii-insert-offset} to each character code. By setting this
179 variable, you specify which character set the unibyte characters
180 correspond to (@pxref{Character Sets}). For example, if
181 @code{nonascii-insert-offset} is 2048, which is @code{(- (make-char
182 'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters
183 correspond to Latin 1. If it is 2688, which is @code{(- (make-char
184 'greek-iso8859-7) 128)}, then they correspond to Greek letters.
185
186 Converting multibyte text to unibyte is simpler: it discards all but
187 the low 8 bits of each character code. If @code{nonascii-insert-offset}
188 has a reasonable value, corresponding to the beginning of some character
189 set, this conversion is the inverse of the other: converting unibyte
190 text to multibyte and back to unibyte reproduces the original unibyte
191 text.
192
193 @defvar nonascii-insert-offset
194 This variable specifies the amount to add to a non-@acronym{ASCII} character
195 when converting unibyte text to multibyte. It also applies when
196 @code{self-insert-command} inserts a character in the unibyte
197 non-@acronym{ASCII} range, 128 through 255. However, the functions
198 @code{insert} and @code{insert-char} do not perform this conversion.
199
200 The right value to use to select character set @var{cs} is @code{(-
201 (make-char @var{cs}) 128)}. If the value of
202 @code{nonascii-insert-offset} is zero, then conversion actually uses the
203 value for the Latin 1 character set, rather than zero.
204 @end defvar
205
206 @defvar nonascii-translation-table
207 This variable provides a more general alternative to
208 @code{nonascii-insert-offset}. You can use it to specify independently
209 how to translate each code in the range of 128 through 255 into a
210 multibyte character. The value should be a char-table, or @code{nil}.
211 If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}.
212 @end defvar
213
214 The next three functions either return the argument @var{string}, or a
215 newly created string with no text properties.
216
217 @defun string-make-unibyte string
218 This function converts the text of @var{string} to unibyte
219 representation, if it isn't already, and returns the result. If
220 @var{string} is a unibyte string, it is returned unchanged. Multibyte
221 character codes are converted to unibyte according to
222 @code{nonascii-translation-table} or, if that is @code{nil}, using
223 @code{nonascii-insert-offset}. If the lookup in the translation table
224 fails, this function takes just the low 8 bits of each character.
225 @end defun
226
227 @defun string-make-multibyte string
228 This function converts the text of @var{string} to multibyte
229 representation, if it isn't already, and returns the result. If
230 @var{string} is a multibyte string or consists entirely of
231 @acronym{ASCII} characters, it is returned unchanged. In particular,
232 if @var{string} is unibyte and entirely @acronym{ASCII}, the returned
233 string is unibyte. (When the characters are all @acronym{ASCII},
234 Emacs primitives will treat the string the same way whether it is
235 unibyte or multibyte.) If @var{string} is unibyte and contains
236 non-@acronym{ASCII} characters, the function
237 @code{unibyte-char-to-multibyte} is used to convert each unibyte
238 character to a multibyte character.
239 @end defun
240
241 @defun string-to-multibyte string
242 This function returns a multibyte string containing the same sequence
243 of character codes as @var{string}. Unlike
244 @code{string-make-multibyte}, this function unconditionally returns a
245 multibyte string. If @var{string} is a multibyte string, it is
246 returned unchanged.
247 @end defun
248
249 @defun multibyte-char-to-unibyte char
250 This convert the multibyte character @var{char} to a unibyte
251 character, based on @code{nonascii-translation-table} and
252 @code{nonascii-insert-offset}.
253 @end defun
254
255 @defun unibyte-char-to-multibyte char
256 This convert the unibyte character @var{char} to a multibyte
257 character, based on @code{nonascii-translation-table} and
258 @code{nonascii-insert-offset}.
259 @end defun
260
261 @node Selecting a Representation
262 @section Selecting a Representation
263
264 Sometimes it is useful to examine an existing buffer or string as
265 multibyte when it was unibyte, or vice versa.
266
267 @defun set-buffer-multibyte multibyte
268 Set the representation type of the current buffer. If @var{multibyte}
269 is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte}
270 is @code{nil}, the buffer becomes unibyte.
271
272 This function leaves the buffer contents unchanged when viewed as a
273 sequence of bytes. As a consequence, it can change the contents viewed
274 as characters; a sequence of two bytes which is treated as one character
275 in multibyte representation will count as two characters in unibyte
276 representation. Character codes 128 through 159 are an exception. They
277 are represented by one byte in a unibyte buffer, but when the buffer is
278 set to multibyte, they are converted to two-byte sequences, and vice
279 versa.
280
281 This function sets @code{enable-multibyte-characters} to record which
282 representation is in use. It also adjusts various data in the buffer
283 (including overlays, text properties and markers) so that they cover the
284 same text as they did before.
285
286 You cannot use @code{set-buffer-multibyte} on an indirect buffer,
287 because indirect buffers always inherit the representation of the
288 base buffer.
289 @end defun
290
291 @defun string-as-unibyte string
292 This function returns a string with the same bytes as @var{string} but
293 treating each byte as a character. This means that the value may have
294 more characters than @var{string} has.
295
296 If @var{string} is already a unibyte string, then the value is
297 @var{string} itself. Otherwise it is a newly created string, with no
298 text properties. If @var{string} is multibyte, any characters it
299 contains of charset @code{eight-bit-control} or @code{eight-bit-graphic}
300 are converted to the corresponding single byte.
301 @end defun
302
303 @defun string-as-multibyte string
304 This function returns a string with the same bytes as @var{string} but
305 treating each multibyte sequence as one character. This means that the
306 value may have fewer characters than @var{string} has.
307
308 If @var{string} is already a multibyte string, then the value is
309 @var{string} itself. Otherwise it is a newly created string, with no
310 text properties. If @var{string} is unibyte and contains any individual
311 8-bit bytes (i.e.@: not part of a multibyte form), they are converted to
312 the corresponding multibyte character of charset @code{eight-bit-control}
313 or @code{eight-bit-graphic}.
314 @end defun
315
316 @node Character Codes
317 @section Character Codes
318 @cindex character codes
319
320 The unibyte and multibyte text representations use different
321 character codes. The valid character codes for unibyte representation
322 range from 0 to 255---the values that can fit in one byte. The valid
323 character codes for multibyte representation range from 0 to 4194303,
324 but not all values in that range are valid. The values 128 through
325 255 do not usually show up in multibyte text, but they can occur if
326 you do explicit encoding and decoding (@pxref{Explicit Encoding}).
327 Some other character codes cannot occur at all in multibyte text.
328 Only the @acronym{ASCII} codes 0 through 127 are completely legitimate
329 in both representations.
330
331 @defun characterp charcode
332 This returns @code{t} if @var{charcode} is a valid character, and
333 @code{nil} otherwise.
334
335 @example
336 (characterp 65)
337 @result{} t
338 (characterp 256)
339 @result{} nil
340 (characterp 4194303)
341 @result{} t
342 (characterp 4194304)
343 @result{} nil
344 @end example
345 @end defun
346
347 @node Character Sets
348 @section Character Sets
349 @cindex character sets
350
351 Emacs classifies characters into various @dfn{character sets}, each of
352 which has a name which is a symbol. Each character belongs to one and
353 only one character set.
354
355 In general, there is one character set for each distinct script. For
356 example, @code{latin-iso8859-1} is one character set,
357 @code{greek-iso8859-7} is another, and @code{ascii} is another. An
358 Emacs character set can hold at most 9025 characters; therefore, in some
359 cases, characters that would logically be grouped together are split
360 into several character sets. For example, one set of Chinese
361 characters, generally known as Big 5, is divided into two Emacs
362 character sets, @code{chinese-big5-1} and @code{chinese-big5-2}.
363
364 @acronym{ASCII} characters are in character set @code{ascii}. The
365 non-@acronym{ASCII} characters 128 through 159 are in character set
366 @code{eight-bit-control}, and codes 160 through 255 are in character set
367 @code{eight-bit-graphic}.
368
369 @defun charsetp object
370 Returns @code{t} if @var{object} is a symbol that names a character set,
371 @code{nil} otherwise.
372 @end defun
373
374 @defvar charset-list
375 The value is a list of all defined character set names.
376 @end defvar
377
378 @defun charset-list
379 This function returns the value of @code{charset-list}. It is only
380 provided for backward compatibility.
381 @end defun
382
383 @defun char-charset character
384 This function returns the name of the character set that @var{character}
385 belongs to, or the symbol @code{unknown} if @var{character} is not a
386 valid character.
387 @end defun
388
389 @defun charset-plist charset
390 This function returns the charset property list of the character set
391 @var{charset}. Although @var{charset} is a symbol, this is not the same
392 as the property list of that symbol. Charset properties are used for
393 special purposes within Emacs.
394 @end defun
395
396 @deffn Command list-charset-chars charset
397 This command displays a list of characters in the character set
398 @var{charset}.
399 @end deffn
400
401 @node Chars and Bytes
402 @section Characters and Bytes
403 @cindex bytes and characters
404
405 @cindex introduction sequence (of character)
406 @cindex dimension (of character set)
407 In multibyte representation, each character occupies one or more
408 bytes. Each character set has an @dfn{introduction sequence}, which is
409 normally one or two bytes long. (Exception: the @code{ascii} character
410 set and the @code{eight-bit-graphic} character set have a zero-length
411 introduction sequence.) The introduction sequence is the beginning of
412 the byte sequence for any character in the character set. The rest of
413 the character's bytes distinguish it from the other characters in the
414 same character set. Depending on the character set, there are either
415 one or two distinguishing bytes; the number of such bytes is called the
416 @dfn{dimension} of the character set.
417
418 @defun charset-dimension charset
419 This function returns the dimension of @var{charset}; at present, the
420 dimension is always 1 or 2.
421 @end defun
422
423 @defun charset-bytes charset
424 This function returns the number of bytes used to represent a character
425 in character set @var{charset}.
426 @end defun
427
428 This is the simplest way to determine the byte length of a character
429 set's introduction sequence:
430
431 @example
432 (- (charset-bytes @var{charset})
433 (charset-dimension @var{charset}))
434 @end example
435
436 @node Splitting Characters
437 @section Splitting Characters
438 @cindex character as bytes
439
440 The functions in this section convert between characters and the byte
441 values used to represent them. For most purposes, there is no need to
442 be concerned with the sequence of bytes used to represent a character,
443 because Emacs translates automatically when necessary.
444
445 @defun split-char character
446 Return a list containing the name of the character set of
447 @var{character}, followed by one or two byte values (integers) which
448 identify @var{character} within that character set. The number of byte
449 values is the character set's dimension.
450
451 If @var{character} is invalid as a character code, @code{split-char}
452 returns a list consisting of the symbol @code{unknown} and @var{character}.
453
454 @example
455 (split-char 2248)
456 @result{} (latin-iso8859-1 72)
457 (split-char 65)
458 @result{} (ascii 65)
459 (split-char 128)
460 @result{} (eight-bit-control 128)
461 @end example
462 @end defun
463
464 @c FIXME: update split-char and make-char
465 @cindex generate characters in charsets
466 @defun make-char charset &optional code1 code2
467 This function returns the character in character set @var{charset} whose
468 position codes are @var{code1} and @var{code2}. This is roughly the
469 inverse of @code{split-char}. Normally, you should specify either one
470 or both of @var{code1} and @var{code2} according to the dimension of
471 @var{charset}. For example,
472
473 @example
474 (make-char 'latin-iso8859-1 72)
475 @result{} 2248
476 @end example
477
478 Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed
479 before they are used to index @var{charset}. Thus you may use, for
480 instance, an ISO 8859 character code rather than subtracting 128, as
481 is necessary to index the corresponding Emacs charset.
482 @end defun
483
484 @node Scanning Charsets
485 @section Scanning for Character Sets
486
487 Sometimes it is useful to find out which character sets appear in a
488 part of a buffer or a string. One use for this is in determining which
489 coding systems (@pxref{Coding Systems}) are capable of representing all
490 of the text in question.
491
492 @defun charset-after &optional pos
493 This function return the charset of a character in the current buffer
494 at position @var{pos}. If @var{pos} is omitted or @code{nil}, it
495 defaults to the current value of point. If @var{pos} is out of range,
496 the value is @code{nil}.
497 @end defun
498
499 @defun find-charset-region beg end &optional translation
500 This function returns a list of the character sets that appear in the
501 current buffer between positions @var{beg} and @var{end}.
502
503 The optional argument @var{translation} specifies a translation table to
504 be used in scanning the text (@pxref{Translation of Characters}). If it
505 is non-@code{nil}, then each character in the region is translated
506 through this table, and the value returned describes the translated
507 characters instead of the characters actually in the buffer.
508 @end defun
509
510 @defun find-charset-string string &optional translation
511 This function returns a list of the character sets that appear in the
512 string @var{string}. It is just like @code{find-charset-region}, except
513 that it applies to the contents of @var{string} instead of part of the
514 current buffer.
515 @end defun
516
517 @node Translation of Characters
518 @section Translation of Characters
519 @cindex character translation tables
520 @cindex translation tables
521
522 A @dfn{translation table} is a char-table that specifies a mapping
523 of characters into characters. These tables are used in encoding and
524 decoding, and for other purposes. Some coding systems specify their
525 own particular translation tables; there are also default translation
526 tables which apply to all other coding systems.
527
528 For instance, the coding-system @code{utf-8} has a translation table
529 that maps characters of various charsets (e.g.,
530 @code{latin-iso8859-@var{x}}) into Unicode character sets. This way,
531 it can encode Latin-2 characters into UTF-8. Meanwhile,
532 @code{unify-8859-on-decoding-mode} operates by specifying
533 @code{standard-translation-table-for-decode} to translate
534 Latin-@var{x} characters into corresponding Unicode characters.
535
536 @defun make-translation-table &rest translations
537 This function returns a translation table based on the argument
538 @var{translations}. Each element of @var{translations} should be a
539 list of elements of the form @code{(@var{from} . @var{to})}; this says
540 to translate the character @var{from} into @var{to}.
541
542 The arguments and the forms in each argument are processed in order,
543 and if a previous form already translates @var{to} to some other
544 character, say @var{to-alt}, @var{from} is also translated to
545 @var{to-alt}.
546 @end defun
547
548 In decoding, the translation table's translations are applied to the
549 characters that result from ordinary decoding. If a coding system has
550 property @code{translation-table-for-decode}, that specifies the
551 translation table to use. (This is a property of the coding system,
552 as returned by @code{coding-system-get}, not a property of the symbol
553 that is the coding system's name. @xref{Coding System Basics,, Basic
554 Concepts of Coding Systems}.) Otherwise, if
555 @code{standard-translation-table-for-decode} is non-@code{nil},
556 decoding uses that table.
557
558 In encoding, the translation table's translations are applied to the
559 characters in the buffer, and the result of translation is actually
560 encoded. If a coding system has property
561 @code{translation-table-for-encode}, that specifies the translation
562 table to use. Otherwise the variable
563 @code{standard-translation-table-for-encode} specifies the translation
564 table.
565
566 @defvar standard-translation-table-for-decode
567 This is the default translation table for decoding, for
568 coding systems that don't specify any other translation table.
569 @end defvar
570
571 @defvar standard-translation-table-for-encode
572 This is the default translation table for encoding, for
573 coding systems that don't specify any other translation table.
574 @end defvar
575
576 @node Coding Systems
577 @section Coding Systems
578
579 @cindex coding system
580 When Emacs reads or writes a file, and when Emacs sends text to a
581 subprocess or receives text from a subprocess, it normally performs
582 character code conversion and end-of-line conversion as specified
583 by a particular @dfn{coding system}.
584
585 How to define a coding system is an arcane matter, and is not
586 documented here.
587
588 @menu
589 * Coding System Basics:: Basic concepts.
590 * Encoding and I/O:: How file I/O functions handle coding systems.
591 * Lisp and Coding Systems:: Functions to operate on coding system names.
592 * User-Chosen Coding Systems:: Asking the user to choose a coding system.
593 * Default Coding Systems:: Controlling the default choices.
594 * Specifying Coding Systems:: Requesting a particular coding system
595 for a single file operation.
596 * Explicit Encoding:: Encoding or decoding text without doing I/O.
597 * Terminal I/O Encoding:: Use of encoding for terminal I/O.
598 * MS-DOS File Types:: How DOS "text" and "binary" files
599 relate to coding systems.
600 @end menu
601
602 @node Coding System Basics
603 @subsection Basic Concepts of Coding Systems
604
605 @cindex character code conversion
606 @dfn{Character code conversion} involves conversion between the encoding
607 used inside Emacs and some other encoding. Emacs supports many
608 different encodings, in that it can convert to and from them. For
609 example, it can convert text to or from encodings such as Latin 1, Latin
610 2, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022. In some
611 cases, Emacs supports several alternative encodings for the same
612 characters; for example, there are three coding systems for the Cyrillic
613 (Russian) alphabet: ISO, Alternativnyj, and KOI8.
614
615 Most coding systems specify a particular character code for
616 conversion, but some of them leave the choice unspecified---to be chosen
617 heuristically for each file, based on the data.
618
619 In general, a coding system doesn't guarantee roundtrip identity:
620 decoding a byte sequence using coding system, then encoding the
621 resulting text in the same coding system, can produce a different byte
622 sequence. However, the following coding systems do guarantee that the
623 byte sequence will be the same as what you originally decoded:
624
625 @quotation
626 chinese-big5 chinese-iso-8bit cyrillic-iso-8bit emacs-mule
627 greek-iso-8bit hebrew-iso-8bit iso-latin-1 iso-latin-2 iso-latin-3
628 iso-latin-4 iso-latin-5 iso-latin-8 iso-latin-9 iso-safe
629 japanese-iso-8bit japanese-shift-jis korean-iso-8bit raw-text
630 @end quotation
631
632 Encoding buffer text and then decoding the result can also fail to
633 reproduce the original text. For instance, if you encode Latin-2
634 characters with @code{utf-8} and decode the result using the same
635 coding system, you'll get Unicode characters (of charset
636 @code{mule-unicode-0100-24ff}). If you encode Unicode characters with
637 @code{iso-latin-2} and decode the result with the same coding system,
638 you'll get Latin-2 characters.
639
640 @cindex EOL conversion
641 @cindex end-of-line conversion
642 @cindex line end conversion
643 @dfn{End of line conversion} handles three different conventions used
644 on various systems for representing end of line in files. The Unix
645 convention is to use the linefeed character (also called newline). The
646 DOS convention is to use a carriage-return and a linefeed at the end of
647 a line. The Mac convention is to use just carriage-return.
648
649 @cindex base coding system
650 @cindex variant coding system
651 @dfn{Base coding systems} such as @code{latin-1} leave the end-of-line
652 conversion unspecified, to be chosen based on the data. @dfn{Variant
653 coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and
654 @code{latin-1-mac} specify the end-of-line conversion explicitly as
655 well. Most base coding systems have three corresponding variants whose
656 names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
657
658 The coding system @code{raw-text} is special in that it prevents
659 character code conversion, and causes the buffer visited with that
660 coding system to be a unibyte buffer. It does not specify the
661 end-of-line conversion, allowing that to be determined as usual by the
662 data, and has the usual three variants which specify the end-of-line
663 conversion. @code{no-conversion} is equivalent to @code{raw-text-unix}:
664 it specifies no conversion of either character codes or end-of-line.
665
666 The coding system @code{emacs-mule} specifies that the data is
667 represented in the internal Emacs encoding. This is like
668 @code{raw-text} in that no code conversion happens, but different in
669 that the result is multibyte data.
670
671 @defun coding-system-get coding-system property
672 This function returns the specified property of the coding system
673 @var{coding-system}. Most coding system properties exist for internal
674 purposes, but one that you might find useful is @code{mime-charset}.
675 That property's value is the name used in MIME for the character coding
676 which this coding system can read and write. Examples:
677
678 @example
679 (coding-system-get 'iso-latin-1 'mime-charset)
680 @result{} iso-8859-1
681 (coding-system-get 'iso-2022-cn 'mime-charset)
682 @result{} iso-2022-cn
683 (coding-system-get 'cyrillic-koi8 'mime-charset)
684 @result{} koi8-r
685 @end example
686
687 The value of the @code{mime-charset} property is also defined
688 as an alias for the coding system.
689 @end defun
690
691 @node Encoding and I/O
692 @subsection Encoding and I/O
693
694 The principal purpose of coding systems is for use in reading and
695 writing files. The function @code{insert-file-contents} uses
696 a coding system for decoding the file data, and @code{write-region}
697 uses one to encode the buffer contents.
698
699 You can specify the coding system to use either explicitly
700 (@pxref{Specifying Coding Systems}), or implicitly using a default
701 mechanism (@pxref{Default Coding Systems}). But these methods may not
702 completely specify what to do. For example, they may choose a coding
703 system such as @code{undefined} which leaves the character code
704 conversion to be determined from the data. In these cases, the I/O
705 operation finishes the job of choosing a coding system. Very often
706 you will want to find out afterwards which coding system was chosen.
707
708 @defvar buffer-file-coding-system
709 This buffer-local variable records the coding system used for saving the
710 buffer and for writing part of the buffer with @code{write-region}. If
711 the text to be written cannot be safely encoded using the coding system
712 specified by this variable, these operations select an alternative
713 encoding by calling the function @code{select-safe-coding-system}
714 (@pxref{User-Chosen Coding Systems}). If selecting a different encoding
715 requires to ask the user to specify a coding system,
716 @code{buffer-file-coding-system} is updated to the newly selected coding
717 system.
718
719 @code{buffer-file-coding-system} does @emph{not} affect sending text
720 to a subprocess.
721 @end defvar
722
723 @defvar save-buffer-coding-system
724 This variable specifies the coding system for saving the buffer (by
725 overriding @code{buffer-file-coding-system}). Note that it is not used
726 for @code{write-region}.
727
728 When a command to save the buffer starts out to use
729 @code{buffer-file-coding-system} (or @code{save-buffer-coding-system}),
730 and that coding system cannot handle
731 the actual text in the buffer, the command asks the user to choose
732 another coding system (by calling @code{select-safe-coding-system}).
733 After that happens, the command also updates
734 @code{buffer-file-coding-system} to represent the coding system that
735 the user specified.
736 @end defvar
737
738 @defvar last-coding-system-used
739 I/O operations for files and subprocesses set this variable to the
740 coding system name that was used. The explicit encoding and decoding
741 functions (@pxref{Explicit Encoding}) set it too.
742
743 @strong{Warning:} Since receiving subprocess output sets this variable,
744 it can change whenever Emacs waits; therefore, you should copy the
745 value shortly after the function call that stores the value you are
746 interested in.
747 @end defvar
748
749 The variable @code{selection-coding-system} specifies how to encode
750 selections for the window system. @xref{Window System Selections}.
751
752 @defvar file-name-coding-system
753 The variable @code{file-name-coding-system} specifies the coding
754 system to use for encoding file names. Emacs encodes file names using
755 that coding system for all file operations. If
756 @code{file-name-coding-system} is @code{nil}, Emacs uses a default
757 coding system determined by the selected language environment. In the
758 default language environment, any non-@acronym{ASCII} characters in
759 file names are not encoded specially; they appear in the file system
760 using the internal Emacs representation.
761 @end defvar
762
763 @strong{Warning:} if you change @code{file-name-coding-system} (or
764 the language environment) in the middle of an Emacs session, problems
765 can result if you have already visited files whose names were encoded
766 using the earlier coding system and are handled differently under the
767 new coding system. If you try to save one of these buffers under the
768 visited file name, saving may use the wrong file name, or it may get
769 an error. If such a problem happens, use @kbd{C-x C-w} to specify a
770 new file name for that buffer.
771
772 @node Lisp and Coding Systems
773 @subsection Coding Systems in Lisp
774
775 Here are the Lisp facilities for working with coding systems:
776
777 @defun coding-system-list &optional base-only
778 This function returns a list of all coding system names (symbols). If
779 @var{base-only} is non-@code{nil}, the value includes only the
780 base coding systems. Otherwise, it includes alias and variant coding
781 systems as well.
782 @end defun
783
784 @defun coding-system-p object
785 This function returns @code{t} if @var{object} is a coding system
786 name or @code{nil}.
787 @end defun
788
789 @defun check-coding-system coding-system
790 This function checks the validity of @var{coding-system}.
791 If that is valid, it returns @var{coding-system}.
792 Otherwise it signals an error with condition @code{coding-system-error}.
793 @end defun
794
795 @defun coding-system-eol-type coding-system
796 This function returns the type of end-of-line (a.k.a.@: @dfn{eol})
797 conversion used by @var{coding-system}. If @var{coding-system}
798 specifies a certain eol conversion, the return value is an integer 0,
799 1, or 2, standing for @code{unix}, @code{dos}, and @code{mac},
800 respectively. If @var{coding-system} doesn't specify eol conversion
801 explicitly, the return value is a vector of coding systems, each one
802 with one of the possible eol conversion types, like this:
803
804 @lisp
805 (coding-system-eol-type 'latin-1)
806 @result{} [latin-1-unix latin-1-dos latin-1-mac]
807 @end lisp
808
809 @noindent
810 If this function returns a vector, Emacs will decide, as part of the
811 text encoding or decoding process, what eol conversion to use. For
812 decoding, the end-of-line format of the text is auto-detected, and the
813 eol conversion is set to match it (e.g., DOS-style CRLF format will
814 imply @code{dos} eol conversion). For encoding, the eol conversion is
815 taken from the appropriate default coding system (e.g.,
816 @code{default-buffer-file-coding-system} for
817 @code{buffer-file-coding-system}), or from the default eol conversion
818 appropriate for the underlying platform.
819 @end defun
820
821 @defun coding-system-change-eol-conversion coding-system eol-type
822 This function returns a coding system which is like @var{coding-system}
823 except for its eol conversion, which is specified by @code{eol-type}.
824 @var{eol-type} should be @code{unix}, @code{dos}, @code{mac}, or
825 @code{nil}. If it is @code{nil}, the returned coding system determines
826 the end-of-line conversion from the data.
827
828 @var{eol-type} may also be 0, 1 or 2, standing for @code{unix},
829 @code{dos} and @code{mac}, respectively.
830 @end defun
831
832 @defun coding-system-change-text-conversion eol-coding text-coding
833 This function returns a coding system which uses the end-of-line
834 conversion of @var{eol-coding}, and the text conversion of
835 @var{text-coding}. If @var{text-coding} is @code{nil}, it returns
836 @code{undecided}, or one of its variants according to @var{eol-coding}.
837 @end defun
838
839 @defun find-coding-systems-region from to
840 This function returns a list of coding systems that could be used to
841 encode a text between @var{from} and @var{to}. All coding systems in
842 the list can safely encode any multibyte characters in that portion of
843 the text.
844
845 If the text contains no multibyte characters, the function returns the
846 list @code{(undecided)}.
847 @end defun
848
849 @defun find-coding-systems-string string
850 This function returns a list of coding systems that could be used to
851 encode the text of @var{string}. All coding systems in the list can
852 safely encode any multibyte characters in @var{string}. If the text
853 contains no multibyte characters, this returns the list
854 @code{(undecided)}.
855 @end defun
856
857 @defun find-coding-systems-for-charsets charsets
858 This function returns a list of coding systems that could be used to
859 encode all the character sets in the list @var{charsets}.
860 @end defun
861
862 @defun detect-coding-region start end &optional highest
863 This function chooses a plausible coding system for decoding the text
864 from @var{start} to @var{end}. This text should be a byte sequence
865 (@pxref{Explicit Encoding}).
866
867 Normally this function returns a list of coding systems that could
868 handle decoding the text that was scanned. They are listed in order of
869 decreasing priority. But if @var{highest} is non-@code{nil}, then the
870 return value is just one coding system, the one that is highest in
871 priority.
872
873 If the region contains only @acronym{ASCII} characters except for such
874 ISO-2022 control characters ISO-2022 as @code{ESC}, the value is
875 @code{undecided} or @code{(undecided)}, or a variant specifying
876 end-of-line conversion, if that can be deduced from the text.
877 @end defun
878
879 @defun detect-coding-string string &optional highest
880 This function is like @code{detect-coding-region} except that it
881 operates on the contents of @var{string} instead of bytes in the buffer.
882 @end defun
883
884 @xref{Coding systems for a subprocess,, Process Information}, in
885 particular the description of the functions
886 @code{process-coding-system} and @code{set-process-coding-system}, for
887 how to examine or set the coding systems used for I/O to a subprocess.
888
889 @node User-Chosen Coding Systems
890 @subsection User-Chosen Coding Systems
891
892 @cindex select safe coding system
893 @defun select-safe-coding-system from to &optional default-coding-system accept-default-p file
894 This function selects a coding system for encoding specified text,
895 asking the user to choose if necessary. Normally the specified text
896 is the text in the current buffer between @var{from} and @var{to}. If
897 @var{from} is a string, the string specifies the text to encode, and
898 @var{to} is ignored.
899
900 If @var{default-coding-system} is non-@code{nil}, that is the first
901 coding system to try; if that can handle the text,
902 @code{select-safe-coding-system} returns that coding system. It can
903 also be a list of coding systems; then the function tries each of them
904 one by one. After trying all of them, it next tries the current
905 buffer's value of @code{buffer-file-coding-system} (if it is not
906 @code{undecided}), then the value of
907 @code{default-buffer-file-coding-system} and finally the user's most
908 preferred coding system, which the user can set using the command
909 @code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing
910 Coding Systems, emacs, The GNU Emacs Manual}).
911
912 If one of those coding systems can safely encode all the specified
913 text, @code{select-safe-coding-system} chooses it and returns it.
914 Otherwise, it asks the user to choose from a list of coding systems
915 which can encode all the text, and returns the user's choice.
916
917 @var{default-coding-system} can also be a list whose first element is
918 t and whose other elements are coding systems. Then, if no coding
919 system in the list can handle the text, @code{select-safe-coding-system}
920 queries the user immediately, without trying any of the three
921 alternatives described above.
922
923 The optional argument @var{accept-default-p}, if non-@code{nil},
924 should be a function to determine whether a coding system selected
925 without user interaction is acceptable. @code{select-safe-coding-system}
926 calls this function with one argument, the base coding system of the
927 selected coding system. If @var{accept-default-p} returns @code{nil},
928 @code{select-safe-coding-system} rejects the silently selected coding
929 system, and asks the user to select a coding system from a list of
930 possible candidates.
931
932 @vindex select-safe-coding-system-accept-default-p
933 If the variable @code{select-safe-coding-system-accept-default-p} is
934 non-@code{nil}, its value overrides the value of
935 @var{accept-default-p}.
936
937 As a final step, before returning the chosen coding system,
938 @code{select-safe-coding-system} checks whether that coding system is
939 consistent with what would be selected if the contents of the region
940 were read from a file. (If not, this could lead to data corruption in
941 a file subsequently re-visited and edited.) Normally,
942 @code{select-safe-coding-system} uses @code{buffer-file-name} as the
943 file for this purpose, but if @var{file} is non-@code{nil}, it uses
944 that file instead (this can be relevant for @code{write-region} and
945 similar functions). If it detects an apparent inconsistency,
946 @code{select-safe-coding-system} queries the user before selecting the
947 coding system.
948 @end defun
949
950 Here are two functions you can use to let the user specify a coding
951 system, with completion. @xref{Completion}.
952
953 @defun read-coding-system prompt &optional default
954 This function reads a coding system using the minibuffer, prompting with
955 string @var{prompt}, and returns the coding system name as a symbol. If
956 the user enters null input, @var{default} specifies which coding system
957 to return. It should be a symbol or a string.
958 @end defun
959
960 @defun read-non-nil-coding-system prompt
961 This function reads a coding system using the minibuffer, prompting with
962 string @var{prompt}, and returns the coding system name as a symbol. If
963 the user tries to enter null input, it asks the user to try again.
964 @xref{Coding Systems}.
965 @end defun
966
967 @node Default Coding Systems
968 @subsection Default Coding Systems
969
970 This section describes variables that specify the default coding
971 system for certain files or when running certain subprograms, and the
972 function that I/O operations use to access them.
973
974 The idea of these variables is that you set them once and for all to the
975 defaults you want, and then do not change them again. To specify a
976 particular coding system for a particular operation in a Lisp program,
977 don't change these variables; instead, override them using
978 @code{coding-system-for-read} and @code{coding-system-for-write}
979 (@pxref{Specifying Coding Systems}).
980
981 @defvar auto-coding-regexp-alist
982 This variable is an alist of text patterns and corresponding coding
983 systems. Each element has the form @code{(@var{regexp}
984 . @var{coding-system})}; a file whose first few kilobytes match
985 @var{regexp} is decoded with @var{coding-system} when its contents are
986 read into a buffer. The settings in this alist take priority over
987 @code{coding:} tags in the files and the contents of
988 @code{file-coding-system-alist} (see below). The default value is set
989 so that Emacs automatically recognizes mail files in Babyl format and
990 reads them with no code conversions.
991 @end defvar
992
993 @defvar file-coding-system-alist
994 This variable is an alist that specifies the coding systems to use for
995 reading and writing particular files. Each element has the form
996 @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
997 expression that matches certain file names. The element applies to file
998 names that match @var{pattern}.
999
1000 The @sc{cdr} of the element, @var{coding}, should be either a coding
1001 system, a cons cell containing two coding systems, or a function name (a
1002 symbol with a function definition). If @var{coding} is a coding system,
1003 that coding system is used for both reading the file and writing it. If
1004 @var{coding} is a cons cell containing two coding systems, its @sc{car}
1005 specifies the coding system for decoding, and its @sc{cdr} specifies the
1006 coding system for encoding.
1007
1008 If @var{coding} is a function name, the function should take one
1009 argument, a list of all arguments passed to
1010 @code{find-operation-coding-system}. It must return a coding system
1011 or a cons cell containing two coding systems. This value has the same
1012 meaning as described above.
1013
1014 If @var{coding} (or what returned by the above function) is
1015 @code{undecided}, the normal code-detection is performed.
1016 @end defvar
1017
1018 @defvar process-coding-system-alist
1019 This variable is an alist specifying which coding systems to use for a
1020 subprocess, depending on which program is running in the subprocess. It
1021 works like @code{file-coding-system-alist}, except that @var{pattern} is
1022 matched against the program name used to start the subprocess. The coding
1023 system or systems specified in this alist are used to initialize the
1024 coding systems used for I/O to the subprocess, but you can specify
1025 other coding systems later using @code{set-process-coding-system}.
1026 @end defvar
1027
1028 @strong{Warning:} Coding systems such as @code{undecided}, which
1029 determine the coding system from the data, do not work entirely reliably
1030 with asynchronous subprocess output. This is because Emacs handles
1031 asynchronous subprocess output in batches, as it arrives. If the coding
1032 system leaves the character code conversion unspecified, or leaves the
1033 end-of-line conversion unspecified, Emacs must try to detect the proper
1034 conversion from one batch at a time, and this does not always work.
1035
1036 Therefore, with an asynchronous subprocess, if at all possible, use a
1037 coding system which determines both the character code conversion and
1038 the end of line conversion---that is, one like @code{latin-1-unix},
1039 rather than @code{undecided} or @code{latin-1}.
1040
1041 @defvar network-coding-system-alist
1042 This variable is an alist that specifies the coding system to use for
1043 network streams. It works much like @code{file-coding-system-alist},
1044 with the difference that the @var{pattern} in an element may be either a
1045 port number or a regular expression. If it is a regular expression, it
1046 is matched against the network service name used to open the network
1047 stream.
1048 @end defvar
1049
1050 @defvar default-process-coding-system
1051 This variable specifies the coding systems to use for subprocess (and
1052 network stream) input and output, when nothing else specifies what to
1053 do.
1054
1055 The value should be a cons cell of the form @code{(@var{input-coding}
1056 . @var{output-coding})}. Here @var{input-coding} applies to input from
1057 the subprocess, and @var{output-coding} applies to output to it.
1058 @end defvar
1059
1060 @defvar auto-coding-functions
1061 This variable holds a list of functions that try to determine a
1062 coding system for a file based on its undecoded contents.
1063
1064 Each function in this list should be written to look at text in the
1065 current buffer, but should not modify it in any way. The buffer will
1066 contain undecoded text of parts of the file. Each function should
1067 take one argument, @var{size}, which tells it how many characters to
1068 look at, starting from point. If the function succeeds in determining
1069 a coding system for the file, it should return that coding system.
1070 Otherwise, it should return @code{nil}.
1071
1072 If a file has a @samp{coding:} tag, that takes precedence, so these
1073 functions won't be called.
1074 @end defvar
1075
1076 @defun find-operation-coding-system operation &rest arguments
1077 This function returns the coding system to use (by default) for
1078 performing @var{operation} with @var{arguments}. The value has this
1079 form:
1080
1081 @example
1082 (@var{decoding-system} . @var{encoding-system})
1083 @end example
1084
1085 The first element, @var{decoding-system}, is the coding system to use
1086 for decoding (in case @var{operation} does decoding), and
1087 @var{encoding-system} is the coding system for encoding (in case
1088 @var{operation} does encoding).
1089
1090 The argument @var{operation} is a symbol, one of @code{write-region},
1091 @code{start-process}, @code{call-process}, @code{call-process-region},
1092 @code{insert-file-contents}, or @code{open-network-stream}. These are
1093 the names of the Emacs I/O primitives that can do character code and
1094 eol conversion.
1095
1096 The remaining arguments should be the same arguments that might be given
1097 to the corresponding I/O primitive. Depending on the primitive, one
1098 of those arguments is selected as the @dfn{target}. For example, if
1099 @var{operation} does file I/O, whichever argument specifies the file
1100 name is the target. For subprocess primitives, the process name is the
1101 target. For @code{open-network-stream}, the target is the service name
1102 or port number.
1103
1104 Depending on @var{operation}, this function looks up the target in
1105 @code{file-coding-system-alist}, @code{process-coding-system-alist},
1106 or @code{network-coding-system-alist}. If the target is found in the
1107 alist, @code{find-operation-coding-system} returns its association in
1108 the alist; otherwise it returns @code{nil}.
1109
1110 If @var{operation} is @code{insert-file-contents}, the argument
1111 corresponding to the target may be a cons cell of the form
1112 @code{(@var{filename} . @var{buffer})}). In that case, @var{filename}
1113 is a file name to look up in @code{file-coding-system-alist}, and
1114 @var{buffer} is a buffer that contains the file's contents (not yet
1115 decoded). If @code{file-coding-system-alist} specifies a function to
1116 call for this file, and that function needs to examine the file's
1117 contents (as it usually does), it should examine the contents of
1118 @var{buffer} instead of reading the file.
1119 @end defun
1120
1121 @node Specifying Coding Systems
1122 @subsection Specifying a Coding System for One Operation
1123
1124 You can specify the coding system for a specific operation by binding
1125 the variables @code{coding-system-for-read} and/or
1126 @code{coding-system-for-write}.
1127
1128 @defvar coding-system-for-read
1129 If this variable is non-@code{nil}, it specifies the coding system to
1130 use for reading a file, or for input from a synchronous subprocess.
1131
1132 It also applies to any asynchronous subprocess or network stream, but in
1133 a different way: the value of @code{coding-system-for-read} when you
1134 start the subprocess or open the network stream specifies the input
1135 decoding method for that subprocess or network stream. It remains in
1136 use for that subprocess or network stream unless and until overridden.
1137
1138 The right way to use this variable is to bind it with @code{let} for a
1139 specific I/O operation. Its global value is normally @code{nil}, and
1140 you should not globally set it to any other value. Here is an example
1141 of the right way to use the variable:
1142
1143 @example
1144 ;; @r{Read the file with no character code conversion.}
1145 ;; @r{Assume @acronym{crlf} represents end-of-line.}
1146 (let ((coding-system-for-read 'emacs-mule-dos))
1147 (insert-file-contents filename))
1148 @end example
1149
1150 When its value is non-@code{nil}, this variable takes precedence over
1151 all other methods of specifying a coding system to use for input,
1152 including @code{file-coding-system-alist},
1153 @code{process-coding-system-alist} and
1154 @code{network-coding-system-alist}.
1155 @end defvar
1156
1157 @defvar coding-system-for-write
1158 This works much like @code{coding-system-for-read}, except that it
1159 applies to output rather than input. It affects writing to files,
1160 as well as sending output to subprocesses and net connections.
1161
1162 When a single operation does both input and output, as do
1163 @code{call-process-region} and @code{start-process}, both
1164 @code{coding-system-for-read} and @code{coding-system-for-write}
1165 affect it.
1166 @end defvar
1167
1168 @defvar inhibit-eol-conversion
1169 When this variable is non-@code{nil}, no end-of-line conversion is done,
1170 no matter which coding system is specified. This applies to all the
1171 Emacs I/O and subprocess primitives, and to the explicit encoding and
1172 decoding functions (@pxref{Explicit Encoding}).
1173 @end defvar
1174
1175 @node Explicit Encoding
1176 @subsection Explicit Encoding and Decoding
1177 @cindex encoding in coding systems
1178 @cindex decoding in coding systems
1179
1180 All the operations that transfer text in and out of Emacs have the
1181 ability to use a coding system to encode or decode the text.
1182 You can also explicitly encode and decode text using the functions
1183 in this section.
1184
1185 The result of encoding, and the input to decoding, are not ordinary
1186 text. They logically consist of a series of byte values; that is, a
1187 series of characters whose codes are in the range 0 through 255. In a
1188 multibyte buffer or string, character codes 128 through 159 are
1189 represented by multibyte sequences, but this is invisible to Lisp
1190 programs.
1191
1192 The usual way to read a file into a buffer as a sequence of bytes, so
1193 you can decode the contents explicitly, is with
1194 @code{insert-file-contents-literally} (@pxref{Reading from Files});
1195 alternatively, specify a non-@code{nil} @var{rawfile} argument when
1196 visiting a file with @code{find-file-noselect}. These methods result in
1197 a unibyte buffer.
1198
1199 The usual way to use the byte sequence that results from explicitly
1200 encoding text is to copy it to a file or process---for example, to write
1201 it with @code{write-region} (@pxref{Writing to Files}), and suppress
1202 encoding by binding @code{coding-system-for-write} to
1203 @code{no-conversion}.
1204
1205 Here are the functions to perform explicit encoding or decoding. The
1206 encoding functions produce sequences of bytes; the decoding functions
1207 are meant to operate on sequences of bytes. All of these functions
1208 discard text properties.
1209
1210 @deffn Command encode-coding-region start end coding-system
1211 This command encodes the text from @var{start} to @var{end} according
1212 to coding system @var{coding-system}. The encoded text replaces the
1213 original text in the buffer. The result of encoding is logically a
1214 sequence of bytes, but the buffer remains multibyte if it was multibyte
1215 before.
1216
1217 This command returns the length of the encoded text.
1218 @end deffn
1219
1220 @defun encode-coding-string string coding-system &optional nocopy
1221 This function encodes the text in @var{string} according to coding
1222 system @var{coding-system}. It returns a new string containing the
1223 encoded text, except when @var{nocopy} is non-@code{nil}, in which
1224 case the function may return @var{string} itself if the encoding
1225 operation is trivial. The result of encoding is a unibyte string.
1226 @end defun
1227
1228 @deffn Command decode-coding-region start end coding-system
1229 This command decodes the text from @var{start} to @var{end} according
1230 to coding system @var{coding-system}. The decoded text replaces the
1231 original text in the buffer. To make explicit decoding useful, the text
1232 before decoding ought to be a sequence of byte values, but both
1233 multibyte and unibyte buffers are acceptable.
1234
1235 This command returns the length of the decoded text.
1236 @end deffn
1237
1238 @defun decode-coding-string string coding-system &optional nocopy
1239 This function decodes the text in @var{string} according to coding
1240 system @var{coding-system}. It returns a new string containing the
1241 decoded text, except when @var{nocopy} is non-@code{nil}, in which
1242 case the function may return @var{string} itself if the decoding
1243 operation is trivial. To make explicit decoding useful, the contents
1244 of @var{string} ought to be a sequence of byte values, but a multibyte
1245 string is acceptable.
1246 @end defun
1247
1248 @defun decode-coding-inserted-region from to filename &optional visit beg end replace
1249 This function decodes the text from @var{from} to @var{to} as if
1250 it were being read from file @var{filename} using @code{insert-file-contents}
1251 using the rest of the arguments provided.
1252
1253 The normal way to use this function is after reading text from a file
1254 without decoding, if you decide you would rather have decoded it.
1255 Instead of deleting the text and reading it again, this time with
1256 decoding, you can call this function.
1257 @end defun
1258
1259 @node Terminal I/O Encoding
1260 @subsection Terminal I/O Encoding
1261
1262 Emacs can decode keyboard input using a coding system, and encode
1263 terminal output. This is useful for terminals that transmit or display
1264 text using a particular encoding such as Latin-1. Emacs does not set
1265 @code{last-coding-system-used} for encoding or decoding for the
1266 terminal.
1267
1268 @defun keyboard-coding-system
1269 This function returns the coding system that is in use for decoding
1270 keyboard input---or @code{nil} if no coding system is to be used.
1271 @end defun
1272
1273 @deffn Command set-keyboard-coding-system coding-system
1274 This command specifies @var{coding-system} as the coding system to
1275 use for decoding keyboard input. If @var{coding-system} is @code{nil},
1276 that means do not decode keyboard input.
1277 @end deffn
1278
1279 @defun terminal-coding-system
1280 This function returns the coding system that is in use for encoding
1281 terminal output---or @code{nil} for no encoding.
1282 @end defun
1283
1284 @deffn Command set-terminal-coding-system coding-system
1285 This command specifies @var{coding-system} as the coding system to use
1286 for encoding terminal output. If @var{coding-system} is @code{nil},
1287 that means do not encode terminal output.
1288 @end deffn
1289
1290 @node MS-DOS File Types
1291 @subsection MS-DOS File Types
1292 @cindex DOS file types
1293 @cindex MS-DOS file types
1294 @cindex Windows file types
1295 @cindex file types on MS-DOS and Windows
1296 @cindex text files and binary files
1297 @cindex binary files and text files
1298
1299 On MS-DOS and Microsoft Windows, Emacs guesses the appropriate
1300 end-of-line conversion for a file by looking at the file's name. This
1301 feature classifies files as @dfn{text files} and @dfn{binary files}. By
1302 ``binary file'' we mean a file of literal byte values that are not
1303 necessarily meant to be characters; Emacs does no end-of-line conversion
1304 and no character code conversion for them. On the other hand, the bytes
1305 in a text file are intended to represent characters; when you create a
1306 new file whose name implies that it is a text file, Emacs uses DOS
1307 end-of-line conversion.
1308
1309 @defvar buffer-file-type
1310 This variable, automatically buffer-local in each buffer, records the
1311 file type of the buffer's visited file. When a buffer does not specify
1312 a coding system with @code{buffer-file-coding-system}, this variable is
1313 used to determine which coding system to use when writing the contents
1314 of the buffer. It should be @code{nil} for text, @code{t} for binary.
1315 If it is @code{t}, the coding system is @code{no-conversion}.
1316 Otherwise, @code{undecided-dos} is used.
1317
1318 Normally this variable is set by visiting a file; it is set to
1319 @code{nil} if the file was visited without any actual conversion.
1320 @end defvar
1321
1322 @defopt file-name-buffer-file-type-alist
1323 This variable holds an alist for recognizing text and binary files.
1324 Each element has the form (@var{regexp} . @var{type}), where
1325 @var{regexp} is matched against the file name, and @var{type} may be
1326 @code{nil} for text, @code{t} for binary, or a function to call to
1327 compute which. If it is a function, then it is called with a single
1328 argument (the file name) and should return @code{t} or @code{nil}.
1329
1330 When running on MS-DOS or MS-Windows, Emacs checks this alist to decide
1331 which coding system to use when reading a file. For a text file,
1332 @code{undecided-dos} is used. For a binary file, @code{no-conversion}
1333 is used.
1334
1335 If no element in this alist matches a given file name, then
1336 @code{default-buffer-file-type} says how to treat the file.
1337 @end defopt
1338
1339 @defopt default-buffer-file-type
1340 This variable says how to handle files for which
1341 @code{file-name-buffer-file-type-alist} says nothing about the type.
1342
1343 If this variable is non-@code{nil}, then these files are treated as
1344 binary: the coding system @code{no-conversion} is used. Otherwise,
1345 nothing special is done for them---the coding system is deduced solely
1346 from the file contents, in the usual Emacs fashion.
1347 @end defopt
1348
1349 @node Input Methods
1350 @section Input Methods
1351 @cindex input methods
1352
1353 @dfn{Input methods} provide convenient ways of entering non-@acronym{ASCII}
1354 characters from the keyboard. Unlike coding systems, which translate
1355 non-@acronym{ASCII} characters to and from encodings meant to be read by
1356 programs, input methods provide human-friendly commands. (@xref{Input
1357 Methods,,, emacs, The GNU Emacs Manual}, for information on how users
1358 use input methods to enter text.) How to define input methods is not
1359 yet documented in this manual, but here we describe how to use them.
1360
1361 Each input method has a name, which is currently a string;
1362 in the future, symbols may also be usable as input method names.
1363
1364 @defvar current-input-method
1365 This variable holds the name of the input method now active in the
1366 current buffer. (It automatically becomes local in each buffer when set
1367 in any fashion.) It is @code{nil} if no input method is active in the
1368 buffer now.
1369 @end defvar
1370
1371 @defopt default-input-method
1372 This variable holds the default input method for commands that choose an
1373 input method. Unlike @code{current-input-method}, this variable is
1374 normally global.
1375 @end defopt
1376
1377 @deffn Command set-input-method input-method
1378 This command activates input method @var{input-method} for the current
1379 buffer. It also sets @code{default-input-method} to @var{input-method}.
1380 If @var{input-method} is @code{nil}, this command deactivates any input
1381 method for the current buffer.
1382 @end deffn
1383
1384 @defun read-input-method-name prompt &optional default inhibit-null
1385 This function reads an input method name with the minibuffer, prompting
1386 with @var{prompt}. If @var{default} is non-@code{nil}, that is returned
1387 by default, if the user enters empty input. However, if
1388 @var{inhibit-null} is non-@code{nil}, empty input signals an error.
1389
1390 The returned value is a string.
1391 @end defun
1392
1393 @defvar input-method-alist
1394 This variable defines all the supported input methods.
1395 Each element defines one input method, and should have the form:
1396
1397 @example
1398 (@var{input-method} @var{language-env} @var{activate-func}
1399 @var{title} @var{description} @var{args}...)
1400 @end example
1401
1402 Here @var{input-method} is the input method name, a string;
1403 @var{language-env} is another string, the name of the language
1404 environment this input method is recommended for. (That serves only for
1405 documentation purposes.)
1406
1407 @var{activate-func} is a function to call to activate this method. The
1408 @var{args}, if any, are passed as arguments to @var{activate-func}. All
1409 told, the arguments to @var{activate-func} are @var{input-method} and
1410 the @var{args}.
1411
1412 @var{title} is a string to display in the mode line while this method is
1413 active. @var{description} is a string describing this method and what
1414 it is good for.
1415 @end defvar
1416
1417 The fundamental interface to input methods is through the
1418 variable @code{input-method-function}. @xref{Reading One Event},
1419 and @ref{Invoking the Input Method}.
1420
1421 @node Locales
1422 @section Locales
1423 @cindex locale
1424
1425 POSIX defines a concept of ``locales'' which control which language
1426 to use in language-related features. These Emacs variables control
1427 how Emacs interacts with these features.
1428
1429 @defvar locale-coding-system
1430 @cindex keyboard input decoding on X
1431 This variable specifies the coding system to use for decoding system
1432 error messages and---on X Window system only---keyboard input, for
1433 encoding the format argument to @code{format-time-string}, and for
1434 decoding the return value of @code{format-time-string}.
1435 @end defvar
1436
1437 @defvar system-messages-locale
1438 This variable specifies the locale to use for generating system error
1439 messages. Changing the locale can cause messages to come out in a
1440 different language or in a different orthography. If the variable is
1441 @code{nil}, the locale is specified by environment variables in the
1442 usual POSIX fashion.
1443 @end defvar
1444
1445 @defvar system-time-locale
1446 This variable specifies the locale to use for formatting time values.
1447 Changing the locale can cause messages to appear according to the
1448 conventions of a different language. If the variable is @code{nil}, the
1449 locale is specified by environment variables in the usual POSIX fashion.
1450 @end defvar
1451
1452 @defun locale-info item
1453 This function returns locale data @var{item} for the current POSIX
1454 locale, if available. @var{item} should be one of these symbols:
1455
1456 @table @code
1457 @item codeset
1458 Return the character set as a string (locale item @code{CODESET}).
1459
1460 @item days
1461 Return a 7-element vector of day names (locale items
1462 @code{DAY_1} through @code{DAY_7});
1463
1464 @item months
1465 Return a 12-element vector of month names (locale items @code{MON_1}
1466 through @code{MON_12}).
1467
1468 @item paper
1469 Return a list @code{(@var{width} @var{height})} for the default paper
1470 size measured in millimeters (locale items @code{PAPER_WIDTH} and
1471 @code{PAPER_HEIGHT}).
1472 @end table
1473
1474 If the system can't provide the requested information, or if
1475 @var{item} is not one of those symbols, the value is @code{nil}. All
1476 strings in the return value are decoded using
1477 @code{locale-coding-system}. @xref{Locales,,, libc, The GNU Libc Manual},
1478 for more information about locales and locale items.
1479 @end defun
1480
1481 @ignore
1482 arch-tag: be705bf8-941b-4c35-84fc-ad7d20ddb7cb
1483 @end ignore