]> code.delx.au - gnu-emacs/blob - lispref/nonascii.texi
*** empty log message ***
[gnu-emacs] / lispref / nonascii.texi
1 @c -*-texinfo-*-
2 @c This is part of the GNU Emacs Lisp Reference Manual.
3 @c Copyright (C) 1998 Free Software Foundation, Inc.
4 @c See the file elisp.texi for copying conditions.
5 @setfilename ../info/characters
6 @node Non-ASCII Characters, Searching and Matching, Text, Top
7 @chapter Non-ASCII Characters
8 @cindex multibyte characters
9 @cindex non-ASCII characters
10
11 This chapter covers the special issues relating to non-@sc{ASCII}
12 characters and how they are stored in strings and buffers.
13
14 @menu
15 * Text Representations::
16 * Converting Representations::
17 * Selecting a Representation::
18 * Character Codes::
19 * Character Sets::
20 * Scanning Charsets::
21 * Chars and Bytes::
22 * Coding Systems::
23 * Lisp and Coding System::
24 * Default Coding Systems::
25 * Specifying Coding Systems::
26 * Explicit Encoding::
27 * MS-DOS File Types::
28 * MS-DOS Subprocesses::
29 @end menu
30
31 @node Text Representations
32 @section Text Representations
33 @cindex text representations
34
35 Emacs has two @dfn{text representations}---two ways to represent text
36 in a string or buffer. These are called @dfn{unibyte} and
37 @dfn{multibyte}. Each string, and each buffer, uses one of these two
38 representations. For most purposes, you can ignore the issue of
39 representations, because Emacs converts text between them as
40 appropriate. Occasionally in Lisp programming you will need to pay
41 attention to the difference.
42
43 @cindex unibyte text
44 In unibyte representation, each character occupies one byte and
45 therefore the possible character codes range from 0 to 255. Codes 0
46 through 127 are @sc{ASCII} characters; the codes from 128 through 255
47 are used for one non-@sc{ASCII} character set (you can choose which
48 character set by setting the variable @code{nonascii-insert-offset}).
49
50 @cindex leading code
51 @cindex multibyte text
52 In multibyte representation, a character may occupy more than one
53 byte, and as a result, the full range of Emacs character codes can be
54 stored. The first byte of a multibyte character is always in the range
55 128 through 159 (octal 0200 through 0237). These values are called
56 @dfn{leading codes}. The first byte determines which character set the
57 character belongs to (@pxref{Character Sets}); in particular, it
58 determines how many bytes long the sequence is. The second and
59 subsequent bytes of a multibyte character are always in the range 160
60 through 255 (octal 0240 through 0377).
61
62 In a buffer, the buffer-local value of the variable
63 @code{enable-multibyte-characters} specifies the representation used.
64 The representation for a string is determined based on the string
65 contents when the string is constructed.
66
67 @tindex enable-multibyte-characters
68 @defvar enable-multibyte-characters
69 This variable specifies the current buffer's text representation.
70 If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
71 it contains unibyte text.
72
73 You cannot set this variable directly; instead, use the function
74 @code{set-buffer-multibyte} to change a buffer's representation.
75 @end defvar
76
77 @tindex default-enable-multibyte-characters
78 @defvar default-enable-multibyte-characters
79 This variable`s value is entirely equivalent to @code{(default-value
80 'enable-multibyte-characters)}, and setting this variable changes that
81 default value. Although setting the local binding of
82 @code{enable-multibyte-characters} in a specific buffer is dangerous,
83 changing the default value is safe, and it is a reasonable thing to do.
84
85 The @samp{--unibyte} command line option does its job by setting the
86 default value to @code{nil} early in startup.
87 @end defvar
88
89 @tindex multibyte-string-p
90 @defun multibyte-string-p string
91 Return @code{t} if @var{string} contains multibyte characters.
92 @end defun
93
94 @node Converting Representations
95 @section Converting Text Representations
96
97 Emacs can convert unibyte text to multibyte; it can also convert
98 multibyte text to unibyte, though this conversion loses information. In
99 general these conversions happen when inserting text into a buffer, or
100 when putting text from several strings together in one string. You can
101 also explicitly convert a string's contents to either representation.
102
103 Emacs chooses the representation for a string based on the text that
104 it is constructed from. The general rule is to convert unibyte text to
105 multibyte text when combining it with other multibyte text, because the
106 multibyte representation is more general and can hold whatever
107 characters the unibyte text has.
108
109 When inserting text into a buffer, Emacs converts the text to the
110 buffer's representation, as specified by
111 @code{enable-multibyte-characters} in that buffer. In particular, when
112 you insert multibyte text into a unibyte buffer, Emacs converts the text
113 to unibyte, even though this conversion cannot in general preserve all
114 the characters that might be in the multibyte text. The other natural
115 alternative, to convert the buffer contents to multibyte, is not
116 acceptable because the buffer's representation is a choice made by the
117 user that cannot be overridden automatically.
118
119 Converting unibyte text to multibyte text leaves @sc{ASCII} characters
120 unchanged, and likewise 128 through 159. It converts the non-@sc{ASCII}
121 codes 160 through 255 by adding the value @code{nonascii-insert-offset}
122 to each character code. By setting this variable, you specify which
123 character set the unibyte characters correspond to. For example, if
124 @code{nonascii-insert-offset} is 2048, which is @code{(- (make-char
125 'latin-iso8859-1 0) 128)}, then the unibyte non-@sc{ASCII} characters
126 correspond to Latin 1. If it is 2688, which is @code{(- (make-char
127 'greek-iso8859-7 0) 128)}, then they correspond to Greek letters.
128
129 Converting multibyte text to unibyte is simpler: it performs
130 logical-and of each character code with 255. If
131 @code{nonascii-insert-offset} has a reasonable value, corresponding to
132 the beginning of some character set, this conversion is the inverse of
133 the other: converting unibyte text to multibyte and back to unibyte
134 reproduces the original unibyte text.
135
136 @tindex nonascii-insert-offset
137 @defvar nonascii-insert-offset
138 This variable specifies the amount to add to a non-@sc{ASCII} character
139 when converting unibyte text to multibyte. It also applies when
140 @code{insert-char} or @code{self-insert-command} inserts a character in
141 the unibyte non-@sc{ASCII} range, 128 through 255.
142
143 The right value to use to select character set @var{cs} is @code{(-
144 (make-char @var{cs} 0) 128)}. If the value of
145 @code{nonascii-insert-offset} is zero, then conversion actually uses the
146 value for the Latin 1 character set, rather than zero.
147 @end defvar
148
149 @tindex nonascii-translate-table
150 @defvar nonascii-translate-table
151 This variable provides a more general alternative to
152 @code{nonascii-insert-offset}. You can use it to specify independently
153 how to translate each code in the range of 128 through 255 into a
154 multibyte character. The value should be a vector, or @code{nil}.
155 If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}.
156 @end defvar
157
158 @tindex string-make-unibyte
159 @defun string-make-unibyte string
160 This function converts the text of @var{string} to unibyte
161 representation, if it isn't already, and return the result. If
162 @var{string} is a unibyte string, it is returned unchanged.
163 @end defun
164
165 @tindex string-make-multibyte
166 @defun string-make-multibyte string
167 This function converts the text of @var{string} to multibyte
168 representation, if it isn't already, and return the result. If
169 @var{string} is a multibyte string, it is returned unchanged.
170 @end defun
171
172 @node Selecting a Representation
173 @section Selecting a Representation
174
175 Sometimes it is useful to examine an existing buffer or string as
176 multibyte when it was unibyte, or vice versa.
177
178 @tindex set-buffer-multibyte
179 @defun set-buffer-multibyte multibyte
180 Set the representation type of the current buffer. If @var{multibyte}
181 is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte}
182 is @code{nil}, the buffer becomes unibyte.
183
184 This function leaves the buffer contents unchanged when viewed as a
185 sequence of bytes. As a consequence, it can change the contents viewed
186 as characters; a sequence of two bytes which is treated as one character
187 in multibyte representation will count as two characters in unibyte
188 representation.
189
190 This function sets @code{enable-multibyte-characters} to record which
191 representation is in use. It also adjusts various data in the buffer
192 (including overlays, text properties and markers) so that they cover the
193 same text as they did before.
194 @end defun
195
196 @tindex string-as-unibyte
197 @defun string-as-unibyte string
198 This function returns a string with the same bytes as @var{string} but
199 treating each byte as a character. This means that the value may have
200 more characters than @var{string} has.
201
202 If @var{string} is unibyte already, then the value is @var{string}
203 itself.
204 @end defun
205
206 @tindex string-as-multibyte
207 @defun string-as-multibyte string
208 This function returns a string with the same bytes as @var{string} but
209 treating each multibyte sequence as one character. This means that the
210 value may have fewer characters than @var{string} has.
211
212 If @var{string} is multibyte already, then the value is @var{string}
213 itself.
214 @end defun
215
216 @node Character Codes
217 @section Character Codes
218 @cindex character codes
219
220 The unibyte and multibyte text representations use different character
221 codes. The valid character codes for unibyte representation range from
222 0 to 255---the values that can fit in one byte. The valid character
223 codes for multibyte representation range from 0 to 524287, but not all
224 values in that range are valid. In particular, the values 128 through
225 255 are not legitimate in multibyte text (though they can occur in ``raw
226 bytes''; @pxref{Explicit Encoding}). Only the @sc{ASCII} codes 0
227 through 127 are fully legitimate in both representations.
228
229 @defun char-valid-p charcode
230 This returns @code{t} if @var{charcode} is valid for either one of the two
231 text representations.
232
233 @example
234 (char-valid-p 65)
235 @result{} t
236 (char-valid-p 256)
237 @result{} nil
238 (char-valid-p 2248)
239 @result{} t
240 @end example
241 @end defun
242
243 @node Character Sets
244 @section Character Sets
245 @cindex character sets
246
247 Emacs classifies characters into various @dfn{character sets}, each of
248 which has a name which is a symbol. Each character belongs to one and
249 only one character set.
250
251 In general, there is one character set for each distinct script. For
252 example, @code{latin-iso8859-1} is one character set,
253 @code{greek-iso8859-7} is another, and @code{ascii} is another. An
254 Emacs character set can hold at most 9025 characters; therefore, in some
255 cases, characters that would logically be grouped together are split
256 into several character sets. For example, one set of Chinese characters
257 is divided into eight Emacs character sets, @code{chinese-cns11643-1}
258 through @code{chinese-cns11643-7}.
259
260 @tindex charsetp
261 @defun charsetp object
262 Return @code{t} if @var{object} is a character set name symbol,
263 @code{nil} otherwise.
264 @end defun
265
266 @tindex charset-list
267 @defun charset-list
268 This function returns a list of all defined character set names.
269 @end defun
270
271 @tindex char-charset
272 @defun char-charset character
273 This function returns the the name of the character
274 set that @var{character} belongs to.
275 @end defun
276
277 @node Scanning Charsets
278 @section Scanning for Character Sets
279
280 Sometimes it is useful to find out which character sets appear in a
281 part of a buffer or a string. One use for this is in determining which
282 coding systems (@pxref{Coding Systems}) are capable of representing all
283 of the text in question.
284
285 @tindex find-charset-region
286 @defun find-charset-region beg end &optional unification
287 This function returns a list of the character sets
288 that appear in the current buffer between positions @var{beg}
289 and @var{end}.
290 @end defun
291
292 @tindex find-charset-string
293 @defun find-charset-string string &optional unification
294 This function returns a list of the character sets
295 that appear in the string @var{string}.
296 @end defun
297
298 @node Chars and Bytes
299 @section Characters and Bytes
300 @cindex bytes and characters
301
302 In multibyte representation, each character occupies one or more
303 bytes. The functions in this section convert between characters and the
304 byte values used to represent them. For most purposes, there is no need
305 to be concerned with the number of bytes used to represent a character
306 because Emacs translates automatically when necessary.
307
308 @tindex char-bytes
309 @defun char-bytes character
310 This function returns the number of bytes used to represent the
311 character @var{character}. In most cases, this is the same as
312 @code{(length (split-char @var{character}))}; the only exception is for
313 ASCII characters and the codes used in unibyte text, which use just one
314 byte.
315
316 @example
317 (char-bytes 2248)
318 @result{} 2
319 (char-bytes 65)
320 @result{} 1
321 @end example
322
323 This function's values are correct for both multibyte and unibyte
324 representations, because the non-@sc{ASCII} character codes used in
325 those two representations do not overlap.
326
327 @example
328 (char-bytes 192)
329 @result{} 1
330 @end example
331 @end defun
332
333 @tindex split-char
334 @defun split-char character
335 Return a list containing the name of the character set of
336 @var{character}, followed by one or two byte-values which identify
337 @var{character} within that character set.
338
339 @example
340 (split-char 2248)
341 @result{} (latin-iso8859-1 72)
342 (split-char 65)
343 @result{} (ascii 65)
344 @end example
345
346 Unibyte non-@sc{ASCII} characters are considered as part of
347 the @code{ascii} character set:
348
349 @example
350 (split-char 192)
351 @result{} (ascii 192)
352 @end example
353 @end defun
354
355 @tindex make-char
356 @defun make-char charset &rest byte-values
357 Thus function returns the character in character set @var{charset}
358 identified by @var{byte-values}. This is roughly the opposite of
359 split-char.
360
361 @example
362 (make-char 'latin-iso8859-1 72)
363 @result{} 2248
364 @end example
365 @end defun
366
367 @node Coding Systems
368 @section Coding Systems
369
370 @cindex coding system
371 When Emacs reads or writes a file, and when Emacs sends text to a
372 subprocess or receives text from a subprocess, it normally performs
373 character code conversion and end-of-line conversion as specified
374 by a particular @dfn{coding system}.
375
376 @cindex character code conversion
377 @dfn{Character code conversion} involves conversion between the encoding
378 used inside Emacs and some other encoding. Emacs supports many
379 different encodings, in that it can convert to and from them. For
380 example, it can convert text to or from encodings such as Latin 1, Latin
381 2, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022. In some
382 cases, Emacs supports several alternative encodings for the same
383 characters; for example, there are three coding systems for the Cyrillic
384 (Russian) alphabet: ISO, Alternativnyj, and KOI8.
385
386 Most coding systems specify a particular character code for
387 conversion, but some of them leave this unspecified---to be chosen
388 heuristically based on the data.
389
390 @cindex end of line conversion
391 @dfn{End of line conversion} handles three different conventions used
392 on various systems for representing end of line in files. The Unix
393 convention is to use the linefeed character (also called newline). The
394 DOS convention is to use the two character sequence, carriage-return
395 linefeed, at the end of a line. The Mac convention is to use just
396 carriage-return.
397
398 @cindex base coding system
399 @cindex variant coding system
400 @dfn{Base coding systems} such as @code{latin-1} leave the end-of-line
401 conversion unspecified, to be chosen based on the data. @dfn{Variant
402 coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and
403 @code{latin-1-mac} specify the end-of-line conversion explicitly as
404 well. Each base coding system has three corresponding variants whose
405 names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
406
407 @node Lisp and Coding Systems
408 @subsection Coding Systems in Lisp
409
410 Here are Lisp facilities for working with coding systems;
411
412 @tindex coding-system-list
413 @defun coding-system-list &optional base-only
414 This function returns a list of all coding system names (symbols). If
415 @var{base-only} is non-@code{nil}, the value includes only the
416 base coding systems. Otherwise, it includes variant coding systems as well.
417 @end defun
418
419 @tindex coding-system-p
420 @defun coding-system-p object
421 This function returns @code{t} if @var{object} is a coding system
422 name.
423 @end defun
424
425 @tindex check-coding-system
426 @defun check-coding-system coding-system
427 This function checks the validity of @var{coding-system}.
428 If that is valid, it returns @var{coding-system}.
429 Otherwise it signals an error with condition @code{coding-system-error}.
430 @end defun
431
432 @tindex find-safe-coding-system
433 @defun find-safe-coding-system from to
434 Return a list of proper coding systems to encode a text between
435 @var{from} and @var{to}. All coding systems in the list can safely
436 encode any multibyte characters in the text.
437
438 If the text contains no multibyte characters, return a list of a single
439 element @code{undecided}.
440 @end defun
441
442 @tindex detect-coding-region
443 @defun detect-coding-region start end highest
444 This function chooses a plausible coding system for decoding the text
445 from @var{start} to @var{end}. This text should be ``raw bytes''
446 (@pxref{Explicit Encoding}).
447
448 Normally this function returns is a list of coding systems that could
449 handle decoding the text that was scanned. They are listed in order of
450 decreasing priority, based on the priority specified by the user with
451 @code{prefer-coding-system}. But if @var{highest} is non-@code{nil},
452 then the return value is just one coding system, the one that is highest
453 in priority.
454 @end defun
455
456 @tindex detect-coding-string string highest
457 @defun detect-coding-string
458 This function is like @code{detect-coding-region} except that it
459 operates on the contents of @var{string} instead of bytes in the buffer.
460 @end defun
461
462 @defun find-operation-coding-system operation &rest arguments
463 This function returns the coding system to use (by default) for
464 performing @var{operation} with @var{arguments}. The value has this
465 form:
466
467 @example
468 (@var{decoding-system} @var{encoding-system})
469 @end example
470
471 The first element, @var{decoding-system}, is the coding system to use
472 for decoding (in case @var{operation} does decoding), and
473 @var{encoding-system} is the coding system for encoding (in case
474 @var{operation} does encoding).
475
476 The argument @var{operation} should be an Emacs I/O primitive:
477 @code{insert-file-contents}, @code{write-region}, @code{call-process},
478 @code{call-process-region}, @code{start-process}, or
479 @code{open-network-stream}.
480
481 The remaining arguments should be the same arguments that might be given
482 to that I/O primitive. Depending on which primitive, one of those
483 arguments is selected as the @dfn{target}. For example, if
484 @var{operation} does file I/O, whichever argument specifies the file
485 name is the target. For subprocess primitives, the process name is the
486 target. For @code{open-network-stream}, the target is the service name
487 or port number.
488
489 This function looks up the target in @code{file-coding-system-alist},
490 @code{process-coding-system-alist}, or
491 @code{network-coding-system-alist}, depending on @var{operation}.
492 @xref{Default Coding Systems}.
493 @end defun
494
495 Here are two functions you can use to let the user specify a coding
496 system, with completion. @xref{Completion}.
497
498 @tindex read-coding-system
499 @defun read-coding-system prompt default
500 This function reads a coding system using the minibuffer, prompting with
501 string @var{prompt}, and returns the coding system name as a symbol. If
502 the user enters null input, @var{default} specifies which coding system
503 to return. It should be a symbol or a string.
504 @end defun
505
506 @tindex read-non-nil-coding-system
507 @defun read-non-nil-coding-system prompt
508 This function reads a coding system using the minibuffer, prompting with
509 string @var{prompt},and returns the coding system name as a symbol. If
510 the user tries to enter null input, it asks the user to try again.
511 @xref{Coding Systems}.
512 @end defun
513
514 @node Default Coding Systems
515 @section Default Coding Systems
516
517 These variable specify which coding system to use by default for
518 certain files or when running certain subprograms. The idea of these
519 variables is that you set them once and for all to the defaults you
520 want, and then do not change them again. To specify a particular coding
521 system for a particular operation in a Lisp program, don't change these
522 variables; instead, override them using @code{coding-system-for-read}
523 and @code{coding-system-for-write} (@pxref{Specifying Coding Systems}).
524
525 @tindex file-coding-system-alist
526 @defvar file-coding-system-alist
527 This variable is an alist that specifies the coding systems to use for
528 reading and writing particular files. Each element has the form
529 @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
530 expression that matches certain file names. The element applies to file
531 names that match @var{pattern}.
532
533 The @sc{cdr} of the element, @var{val}, should be either a coding
534 system, a cons cell containing two coding systems, or a function symbol.
535 If @var{val} is a coding system, that coding system is used for both
536 reading the file and writing it. If @var{val} is a cons cell containing
537 two coding systems, its @sc{car} specifies the coding system for
538 decoding, and its @sc{cdr} specifies the coding system for encoding.
539
540 If @var{val} is a function symbol, the function must return a coding
541 system or a cons cell containing two coding systems. This value is used
542 as described above.
543 @end defvar
544
545 @tindex process-coding-system-alist
546 @defvar process-coding-system-alist
547 This variable is an alist specifying which coding systems to use for a
548 subprocess, depending on which program is running in the subprocess. It
549 works like @code{file-coding-system-alist}, except that @var{pattern} is
550 matched against the program name used to start the subprocess. The coding
551 system or systems specified in this alist are used to initialize the
552 coding systems used for I/O to the subprocess, but you can specify
553 other coding systems later using @code{set-process-coding-system}.
554 @end defvar
555
556 @tindex network-coding-system-alist
557 @defvar network-coding-system-alist
558 This variable is an alist that specifies the coding system to use for
559 network streams. It works much like @code{file-coding-system-alist},
560 with the difference that the @var{pattern} in an element may be either a
561 port number or a regular expression. If it is a regular expression, it
562 is matched against the network service name used to open the network
563 stream.
564 @end defvar
565
566 @tindex default-process-coding-system
567 @defvar default-process-coding-system
568 This variable specifies the coding systems to use for subprocess (and
569 network stream) input and output, when nothing else specifies what to
570 do.
571
572 The value should be a cons cell of the form @code{(@var{output-coding}
573 . @var{input-coding})}. Here @var{output-coding} applies to output to
574 the subprocess, and @var{input-coding} applies to input from it.
575 @end defvar
576
577 @node Specifying Coding Systems
578 @section Specifying a Coding System for One Operation
579
580 You can specify the coding system for a specific operation by binding
581 the variables @code{coding-system-for-read} and/or
582 @code{coding-system-for-write}.
583
584 @tindex coding-system-for-read
585 @defvar coding-system-for-read
586 If this variable is non-@code{nil}, it specifies the coding system to
587 use for reading a file, or for input from a synchronous subprocess.
588
589 It also applies to any asynchronous subprocess or network stream, but in
590 a different way: the value of @code{coding-system-for-read} when you
591 start the subprocess or open the network stream specifies the input
592 decoding method for that subprocess or network stream. It remains in
593 use for that subprocess or network stream unless and until overridden.
594
595 The right way to use this variable is to bind it with @code{let} for a
596 specific I/O operation. Its global value is normally @code{nil}, and
597 you should not globally set it to any other value. Here is an example
598 of the right way to use the variable:
599
600 @example
601 ;; @r{Read the file with no character code conversion.}
602 ;; @r{Assume @sc{crlf} represents end-of-line.}
603 (let ((coding-system-for-write 'emacs-mule-dos))
604 (insert-file-contents filename))
605 @end example
606
607 When its value is non-@code{nil}, @code{coding-system-for-read} takes
608 precedence all other methods of specifying a coding system to use for
609 input, including @code{file-coding-system-alist},
610 @code{process-coding-system-alist} and
611 @code{network-coding-system-alist}.
612 @end defvar
613
614 @tindex coding-system-for-write
615 @defvar coding-system-for-write
616 This works much like @code{coding-system-for-read}, except that it
617 applies to output rather than input. It affects writing to files,
618 subprocesses, and net connections.
619
620 When a single operation does both input and output, as do
621 @code{call-process-region} and @code{start-process}, both
622 @code{coding-system-for-read} and @code{coding-system-for-write}
623 affect it.
624 @end defvar
625
626 @tindex last-coding-system-used
627 @defvar last-coding-system-used
628 All I/O operations that use a coding system set this variable
629 to the coding system name that was used.
630 @end defvar
631
632 @tindex inhibit-eol-conversion
633 @defvar inhibit-eol-conversion
634 When this variable is non-@code{nil}, no end-of-line conversion is done,
635 no matter which coding system is specified. This applies to all the
636 Emacs I/O and subprocess primitives, and to the explicit encoding and
637 decoding functions (@pxref{Explicit Encoding}).
638 @end defvar
639
640 @tindex keyboard-coding-system
641 @defun keyboard-coding-system
642 This function returns the coding system that is in use for decoding
643 keyboard input---or @code{nil} if no coding system is to be used.
644 @end defun
645
646 @tindex set-keyboard-coding-system
647 @defun set-keyboard-coding-system coding-system
648 This function specifies @var{coding-system} as the coding system to
649 use for decoding keyboard input. If @var{coding-system} is @code{nil},
650 that means do not decode keyboard input.
651 @end defun
652
653 @tindex terminal-coding-system
654 @defun terminal-coding-system
655 This function returns the coding system that is in use for encoding
656 terminal output---or @code{nil} for no encoding.
657 @end defun
658
659 @tindex set-terminal-coding-system
660 @defun set-terminal-coding-system coding-system
661 This function specifies @var{coding-system} as the coding system to use
662 for encoding terminal output. If @var{coding-system} is @code{nil},
663 that means do not encode terminal output.
664 @end defun
665
666 See also the functions @code{process-coding-system} and
667 @code{set-process-coding-system}. @xref{Process Information}.
668
669 See also @code{read-coding-system} in @ref{High-Level Completion}.
670
671 @node Explicit Encoding
672 @section Explicit Encoding and Decoding
673 @cindex encoding text
674 @cindex decoding text
675
676 All the operations that transfer text in and out of Emacs have the
677 ability to use a coding system to encode or decode the text.
678 You can also explicitly encode and decode text using the functions
679 in this section.
680
681 @cindex raw bytes
682 The result of encoding, and the input to decoding, are not ordinary
683 text. They are ``raw bytes''---bytes that represent text in the same
684 way that an external file would. When a buffer contains raw bytes, it
685 is most natural to mark that buffer as using unibyte representation,
686 using @code{set-buffer-multibyte} (@pxref{Selecting a Representation}),
687 but this is not required. If the buffer's contents are only temporarily
688 raw, leave the buffer multibyte, which will be correct after you decode
689 them.
690
691 The usual way to get raw bytes in a buffer, for explicit decoding, is
692 to read them from a file with @code{insert-file-contents-literally}
693 (@pxref{Reading from Files}) or specify a non-@code{nil} @var{rawfile}
694 argument when visiting a file with @code{find-file-noselect}.
695
696 The usual way to use the raw bytes that result from explicitly
697 encoding text is to copy them to a file or process---for example, to
698 write them with @code{write-region} (@pxref{Writing to Files}), and
699 suppress encoding for that @code{write-region} call by binding
700 @code{coding-system-for-write} to @code{no-conversion}.
701
702 @tindex encode-coding-region
703 @defun encode-coding-region start end coding-system
704 This function encodes the text from @var{start} to @var{end} according
705 to coding system @var{coding-system}. The encoded text replaces the
706 original text in the buffer. The result of encoding is ``raw bytes,''
707 but the buffer remains multibyte if it was multibyte before.
708 @end defun
709
710 @tindex encode-coding-string
711 @defun encode-coding-string string coding-system
712 This function encodes the text in @var{string} according to coding
713 system @var{coding-system}. It returns a new string containing the
714 encoded text. The result of encoding is a unibyte string of ``raw bytes.''
715 @end defun
716
717 @tindex decode-coding-region
718 @defun decode-coding-region start end coding-system
719 This function decodes the text from @var{start} to @var{end} according
720 to coding system @var{coding-system}. The decoded text replaces the
721 original text in the buffer. To make explicit decoding useful, the text
722 before decoding ought to be ``raw bytes.''
723 @end defun
724
725 @tindex decode-coding-string
726 @defun decode-coding-string string coding-system
727 This function decodes the text in @var{string} according to coding
728 system @var{coding-system}. It returns a new string containing the
729 decoded text. To make explicit decoding useful, the contents of
730 @var{string} ought to be ``raw bytes.''
731 @end defun
732
733 @node MS-DOS File Types
734 @section MS-DOS File Types
735 @cindex DOS file types
736 @cindex MS-DOS file types
737 @cindex Windows file types
738 @cindex file types on MS-DOS and Windows
739 @cindex text files and binary files
740 @cindex binary files and text files
741
742 Emacs on MS-DOS and on MS-Windows recognizes certain file names as
743 text files or binary files. For a text file, Emacs always uses DOS
744 end-of-line conversion. For a binary file, Emacs does no end-of-line
745 conversion and no character code conversion.
746
747 @defvar buffer-file-type
748 This variable, automatically buffer-local in each buffer, records the
749 file type of the buffer's visited file. The value is @code{nil} for
750 text, @code{t} for binary. When a buffer does not specify a coding
751 system with @code{buffer-file-coding-system}, this variable is used by
752 the function @code{find-buffer-file-type-coding-system} to determine
753 which coding system to use when writing the contents of the buffer.
754 @end defvar
755
756 @defopt file-name-buffer-file-type-alist
757 This variable holds an alist for recognizing text and binary files.
758 Each element has the form (@var{regexp} . @var{type}), where
759 @var{regexp} is matched against the file name, and @var{type} may be
760 @code{nil} for text, @code{t} for binary, or a function to call to
761 compute which. If it is a function, then it is called with a single
762 argument (the file name) and should return @code{t} or @code{nil}.
763
764 Emacs when running on MS-DOS or MS-Windows checks this alist to decide
765 which coding system to use when reading a file. For a text file,
766 @code{undecided-dos} is used. For a binary file, @code{no-conversion}
767 is used.
768
769 If no element in this alist matches a given file name, then
770 @code{default-buffer-file-type} says how to treat the file.
771 @end defopt
772
773 @defopt default-buffer-file-type
774 This variable says how to handle files for which
775 @code{file-name-buffer-file-type-alist} says nothing about the type.
776
777 If this variable is non-@code{nil}, then these files are treated as
778 binary. Otherwise, nothing special is done for them---the coding system
779 is deduced solely from the file contents, in the usual Emacs fashion.
780 @end defopt
781
782 @node MS-DOS Subprocesses
783 @section MS-DOS Subprocesses
784
785 On Microsoft operating systems, these variables provide an alternative
786 way to specify the kind of end-of-line conversion to use for input and
787 output. The variable @code{binary-process-input} applies to input sent
788 to the subprocess, and @code{binary-process-output} applies to output
789 received from it. A non-@code{nil} value means the data is ``binary,''
790 and @code{nil} means the data is text.
791
792 @defvar binary-process-input
793 If this variable is @code{nil}, convert newlines to @sc{crlf} sequences in
794 the input to a synchronous subprocess.
795 @end defvar
796
797 @defvar binary-process-output
798 If this variable is @code{nil}, convert @sc{crlf} sequences to newlines in
799 the output from a synchronous subprocess.
800 @end defvar