code.delx.au - gnu-emacs/blob - doc/lispref/nonascii.texi

   1 @c -*-texinfo-*-
   2 @c This is part of the GNU Emacs Lisp Reference Manual.
   3 @c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004,
   4 @c   2005, 2006, 2007, 2008  Free Software Foundation, Inc.
   5 @c See the file elisp.texi for copying conditions.
   6 @setfilename ../../info/characters
   7 @node Non-ASCII Characters, Searching and Matching, Text, Top
   8 @chapter Non-@acronym{ASCII} Characters
   9 @cindex multibyte characters
  10 @cindex characters, multi-byte
  11 @cindex non-@acronym{ASCII} characters
  12
  13   This chapter covers the special issues relating to characters and
  14 how they are stored in strings and buffers.
  15
  16 @menu
  17 * Text Representations::    How Emacs represents text.
  18 * Converting Representations::  Converting unibyte to multibyte and vice versa.
  19 * Selecting a Representation::  Treating a byte sequence as unibyte or multi.
  20 * Character Codes::         How unibyte and multibyte relate to
  21                                 codes of individual characters.
  22 * Character Sets::          The space of possible character codes
  23                                 is divided into various character sets.
  24 * Chars and Bytes::         More information about multibyte encodings.
  25 * Splitting Characters::    Converting a character to its byte sequence.
  26 * Scanning Charsets::       Which character sets are used in a buffer?
  27 * Translation of Characters::   Translation tables are used for conversion.
  28 * Coding Systems::          Coding systems are conversions for saving files.
  29 * Input Methods::           Input methods allow users to enter various
  30                                 non-ASCII characters without special keyboards.
  31 * Locales::                 Interacting with the POSIX locale.
  32 @end menu
  33
  34 @node Text Representations
  35 @section Text Representations
  36 @cindex text representation
  37
  38   Emacs buffers and strings support a large repertoire of characters
  39 from many different scripts.  This is so users could type and display
  40 text in most any known written language.
  41
  42 @cindex character codepoint
  43 @cindex codespace
  44 @cindex Unicode
  45   To support this multitude of characters and scripts, Emacs closely
  46 follows the @dfn{Unicode Standard}.  The Unicode Standard assigns a
  47 unique number, called a @dfn{codepoint}, to each and every character.
  48 The range of codepoints defined by Unicode, or the Unicode
  49 @dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive.  Emacs
  50 extends this range with codepoints in the range @code{3FFF80..3FFFFF},
  51 which it uses for representing raw 8-bit bytes that cannot be
  52 interpreted as characters.  Thus, a character codepoint in Emacs is a
  53 22-bit integer number.
  54
  55 @cindex internal representation of characters
  56 @cindex characters, representation in buffers and strings
  57 @cindex multibyte text
  58   To conserve memory, Emacs does not hold fixed-length 22-bit numbers
  59 that are codepoints of text characters within buffers and strings.
  60 Rather, Emacs uses a variable-length internal representation of
  61 characters, that stores each character as a sequence of 1 to 5 8-bit
  62 bytes, depending on the magnitude of its codepoint@footnote{
  63 This internal representation is based on one of the encodings defined
  64 by the Unicode Standard, called @dfn{UTF-8}, for representing any
  65 Unicode codepoint, but Emacs extends UTF-8 to represent the additional
  66 codepoints it uses for raw 8-bit bytes.}.
  67 For example, any @acronym{ASCII} character takes up only 1 byte, a
  68 Latin-1 character takes up 2 bytes, etc.  We call this representation
  69 of text @dfn{multibyte}, because it uses several bytes for each
  70 character.
  71
  72   Outside Emacs, characters can be represented in many different
  73 encodings, such as ISO-8859-1, GB-2312, Big-5, etc.  Emacs converts
  74 between these external encodings and the internal representation, as
  75 appropriate, when it reads text into a buffer or a string, or when it
  76 writes text to a disk file or passes it to some other process.
  77
  78   Occasionally, Emacs needs to hold and manipulate encoded text or
  79 binary non-text data in its buffer or string.  For example, when Emacs
  80 visits a file, it first reads the file's text verbatim into a buffer,
  81 and only then converts it to the internal representation.  Before the
  82 conversion, the buffer holds encoded text.
  83
  84 @cindex unibyte text
  85   Encoded text is not really text, as far as Emacs is concerned, but
  86 rather a sequence of raw 8-bit bytes.  We call buffers and strings
  87 that hold encoded text @dfn{unibyte} buffers and strings, because
  88 Emacs treats them as a sequence of individual bytes.  In particular,
  89 Emacs usually displays unibyte buffers and strings as octal codes such
  90 as @code{\237}.  We recommend that you never use unibyte buffers and
  91 strings except for manipulating encoded text or binary non-text data.
  92
  93   In a buffer, the buffer-local value of the variable
  94 @code{enable-multibyte-characters} specifies the representation used.
  95 The representation for a string is determined and recorded in the string
  96 when the string is constructed.
  97
  98 @defvar enable-multibyte-characters
  99 This variable specifies the current buffer's text representation.
 100 If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
 101 it contains unibyte encoded text or binary non-text data.
 102
 103 You cannot set this variable directly; instead, use the function
 104 @code{set-buffer-multibyte} to change a buffer's representation.
 105 @end defvar
 106
 107 @defvar default-enable-multibyte-characters
 108 This variable's value is entirely equivalent to @code{(default-value
 109 'enable-multibyte-characters)}, and setting this variable changes that
 110 default value.  Setting the local binding of
 111 @code{enable-multibyte-characters} in a specific buffer is not allowed,
 112 but changing the default value is supported, and it is a reasonable
 113 thing to do, because it has no effect on existing buffers.
 114
 115 The @samp{--unibyte} command line option does its job by setting the
 116 default value to @code{nil} early in startup.
 117 @end defvar
 118
 119 @defun position-bytes position
 120 Buffer positions are measured in character units.  This function
 121 returns the byte-position corresponding to buffer position
 122 @var{position} in the current buffer.  This is 1 at the start of the
 123 buffer, and counts upward in bytes.  If @var{position} is out of
 124 range, the value is @code{nil}.
 125 @end defun
 126
 127 @defun byte-to-position byte-position
 128 Return the buffer position, in character units, corresponding to
 129 byte-position @var{byte-position} in the current buffer.  If
 130 @var{byte-position} is out of range, the value is @code{nil}.
 131 @end defun
 132
 133 @defun multibyte-string-p string
 134 Return @code{t} if @var{string} is a multibyte string, @code{nil}
 135 otherwise.
 136 @end defun
 137
 138 @defun string-bytes string
 139 @cindex string, number of bytes
 140 This function returns the number of bytes in @var{string}.
 141 If @var{string} is a multibyte string, this can be greater than
 142 @code{(length @var{string})}.
 143 @end defun
 144
 145 @defun unibyte-string &rest bytes
 146 This function concatenates all its argument @var{bytes} and makes the
 147 result a unibyte string.
 148 @end defun
 149
 150 @node Converting Representations
 151 @section Converting Text Representations
 152
 153   Emacs can convert unibyte text to multibyte; it can also convert
 154 multibyte text to unibyte, though this conversion loses information.  In
 155 general these conversions happen when inserting text into a buffer, or
 156 when putting text from several strings together in one string.  You can
 157 also explicitly convert a string's contents to either representation.
 158
 159   Emacs chooses the representation for a string based on the text that
 160 it is constructed from.  The general rule is to convert unibyte text to
 161 multibyte text when combining it with other multibyte text, because the
 162 multibyte representation is more general and can hold whatever
 163 characters the unibyte text has.
 164
 165   When inserting text into a buffer, Emacs converts the text to the
 166 buffer's representation, as specified by
 167 @code{enable-multibyte-characters} in that buffer.  In particular, when
 168 you insert multibyte text into a unibyte buffer, Emacs converts the text
 169 to unibyte, even though this conversion cannot in general preserve all
 170 the characters that might be in the multibyte text.  The other natural
 171 alternative, to convert the buffer contents to multibyte, is not
 172 acceptable because the buffer's representation is a choice made by the
 173 user that cannot be overridden automatically.
 174
 175   Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
 176 unchanged, and likewise character codes 128 through 159.  It converts
 177 the non-@acronym{ASCII} codes 160 through 255 by adding the value
 178 @code{nonascii-insert-offset} to each character code.  By setting this
 179 variable, you specify which character set the unibyte characters
 180 correspond to (@pxref{Character Sets}).  For example, if
 181 @code{nonascii-insert-offset} is 2048, which is @code{(- (make-char
 182 'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters
 183 correspond to Latin 1.  If it is 2688, which is @code{(- (make-char
 184 'greek-iso8859-7) 128)}, then they correspond to Greek letters.
 185
 186   Converting multibyte text to unibyte is simpler: it discards all but
 187 the low 8 bits of each character code.  If @code{nonascii-insert-offset}
 188 has a reasonable value, corresponding to the beginning of some character
 189 set, this conversion is the inverse of the other: converting unibyte
 190 text to multibyte and back to unibyte reproduces the original unibyte
 191 text.
 192
 193 @defvar nonascii-insert-offset
 194 This variable specifies the amount to add to a non-@acronym{ASCII} character
 195 when converting unibyte text to multibyte.  It also applies when
 196 @code{self-insert-command} inserts a character in the unibyte
 197 non-@acronym{ASCII} range, 128 through 255.  However, the functions
 198 @code{insert} and @code{insert-char} do not perform this conversion.
 199
 200 The right value to use to select character set @var{cs} is @code{(-
 201 (make-char @var{cs}) 128)}.  If the value of
 202 @code{nonascii-insert-offset} is zero, then conversion actually uses the
 203 value for the Latin 1 character set, rather than zero.
 204 @end defvar
 205
 206 @defvar nonascii-translation-table
 207 This variable provides a more general alternative to
 208 @code{nonascii-insert-offset}.  You can use it to specify independently
 209 how to translate each code in the range of 128 through 255 into a
 210 multibyte character.  The value should be a char-table, or @code{nil}.
 211 If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}.
 212 @end defvar
 213
 214 The next three functions either return the argument @var{string}, or a
 215 newly created string with no text properties.
 216
 217 @defun string-make-unibyte string
 218 This function converts the text of @var{string} to unibyte
 219 representation, if it isn't already, and returns the result.  If
 220 @var{string} is a unibyte string, it is returned unchanged.  Multibyte
 221 character codes are converted to unibyte according to
 222 @code{nonascii-translation-table} or, if that is @code{nil}, using
 223 @code{nonascii-insert-offset}.  If the lookup in the translation table
 224 fails, this function takes just the low 8 bits of each character.
 225 @end defun
 226
 227 @defun string-make-multibyte string
 228 This function converts the text of @var{string} to multibyte
 229 representation, if it isn't already, and returns the result.  If
 230 @var{string} is a multibyte string or consists entirely of
 231 @acronym{ASCII} characters, it is returned unchanged.  In particular,
 232 if @var{string} is unibyte and entirely @acronym{ASCII}, the returned
 233 string is unibyte.  (When the characters are all @acronym{ASCII},
 234 Emacs primitives will treat the string the same way whether it is
 235 unibyte or multibyte.)  If @var{string} is unibyte and contains
 236 non-@acronym{ASCII} characters, the function
 237 @code{unibyte-char-to-multibyte} is used to convert each unibyte
 238 character to a multibyte character.
 239 @end defun
 240
 241 @defun string-to-multibyte string
 242 This function returns a multibyte string containing the same sequence
 243 of character codes as @var{string}.  Unlike
 244 @code{string-make-multibyte}, this function unconditionally returns a
 245 multibyte string.  If @var{string} is a multibyte string, it is
 246 returned unchanged.
 247 @end defun
 248
 249 @defun multibyte-char-to-unibyte char
 250 This convert the multibyte character @var{char} to a unibyte
 251 character, based on @code{nonascii-translation-table} and
 252 @code{nonascii-insert-offset}.
 253 @end defun
 254
 255 @defun unibyte-char-to-multibyte char
 256 This convert the unibyte character @var{char} to a multibyte
 257 character, based on @code{nonascii-translation-table} and
 258 @code{nonascii-insert-offset}.
 259 @end defun
 260
 261 @node Selecting a Representation
 262 @section Selecting a Representation
 263
 264   Sometimes it is useful to examine an existing buffer or string as
 265 multibyte when it was unibyte, or vice versa.
 266
 267 @defun set-buffer-multibyte multibyte
 268 Set the representation type of the current buffer.  If @var{multibyte}
 269 is non-@code{nil}, the buffer becomes multibyte.  If @var{multibyte}
 270 is @code{nil}, the buffer becomes unibyte.
 271
 272 This function leaves the buffer contents unchanged when viewed as a
 273 sequence of bytes.  As a consequence, it can change the contents viewed
 274 as characters; a sequence of two bytes which is treated as one character
 275 in multibyte representation will count as two characters in unibyte
 276 representation.  Character codes 128 through 159 are an exception.  They
 277 are represented by one byte in a unibyte buffer, but when the buffer is
 278 set to multibyte, they are converted to two-byte sequences, and vice
 279 versa.
 280
 281 This function sets @code{enable-multibyte-characters} to record which
 282 representation is in use.  It also adjusts various data in the buffer
 283 (including overlays, text properties and markers) so that they cover the
 284 same text as they did before.
 285
 286 You cannot use @code{set-buffer-multibyte} on an indirect buffer,
 287 because indirect buffers always inherit the representation of the
 288 base buffer.
 289 @end defun
 290
 291 @defun string-as-unibyte string
 292 This function returns a string with the same bytes as @var{string} but
 293 treating each byte as a character.  This means that the value may have
 294 more characters than @var{string} has.
 295
 296 If @var{string} is already a unibyte string, then the value is
 297 @var{string} itself.  Otherwise it is a newly created string, with no
 298 text properties.  If @var{string} is multibyte, any characters it
 299 contains of charset @code{eight-bit-control} or @code{eight-bit-graphic}
 300 are converted to the corresponding single byte.
 301 @end defun
 302
 303 @defun string-as-multibyte string
 304 This function returns a string with the same bytes as @var{string} but
 305 treating each multibyte sequence as one character.  This means that the
 306 value may have fewer characters than @var{string} has.
 307
 308 If @var{string} is already a multibyte string, then the value is
 309 @var{string} itself.  Otherwise it is a newly created string, with no
 310 text properties.  If @var{string} is unibyte and contains any individual
 311 8-bit bytes (i.e.@: not part of a multibyte form), they are converted to
 312 the corresponding multibyte character of charset @code{eight-bit-control}
 313 or @code{eight-bit-graphic}.
 314 @end defun
 315
 316 @node Character Codes
 317 @section Character Codes
 318 @cindex character codes
 319
 320   The unibyte and multibyte text representations use different
 321 character codes.  The valid character codes for unibyte representation
 322 range from 0 to 255---the values that can fit in one byte.  The valid
 323 character codes for multibyte representation range from 0 to 4194303,
 324 but not all values in that range are valid.  The values 128 through
 325 255 do not usually show up in multibyte text, but they can occur if
 326 you do explicit encoding and decoding (@pxref{Explicit Encoding}).
 327 Some other character codes cannot occur at all in multibyte text.
 328 Only the @acronym{ASCII} codes 0 through 127 are completely legitimate
 329 in both representations.
 330
 331 @defun characterp charcode
 332 This returns @code{t} if @var{charcode} is a valid character, and
 333 @code{nil} otherwise.
 334
 335 @example
 336 (characterp 65)
 337      @result{} t
 338 (characterp 256)
 339      @result{} nil
 340 (characterp 4194303)
 341      @result{} t
 342 (characterp 4194304)
 343      @result{} nil
 344 @end example
 345 @end defun
 346
 347 @node Character Sets
 348 @section Character Sets
 349 @cindex character sets
 350
 351   Emacs classifies characters into various @dfn{character sets}, each of
 352 which has a name which is a symbol.  Each character belongs to one and
 353 only one character set.
 354
 355   In general, there is one character set for each distinct script.  For
 356 example, @code{latin-iso8859-1} is one character set,
 357 @code{greek-iso8859-7} is another, and @code{ascii} is another.  An
 358 Emacs character set can hold at most 9025 characters; therefore, in some
 359 cases, characters that would logically be grouped together are split
 360 into several character sets.  For example, one set of Chinese
 361 characters, generally known as Big 5, is divided into two Emacs
 362 character sets, @code{chinese-big5-1} and @code{chinese-big5-2}.
 363
 364   @acronym{ASCII} characters are in character set @code{ascii}.  The
 365 non-@acronym{ASCII} characters 128 through 159 are in character set
 366 @code{eight-bit-control}, and codes 160 through 255 are in character set
 367 @code{eight-bit-graphic}.
 368
 369 @defun charsetp object
 370 Returns @code{t} if @var{object} is a symbol that names a character set,
 371 @code{nil} otherwise.
 372 @end defun
 373
 374 @defvar charset-list
 375 The value is a list of all defined character set names.
 376 @end defvar
 377
 378 @defun charset-list
 379 This function returns the value of @code{charset-list}.  It is only
 380 provided for backward compatibility.
 381 @end defun
 382
 383 @defun char-charset character
 384 This function returns the name of the character set that @var{character}
 385 belongs to, or the symbol @code{unknown} if @var{character} is not a
 386 valid character.
 387 @end defun
 388
 389 @defun charset-plist charset
 390 This function returns the charset property list of the character set
 391 @var{charset}.  Although @var{charset} is a symbol, this is not the same
 392 as the property list of that symbol.  Charset properties are used for
 393 special purposes within Emacs.
 394 @end defun
 395
 396 @deffn Command list-charset-chars charset
 397 This command displays a list of characters in the character set
 398 @var{charset}.
 399 @end deffn
 400
 401 @node Chars and Bytes
 402 @section Characters and Bytes
 403 @cindex bytes and characters
 404
 405 @cindex introduction sequence (of character)
 406 @cindex dimension (of character set)
 407   In multibyte representation, each character occupies one or more
 408 bytes.  Each character set has an @dfn{introduction sequence}, which is
 409 normally one or two bytes long.  (Exception: the @code{ascii} character
 410 set and the @code{eight-bit-graphic} character set have a zero-length
 411 introduction sequence.)  The introduction sequence is the beginning of
 412 the byte sequence for any character in the character set.  The rest of
 413 the character's bytes distinguish it from the other characters in the
 414 same character set.  Depending on the character set, there are either
 415 one or two distinguishing bytes; the number of such bytes is called the
 416 @dfn{dimension} of the character set.
 417
 418 @defun charset-dimension charset
 419 This function returns the dimension of @var{charset}; at present, the
 420 dimension is always 1 or 2.
 421 @end defun
 422
 423 @defun charset-bytes charset
 424 This function returns the number of bytes used to represent a character
 425 in character set @var{charset}.
 426 @end defun
 427
 428   This is the simplest way to determine the byte length of a character
 429 set's introduction sequence:
 430
 431 @example
 432 (- (charset-bytes @var{charset})
 433    (charset-dimension @var{charset}))
 434 @end example
 435
 436 @node Splitting Characters
 437 @section Splitting Characters
 438 @cindex character as bytes
 439
 440   The functions in this section convert between characters and the byte
 441 values used to represent them.  For most purposes, there is no need to
 442 be concerned with the sequence of bytes used to represent a character,
 443 because Emacs translates automatically when necessary.
 444
 445 @defun split-char character
 446 Return a list containing the name of the character set of
 447 @var{character}, followed by one or two byte values (integers) which
 448 identify @var{character} within that character set.  The number of byte
 449 values is the character set's dimension.
 450
 451 If @var{character} is invalid as a character code, @code{split-char}
 452 returns a list consisting of the symbol @code{unknown} and @var{character}.
 453
 454 @example
 455 (split-char 2248)
 456      @result{} (latin-iso8859-1 72)
 457 (split-char 65)
 458      @result{} (ascii 65)
 459 (split-char 128)
 460      @result{} (eight-bit-control 128)
 461 @end example
 462 @end defun
 463
 464 @c FIXME: update split-char and make-char
 465 @cindex generate characters in charsets
 466 @defun make-char charset &optional code1 code2
 467 This function returns the character in character set @var{charset} whose
 468 position codes are @var{code1} and @var{code2}.  This is roughly the
 469 inverse of @code{split-char}.  Normally, you should specify either one
 470 or both of @var{code1} and @var{code2} according to the dimension of
 471 @var{charset}.  For example,
 472
 473 @example
 474 (make-char 'latin-iso8859-1 72)
 475      @result{} 2248
 476 @end example
 477
 478 Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed
 479 before they are used to index @var{charset}.  Thus you may use, for
 480 instance, an ISO 8859 character code rather than subtracting 128, as
 481 is necessary to index the corresponding Emacs charset.
 482 @end defun
 483
 484 @node Scanning Charsets
 485 @section Scanning for Character Sets
 486
 487   Sometimes it is useful to find out which character sets appear in a
 488 part of a buffer or a string.  One use for this is in determining which
 489 coding systems (@pxref{Coding Systems}) are capable of representing all
 490 of the text in question.
 491
 492 @defun charset-after &optional pos
 493 This function return the charset of a character in the current buffer
 494 at position @var{pos}.  If @var{pos} is omitted or @code{nil}, it
 495 defaults to the current value of point.  If @var{pos} is out of range,
 496 the value is @code{nil}.
 497 @end defun
 498
 499 @defun find-charset-region beg end &optional translation
 500 This function returns a list of the character sets that appear in the
 501 current buffer between positions @var{beg} and @var{end}.
 502
 503 The optional argument @var{translation} specifies a translation table to
 504 be used in scanning the text (@pxref{Translation of Characters}).  If it
 505 is non-@code{nil}, then each character in the region is translated
 506 through this table, and the value returned describes the translated
 507 characters instead of the characters actually in the buffer.
 508 @end defun
 509
 510 @defun find-charset-string string &optional translation
 511 This function returns a list of the character sets that appear in the
 512 string @var{string}.  It is just like @code{find-charset-region}, except
 513 that it applies to the contents of @var{string} instead of part of the
 514 current buffer.
 515 @end defun
 516
 517 @node Translation of Characters
 518 @section Translation of Characters
 519 @cindex character translation tables
 520 @cindex translation tables
 521
 522   A @dfn{translation table} is a char-table that specifies a mapping
 523 of characters into characters.  These tables are used in encoding and
 524 decoding, and for other purposes.  Some coding systems specify their
 525 own particular translation tables; there are also default translation
 526 tables which apply to all other coding systems.
 527
 528   For instance, the coding-system @code{utf-8} has a translation table
 529 that maps characters of various charsets (e.g.,
 530 @code{latin-iso8859-@var{x}}) into Unicode character sets.  This way,
 531 it can encode Latin-2 characters into UTF-8.  Meanwhile,
 532 @code{unify-8859-on-decoding-mode} operates by specifying
 533 @code{standard-translation-table-for-decode} to translate
 534 Latin-@var{x} characters into corresponding Unicode characters.
 535
 536 @defun make-translation-table &rest translations
 537 This function returns a translation table based on the argument
 538 @var{translations}.  Each element of @var{translations} should be a
 539 list of elements of the form @code{(@var{from} . @var{to})}; this says
 540 to translate the character @var{from} into @var{to}.
 541
 542 The arguments and the forms in each argument are processed in order,
 543 and if a previous form already translates @var{to} to some other
 544 character, say @var{to-alt}, @var{from} is also translated to
 545 @var{to-alt}.
 546 @end defun
 547
 548   In decoding, the translation table's translations are applied to the
 549 characters that result from ordinary decoding.  If a coding system has
 550 property @code{translation-table-for-decode}, that specifies the
 551 translation table to use.  (This is a property of the coding system,
 552 as returned by @code{coding-system-get}, not a property of the symbol
 553 that is the coding system's name. @xref{Coding System Basics,, Basic
 554 Concepts of Coding Systems}.)  Otherwise, if
 555 @code{standard-translation-table-for-decode} is non-@code{nil},
 556 decoding uses that table.
 557
 558   In encoding, the translation table's translations are applied to the
 559 characters in the buffer, and the result of translation is actually
 560 encoded.  If a coding system has property
 561 @code{translation-table-for-encode}, that specifies the translation
 562 table to use.  Otherwise the variable
 563 @code{standard-translation-table-for-encode} specifies the translation
 564 table.
 565
 566 @defvar standard-translation-table-for-decode
 567 This is the default translation table for decoding, for
 568 coding systems that don't specify any other translation table.
 569 @end defvar
 570
 571 @defvar standard-translation-table-for-encode
 572 This is the default translation table for encoding, for
 573 coding systems that don't specify any other translation table.
 574 @end defvar
 575
 576 @node Coding Systems
 577 @section Coding Systems
 578
 579 @cindex coding system
 580   When Emacs reads or writes a file, and when Emacs sends text to a
 581 subprocess or receives text from a subprocess, it normally performs
 582 character code conversion and end-of-line conversion as specified
 583 by a particular @dfn{coding system}.
 584
 585   How to define a coding system is an arcane matter, and is not
 586 documented here.
 587
 588 @menu
 589 * Coding System Basics::        Basic concepts.
 590 * Encoding and I/O::            How file I/O functions handle coding systems.
 591 * Lisp and Coding Systems::     Functions to operate on coding system names.
 592 * User-Chosen Coding Systems::  Asking the user to choose a coding system.
 593 * Default Coding Systems::      Controlling the default choices.
 594 * Specifying Coding Systems::   Requesting a particular coding system
 595                                     for a single file operation.
 596 * Explicit Encoding::           Encoding or decoding text without doing I/O.
 597 * Terminal I/O Encoding::       Use of encoding for terminal I/O.
 598 * MS-DOS File Types::           How DOS "text" and "binary" files
 599                                     relate to coding systems.
 600 @end menu
 601
 602 @node Coding System Basics
 603 @subsection Basic Concepts of Coding Systems
 604
 605 @cindex character code conversion
 606   @dfn{Character code conversion} involves conversion between the encoding
 607 used inside Emacs and some other encoding.  Emacs supports many
 608 different encodings, in that it can convert to and from them.  For
 609 example, it can convert text to or from encodings such as Latin 1, Latin
 610 2, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022.  In some
 611 cases, Emacs supports several alternative encodings for the same
 612 characters; for example, there are three coding systems for the Cyrillic
 613 (Russian) alphabet: ISO, Alternativnyj, and KOI8.
 614
 615   Most coding systems specify a particular character code for
 616 conversion, but some of them leave the choice unspecified---to be chosen
 617 heuristically for each file, based on the data.
 618
 619   In general, a coding system doesn't guarantee roundtrip identity:
 620 decoding a byte sequence using coding system, then encoding the
 621 resulting text in the same coding system, can produce a different byte
 622 sequence.  However, the following coding systems do guarantee that the
 623 byte sequence will be the same as what you originally decoded:
 624
 625 @quotation
 626 chinese-big5 chinese-iso-8bit cyrillic-iso-8bit emacs-mule
 627 greek-iso-8bit hebrew-iso-8bit iso-latin-1 iso-latin-2 iso-latin-3
 628 iso-latin-4 iso-latin-5 iso-latin-8 iso-latin-9 iso-safe
 629 japanese-iso-8bit japanese-shift-jis korean-iso-8bit raw-text
 630 @end quotation
 631
 632   Encoding buffer text and then decoding the result can also fail to
 633 reproduce the original text.  For instance, if you encode Latin-2
 634 characters with @code{utf-8} and decode the result using the same
 635 coding system, you'll get Unicode characters (of charset
 636 @code{mule-unicode-0100-24ff}).  If you encode Unicode characters with
 637 @code{iso-latin-2} and decode the result with the same coding system,
 638 you'll get Latin-2 characters.
 639
 640 @cindex EOL conversion
 641 @cindex end-of-line conversion
 642 @cindex line end conversion
 643   @dfn{End of line conversion} handles three different conventions used
 644 on various systems for representing end of line in files.  The Unix
 645 convention is to use the linefeed character (also called newline).  The
 646 DOS convention is to use a carriage-return and a linefeed at the end of
 647 a line.  The Mac convention is to use just carriage-return.
 648
 649 @cindex base coding system
 650 @cindex variant coding system
 651   @dfn{Base coding systems} such as @code{latin-1} leave the end-of-line
 652 conversion unspecified, to be chosen based on the data.  @dfn{Variant
 653 coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and
 654 @code{latin-1-mac} specify the end-of-line conversion explicitly as
 655 well.  Most base coding systems have three corresponding variants whose
 656 names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
 657
 658   The coding system @code{raw-text} is special in that it prevents
 659 character code conversion, and causes the buffer visited with that
 660 coding system to be a unibyte buffer.  It does not specify the
 661 end-of-line conversion, allowing that to be determined as usual by the
 662 data, and has the usual three variants which specify the end-of-line
 663 conversion.  @code{no-conversion} is equivalent to @code{raw-text-unix}:
 664 it specifies no conversion of either character codes or end-of-line.
 665
 666   The coding system @code{emacs-mule} specifies that the data is
 667 represented in the internal Emacs encoding.  This is like
 668 @code{raw-text} in that no code conversion happens, but different in
 669 that the result is multibyte data.
 670
 671 @defun coding-system-get coding-system property
 672 This function returns the specified property of the coding system
 673 @var{coding-system}.  Most coding system properties exist for internal
 674 purposes, but one that you might find useful is @code{mime-charset}.
 675 That property's value is the name used in MIME for the character coding
 676 which this coding system can read and write.  Examples:
 677
 678 @example
 679 (coding-system-get 'iso-latin-1 'mime-charset)
 680      @result{} iso-8859-1
 681 (coding-system-get 'iso-2022-cn 'mime-charset)
 682      @result{} iso-2022-cn
 683 (coding-system-get 'cyrillic-koi8 'mime-charset)
 684      @result{} koi8-r
 685 @end example
 686
 687 The value of the @code{mime-charset} property is also defined
 688 as an alias for the coding system.
 689 @end defun
 690
 691 @node Encoding and I/O
 692 @subsection Encoding and I/O
 693
 694   The principal purpose of coding systems is for use in reading and
 695 writing files.  The function @code{insert-file-contents} uses
 696 a coding system for decoding the file data, and @code{write-region}
 697 uses one to encode the buffer contents.
 698
 699   You can specify the coding system to use either explicitly
 700 (@pxref{Specifying Coding Systems}), or implicitly using a default
 701 mechanism (@pxref{Default Coding Systems}).  But these methods may not
 702 completely specify what to do.  For example, they may choose a coding
 703 system such as @code{undefined} which leaves the character code
 704 conversion to be determined from the data.  In these cases, the I/O
 705 operation finishes the job of choosing a coding system.  Very often
 706 you will want to find out afterwards which coding system was chosen.
 707
 708 @defvar buffer-file-coding-system
 709 This buffer-local variable records the coding system used for saving the
 710 buffer and for writing part of the buffer with @code{write-region}.  If
 711 the text to be written cannot be safely encoded using the coding system
 712 specified by this variable, these operations select an alternative
 713 encoding by calling the function @code{select-safe-coding-system}
 714 (@pxref{User-Chosen Coding Systems}).  If selecting a different encoding
 715 requires to ask the user to specify a coding system,
 716 @code{buffer-file-coding-system} is updated to the newly selected coding
 717 system.
 718
 719 @code{buffer-file-coding-system} does @emph{not} affect sending text
 720 to a subprocess.
 721 @end defvar
 722
 723 @defvar save-buffer-coding-system
 724 This variable specifies the coding system for saving the buffer (by
 725 overriding @code{buffer-file-coding-system}).  Note that it is not used
 726 for @code{write-region}.
 727
 728 When a command to save the buffer starts out to use
 729 @code{buffer-file-coding-system} (or @code{save-buffer-coding-system}),
 730 and that coding system cannot handle
 731 the actual text in the buffer, the command asks the user to choose
 732 another coding system (by calling @code{select-safe-coding-system}).
 733 After that happens, the command also updates
 734 @code{buffer-file-coding-system} to represent the coding system that
 735 the user specified.
 736 @end defvar
 737
 738 @defvar last-coding-system-used
 739 I/O operations for files and subprocesses set this variable to the
 740 coding system name that was used.  The explicit encoding and decoding
 741 functions (@pxref{Explicit Encoding}) set it too.
 742
 743 @strong{Warning:} Since receiving subprocess output sets this variable,
 744 it can change whenever Emacs waits; therefore, you should copy the
 745 value shortly after the function call that stores the value you are
 746 interested in.
 747 @end defvar
 748
 749   The variable @code{selection-coding-system} specifies how to encode
 750 selections for the window system.  @xref{Window System Selections}.
 751
 752 @defvar file-name-coding-system
 753 The variable @code{file-name-coding-system} specifies the coding
 754 system to use for encoding file names.  Emacs encodes file names using
 755 that coding system for all file operations.  If
 756 @code{file-name-coding-system} is @code{nil}, Emacs uses a default
 757 coding system determined by the selected language environment.  In the
 758 default language environment, any non-@acronym{ASCII} characters in
 759 file names are not encoded specially; they appear in the file system
 760 using the internal Emacs representation.
 761 @end defvar
 762
 763   @strong{Warning:} if you change @code{file-name-coding-system} (or
 764 the language environment) in the middle of an Emacs session, problems
 765 can result if you have already visited files whose names were encoded
 766 using the earlier coding system and are handled differently under the
 767 new coding system.  If you try to save one of these buffers under the
 768 visited file name, saving may use the wrong file name, or it may get
 769 an error.  If such a problem happens, use @kbd{C-x C-w} to specify a
 770 new file name for that buffer.
 771
 772 @node Lisp and Coding Systems
 773 @subsection Coding Systems in Lisp
 774
 775   Here are the Lisp facilities for working with coding systems:
 776
 777 @defun coding-system-list &optional base-only
 778 This function returns a list of all coding system names (symbols).  If
 779 @var{base-only} is non-@code{nil}, the value includes only the
 780 base coding systems.  Otherwise, it includes alias and variant coding
 781 systems as well.
 782 @end defun
 783
 784 @defun coding-system-p object
 785 This function returns @code{t} if @var{object} is a coding system
 786 name or @code{nil}.
 787 @end defun
 788
 789 @defun check-coding-system coding-system
 790 This function checks the validity of @var{coding-system}.
 791 If that is valid, it returns @var{coding-system}.
 792 Otherwise it signals an error with condition @code{coding-system-error}.
 793 @end defun
 794
 795 @defun coding-system-eol-type coding-system
 796 This function returns the type of end-of-line (a.k.a.@: @dfn{eol})
 797 conversion used by @var{coding-system}.  If @var{coding-system}
 798 specifies a certain eol conversion, the return value is an integer 0,
 799 1, or 2, standing for @code{unix}, @code{dos}, and @code{mac},
 800 respectively.  If @var{coding-system} doesn't specify eol conversion
 801 explicitly, the return value is a vector of coding systems, each one
 802 with one of the possible eol conversion types, like this:
 803
 804 @lisp
 805 (coding-system-eol-type 'latin-1)
 806      @result{} [latin-1-unix latin-1-dos latin-1-mac]
 807 @end lisp
 808
 809 @noindent
 810 If this function returns a vector, Emacs will decide, as part of the
 811 text encoding or decoding process, what eol conversion to use.  For
 812 decoding, the end-of-line format of the text is auto-detected, and the
 813 eol conversion is set to match it (e.g., DOS-style CRLF format will
 814 imply @code{dos} eol conversion).  For encoding, the eol conversion is
 815 taken from the appropriate default coding system (e.g.,
 816 @code{default-buffer-file-coding-system} for
 817 @code{buffer-file-coding-system}), or from the default eol conversion
 818 appropriate for the underlying platform.
 819 @end defun
 820
 821 @defun coding-system-change-eol-conversion coding-system eol-type
 822 This function returns a coding system which is like @var{coding-system}
 823 except for its eol conversion, which is specified by @code{eol-type}.
 824 @var{eol-type} should be @code{unix}, @code{dos}, @code{mac}, or
 825 @code{nil}.  If it is @code{nil}, the returned coding system determines
 826 the end-of-line conversion from the data.
 827
 828 @var{eol-type} may also be 0, 1 or 2, standing for @code{unix},
 829 @code{dos} and @code{mac}, respectively.
 830 @end defun
 831
 832 @defun coding-system-change-text-conversion eol-coding text-coding
 833 This function returns a coding system which uses the end-of-line
 834 conversion of @var{eol-coding}, and the text conversion of
 835 @var{text-coding}.  If @var{text-coding} is @code{nil}, it returns
 836 @code{undecided}, or one of its variants according to @var{eol-coding}.
 837 @end defun
 838
 839 @defun find-coding-systems-region from to
 840 This function returns a list of coding systems that could be used to
 841 encode a text between @var{from} and @var{to}.  All coding systems in
 842 the list can safely encode any multibyte characters in that portion of
 843 the text.
 844
 845 If the text contains no multibyte characters, the function returns the
 846 list @code{(undecided)}.
 847 @end defun
 848
 849 @defun find-coding-systems-string string
 850 This function returns a list of coding systems that could be used to
 851 encode the text of @var{string}.  All coding systems in the list can
 852 safely encode any multibyte characters in @var{string}.  If the text
 853 contains no multibyte characters, this returns the list
 854 @code{(undecided)}.
 855 @end defun
 856
 857 @defun find-coding-systems-for-charsets charsets
 858 This function returns a list of coding systems that could be used to
 859 encode all the character sets in the list @var{charsets}.
 860 @end defun
 861
 862 @defun detect-coding-region start end &optional highest
 863 This function chooses a plausible coding system for decoding the text
 864 from @var{start} to @var{end}.  This text should be a byte sequence
 865 (@pxref{Explicit Encoding}).
 866
 867 Normally this function returns a list of coding systems that could
 868 handle decoding the text that was scanned.  They are listed in order of
 869 decreasing priority.  But if @var{highest} is non-@code{nil}, then the
 870 return value is just one coding system, the one that is highest in
 871 priority.
 872
 873 If the region contains only @acronym{ASCII} characters except for such
 874 ISO-2022 control characters ISO-2022 as @code{ESC}, the value is
 875 @code{undecided} or @code{(undecided)}, or a variant specifying
 876 end-of-line conversion, if that can be deduced from the text.
 877 @end defun
 878
 879 @defun detect-coding-string string &optional highest
 880 This function is like @code{detect-coding-region} except that it
 881 operates on the contents of @var{string} instead of bytes in the buffer.
 882 @end defun
 883
 884   @xref{Coding systems for a subprocess,, Process Information}, in
 885 particular the description of the functions
 886 @code{process-coding-system} and @code{set-process-coding-system}, for
 887 how to examine or set the coding systems used for I/O to a subprocess.
 888
 889 @node User-Chosen Coding Systems
 890 @subsection User-Chosen Coding Systems
 891
 892 @cindex select safe coding system
 893 @defun select-safe-coding-system from to &optional default-coding-system accept-default-p file
 894 This function selects a coding system for encoding specified text,
 895 asking the user to choose if necessary.  Normally the specified text
 896 is the text in the current buffer between @var{from} and @var{to}.  If
 897 @var{from} is a string, the string specifies the text to encode, and
 898 @var{to} is ignored.
 899
 900 If @var{default-coding-system} is non-@code{nil}, that is the first
 901 coding system to try; if that can handle the text,
 902 @code{select-safe-coding-system} returns that coding system.  It can
 903 also be a list of coding systems; then the function tries each of them
 904 one by one.  After trying all of them, it next tries the current
 905 buffer's value of @code{buffer-file-coding-system} (if it is not
 906 @code{undecided}), then the value of
 907 @code{default-buffer-file-coding-system} and finally the user's most
 908 preferred coding system, which the user can set using the command
 909 @code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing
 910 Coding Systems, emacs, The GNU Emacs Manual}).
 911
 912 If one of those coding systems can safely encode all the specified
 913 text, @code{select-safe-coding-system} chooses it and returns it.
 914 Otherwise, it asks the user to choose from a list of coding systems
 915 which can encode all the text, and returns the user's choice.
 916
 917 @var{default-coding-system} can also be a list whose first element is
 918 t and whose other elements are coding systems.  Then, if no coding
 919 system in the list can handle the text, @code{select-safe-coding-system}
 920 queries the user immediately, without trying any of the three
 921 alternatives described above.
 922
 923 The optional argument @var{accept-default-p}, if non-@code{nil},
 924 should be a function to determine whether a coding system selected
 925 without user interaction is acceptable. @code{select-safe-coding-system}
 926 calls this function with one argument, the base coding system of the
 927 selected coding system.  If @var{accept-default-p} returns @code{nil},
 928 @code{select-safe-coding-system} rejects the silently selected coding
 929 system, and asks the user to select a coding system from a list of
 930 possible candidates.
 931
 932 @vindex select-safe-coding-system-accept-default-p
 933 If the variable @code{select-safe-coding-system-accept-default-p} is
 934 non-@code{nil}, its value overrides the value of
 935 @var{accept-default-p}.
 936
 937 As a final step, before returning the chosen coding system,
 938 @code{select-safe-coding-system} checks whether that coding system is
 939 consistent with what would be selected if the contents of the region
 940 were read from a file.  (If not, this could lead to data corruption in
 941 a file subsequently re-visited and edited.)  Normally,
 942 @code{select-safe-coding-system} uses @code{buffer-file-name} as the
 943 file for this purpose, but if @var{file} is non-@code{nil}, it uses
 944 that file instead (this can be relevant for @code{write-region} and
 945 similar functions).  If it detects an apparent inconsistency,
 946 @code{select-safe-coding-system} queries the user before selecting the
 947 coding system.
 948 @end defun
 949
 950   Here are two functions you can use to let the user specify a coding
 951 system, with completion.  @xref{Completion}.
 952
 953 @defun read-coding-system prompt &optional default
 954 This function reads a coding system using the minibuffer, prompting with
 955 string @var{prompt}, and returns the coding system name as a symbol.  If
 956 the user enters null input, @var{default} specifies which coding system
 957 to return.  It should be a symbol or a string.
 958 @end defun
 959
 960 @defun read-non-nil-coding-system prompt
 961 This function reads a coding system using the minibuffer, prompting with
 962 string @var{prompt}, and returns the coding system name as a symbol.  If
 963 the user tries to enter null input, it asks the user to try again.
 964 @xref{Coding Systems}.
 965 @end defun
 966
 967 @node Default Coding Systems
 968 @subsection Default Coding Systems
 969
 970   This section describes variables that specify the default coding
 971 system for certain files or when running certain subprograms, and the
 972 function that I/O operations use to access them.
 973
 974   The idea of these variables is that you set them once and for all to the
 975 defaults you want, and then do not change them again.  To specify a
 976 particular coding system for a particular operation in a Lisp program,
 977 don't change these variables; instead, override them using
 978 @code{coding-system-for-read} and @code{coding-system-for-write}
 979 (@pxref{Specifying Coding Systems}).
 980
 981 @defvar auto-coding-regexp-alist
 982 This variable is an alist of text patterns and corresponding coding
 983 systems. Each element has the form @code{(@var{regexp}
 984 . @var{coding-system})}; a file whose first few kilobytes match
 985 @var{regexp} is decoded with @var{coding-system} when its contents are
 986 read into a buffer.  The settings in this alist take priority over
 987 @code{coding:} tags in the files and the contents of
 988 @code{file-coding-system-alist} (see below).  The default value is set
 989 so that Emacs automatically recognizes mail files in Babyl format and
 990 reads them with no code conversions.
 991 @end defvar
 992
 993 @defvar file-coding-system-alist
 994 This variable is an alist that specifies the coding systems to use for
 995 reading and writing particular files.  Each element has the form
 996 @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
 997 expression that matches certain file names.  The element applies to file
 998 names that match @var{pattern}.
 999
1000 The @sc{cdr} of the element, @var{coding}, should be either a coding
1001 system, a cons cell containing two coding systems, or a function name (a
1002 symbol with a function definition).  If @var{coding} is a coding system,
1003 that coding system is used for both reading the file and writing it.  If
1004 @var{coding} is a cons cell containing two coding systems, its @sc{car}
1005 specifies the coding system for decoding, and its @sc{cdr} specifies the
1006 coding system for encoding.
1007
1008 If @var{coding} is a function name, the function should take one
1009 argument, a list of all arguments passed to
1010 @code{find-operation-coding-system}.  It must return a coding system
1011 or a cons cell containing two coding systems.  This value has the same
1012 meaning as described above.
1013
1014 If @var{coding} (or what returned by the above function) is
1015 @code{undecided}, the normal code-detection is performed.
1016 @end defvar
1017
1018 @defvar process-coding-system-alist
1019 This variable is an alist specifying which coding systems to use for a
1020 subprocess, depending on which program is running in the subprocess.  It
1021 works like @code{file-coding-system-alist}, except that @var{pattern} is
1022 matched against the program name used to start the subprocess.  The coding
1023 system or systems specified in this alist are used to initialize the
1024 coding systems used for I/O to the subprocess, but you can specify
1025 other coding systems later using @code{set-process-coding-system}.
1026 @end defvar
1027
1028   @strong{Warning:} Coding systems such as @code{undecided}, which
1029 determine the coding system from the data, do not work entirely reliably
1030 with asynchronous subprocess output.  This is because Emacs handles
1031 asynchronous subprocess output in batches, as it arrives.  If the coding
1032 system leaves the character code conversion unspecified, or leaves the
1033 end-of-line conversion unspecified, Emacs must try to detect the proper
1034 conversion from one batch at a time, and this does not always work.
1035
1036   Therefore, with an asynchronous subprocess, if at all possible, use a
1037 coding system which determines both the character code conversion and
1038 the end of line conversion---that is, one like @code{latin-1-unix},
1039 rather than @code{undecided} or @code{latin-1}.
1040
1041 @defvar network-coding-system-alist
1042 This variable is an alist that specifies the coding system to use for
1043 network streams.  It works much like @code{file-coding-system-alist},
1044 with the difference that the @var{pattern} in an element may be either a
1045 port number or a regular expression.  If it is a regular expression, it
1046 is matched against the network service name used to open the network
1047 stream.
1048 @end defvar
1049
1050 @defvar default-process-coding-system
1051 This variable specifies the coding systems to use for subprocess (and
1052 network stream) input and output, when nothing else specifies what to
1053 do.
1054
1055 The value should be a cons cell of the form @code{(@var{input-coding}
1056 . @var{output-coding})}.  Here @var{input-coding} applies to input from
1057 the subprocess, and @var{output-coding} applies to output to it.
1058 @end defvar
1059
1060 @defvar auto-coding-functions
1061 This variable holds a list of functions that try to determine a
1062 coding system for a file based on its undecoded contents.
1063
1064 Each function in this list should be written to look at text in the
1065 current buffer, but should not modify it in any way.  The buffer will
1066 contain undecoded text of parts of the file.  Each function should
1067 take one argument, @var{size}, which tells it how many characters to
1068 look at, starting from point.  If the function succeeds in determining
1069 a coding system for the file, it should return that coding system.
1070 Otherwise, it should return @code{nil}.
1071
1072 If a file has a @samp{coding:} tag, that takes precedence, so these
1073 functions won't be called.
1074 @end defvar
1075
1076 @defun find-operation-coding-system operation &rest arguments
1077 This function returns the coding system to use (by default) for
1078 performing @var{operation} with @var{arguments}.  The value has this
1079 form:
1080
1081 @example
1082 (@var{decoding-system} . @var{encoding-system})
1083 @end example
1084
1085 The first element, @var{decoding-system}, is the coding system to use
1086 for decoding (in case @var{operation} does decoding), and
1087 @var{encoding-system} is the coding system for encoding (in case
1088 @var{operation} does encoding).
1089
1090 The argument @var{operation} is a symbol, one of @code{write-region},
1091 @code{start-process}, @code{call-process}, @code{call-process-region},
1092 @code{insert-file-contents}, or @code{open-network-stream}.  These are
1093 the names of the Emacs I/O primitives that can do character code and
1094 eol conversion.
1095
1096 The remaining arguments should be the same arguments that might be given
1097 to the corresponding I/O primitive.  Depending on the primitive, one
1098 of those arguments is selected as the @dfn{target}.  For example, if
1099 @var{operation} does file I/O, whichever argument specifies the file
1100 name is the target.  For subprocess primitives, the process name is the
1101 target.  For @code{open-network-stream}, the target is the service name
1102 or port number.
1103
1104 Depending on @var{operation}, this function looks up the target in
1105 @code{file-coding-system-alist}, @code{process-coding-system-alist},
1106 or @code{network-coding-system-alist}.  If the target is found in the
1107 alist, @code{find-operation-coding-system} returns its association in
1108 the alist; otherwise it returns @code{nil}.
1109
1110 If @var{operation} is @code{insert-file-contents}, the argument
1111 corresponding to the target may be a cons cell of the form
1112 @code{(@var{filename} . @var{buffer})}).  In that case, @var{filename}
1113 is a file name to look up in @code{file-coding-system-alist}, and
1114 @var{buffer} is a buffer that contains the file's contents (not yet
1115 decoded).  If @code{file-coding-system-alist} specifies a function to
1116 call for this file, and that function needs to examine the file's
1117 contents (as it usually does), it should examine the contents of
1118 @var{buffer} instead of reading the file.
1119 @end defun
1120
1121 @node Specifying Coding Systems
1122 @subsection Specifying a Coding System for One Operation
1123
1124   You can specify the coding system for a specific operation by binding
1125 the variables @code{coding-system-for-read} and/or
1126 @code{coding-system-for-write}.
1127
1128 @defvar coding-system-for-read
1129 If this variable is non-@code{nil}, it specifies the coding system to
1130 use for reading a file, or for input from a synchronous subprocess.
1131
1132 It also applies to any asynchronous subprocess or network stream, but in
1133 a different way: the value of @code{coding-system-for-read} when you
1134 start the subprocess or open the network stream specifies the input
1135 decoding method for that subprocess or network stream.  It remains in
1136 use for that subprocess or network stream unless and until overridden.
1137
1138 The right way to use this variable is to bind it with @code{let} for a
1139 specific I/O operation.  Its global value is normally @code{nil}, and
1140 you should not globally set it to any other value.  Here is an example
1141 of the right way to use the variable:
1142
1143 @example
1144 ;; @r{Read the file with no character code conversion.}
1145 ;; @r{Assume @acronym{crlf} represents end-of-line.}
1146 (let ((coding-system-for-read 'emacs-mule-dos))
1147   (insert-file-contents filename))
1148 @end example
1149
1150 When its value is non-@code{nil}, this variable takes precedence over
1151 all other methods of specifying a coding system to use for input,
1152 including @code{file-coding-system-alist},
1153 @code{process-coding-system-alist} and
1154 @code{network-coding-system-alist}.
1155 @end defvar
1156
1157 @defvar coding-system-for-write
1158 This works much like @code{coding-system-for-read}, except that it
1159 applies to output rather than input.  It affects writing to files,
1160 as well as sending output to subprocesses and net connections.
1161
1162 When a single operation does both input and output, as do
1163 @code{call-process-region} and @code{start-process}, both
1164 @code{coding-system-for-read} and @code{coding-system-for-write}
1165 affect it.
1166 @end defvar
1167
1168 @defvar inhibit-eol-conversion
1169 When this variable is non-@code{nil}, no end-of-line conversion is done,
1170 no matter which coding system is specified.  This applies to all the
1171 Emacs I/O and subprocess primitives, and to the explicit encoding and
1172 decoding functions (@pxref{Explicit Encoding}).
1173 @end defvar
1174
1175 @node Explicit Encoding
1176 @subsection Explicit Encoding and Decoding
1177 @cindex encoding in coding systems
1178 @cindex decoding in coding systems
1179
1180   All the operations that transfer text in and out of Emacs have the
1181 ability to use a coding system to encode or decode the text.
1182 You can also explicitly encode and decode text using the functions
1183 in this section.
1184
1185   The result of encoding, and the input to decoding, are not ordinary
1186 text.  They logically consist of a series of byte values; that is, a
1187 series of characters whose codes are in the range 0 through 255.  In a
1188 multibyte buffer or string, character codes 128 through 159 are
1189 represented by multibyte sequences, but this is invisible to Lisp
1190 programs.
1191
1192   The usual way to read a file into a buffer as a sequence of bytes, so
1193 you can decode the contents explicitly, is with
1194 @code{insert-file-contents-literally} (@pxref{Reading from Files});
1195 alternatively, specify a non-@code{nil} @var{rawfile} argument when
1196 visiting a file with @code{find-file-noselect}.  These methods result in
1197 a unibyte buffer.
1198
1199   The usual way to use the byte sequence that results from explicitly
1200 encoding text is to copy it to a file or process---for example, to write
1201 it with @code{write-region} (@pxref{Writing to Files}), and suppress
1202 encoding by binding @code{coding-system-for-write} to
1203 @code{no-conversion}.
1204
1205   Here are the functions to perform explicit encoding or decoding.  The
1206 encoding functions produce sequences of bytes; the decoding functions
1207 are meant to operate on sequences of bytes.  All of these functions
1208 discard text properties.
1209
1210 @deffn Command encode-coding-region start end coding-system
1211 This command encodes the text from @var{start} to @var{end} according
1212 to coding system @var{coding-system}.  The encoded text replaces the
1213 original text in the buffer.  The result of encoding is logically a
1214 sequence of bytes, but the buffer remains multibyte if it was multibyte
1215 before.
1216
1217 This command returns the length of the encoded text.
1218 @end deffn
1219
1220 @defun encode-coding-string string coding-system &optional nocopy
1221 This function encodes the text in @var{string} according to coding
1222 system @var{coding-system}.  It returns a new string containing the
1223 encoded text, except when @var{nocopy} is non-@code{nil}, in which
1224 case the function may return @var{string} itself if the encoding
1225 operation is trivial.  The result of encoding is a unibyte string.
1226 @end defun
1227
1228 @deffn Command decode-coding-region start end coding-system
1229 This command decodes the text from @var{start} to @var{end} according
1230 to coding system @var{coding-system}.  The decoded text replaces the
1231 original text in the buffer.  To make explicit decoding useful, the text
1232 before decoding ought to be a sequence of byte values, but both
1233 multibyte and unibyte buffers are acceptable.
1234
1235 This command returns the length of the decoded text.
1236 @end deffn
1237
1238 @defun decode-coding-string string coding-system &optional nocopy
1239 This function decodes the text in @var{string} according to coding
1240 system @var{coding-system}.  It returns a new string containing the
1241 decoded text, except when @var{nocopy} is non-@code{nil}, in which
1242 case the function may return @var{string} itself if the decoding
1243 operation is trivial.  To make explicit decoding useful, the contents
1244 of @var{string} ought to be a sequence of byte values, but a multibyte
1245 string is acceptable.
1246 @end defun
1247
1248 @defun decode-coding-inserted-region from to filename &optional visit beg end replace
1249 This function decodes the text from @var{from} to @var{to} as if
1250 it were being read from file @var{filename} using @code{insert-file-contents}
1251 using the rest of the arguments provided.
1252
1253 The normal way to use this function is after reading text from a file
1254 without decoding, if you decide you would rather have decoded it.
1255 Instead of deleting the text and reading it again, this time with
1256 decoding, you can call this function.
1257 @end defun
1258
1259 @node Terminal I/O Encoding
1260 @subsection Terminal I/O Encoding
1261
1262   Emacs can decode keyboard input using a coding system, and encode
1263 terminal output.  This is useful for terminals that transmit or display
1264 text using a particular encoding such as Latin-1.  Emacs does not set
1265 @code{last-coding-system-used} for encoding or decoding for the
1266 terminal.
1267
1268 @defun keyboard-coding-system
1269 This function returns the coding system that is in use for decoding
1270 keyboard input---or @code{nil} if no coding system is to be used.
1271 @end defun
1272
1273 @deffn Command set-keyboard-coding-system coding-system
1274 This command specifies @var{coding-system} as the coding system to
1275 use for decoding keyboard input.  If @var{coding-system} is @code{nil},
1276 that means do not decode keyboard input.
1277 @end deffn
1278
1279 @defun terminal-coding-system
1280 This function returns the coding system that is in use for encoding
1281 terminal output---or @code{nil} for no encoding.
1282 @end defun
1283
1284 @deffn Command set-terminal-coding-system coding-system
1285 This command specifies @var{coding-system} as the coding system to use
1286 for encoding terminal output.  If @var{coding-system} is @code{nil},
1287 that means do not encode terminal output.
1288 @end deffn
1289
1290 @node MS-DOS File Types
1291 @subsection MS-DOS File Types
1292 @cindex DOS file types
1293 @cindex MS-DOS file types
1294 @cindex Windows file types
1295 @cindex file types on MS-DOS and Windows
1296 @cindex text files and binary files
1297 @cindex binary files and text files
1298
1299   On MS-DOS and Microsoft Windows, Emacs guesses the appropriate
1300 end-of-line conversion for a file by looking at the file's name.  This
1301 feature classifies files as @dfn{text files} and @dfn{binary files}.  By
1302 ``binary file'' we mean a file of literal byte values that are not
1303 necessarily meant to be characters; Emacs does no end-of-line conversion
1304 and no character code conversion for them.  On the other hand, the bytes
1305 in a text file are intended to represent characters; when you create a
1306 new file whose name implies that it is a text file, Emacs uses DOS
1307 end-of-line conversion.
1308
1309 @defvar buffer-file-type
1310 This variable, automatically buffer-local in each buffer, records the
1311 file type of the buffer's visited file.  When a buffer does not specify
1312 a coding system with @code{buffer-file-coding-system}, this variable is
1313 used to determine which coding system to use when writing the contents
1314 of the buffer.  It should be @code{nil} for text, @code{t} for binary.
1315 If it is @code{t}, the coding system is @code{no-conversion}.
1316 Otherwise, @code{undecided-dos} is used.
1317
1318 Normally this variable is set by visiting a file; it is set to
1319 @code{nil} if the file was visited without any actual conversion.
1320 @end defvar
1321
1322 @defopt file-name-buffer-file-type-alist
1323 This variable holds an alist for recognizing text and binary files.
1324 Each element has the form (@var{regexp} . @var{type}), where
1325 @var{regexp} is matched against the file name, and @var{type} may be
1326 @code{nil} for text, @code{t} for binary, or a function to call to
1327 compute which.  If it is a function, then it is called with a single
1328 argument (the file name) and should return @code{t} or @code{nil}.
1329
1330 When running on MS-DOS or MS-Windows, Emacs checks this alist to decide
1331 which coding system to use when reading a file.  For a text file,
1332 @code{undecided-dos} is used.  For a binary file, @code{no-conversion}
1333 is used.
1334
1335 If no element in this alist matches a given file name, then
1336 @code{default-buffer-file-type} says how to treat the file.
1337 @end defopt
1338
1339 @defopt default-buffer-file-type
1340 This variable says how to handle files for which
1341 @code{file-name-buffer-file-type-alist} says nothing about the type.
1342
1343 If this variable is non-@code{nil}, then these files are treated as
1344 binary: the coding system @code{no-conversion} is used.  Otherwise,
1345 nothing special is done for them---the coding system is deduced solely
1346 from the file contents, in the usual Emacs fashion.
1347 @end defopt
1348
1349 @node Input Methods
1350 @section Input Methods
1351 @cindex input methods
1352
1353   @dfn{Input methods} provide convenient ways of entering non-@acronym{ASCII}
1354 characters from the keyboard.  Unlike coding systems, which translate
1355 non-@acronym{ASCII} characters to and from encodings meant to be read by
1356 programs, input methods provide human-friendly commands.  (@xref{Input
1357 Methods,,, emacs, The GNU Emacs Manual}, for information on how users
1358 use input methods to enter text.)  How to define input methods is not
1359 yet documented in this manual, but here we describe how to use them.
1360
1361   Each input method has a name, which is currently a string;
1362 in the future, symbols may also be usable as input method names.
1363
1364 @defvar current-input-method
1365 This variable holds the name of the input method now active in the
1366 current buffer.  (It automatically becomes local in each buffer when set
1367 in any fashion.)  It is @code{nil} if no input method is active in the
1368 buffer now.
1369 @end defvar
1370
1371 @defopt default-input-method
1372 This variable holds the default input method for commands that choose an
1373 input method.  Unlike @code{current-input-method}, this variable is
1374 normally global.
1375 @end defopt
1376
1377 @deffn Command set-input-method input-method
1378 This command activates input method @var{input-method} for the current
1379 buffer.  It also sets @code{default-input-method} to @var{input-method}.
1380 If @var{input-method} is @code{nil}, this command deactivates any input
1381 method for the current buffer.
1382 @end deffn
1383
1384 @defun read-input-method-name prompt &optional default inhibit-null
1385 This function reads an input method name with the minibuffer, prompting
1386 with @var{prompt}.  If @var{default} is non-@code{nil}, that is returned
1387 by default, if the user enters empty input.  However, if
1388 @var{inhibit-null} is non-@code{nil}, empty input signals an error.
1389
1390 The returned value is a string.
1391 @end defun
1392
1393 @defvar input-method-alist
1394 This variable defines all the supported input methods.
1395 Each element defines one input method, and should have the form:
1396
1397 @example
1398 (@var{input-method} @var{language-env} @var{activate-func}
1399  @var{title} @var{description} @var{args}...)
1400 @end example
1401
1402 Here @var{input-method} is the input method name, a string;
1403 @var{language-env} is another string, the name of the language
1404 environment this input method is recommended for.  (That serves only for
1405 documentation purposes.)
1406
1407 @var{activate-func} is a function to call to activate this method.  The
1408 @var{args}, if any, are passed as arguments to @var{activate-func}.  All
1409 told, the arguments to @var{activate-func} are @var{input-method} and
1410 the @var{args}.
1411
1412 @var{title} is a string to display in the mode line while this method is
1413 active.  @var{description} is a string describing this method and what
1414 it is good for.
1415 @end defvar
1416
1417   The fundamental interface to input methods is through the
1418 variable @code{input-method-function}.  @xref{Reading One Event},
1419 and @ref{Invoking the Input Method}.
1420
1421 @node Locales
1422 @section Locales
1423 @cindex locale
1424
1425   POSIX defines a concept of ``locales'' which control which language
1426 to use in language-related features.  These Emacs variables control
1427 how Emacs interacts with these features.
1428
1429 @defvar locale-coding-system
1430 @cindex keyboard input decoding on X
1431 This variable specifies the coding system to use for decoding system
1432 error messages and---on X Window system only---keyboard input, for
1433 encoding the format argument to @code{format-time-string}, and for
1434 decoding the return value of @code{format-time-string}.
1435 @end defvar
1436
1437 @defvar system-messages-locale
1438 This variable specifies the locale to use for generating system error
1439 messages.  Changing the locale can cause messages to come out in a
1440 different language or in a different orthography.  If the variable is
1441 @code{nil}, the locale is specified by environment variables in the
1442 usual POSIX fashion.
1443 @end defvar
1444
1445 @defvar system-time-locale
1446 This variable specifies the locale to use for formatting time values.
1447 Changing the locale can cause messages to appear according to the
1448 conventions of a different language.  If the variable is @code{nil}, the
1449 locale is specified by environment variables in the usual POSIX fashion.
1450 @end defvar
1451
1452 @defun locale-info item
1453 This function returns locale data @var{item} for the current POSIX
1454 locale, if available.  @var{item} should be one of these symbols:
1455
1456 @table @code
1457 @item codeset
1458 Return the character set as a string (locale item @code{CODESET}).
1459
1460 @item days
1461 Return a 7-element vector of day names (locale items
1462 @code{DAY_1} through @code{DAY_7});
1463
1464 @item months
1465 Return a 12-element vector of month names (locale items @code{MON_1}
1466 through @code{MON_12}).
1467
1468 @item paper
1469 Return a list @code{(@var{width} @var{height})} for the default paper
1470 size measured in millimeters (locale items @code{PAPER_WIDTH} and
1471 @code{PAPER_HEIGHT}).
1472 @end table
1473
1474 If the system can't provide the requested information, or if
1475 @var{item} is not one of those symbols, the value is @code{nil}.  All
1476 strings in the return value are decoded using
1477 @code{locale-coding-system}.  @xref{Locales,,, libc, The GNU Libc Manual},
1478 for more information about locales and locale items.
1479 @end defun
1480
1481 @ignore
1482    arch-tag: be705bf8-941b-4c35-84fc-ad7d20ddb7cb
1483 @end ignore