* minibuf.texi (Reading File Names): Fix introductory text.

[gnu-emacs] / doc / lispref / nonascii.texi
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi

index 233fe59e1b184c19477d336906549087b4a975cf..9f8df7c77f226a895a4fd5099f5108bc98e5eb34 100644 (file)
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -1,7 +1,7 @@
  @c -*-texinfo-*-
  @c This is part of the GNU Emacs Lisp Reference Manual.
  @c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004,
  @c -*-texinfo-*-
  @c This is part of the GNU Emacs Lisp Reference Manual.
  @c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004,
-@c   2005, 2006, 2007, 2008  Free Software Foundation, Inc.
+@c   2005, 2006, 2007, 2008, 2009  Free Software Foundation, Inc.
  @c See the file elisp.texi for copying conditions.
  @setfilename ../../info/characters
  @node Non-ASCII Characters, Searching and Matching, Text, Top
  @c See the file elisp.texi for copying conditions.
  @setfilename ../../info/characters
  @node Non-ASCII Characters, Searching and Matching, Text, Top
@@ -10,19 +10,19 @@
  @cindex characters, multi-byte
  @cindex non-@acronym{ASCII} characters
  
  @cindex characters, multi-byte
  @cindex non-@acronym{ASCII} characters
  
-  This chapter covers the special issues relating to non-@acronym{ASCII}
-characters and how they are stored in strings and buffers.
+  This chapter covers the special issues relating to characters and
+how they are stored in strings and buffers.
  
  @menu
  
  @menu
-* Text Representations::    Unibyte and multibyte representations
+* Text Representations::    How Emacs represents text.
  * Converting Representations::  Converting unibyte to multibyte and vice versa.
  * Selecting a Representation::  Treating a byte sequence as unibyte or multi.
  * Character Codes::         How unibyte and multibyte relate to
                                  codes of individual characters.
  * Converting Representations::  Converting unibyte to multibyte and vice versa.
  * Selecting a Representation::  Treating a byte sequence as unibyte or multi.
  * Character Codes::         How unibyte and multibyte relate to
                                  codes of individual characters.
+* Character Properties::    Character attributes that define their
+                                behavior and handling.
  * Character Sets::          The space of possible character codes
                                  is divided into various character sets.
  * Character Sets::          The space of possible character codes
                                  is divided into various character sets.
-* Chars and Bytes::         More information about multibyte encodings.
-* Splitting Characters::    Converting a character to its byte sequence.
  * Scanning Charsets::       Which character sets are used in a buffer?
  * Translation of Characters::   Translation tables are used for conversion.
  * Coding Systems::          Coding systems are conversions for saving files.
  * Scanning Charsets::       Which character sets are used in a buffer?
  * Translation of Characters::   Translation tables are used for conversion.
  * Coding Systems::          Coding systems are conversions for saving files.
@@ -33,55 +33,76 @@ characters and how they are stored in strings and buffers.
  
  @node Text Representations
  @section Text Representations
  
  @node Text Representations
  @section Text Representations
-@cindex text representations
-
-  Emacs has two @dfn{text representations}---two ways to represent text
-in a string or buffer.  These are called @dfn{unibyte} and
-@dfn{multibyte}.  Each string, and each buffer, uses one of these two
-representations.  For most purposes, you can ignore the issue of
-representations, because Emacs converts text between them as
-appropriate.  Occasionally in Lisp programming you will need to pay
-attention to the difference.
+@cindex text representation
+
+  Emacs buffers and strings support a large repertoire of characters
+from many different scripts, allowing users to type and display text
+in most any known written language.
+
+@cindex character codepoint
+@cindex codespace
+@cindex Unicode
+  To support this multitude of characters and scripts, Emacs closely
+follows the @dfn{Unicode Standard}.  The Unicode Standard assigns a
+unique number, called a @dfn{codepoint}, to each and every character.
+The range of codepoints defined by Unicode, or the Unicode
+@dfn{codespace}, is @code{0..10FFFF} (in hex), inclusive.  Emacs
+extends this range with codepoints in the range @code{110000..3FFFFF},
+which it uses for representing characters that are not unified with
+Unicode and raw 8-bit bytes that cannot be interpreted as characters
+(the latter occupy the range @code{3FFF80..3FFFFF}).  Thus, a
+character codepoint in Emacs is a 22-bit integer number.
+
+@cindex internal representation of characters
+@cindex characters, representation in buffers and strings
+@cindex multibyte text
+  To conserve memory, Emacs does not hold fixed-length 22-bit numbers
+that are codepoints of text characters within buffers and strings.
+Rather, Emacs uses a variable-length internal representation of
+characters, that stores each character as a sequence of 1 to 5 8-bit
+bytes, depending on the magnitude of its codepoint@footnote{
+This internal representation is based on one of the encodings defined
+by the Unicode Standard, called @dfn{UTF-8}, for representing any
+Unicode codepoint, but Emacs extends UTF-8 to represent the additional
+codepoints it uses for raw 8-bit bytes and characters not unified with
+Unicode.}.  For example, any @acronym{ASCII} character takes up only 1
+byte, a Latin-1 character takes up 2 bytes, etc.  We call this
+representation of text @dfn{multibyte}.
+
+  Outside Emacs, characters can be represented in many different
+encodings, such as ISO-8859-1, GB-2312, Big-5, etc.  Emacs converts
+between these external encodings and its internal representation, as
+appropriate, when it reads text into a buffer or a string, or when it
+writes text to a disk file or passes it to some other process.
+
+  Occasionally, Emacs needs to hold and manipulate encoded text or
+binary non-text data in its buffers or strings.  For example, when
+Emacs visits a file, it first reads the file's text verbatim into a
+buffer, and only then converts it to the internal representation.
+Before the conversion, the buffer holds encoded text.
  
  @cindex unibyte text
  
  @cindex unibyte text
-  In unibyte representation, each character occupies one byte and
-therefore the possible character codes range from 0 to 255.  Codes 0
-through 127 are @acronym{ASCII} characters; the codes from 128 through 255
-are used for one non-@acronym{ASCII} character set (you can choose which
-character set by setting the variable @code{nonascii-insert-offset}).
-
-@cindex leading code
-@cindex multibyte text
-@cindex trailing codes
-  In multibyte representation, a character may occupy more than one
-byte, and as a result, the full range of Emacs character codes can be
-stored.  The first byte of a multibyte character is always in the range
-128 through 159 (octal 0200 through 0237).  These values are called
-@dfn{leading codes}.  The second and subsequent bytes of a multibyte
-character are always in the range 160 through 255 (octal 0240 through
-0377); these values are @dfn{trailing codes}.
-
-  Some sequences of bytes are not valid in multibyte text: for example,
-a single isolated byte in the range 128 through 159 is not allowed.  But
-character codes 128 through 159 can appear in multibyte text,
-represented as two-byte sequences.  All the character codes 128 through
-255 are possible (though slightly abnormal) in multibyte text; they
-appear in multibyte buffers and strings when you do explicit encoding
-and decoding (@pxref{Explicit Encoding}).
+  Encoded text is not really text, as far as Emacs is concerned, but
+rather a sequence of raw 8-bit bytes.  We call buffers and strings
+that hold encoded text @dfn{unibyte} buffers and strings, because
+Emacs treats them as a sequence of individual bytes.  Usually, Emacs
+displays unibyte buffers and strings as octal codes such as
+@code{\237}.  We recommend that you never use unibyte buffers and
+strings except for manipulating encoded text or binary non-text data.
  
    In a buffer, the buffer-local value of the variable
  @code{enable-multibyte-characters} specifies the representation used.
  The representation for a string is determined and recorded in the string
  when the string is constructed.
  
  
    In a buffer, the buffer-local value of the variable
  @code{enable-multibyte-characters} specifies the representation used.
  The representation for a string is determined and recorded in the string
  when the string is constructed.
  
-@defvar enable-multibyte-characters
+@defopt enable-multibyte-characters
  This variable specifies the current buffer's text representation.
  If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
  This variable specifies the current buffer's text representation.
  If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
-it contains unibyte text.
+it contains unibyte encoded text or binary non-text data.
  
  You cannot set this variable directly; instead, use the function
  @code{set-buffer-multibyte} to change a buffer's representation.
  
  You cannot set this variable directly; instead, use the function
  @code{set-buffer-multibyte} to change a buffer's representation.
-@end defvar
+@end defopt
  
  @defvar default-enable-multibyte-characters
  This variable's value is entirely equivalent to @code{(default-value
  
  @defvar default-enable-multibyte-characters
  This variable's value is entirely equivalent to @code{(default-value
@@ -96,20 +117,28 @@ default value to @code{nil} early in startup.
  @end defvar
  
  @defun position-bytes position
  @end defvar
  
  @defun position-bytes position
-Return the byte-position corresponding to buffer position
+Buffer positions are measured in character units.  This function
+returns the byte-position corresponding to buffer position
  @var{position} in the current buffer.  This is 1 at the start of the
  buffer, and counts upward in bytes.  If @var{position} is out of
  range, the value is @code{nil}.
  @end defun
  
  @defun byte-to-position byte-position
  @var{position} in the current buffer.  This is 1 at the start of the
  buffer, and counts upward in bytes.  If @var{position} is out of
  range, the value is @code{nil}.
  @end defun
  
  @defun byte-to-position byte-position
-Return the buffer position corresponding to byte-position
+Return the buffer position, in character units, corresponding to given
  @var{byte-position} in the current buffer.  If @var{byte-position} is
  @var{byte-position} in the current buffer.  If @var{byte-position} is
-out of range, the value is @code{nil}.
+out of range, the value is @code{nil}.  In a multibyte buffer, an
+arbitrary value of @var{byte-position} can be not at character
+boundary, but inside a multibyte sequence representing a single
+character; in this case, this function returns the buffer position of
+the character whose multibyte sequence includes @var{byte-position}.
+In other words, the value does not change for all byte positions that
+belong to the same character.
  @end defun
  
  @defun multibyte-string-p string
  @end defun
  
  @defun multibyte-string-p string
-Return @code{t} if @var{string} is a multibyte string.
+Return @code{t} if @var{string} is a multibyte string, @code{nil}
+otherwise.
  @end defun
  
  @defun string-bytes string
  @end defun
  
  @defun string-bytes string
@@ -119,19 +148,25 @@ If @var{string} is a multibyte string, this can be greater than
  @code{(length @var{string})}.
  @end defun
  
  @code{(length @var{string})}.
  @end defun
  
+@defun unibyte-string &rest bytes
+This function concatenates all its argument @var{bytes} and makes the
+result a unibyte string.
+@end defun
+
  @node Converting Representations
  @section Converting Text Representations
  
    Emacs can convert unibyte text to multibyte; it can also convert
  @node Converting Representations
  @section Converting Text Representations
  
    Emacs can convert unibyte text to multibyte; it can also convert
-multibyte text to unibyte, though this conversion loses information.  In
-general these conversions happen when inserting text into a buffer, or
-when putting text from several strings together in one string.  You can
-also explicitly convert a string's contents to either representation.
-
-  Emacs chooses the representation for a string based on the text that
-it is constructed from.  The general rule is to convert unibyte text to
-multibyte text when combining it with other multibyte text, because the
-multibyte representation is more general and can hold whatever
+multibyte text to unibyte, provided that the multibyte text contains
+only @acronym{ASCII} and 8-bit raw bytes.  In general, these
+conversions happen when inserting text into a buffer, or when putting
+text from several strings together in one string.  You can also
+explicitly convert a string's contents to either representation.
+
+  Emacs chooses the representation for a string based on the text from
+which it is constructed.  The general rule is to convert unibyte text
+to multibyte text when combining it with other multibyte text, because
+the multibyte representation is more general and can hold whatever
  characters the unibyte text has.
  
    When inserting text into a buffer, Emacs converts the text to the
  characters the unibyte text has.
  
    When inserting text into a buffer, Emacs converts the text to the
@@ -144,90 +179,48 @@ alternative, to convert the buffer contents to multibyte, is not
  acceptable because the buffer's representation is a choice made by the
  user that cannot be overridden automatically.
  
  acceptable because the buffer's representation is a choice made by the
  user that cannot be overridden automatically.
  
-  Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
-unchanged, and likewise character codes 128 through 159.  It converts
-the non-@acronym{ASCII} codes 160 through 255 by adding the value
-@code{nonascii-insert-offset} to each character code.  By setting this
-variable, you specify which character set the unibyte characters
-correspond to (@pxref{Character Sets}).  For example, if
-@code{nonascii-insert-offset} is 2048, which is @code{(- (make-char
-'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters
-correspond to Latin 1.  If it is 2688, which is @code{(- (make-char
-'greek-iso8859-7) 128)}, then they correspond to Greek letters.
-
-  Converting multibyte text to unibyte is simpler: it discards all but
-the low 8 bits of each character code.  If @code{nonascii-insert-offset}
-has a reasonable value, corresponding to the beginning of some character
-set, this conversion is the inverse of the other: converting unibyte
-text to multibyte and back to unibyte reproduces the original unibyte
-text.
-
-@defvar nonascii-insert-offset
-This variable specifies the amount to add to a non-@acronym{ASCII} character
-when converting unibyte text to multibyte.  It also applies when
-@code{self-insert-command} inserts a character in the unibyte
-non-@acronym{ASCII} range, 128 through 255.  However, the functions
-@code{insert} and @code{insert-char} do not perform this conversion.
-
-The right value to use to select character set @var{cs} is @code{(-
-(make-char @var{cs}) 128)}.  If the value of
-@code{nonascii-insert-offset} is zero, then conversion actually uses the
-value for the Latin 1 character set, rather than zero.
-@end defvar
+  Converting unibyte text to multibyte text leaves @acronym{ASCII}
+characters unchanged, and converts bytes with codes 128 through 159 to
+the multibyte representation of raw eight-bit bytes.
  
  
-@defvar nonascii-translation-table
-This variable provides a more general alternative to
-@code{nonascii-insert-offset}.  You can use it to specify independently
-how to translate each code in the range of 128 through 255 into a
-multibyte character.  The value should be a char-table, or @code{nil}.
-If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}.
-@end defvar
+  Converting multibyte text to unibyte converts all @acronym{ASCII}
+and eight-bit characters to their single-byte form, but loses
+information for non-@acronym{ASCII} characters by discarding all but
+the low 8 bits of each character's codepoint.  Converting unibyte text
+to multibyte and back to unibyte reproduces the original unibyte text.
  
  
-The next three functions either return the argument @var{string}, or a
+The next two functions either return the argument @var{string}, or a
  newly created string with no text properties.
  
  newly created string with no text properties.
  
-@defun string-make-unibyte string
-This function converts the text of @var{string} to unibyte
-representation, if it isn't already, and returns the result.  If
-@var{string} is a unibyte string, it is returned unchanged.  Multibyte
-character codes are converted to unibyte according to
-@code{nonascii-translation-table} or, if that is @code{nil}, using
-@code{nonascii-insert-offset}.  If the lookup in the translation table
-fails, this function takes just the low 8 bits of each character.
-@end defun
-
-@defun string-make-multibyte string
-This function converts the text of @var{string} to multibyte
-representation, if it isn't already, and returns the result.  If
-@var{string} is a multibyte string or consists entirely of
-@acronym{ASCII} characters, it is returned unchanged.  In particular,
-if @var{string} is unibyte and entirely @acronym{ASCII}, the returned
-string is unibyte.  (When the characters are all @acronym{ASCII},
-Emacs primitives will treat the string the same way whether it is
-unibyte or multibyte.)  If @var{string} is unibyte and contains
-non-@acronym{ASCII} characters, the function
-@code{unibyte-char-to-multibyte} is used to convert each unibyte
-character to a multibyte character.
-@end defun
-
  @defun string-to-multibyte string
  This function returns a multibyte string containing the same sequence
  @defun string-to-multibyte string
  This function returns a multibyte string containing the same sequence
-of character codes as @var{string}.  Unlike
-@code{string-make-multibyte}, this function unconditionally returns a
-multibyte string.  If @var{string} is a multibyte string, it is
-returned unchanged.
+of characters as @var{string}.  If @var{string} is a multibyte string,
+it is returned unchanged.  The function assumes that @var{string}
+includes only @acronym{ASCII} characters and raw 8-bit bytes; the
+latter are converted to their multibyte representation corresponding
+to the codepoints in the @code{3FFF80..3FFFFF} area (@pxref{Text
+Representations, codepoints}).
+@end defun
+
+@defun string-to-unibyte string
+This function returns a unibyte string containing the same sequence of
+characters as @var{string}.  It signals an error if @var{string}
+contains a non-@acronym{ASCII} character.  If @var{string} is a
+unibyte string, it is returned unchanged.  Use this function for
+@var{string} arguments that contain only @acronym{ASCII} and eight-bit
+characters.
  @end defun
  
  @defun multibyte-char-to-unibyte char
  @end defun
  
  @defun multibyte-char-to-unibyte char
-This convert the multibyte character @var{char} to a unibyte
-character, based on @code{nonascii-translation-table} and
-@code{nonascii-insert-offset}.
+This converts the multibyte character @var{char} to a unibyte
+character, and returns that character.  If @var{char} is neither
+@acronym{ASCII} nor eight-bit, the function returns -1.
  @end defun
  
  @defun unibyte-char-to-multibyte char
  This convert the unibyte character @var{char} to a multibyte
  @end defun
  
  @defun unibyte-char-to-multibyte char
  This convert the unibyte character @var{char} to a multibyte
-character, based on @code{nonascii-translation-table} and
-@code{nonascii-insert-offset}.
+character, assuming @var{char} is either @acronym{ASCII} or raw 8-bit
+byte.
  @end defun
  
  @node Selecting a Representation
  @end defun
  
  @node Selecting a Representation
@@ -242,13 +235,13 @@ is non-@code{nil}, the buffer becomes multibyte.  If @var{multibyte}
  is @code{nil}, the buffer becomes unibyte.
  
  This function leaves the buffer contents unchanged when viewed as a
  is @code{nil}, the buffer becomes unibyte.
  
  This function leaves the buffer contents unchanged when viewed as a
-sequence of bytes.  As a consequence, it can change the contents viewed
-as characters; a sequence of two bytes which is treated as one character
-in multibyte representation will count as two characters in unibyte
-representation.  Character codes 128 through 159 are an exception.  They
-are represented by one byte in a unibyte buffer, but when the buffer is
-set to multibyte, they are converted to two-byte sequences, and vice
-versa.
+sequence of bytes.  As a consequence, it can change the contents
+viewed as characters; for instance, a sequence of three bytes which is
+treated as one character in multibyte representation will count as
+three characters in unibyte representation.  Eight-bit characters
+representing raw bytes are an exception.  They are represented by one
+byte in a unibyte buffer, but when the buffer is set to multibyte,
+they are converted to two-byte sequences, and vice versa.
  
  This function sets @code{enable-multibyte-characters} to record which
  representation is in use.  It also adjusts various data in the buffer
  
  This function sets @code{enable-multibyte-characters} to record which
  representation is in use.  It also adjusts various data in the buffer
@@ -261,255 +254,436 @@ base buffer.
  @end defun
  
  @defun string-as-unibyte string
  @end defun
  
  @defun string-as-unibyte string
-This function returns a string with the same bytes as @var{string} but
-treating each byte as a character.  This means that the value may have
-more characters than @var{string} has.
-
-If @var{string} is already a unibyte string, then the value is
-@var{string} itself.  Otherwise it is a newly created string, with no
-text properties.  If @var{string} is multibyte, any characters it
-contains of charset @code{eight-bit-control} or @code{eight-bit-graphic}
-are converted to the corresponding single byte.
+If @var{string} is already a unibyte string, this function returns
+@var{string} itself.  Otherwise, it returns a new string with the same
+bytes as @var{string}, but treating each byte as a separate character
+(so that the value may have more characters than @var{string}); as an
+exception, each eight-bit character representing a raw byte is
+converted into a single byte.  The newly-created string contains no
+text properties.
  @end defun
  
  @defun string-as-multibyte string
  @end defun
  
  @defun string-as-multibyte string
-This function returns a string with the same bytes as @var{string} but
-treating each multibyte sequence as one character.  This means that the
-value may have fewer characters than @var{string} has.
-
-If @var{string} is already a multibyte string, then the value is
-@var{string} itself.  Otherwise it is a newly created string, with no
-text properties.  If @var{string} is unibyte and contains any individual
-8-bit bytes (i.e.@: not part of a multibyte form), they are converted to
-the corresponding multibyte character of charset @code{eight-bit-control}
-or @code{eight-bit-graphic}.
+If @var{string} is a multibyte string, this function returns
+@var{string} itself.  Otherwise, it returns a new string with the same
+bytes as @var{string}, but treating each multibyte sequence as one
+character.  This means that the value may have fewer characters than
+@var{string} has.  If a byte sequence in @var{string} is invalid as a
+multibyte representation of a single character, each byte in the
+sequence is treated as a raw 8-bit byte.  The newly-created string
+contains no text properties.
  @end defun
  
  @node Character Codes
  @section Character Codes
  @cindex character codes
  
  @end defun
  
  @node Character Codes
  @section Character Codes
  @cindex character codes
  
-  The unibyte and multibyte text representations use different character
-codes.  The valid character codes for unibyte representation range from
-0 to 255---the values that can fit in one byte.  The valid character
-codes for multibyte representation range from 0 to 524287, but not all
-values in that range are valid.  The values 128 through 255 are not
-entirely proper in multibyte text, but they can occur if you do explicit
-encoding and decoding (@pxref{Explicit Encoding}).  Some other character
-codes cannot occur at all in multibyte text.  Only the @acronym{ASCII} codes
-0 through 127 are completely legitimate in both representations.
-
-@defun char-valid-p charcode &optional genericp
-This returns @code{t} if @var{charcode} is valid (either for unibyte
-text or for multibyte text).
+  The unibyte and multibyte text representations use different
+character codes.  The valid character codes for unibyte representation
+range from 0 to 255---the values that can fit in one byte.  The valid
+character codes for multibyte representation range from 0 to 4194303
+(#x3FFFFF).  In this code space, values 0 through 127 are for
+@acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F)
+are for non-@acronym{ASCII} characters.  Values 0 through 1114111
+(#10FFFF) correspond to Unicode characters of the same codepoint;
+values 1114112 (#110000) through 4194175 (#x3FFF7F) represent
+characters that are not unified with Unicode; and values 4194176
+(#x3FFF80) through 4194303 (#x3FFFFF) represent eight-bit raw bytes.
+
+@defun characterp charcode
+This returns @code{t} if @var{charcode} is a valid character, and
+@code{nil} otherwise.
  
  @example
  
  @example
-(char-valid-p 65)
+@group
+(characterp 65)
+     @result{} t
+@end group
+@group
+(characterp 4194303)
       @result{} t
       @result{} t
-(char-valid-p 256)
+@end group
+@group
+(characterp 4194304)
       @result{} nil
       @result{} nil
-(char-valid-p 2248)
+@end group
+@end example
+@end defun
+
+@cindex maximum value of character codepoint
+@cindex codepoint, largest value
+@defun max-char
+This function returns the largest value that a valid character
+codepoint can have.
+
+@example
+@group
+(characterp (max-char))
       @result{} t
       @result{} t
+@end group
+@group
+(characterp (1+ (max-char)))
+     @result{} nil
+@end group
  @end example
  @end example
+@end defun
  
  
-If the optional argument @var{genericp} is non-@code{nil}, this
-function also returns @code{t} if @var{charcode} is a generic
-character (@pxref{Splitting Characters}).
+@defun get-byte &optional pos string
+This function returns the byte at character position @var{pos} in the
+current buffer.  If the current buffer is unibyte, this is literally
+the byte at that position.  If the buffer is multibyte, byte values of
+@acronym{ASCII} characters are the same as character codepoints,
+whereas eight-bit raw bytes are converted to their 8-bit codes.  The
+function signals an error if the character at @var{pos} is
+non-@acronym{ASCII}.
+
+The optional argument @var{string} means to get a byte value from that
+string instead of the current buffer.
  @end defun
  
  @end defun
  
-@node Character Sets
-@section Character Sets
-@cindex character sets
+@node Character Properties
+@section Character Properties
+@cindex character properties
+A @dfn{character property} is a named attribute of a character that
+specifies how the character behaves and how it should be handled
+during text processing and display.  Thus, character properties are an
+important part of specifying the character's semantics.
+
+  Emacs generally follows the Unicode Standard in its implementation
+of character properties.  In particular, Emacs supports the
+@uref{http://www.unicode.org/reports/tr23/, Unicode Character Property
+Model}, and the Emacs character property database is derived from the
+Unicode Character Database (@acronym{UCD}).  See the
+@uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character
+Properties chapter of the Unicode Standard}, for a detailed
+description of Unicode character properties and their meaning.  This
+section assumes you are already familiar with that chapter of the
+Unicode Standard, and want to apply that knowledge to Emacs Lisp
+programs.
  
  
-  Emacs classifies characters into various @dfn{character sets}, each of
-which has a name which is a symbol.  Each character belongs to one and
-only one character set.
+  In Emacs, each property has a name, which is a symbol, and a set of
+possible values, whose types depend on the property; if a character
+does not have a certain property, the value is @code{nil}.  As a
+general rule, the names of character properties in Emacs are produced
+from the corresponding Unicode properties by downcasing them and
+replacing each @samp{_} character with a dash @samp{-}.  For example,
+@code{Canonical_Combining_Class} becomes
+@code{canonical-combining-class}.  However, sometimes we shorten the
+names to make their use easier.
  
  
-  In general, there is one character set for each distinct script.  For
-example, @code{latin-iso8859-1} is one character set,
-@code{greek-iso8859-7} is another, and @code{ascii} is another.  An
-Emacs character set can hold at most 9025 characters; therefore, in some
-cases, characters that would logically be grouped together are split
-into several character sets.  For example, one set of Chinese
-characters, generally known as Big 5, is divided into two Emacs
-character sets, @code{chinese-big5-1} and @code{chinese-big5-2}.
+  Here is the full list of value types for all the character
+properties that Emacs knows about:
  
  
-  @acronym{ASCII} characters are in character set @code{ascii}.  The
-non-@acronym{ASCII} characters 128 through 159 are in character set
-@code{eight-bit-control}, and codes 160 through 255 are in character set
-@code{eight-bit-graphic}.
+@table @code
+@item name
+This property corresponds to the Unicode @code{Name} property.  The
+value is a string consisting of upper-case Latin letters A to Z,
+digits, spaces, and hyphen @samp{-} characters.
+
+@item general-category
+This property corresponds to the Unicode @code{General_Category}
+property.  The value is a symbol whose name is a 2-letter abbreviation
+of the character's classification.
+
+@item canonical-combining-class
+Corresponds to the Unicode @code{Canonical_Combining_Class} property.
+The value is an integer number.
+
+@item bidi-class
+Corresponds to the Unicode @code{Bidi_Class} property.  The value is a
+symbol whose name is the Unicode @dfn{directional type} of the
+character.
+
+@item decomposition
+Corresponds to the Unicode @code{Decomposition_Type} and
+@code{Decomposition_Value} properties.  The value is a list, whose
+first element may be a symbol representing a compatibility formatting
+tag, such as @code{small}@footnote{
+Note that the Unicode spec writes these tag names inside
+@samp{<..>} brackets.  The tag names in Emacs do not include the
+brackets; e.g., Unicode specifies @samp{<small>} where Emacs uses
+@samp{small}.
+}; the other elements are characters that give the compatibility
+decomposition sequence of this character.
+
+@item decimal-digit-value
+Corresponds to the Unicode @code{Numeric_Value} property for
+characters whose @code{Numeric_Type} is @samp{Digit}.  The value is an
+integer number.
+
+@item digit
+Corresponds to the Unicode @code{Numeric_Value} property for
+characters whose @code{Numeric_Type} is @samp{Decimal}.  The value is
+an integer number.  Examples of such characters include compatibility
+subscript and superscript digits, for which the value is the
+corresponding number.
+
+@item numeric-value
+Corresponds to the Unicode @code{Numeric_Value} property for
+characters whose @code{Numeric_Type} is @samp{Numeric}.  The value of
+this property is an integer or a floating-point number.  Examples of
+characters that have this property include fractions, subscripts,
+superscripts, Roman numerals, currency numerators, and encircled
+numbers.  For example, the value of this property for the character
+@code{U+2155} (@sc{vulgar fraction one fifth}) is @code{0.2}.
+
+@item mirrored
+Corresponds to the Unicode @code{Bidi_Mirrored} property.  The value
+of this property is a symbol, either @code{Y} or @code{N}.
+
+@item old-name
+Corresponds to the Unicode @code{Unicode_1_Name} property.  The value
+is a string.
+
+@item iso-10646-comment
+Corresponds to the Unicode @code{ISO_Comment} property.  The value is
+a string.
+
+@item uppercase
+Corresponds to the Unicode @code{Simple_Uppercase_Mapping} property.
+The value of this property is a single character.
+
+@item lowercase
+Corresponds to the Unicode @code{Simple_Lowercase_Mapping} property.
+The value of this property is a single character.
+
+@item titlecase
+Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property.
+@dfn{Title case} is a special form of a character used when the first
+character of a word needs to be capitalized.  The value of this
+property is a single character.
+@end table
  
  
-@defun charsetp object
-Returns @code{t} if @var{object} is a symbol that names a character set,
-@code{nil} otherwise.
+@defun get-char-code-property char propname
+This function returns the value of @var{char}'s @var{propname} property.
+
+@example
+@group
+(get-char-code-property ?  'general-category)
+     @result{} Zs
+@end group
+@group
+(get-char-code-property ?1  'general-category)
+     @result{} Nd
+@end group
+@group
+(get-char-code-property ?\u2084 'digit-value) ; subscript 4
+     @result{} 4
+@end group
+@group
+(get-char-code-property ?\u2155 'numeric-value) ; one fifth
+     @result{} 1/5
+@end group
+@group
+(get-char-code-property ?\u2163 'numeric-value) ; Roman IV
+     @result{} \4
+@end group
+@end example
  @end defun
  
  @end defun
  
-@defvar charset-list
-The value is a list of all defined character set names.
-@end defvar
+@defun char-code-property-description prop value
+This function returns the description string of property @var{prop}'s
+@var{value}, or @code{nil} if @var{value} has no description.
  
  
-@defun charset-list
-This function returns the value of @code{charset-list}.  It is only
-provided for backward compatibility.
+@example
+@group
+(char-code-property-description 'general-category 'Zs)
+     @result{} "Separator, Space"
+@end group
+@group
+(char-code-property-description 'general-category 'Nd)
+     @result{} "Number, Decimal Digit"
+@end group
+@group
+(char-code-property-description 'numeric-value '1/5)
+     @result{} nil
+@end group
+@end example
  @end defun
  
  @end defun
  
-@defun char-charset character
-This function returns the name of the character set that @var{character}
-belongs to, or the symbol @code{unknown} if @var{character} is not a
-valid character.
+@defun put-char-code-property char propname value
+This function stores @var{value} as the value of the property
+@var{propname} for the character @var{char}.
  @end defun
  
  @end defun
  
-@defun charset-plist charset
-This function returns the charset property list of the character set
-@var{charset}.  Although @var{charset} is a symbol, this is not the same
-as the property list of that symbol.  Charset properties are used for
-special purposes within Emacs.
-@end defun
+@defvar char-script-table
+The value of this variable is a char-table (@pxref{Char-Tables}) that
+specifies, for each character, a symbol whose name is the script to
+which the character belongs, according to the Unicode Standard
+classification of the Unicode code space into script-specific blocks.
+This char-table has a single extra slot whose value is the list of all
+script symbols.
+@end defvar
  
  
-@deffn Command list-charset-chars charset
-This command displays a list of characters in the character set
-@var{charset}.
-@end deffn
+@defvar char-width-table
+The value of this variable is a char-table that specifies the width of
+each character in columns that it will occupy on the screen.
+@end defvar
  
  
-@node Chars and Bytes
-@section Characters and Bytes
-@cindex bytes and characters
+@defvar printable-chars
+The value of this variable is a char-table that specifies, for each
+character, whether it is printable or not.  That is, if evaluating
+@code{(aref printable-chars char)} results in @code{t}, the character
+is printable, and if it results in @code{nil}, it is not.
+@end defvar
  
  
-@cindex introduction sequence (of character)
-@cindex dimension (of character set)
-  In multibyte representation, each character occupies one or more
-bytes.  Each character set has an @dfn{introduction sequence}, which is
-normally one or two bytes long.  (Exception: the @code{ascii} character
-set and the @code{eight-bit-graphic} character set have a zero-length
-introduction sequence.)  The introduction sequence is the beginning of
-the byte sequence for any character in the character set.  The rest of
-the character's bytes distinguish it from the other characters in the
-same character set.  Depending on the character set, there are either
-one or two distinguishing bytes; the number of such bytes is called the
-@dfn{dimension} of the character set.
+@node Character Sets
+@section Character Sets
+@cindex character sets
  
  
-@defun charset-dimension charset
-This function returns the dimension of @var{charset}; at present, the
-dimension is always 1 or 2.
-@end defun
+@cindex charset
+@cindex coded character set
+An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters
+in which each character is assigned a numeric code point.  (The
+Unicode standard calls this a @dfn{coded character set}.)  Each Emacs
+charset has a name which is a symbol.  A single character can belong
+to any number of different character sets, but it will generally have
+a different code point in each charset.  Examples of character sets
+include @code{ascii}, @code{iso-8859-1}, @code{greek-iso8859-7}, and
+@code{windows-1255}.  The code point assigned to a character in a
+charset is usually different from its code point used in Emacs buffers
+and strings.
+
+@cindex @code{emacs}, a charset
+@cindex @code{unicode}, a charset
+@cindex @code{eight-bit}, a charset
+  Emacs defines several special character sets.  The character set
+@code{unicode} includes all the characters whose Emacs code points are
+in the range @code{0..10FFFF}.  The character set @code{emacs}
+includes all @acronym{ASCII} and non-@acronym{ASCII} characters.
+Finally, the @code{eight-bit} charset includes the 8-bit raw bytes;
+Emacs uses it to represent raw bytes encountered in text.
  
  
-@defun charset-bytes charset
-This function returns the number of bytes used to represent a character
-in character set @var{charset}.
+@defun charsetp object
+Returns @code{t} if @var{object} is a symbol that names a character set,
+@code{nil} otherwise.
  @end defun
  
  @end defun
  
-  This is the simplest way to determine the byte length of a character
-set's introduction sequence:
+@defvar charset-list
+The value is a list of all defined character set names.
+@end defvar
  
  
-@example
-(- (charset-bytes @var{charset})
-   (charset-dimension @var{charset}))
-@end example
+@defun charset-priority-list &optional highestp
+This functions returns a list of all defined character sets ordered by
+their priority.  If @var{highestp} is non-@code{nil}, the function
+returns a single character set of the highest priority.
+@end defun
  
  
-@node Splitting Characters
-@section Splitting Characters
-@cindex character as bytes
+@defun set-charset-priority &rest charsets
+This function makes @var{charsets} the highest priority character sets.
+@end defun
  
  
-  The functions in this section convert between characters and the byte
-values used to represent them.  For most purposes, there is no need to
-be concerned with the sequence of bytes used to represent a character,
-because Emacs translates automatically when necessary.
+@defun char-charset character &optional restriction
+This function returns the name of the character set of highest
+priority that @var{character} belongs to.  @acronym{ASCII} characters
+are an exception: for them, this function always returns @code{ascii}.
  
  
-@defun split-char character
-Return a list containing the name of the character set of
-@var{character}, followed by one or two byte values (integers) which
-identify @var{character} within that character set.  The number of byte
-values is the character set's dimension.
+If @var{restriction} is non-@code{nil}, it should be a list of
+charsets to search.  Alternatively, it can be a coding system, in
+which case the returned charset must be supported by that coding
+system (@pxref{Coding Systems}).
+@end defun
  
  
-If @var{character} is invalid as a character code, @code{split-char}
-returns a list consisting of the symbol @code{unknown} and @var{character}.
+@defun charset-plist charset
+This function returns the property list of the character set
+@var{charset}.  Although @var{charset} is a symbol, this is not the
+same as the property list of that symbol.  Charset properties include
+important information about the charset, such as its documentation
+string, short name, etc.
+@end defun
  
  
-@example
-(split-char 2248)
-     @result{} (latin-iso8859-1 72)
-(split-char 65)
-     @result{} (ascii 65)
-(split-char 128)
-     @result{} (eight-bit-control 128)
-@end example
+@defun put-charset-property charset propname value
+This function sets the @var{propname} property of @var{charset} to the
+given @var{value}.
  @end defun
  
  @end defun
  
-@cindex generate characters in charsets
-@defun make-char charset &optional code1 code2
-This function returns the character in character set @var{charset} whose
-position codes are @var{code1} and @var{code2}.  This is roughly the
-inverse of @code{split-char}.  Normally, you should specify either one
-or both of @var{code1} and @var{code2} according to the dimension of
-@var{charset}.  For example,
+@defun get-charset-property charset propname
+This function returns the value of @var{charset}s property
+@var{propname}.
+@end defun
  
  
-@example
-(make-char 'latin-iso8859-1 72)
-     @result{} 2248
-@end example
+@deffn Command list-charset-chars charset
+This command displays a list of characters in the character set
+@var{charset}.
+@end deffn
  
  
-Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed
-before they are used to index @var{charset}.  Thus you may use, for
-instance, an ISO 8859 character code rather than subtracting 128, as
-is necessary to index the corresponding Emacs charset.
+  Emacs can convert between its internal representation of a character
+and the character's codepoint in a specific charset.  The following
+two functions support these conversions.
+
+@c FIXME: decode-char and encode-char accept and ignore an additional
+@c argument @var{restriction}.  When that argument actually makes a
+@c difference, it should be documented here.
+@defun decode-char charset code-point
+This function decodes a character that is assigned a @var{code-point}
+in @var{charset}, to the corresponding Emacs character, and returns
+it.  If @var{charset} doesn't contain a character of that code point,
+the value is @code{nil}.  If @var{code-point} doesn't fit in a Lisp
+integer (@pxref{Integer Basics, most-positive-fixnum}), it can be
+specified as a cons cell @code{(@var{high} . @var{low})}, where
+@var{low} are the lower 16 bits of the value and @var{high} are the
+high 16 bits.
  @end defun
  
  @end defun
  
-@cindex generic characters
-  If you call @code{make-char} with no @var{byte-values}, the result is
-a @dfn{generic character} which stands for @var{charset}.  A generic
-character is an integer, but it is @emph{not} valid for insertion in the
-buffer as a character.  It can be used in @code{char-table-range} to
-refer to the whole character set (@pxref{Char-Tables}).
-@code{char-valid-p} returns @code{nil} for generic characters.
-For example:
-
-@example
-(make-char 'latin-iso8859-1)
-     @result{} 2176
-(char-valid-p 2176)
-     @result{} nil
-(char-valid-p 2176 t)
-     @result{} t
-(split-char 2176)
-     @result{} (latin-iso8859-1 0)
-@end example
+@defun encode-char char charset
+This function returns the code point assigned to the character
+@var{char} in @var{charset}.  If the result does not fit in a Lisp
+integer, it is returned as a cons cell @code{(@var{high} . @var{low})}
+that fits the second argument of @code{decode-char} above.  If
+@var{charset} doesn't have a codepoint for @var{char}, the value is
+@code{nil}.
+@end defun
  
  
-The character sets @code{ascii}, @code{eight-bit-control}, and
-@code{eight-bit-graphic} don't have corresponding generic characters.  If
-@var{charset} is one of them and you don't supply @var{code1},
-@code{make-char} returns the character code corresponding to the
-smallest code in @var{charset}.
+  The following function comes in handy for applying a certain
+function to all or part of the characters in a charset:
+
+@defun map-charset-chars function charset &optional arg from to
+Call @var{function} for characters in @var{charset}.  @var{function}
+is called with two arguments.  The first one is a cons cell
+@code{(@var{from} .  @var{to})}, where @var{from} and @var{to}
+indicate a range of characters contained in charset.  The second
+argument is the optional argument @var{arg}.
+
+By default, the range of codepoints passed to @var{function} includes
+all the characters in @var{charset}, but optional arguments @var{from}
+and @var{to} limit that to the range of characters between these two
+codepoints.  If either of them is @code{nil}, it defaults to the first
+or last codepoint of @var{charset}, respectively.
+@end defun
  
  @node Scanning Charsets
  @section Scanning for Character Sets
  
  
  @node Scanning Charsets
  @section Scanning for Character Sets
  
-  Sometimes it is useful to find out which character sets appear in a
-part of a buffer or a string.  One use for this is in determining which
-coding systems (@pxref{Coding Systems}) are capable of representing all
-of the text in question.
+  Sometimes it is useful to find out which character set a particular
+character belongs to.  One use for this is in determining which coding
+systems (@pxref{Coding Systems}) are capable of representing all of
+the text in question; another is to determine the font(s) for
+displaying that text.
  
  @defun charset-after &optional pos
  
  @defun charset-after &optional pos
-This function return the charset of a character in the current buffer
-at position @var{pos}.  If @var{pos} is omitted or @code{nil}, it
-defaults to the current value of point.  If @var{pos} is out of range,
-the value is @code{nil}.
+This function returns the charset of highest priority containing the
+character at position @var{pos} in the current buffer.  If @var{pos}
+is omitted or @code{nil}, it defaults to the current value of point.
+If @var{pos} is out of range, the value is @code{nil}.
  @end defun
  
  @defun find-charset-region beg end &optional translation
  @end defun
  
  @defun find-charset-region beg end &optional translation
-This function returns a list of the character sets that appear in the
-current buffer between positions @var{beg} and @var{end}.
+This function returns a list of the character sets of highest priority
+that contain characters in the current buffer between positions
+@var{beg} and @var{end}.
  
  
-The optional argument @var{translation} specifies a translation table to
-be used in scanning the text (@pxref{Translation of Characters}).  If it
-is non-@code{nil}, then each character in the region is translated
+The optional argument @var{translation} specifies a translation table
+to use for scanning the text (@pxref{Translation of Characters}).  If
+it is non-@code{nil}, then each character in the region is translated
  through this table, and the value returned describes the translated
  characters instead of the characters actually in the buffer.
  @end defun
  
  @defun find-charset-string string &optional translation
  through this table, and the value returned describes the translated
  characters instead of the characters actually in the buffer.
  @end defun
  
  @defun find-charset-string string &optional translation
-This function returns a list of the character sets that appear in the
-string @var{string}.  It is just like @code{find-charset-region}, except
-that it applies to the contents of @var{string} instead of part of the
-current buffer.
+This function returns a list of character sets of highest priority
+that contain characters in @var{string}.  It is just like
+@code{find-charset-region}, except that it applies to the contents of
+@var{string} instead of part of the current buffer.
  @end defun
  
  @node Translation of Characters
  @end defun
  
  @node Translation of Characters
@@ -517,19 +691,18 @@ current buffer.
  @cindex character translation tables
  @cindex translation tables
  
  @cindex character translation tables
  @cindex translation tables
  
-  A @dfn{translation table} is a char-table that specifies a mapping
-of characters into characters.  These tables are used in encoding and
-decoding, and for other purposes.  Some coding systems specify their
-own particular translation tables; there are also default translation
-tables which apply to all other coding systems.
+  A @dfn{translation table} is a char-table (@pxref{Char-Tables}) that
+specifies a mapping of characters into characters.  These tables are
+used in encoding and decoding, and for other purposes.  Some coding
+systems specify their own particular translation tables; there are
+also default translation tables which apply to all other coding
+systems.
  
  
-  For instance, the coding-system @code{utf-8} has a translation table
-that maps characters of various charsets (e.g.,
-@code{latin-iso8859-@var{x}}) into Unicode character sets.  This way,
-it can encode Latin-2 characters into UTF-8.  Meanwhile,
-@code{unify-8859-on-decoding-mode} operates by specifying
-@code{standard-translation-table-for-decode} to translate
-Latin-@var{x} characters into corresponding Unicode characters.
+  A translation table has two extra slots.  The first is either
+@code{nil} or a translation table that performs the reverse
+translation; the second is the maximum number of characters to look up
+for translating sequences of characters (see the description of
+@code{make-translation-table-from-alist} below).
  
  @defun make-translation-table &rest translations
  This function returns a translation table based on the argument
  
  @defun make-translation-table &rest translations
  This function returns a translation table based on the argument
@@ -541,47 +714,78 @@ The arguments and the forms in each argument are processed in order,
  and if a previous form already translates @var{to} to some other
  character, say @var{to-alt}, @var{from} is also translated to
  @var{to-alt}.
  and if a previous form already translates @var{to} to some other
  character, say @var{to-alt}, @var{from} is also translated to
  @var{to-alt}.
+@end defun
  
  
-You can also map one whole character set into another character set with
-the same dimension.  To do this, you specify a generic character (which
-designates a character set) for @var{from} (@pxref{Splitting Characters}).
-In this case, if @var{to} is also a generic character, its character
-set should have the same dimension as @var{from}'s.  Then the
-translation table translates each character of @var{from}'s character
-set into the corresponding character of @var{to}'s character set.  If
-@var{from} is a generic character and @var{to} is an ordinary
-character, then the translation table translates every character of
-@var{from}'s character set into @var{to}.
-@end defun
-
-  In decoding, the translation table's translations are applied to the
-characters that result from ordinary decoding.  If a coding system has
-property @code{translation-table-for-decode}, that specifies the
-translation table to use.  (This is a property of the coding system,
-as returned by @code{coding-system-get}, not a property of the symbol
-that is the coding system's name. @xref{Coding System Basics,, Basic
-Concepts of Coding Systems}.)  Otherwise, if
-@code{standard-translation-table-for-decode} is non-@code{nil},
-decoding uses that table.
-
-  In encoding, the translation table's translations are applied to the
-characters in the buffer, and the result of translation is actually
-encoded.  If a coding system has property
-@code{translation-table-for-encode}, that specifies the translation
-table to use.  Otherwise the variable
-@code{standard-translation-table-for-encode} specifies the translation
-table.
+  During decoding, the translation table's translations are applied to
+the characters that result from ordinary decoding.  If a coding system
+has the property @code{:decode-translation-table}, that specifies the
+translation table to use, or a list of translation tables to apply in
+sequence.  (This is a property of the coding system, as returned by
+@code{coding-system-get}, not a property of the symbol that is the
+coding system's name.  @xref{Coding System Basics,, Basic Concepts of
+Coding Systems}.)  Finally, if
+@code{standard-translation-table-for-decode} is non-@code{nil}, the
+resulting characters are translated by that table.
+
+  During encoding, the translation table's translations are applied to
+the characters in the buffer, and the result of translation is
+actually encoded.  If a coding system has property
+@code{:encode-translation-table}, that specifies the translation table
+to use, or a list of translation tables to apply in sequence.  In
+addition, if the variable @code{standard-translation-table-for-encode}
+is non-@code{nil}, it specifies the translation table to use for
+translating the result.
  
  @defvar standard-translation-table-for-decode
  
  @defvar standard-translation-table-for-decode
-This is the default translation table for decoding, for
-coding systems that don't specify any other translation table.
+This is the default translation table for decoding.  If a coding
+systems specifies its own translation tables, the table that is the
+value of this variable, if non-@code{nil}, is applied after them.
  @end defvar
  
  @defvar standard-translation-table-for-encode
  @end defvar
  
  @defvar standard-translation-table-for-encode
-This is the default translation table for encoding, for
-coding systems that don't specify any other translation table.
+This is the default translation table for encoding.  If a coding
+systems specifies its own translation tables, the table that is the
+value of this variable, if non-@code{nil}, is applied after them.
+@end defvar
+
+@defvar translation-table-for-input
+Self-inserting characters are translated through this translation
+table before they are inserted.  Search commands also translate their
+input through this table, so they can compare more reliably with
+what's in the buffer.
+
+This variable automatically becomes buffer-local when set.
  @end defvar
  
  @end defvar
  
+@defun make-translation-table-from-vector vec
+This function returns a translation table made from @var{vec} that is
+an array of 256 elements to map byte values 0 through 255 to
+characters.  Elements may be @code{nil} for untranslated bytes.  The
+returned table has a translation table for reverse mapping in the
+first extra slot, and the value @code{1} in the second extra slot.
+
+This function provides an easy way to make a private coding system
+that maps each byte to a specific character.  You can specify the
+returned table and the reverse translation table using the properties
+@code{:decode-translation-table} and @code{:encode-translation-table}
+respectively in the @var{props} argument to
+@code{define-coding-system}.
+@end defun
+
+@defun make-translation-table-from-alist alist
+This function is similar to @code{make-translation-table} but returns
+a complex translation table rather than a simple one-to-one mapping.
+Each element of @var{alist} is of the form @code{(@var{from}
+. @var{to})}, where @var{from} and @var{to} are either characters or
+vectors specifying a sequence of characters.  If @var{from} is a
+character, that character is translated to @var{to} (i.e.@: to a
+character or a character sequence).  If @var{from} is a vector of
+characters, that sequence is translated to @var{to}.  The returned
+table has a translation table for reverse mapping in the first extra
+slot, and the maximum length of all the @var{from} character sequences
+in the second extra slot.
+@end defun
+
  @node Coding Systems
  @section Coding Systems
  
  @node Coding Systems
  @section Coding Systems
  
@@ -612,48 +816,49 @@ documented here.
  @subsection Basic Concepts of Coding Systems
  
  @cindex character code conversion
  @subsection Basic Concepts of Coding Systems
  
  @cindex character code conversion
-  @dfn{Character code conversion} involves conversion between the encoding
-used inside Emacs and some other encoding.  Emacs supports many
-different encodings, in that it can convert to and from them.  For
-example, it can convert text to or from encodings such as Latin 1, Latin
-2, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022.  In some
-cases, Emacs supports several alternative encodings for the same
-characters; for example, there are three coding systems for the Cyrillic
-(Russian) alphabet: ISO, Alternativnyj, and KOI8.
-
-  Most coding systems specify a particular character code for
-conversion, but some of them leave the choice unspecified---to be chosen
-heuristically for each file, based on the data.
+  @dfn{Character code conversion} involves conversion between the
+internal representation of characters used inside Emacs and some other
+encoding.  Emacs supports many different encodings, in that it can
+convert to and from them.  For example, it can convert text to or from
+encodings such as Latin 1, Latin 2, Latin 3, Latin 4, Latin 5, and
+several variants of ISO 2022.  In some cases, Emacs supports several
+alternative encodings for the same characters; for example, there are
+three coding systems for the Cyrillic (Russian) alphabet: ISO,
+Alternativnyj, and KOI8.
+
+  Every coding system specifies a particular set of character code
+conversions, but the coding system @code{undecided} is special: it
+leaves the choice unspecified, to be chosen heuristically for each
+file, based on the file's data.
  
    In general, a coding system doesn't guarantee roundtrip identity:
  decoding a byte sequence using coding system, then encoding the
  resulting text in the same coding system, can produce a different byte
  
    In general, a coding system doesn't guarantee roundtrip identity:
  decoding a byte sequence using coding system, then encoding the
  resulting text in the same coding system, can produce a different byte
-sequence.  However, the following coding systems do guarantee that the
-byte sequence will be the same as what you originally decoded:
+sequence.  But some coding systems do guarantee that the byte sequence
+will be the same as what you originally decoded.  Here are a few
+examples:
  
  @quotation
  
  @quotation
-chinese-big5 chinese-iso-8bit cyrillic-iso-8bit emacs-mule
-greek-iso-8bit hebrew-iso-8bit iso-latin-1 iso-latin-2 iso-latin-3
-iso-latin-4 iso-latin-5 iso-latin-8 iso-latin-9 iso-safe
-japanese-iso-8bit japanese-shift-jis korean-iso-8bit raw-text
+iso-8859-1, utf-8, big5, shift_jis, euc-jp
  @end quotation
  
    Encoding buffer text and then decoding the result can also fail to
  @end quotation
  
    Encoding buffer text and then decoding the result can also fail to
-reproduce the original text.  For instance, if you encode Latin-2
-characters with @code{utf-8} and decode the result using the same
-coding system, you'll get Unicode characters (of charset
-@code{mule-unicode-0100-24ff}).  If you encode Unicode characters with
-@code{iso-latin-2} and decode the result with the same coding system,
-you'll get Latin-2 characters.
+reproduce the original text.  For instance, if you encode a character
+with a coding system which does not support that character, the result
+is unpredictable, and thus decoding it using the same coding system
+may produce a different text.  Currently, Emacs can't report errors
+that result from encoding unsupported characters.
  
  @cindex EOL conversion
  @cindex end-of-line conversion
  @cindex line end conversion
  
  @cindex EOL conversion
  @cindex end-of-line conversion
  @cindex line end conversion
-  @dfn{End of line conversion} handles three different conventions used
-on various systems for representing end of line in files.  The Unix
-convention is to use the linefeed character (also called newline).  The
-DOS convention is to use a carriage-return and a linefeed at the end of
-a line.  The Mac convention is to use just carriage-return.
+  @dfn{End of line conversion} handles three different conventions
+used on various systems for representing end of line in files.  The
+Unix convention, used on GNU and Unix systems, is to use the linefeed
+character (also called newline).  The DOS convention, used on
+MS-Windows and MS-DOS systems, is to use a carriage-return and a
+linefeed at the end of a line.  The Mac convention is to use just
+carriage-return.
  
  @cindex base coding system
  @cindex variant coding system
  
  @cindex base coding system
  @cindex variant coding system
@@ -664,46 +869,64 @@ coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and
  well.  Most base coding systems have three corresponding variants whose
  names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
  
  well.  Most base coding systems have three corresponding variants whose
  names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
  
+@vindex raw-text@r{ coding system}
    The coding system @code{raw-text} is special in that it prevents
    The coding system @code{raw-text} is special in that it prevents
-character code conversion, and causes the buffer visited with that
-coding system to be a unibyte buffer.  It does not specify the
-end-of-line conversion, allowing that to be determined as usual by the
-data, and has the usual three variants which specify the end-of-line
-conversion.  @code{no-conversion} is equivalent to @code{raw-text-unix}:
-it specifies no conversion of either character codes or end-of-line.
-
-  The coding system @code{emacs-mule} specifies that the data is
-represented in the internal Emacs encoding.  This is like
-@code{raw-text} in that no code conversion happens, but different in
-that the result is multibyte data.
+character code conversion, and causes the buffer visited with this
+coding system to be a unibyte buffer.  For historical reasons, you can
+save both unibyte and multibyte text with this coding system.  When
+you use @code{raw-text} to encode multibyte text, it does perform one
+character code conversion: it converts eight-bit characters to their
+single-byte external representation.  @code{raw-text} does not specify
+the end-of-line conversion, allowing that to be determined as usual by
+the data, and has the usual three variants which specify the
+end-of-line conversion.
+
+@vindex no-conversion@r{ coding system}
+@vindex binary@r{ coding system}
+  @code{no-conversion} (and its alias @code{binary}) is equivalent to
+@code{raw-text-unix}: it specifies no conversion of either character
+codes or end-of-line.
+
+@vindex emacs-internal@r{ coding system}
+@vindex utf-8-emacs@r{ coding system}
+  The coding system @code{utf-8-emacs} specifies that the data is
+represented in the internal Emacs encoding (@pxref{Text
+Representations}).  This is like @code{raw-text} in that no code
+conversion happens, but different in that the result is multibyte
+data.  The name @code{emacs-internal} is an alias for
+@code{utf-8-emacs}.
  
  @defun coding-system-get coding-system property
  This function returns the specified property of the coding system
  @var{coding-system}.  Most coding system properties exist for internal
  
  @defun coding-system-get coding-system property
  This function returns the specified property of the coding system
  @var{coding-system}.  Most coding system properties exist for internal
-purposes, but one that you might find useful is @code{mime-charset}.
+purposes, but one that you might find useful is @code{:mime-charset}.
  That property's value is the name used in MIME for the character coding
  which this coding system can read and write.  Examples:
  
  @example
  That property's value is the name used in MIME for the character coding
  which this coding system can read and write.  Examples:
  
  @example
-(coding-system-get 'iso-latin-1 'mime-charset)
+(coding-system-get 'iso-latin-1 :mime-charset)
       @result{} iso-8859-1
       @result{} iso-8859-1
-(coding-system-get 'iso-2022-cn 'mime-charset)
+(coding-system-get 'iso-2022-cn :mime-charset)
       @result{} iso-2022-cn
       @result{} iso-2022-cn
-(coding-system-get 'cyrillic-koi8 'mime-charset)
+(coding-system-get 'cyrillic-koi8 :mime-charset)
       @result{} koi8-r
  @end example
  
       @result{} koi8-r
  @end example
  
-The value of the @code{mime-charset} property is also defined
+The value of the @code{:mime-charset} property is also defined
  as an alias for the coding system.
  @end defun
  
  as an alias for the coding system.
  @end defun
  
+@defun coding-system-aliases coding-system
+This function returns the list of aliases of @var{coding-system}.
+@end defun
+
  @node Encoding and I/O
  @subsection Encoding and I/O
  
    The principal purpose of coding systems is for use in reading and
  @node Encoding and I/O
  @subsection Encoding and I/O
  
    The principal purpose of coding systems is for use in reading and
-writing files.  The function @code{insert-file-contents} uses
-a coding system for decoding the file data, and @code{write-region}
-uses one to encode the buffer contents.
+writing files.  The function @code{insert-file-contents} uses a coding
+system to decode the file data, and @code{write-region} uses one to
+encode the buffer contents.
  
    You can specify the coding system to use either explicitly
  (@pxref{Specifying Coding Systems}), or implicitly using a default
  
    You can specify the coding system to use either explicitly
  (@pxref{Specifying Coding Systems}), or implicitly using a default
@@ -783,6 +1006,7 @@ new file name for that buffer.
  
    Here are the Lisp facilities for working with coding systems:
  
  
    Here are the Lisp facilities for working with coding systems:
  
+@cindex list all coding systems
  @defun coding-system-list &optional base-only
  This function returns a list of all coding system names (symbols).  If
  @var{base-only} is non-@code{nil}, the value includes only the
  @defun coding-system-list &optional base-only
  This function returns a list of all coding system names (symbols).  If
  @var{base-only} is non-@code{nil}, the value includes only the
@@ -795,12 +1019,17 @@ This function returns @code{t} if @var{object} is a coding system
  name or @code{nil}.
  @end defun
  
  name or @code{nil}.
  @end defun
  
+@cindex validity of coding system
+@cindex coding system, validity check
  @defun check-coding-system coding-system
  @defun check-coding-system coding-system
-This function checks the validity of @var{coding-system}.
-If that is valid, it returns @var{coding-system}.
-Otherwise it signals an error with condition @code{coding-system-error}.
+This function checks the validity of @var{coding-system}.  If that is
+valid, it returns @var{coding-system}.  If @var{coding-system} is
+@code{nil}, the function return @code{nil}.  For any other values, it
+signals an error whose @code{error-symbol} is @code{coding-system-error}
+(@pxref{Signaling Errors, signal}).
  @end defun
  
  @end defun
  
+@cindex eol type of coding system
  @defun coding-system-eol-type coding-system
  This function returns the type of end-of-line (a.k.a.@: @dfn{eol})
  conversion used by @var{coding-system}.  If @var{coding-system}
  @defun coding-system-eol-type coding-system
  This function returns the type of end-of-line (a.k.a.@: @dfn{eol})
  conversion used by @var{coding-system}.  If @var{coding-system}
@@ -827,6 +1056,7 @@ taken from the appropriate default coding system (e.g.,
  appropriate for the underlying platform.
  @end defun
  
  appropriate for the underlying platform.
  @end defun
  
+@cindex eol conversion of coding system
  @defun coding-system-change-eol-conversion coding-system eol-type
  This function returns a coding system which is like @var{coding-system}
  except for its eol conversion, which is specified by @code{eol-type}.
  @defun coding-system-change-eol-conversion coding-system eol-type
  This function returns a coding system which is like @var{coding-system}
  except for its eol conversion, which is specified by @code{eol-type}.
@@ -838,6 +1068,7 @@ the end-of-line conversion from the data.
  @code{dos} and @code{mac}, respectively.
  @end defun
  
  @code{dos} and @code{mac}, respectively.
  @end defun
  
+@cindex text conversion of coding system
  @defun coding-system-change-text-conversion eol-coding text-coding
  This function returns a coding system which uses the end-of-line
  conversion of @var{eol-coding}, and the text conversion of
  @defun coding-system-change-text-conversion eol-coding text-coding
  This function returns a coding system which uses the end-of-line
  conversion of @var{eol-coding}, and the text conversion of
@@ -845,6 +1076,8 @@ conversion of @var{eol-coding}, and the text conversion of
  @code{undecided}, or one of its variants according to @var{eol-coding}.
  @end defun
  
  @code{undecided}, or one of its variants according to @var{eol-coding}.
  @end defun
  
+@cindex safely encode region
+@cindex coding systems for encoding region
  @defun find-coding-systems-region from to
  This function returns a list of coding systems that could be used to
  encode a text between @var{from} and @var{to}.  All coding systems in
  @defun find-coding-systems-region from to
  This function returns a list of coding systems that could be used to
  encode a text between @var{from} and @var{to}.  All coding systems in
@@ -855,6 +1088,8 @@ If the text contains no multibyte characters, the function returns the
  list @code{(undecided)}.
  @end defun
  
  list @code{(undecided)}.
  @end defun
  
+@cindex safely encode a string
+@cindex coding systems for encoding a string
  @defun find-coding-systems-string string
  This function returns a list of coding systems that could be used to
  encode the text of @var{string}.  All coding systems in the list can
  @defun find-coding-systems-string string
  This function returns a list of coding systems that could be used to
  encode the text of @var{string}.  All coding systems in the list can
@@ -863,15 +1098,34 @@ contains no multibyte characters, this returns the list
  @code{(undecided)}.
  @end defun
  
  @code{(undecided)}.
  @end defun
  
+@cindex charset, coding systems to encode
+@cindex safely encode characters in a charset
  @defun find-coding-systems-for-charsets charsets
  This function returns a list of coding systems that could be used to
  encode all the character sets in the list @var{charsets}.
  @end defun
  
  @defun find-coding-systems-for-charsets charsets
  This function returns a list of coding systems that could be used to
  encode all the character sets in the list @var{charsets}.
  @end defun
  
+@defun check-coding-systems-region start end coding-system-list
+This function checks whether coding systems in the list
+@code{coding-system-list} can encode all the characters in the region
+between @var{start} and @var{end}.  If all of the coding systems in
+the list can encode the specified text, the function returns
+@code{nil}.  If some coding systems cannot encode some of the
+characters, the value is an alist, each element of which has the form
+@code{(@var{coding-system1} @var{pos1} @var{pos2} @dots{})}, meaning
+that @var{coding-system1} cannot encode characters at buffer positions
+@var{pos1}, @var{pos2}, @enddots{}.
+
+@var{start} may be a string, in which case @var{end} is ignored and
+the returned value references string indices instead of buffer
+positions.
+@end defun
+
  @defun detect-coding-region start end &optional highest
  This function chooses a plausible coding system for decoding the text
  @defun detect-coding-region start end &optional highest
  This function chooses a plausible coding system for decoding the text
-from @var{start} to @var{end}.  This text should be a byte sequence
-(@pxref{Explicit Encoding}).
+from @var{start} to @var{end}.  This text should be a byte sequence,
+i.e.@: unibyte text or multibyte text with only @acronym{ASCII} and
+eight-bit characters (@pxref{Explicit Encoding}).
  
  Normally this function returns a list of coding systems that could
  handle decoding the text that was scanned.  They are listed in order of
  
  Normally this function returns a list of coding systems that could
  handle decoding the text that was scanned.  They are listed in order of
@@ -883,11 +1137,52 @@ If the region contains only @acronym{ASCII} characters except for such
  ISO-2022 control characters ISO-2022 as @code{ESC}, the value is
  @code{undecided} or @code{(undecided)}, or a variant specifying
  end-of-line conversion, if that can be deduced from the text.
  ISO-2022 control characters ISO-2022 as @code{ESC}, the value is
  @code{undecided} or @code{(undecided)}, or a variant specifying
  end-of-line conversion, if that can be deduced from the text.
+
+If the region contains null bytes, the value is @code{no-conversion},
+even if the region contains text encoded in some coding system.
  @end defun
  
  @defun detect-coding-string string &optional highest
  This function is like @code{detect-coding-region} except that it
  operates on the contents of @var{string} instead of bytes in the buffer.
  @end defun
  
  @defun detect-coding-string string &optional highest
  This function is like @code{detect-coding-region} except that it
  operates on the contents of @var{string} instead of bytes in the buffer.
+@end defun
+
+@cindex null bytes, and decoding text
+@defvar inhibit-null-byte-detection
+If this variable has a non-@code{nil} value, null bytes are ignored
+when detecting the encoding of a region or a string.  This allows to
+correctly detect the encoding of text that contains null bytes, such
+as Info files with Index nodes.
+@end defvar
+
+@defvar inhibit-iso-escape-detection
+If this variable has a non-@code{nil} value, ISO-2022 escape sequences
+are ignored when detecting the encoding of a region or a string.  The
+result is that no text is ever detected as encoded in some ISO-2022
+encoding, and all escape sequences become visible in a buffer.
+@strong{Warning:} @emph{Use this variable with extreme caution,
+because many files in the Emacs distribution use ISO-2022 encoding.}
+@end defvar
+
+@cindex charsets supported by a coding system
+@defun coding-system-charset-list coding-system
+This function returns the list of character sets (@pxref{Character
+Sets}) supported by @var{coding-system}.  Some coding systems that
+support too many character sets to list them all yield special values:
+@itemize @bullet
+@item
+If @var{coding-system} supports all the ISO-2022 charsets, the value
+is @code{iso-2022}.
+@item
+If @var{coding-system} supports all Emacs characters, the value is
+@code{(emacs)}.
+@item
+If @var{coding-system} supports all emacs-mule characters, the value
+is @code{emacs-mule}.
+@item
+If @var{coding-system} supports all Unicode characters, the value is
+@code{(unicode)}.
+@end itemize
  @end defun
  
    @xref{Coding systems for a subprocess,, Process Information}, in
  @end defun
  
    @xref{Coding systems for a subprocess,, Process Information}, in
@@ -906,6 +1201,10 @@ is the text in the current buffer between @var{from} and @var{to}.  If
  @var{from} is a string, the string specifies the text to encode, and
  @var{to} is ignored.
  
  @var{from} is a string, the string specifies the text to encode, and
  @var{to} is ignored.
  
+If the specified text includes raw bytes (@pxref{Text
+Representations}), @code{select-safe-coding-system} suggests
+@code{raw-text} for its encoding.
+
  If @var{default-coding-system} is non-@code{nil}, that is the first
  coding system to try; if that can handle the text,
  @code{select-safe-coding-system} returns that coding system.  It can
  If @var{default-coding-system} is non-@code{nil}, that is the first
  coding system to try; if that can handle the text,
  @code{select-safe-coding-system} returns that coding system.  It can
@@ -975,6 +1274,8 @@ the user tries to enter null input, it asks the user to try again.
  
  @node Default Coding Systems
  @subsection Default Coding Systems
  
  @node Default Coding Systems
  @subsection Default Coding Systems
+@cindex default coding system
+@cindex coding system, automatically determined
  
    This section describes variables that specify the default coding
  system for certain files or when running certain subprograms, and the
  
    This section describes variables that specify the default coding
  system for certain files or when running certain subprograms, and the
@@ -987,7 +1288,8 @@ don't change these variables; instead, override them using
  @code{coding-system-for-read} and @code{coding-system-for-write}
  (@pxref{Specifying Coding Systems}).
  
  @code{coding-system-for-read} and @code{coding-system-for-write}
  (@pxref{Specifying Coding Systems}).
  
-@defvar auto-coding-regexp-alist
+@cindex file contents, and default coding system
+@defopt auto-coding-regexp-alist
  This variable is an alist of text patterns and corresponding coding
  systems. Each element has the form @code{(@var{regexp}
  . @var{coding-system})}; a file whose first few kilobytes match
  This variable is an alist of text patterns and corresponding coding
  systems. Each element has the form @code{(@var{regexp}
  . @var{coding-system})}; a file whose first few kilobytes match
@@ -997,9 +1299,10 @@ read into a buffer.  The settings in this alist take priority over
  @code{file-coding-system-alist} (see below).  The default value is set
  so that Emacs automatically recognizes mail files in Babyl format and
  reads them with no code conversions.
  @code{file-coding-system-alist} (see below).  The default value is set
  so that Emacs automatically recognizes mail files in Babyl format and
  reads them with no code conversions.
-@end defvar
+@end defopt
  
  
-@defvar file-coding-system-alist
+@cindex file name, and default coding system
+@defopt file-coding-system-alist
  This variable is an alist that specifies the coding systems to use for
  reading and writing particular files.  Each element has the form
  @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
  This variable is an alist that specifies the coding systems to use for
  reading and writing particular files.  Each element has the form
  @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
@@ -1022,8 +1325,16 @@ meaning as described above.
  
  If @var{coding} (or what returned by the above function) is
  @code{undecided}, the normal code-detection is performed.
  
  If @var{coding} (or what returned by the above function) is
  @code{undecided}, the normal code-detection is performed.
-@end defvar
+@end defopt
+
+@defopt auto-coding-alist
+This variable is an alist that specifies the coding systems to use for
+reading and writing particular files.  Its form is like that of
+@code{file-coding-system-alist}, but, unlike the latter, this variable
+takes priority over any @code{coding:} tags in the file.
+@end defopt
  
  
+@cindex program name, and default coding system
  @defvar process-coding-system-alist
  This variable is an alist specifying which coding systems to use for a
  subprocess, depending on which program is running in the subprocess.  It
  @defvar process-coding-system-alist
  This variable is an alist specifying which coding systems to use for a
  subprocess, depending on which program is running in the subprocess.  It
@@ -1047,6 +1358,8 @@ coding system which determines both the character code conversion and
  the end of line conversion---that is, one like @code{latin-1-unix},
  rather than @code{undecided} or @code{latin-1}.
  
  the end of line conversion---that is, one like @code{latin-1-unix},
  rather than @code{undecided} or @code{latin-1}.
  
+@cindex port number, and default coding system
+@cindex network service name, and default coding system
  @defvar network-coding-system-alist
  This variable is an alist that specifies the coding system to use for
  network streams.  It works much like @code{file-coding-system-alist},
  @defvar network-coding-system-alist
  This variable is an alist that specifies the coding system to use for
  network streams.  It works much like @code{file-coding-system-alist},
@@ -1066,7 +1379,8 @@ The value should be a cons cell of the form @code{(@var{input-coding}
  the subprocess, and @var{output-coding} applies to output to it.
  @end defvar
  
  the subprocess, and @var{output-coding} applies to output to it.
  @end defvar
  
-@defvar auto-coding-functions
+@cindex default coding system, functions to determine
+@defopt auto-coding-functions
  This variable holds a list of functions that try to determine a
  coding system for a file based on its undecoded contents.
  
  This variable holds a list of functions that try to determine a
  coding system for a file based on its undecoded contents.
  
@@ -1080,7 +1394,40 @@ Otherwise, it should return @code{nil}.
  
  If a file has a @samp{coding:} tag, that takes precedence, so these
  functions won't be called.
  
  If a file has a @samp{coding:} tag, that takes precedence, so these
  functions won't be called.
-@end defvar
+@end defopt
+
+@defun find-auto-coding filename size
+This function tries to determine a suitable coding system for
+@var{filename}.  It examines the buffer visiting the named file, using
+the variables documented above in sequence, until it finds a match for
+one of the rules specified by these variables.  It then returns a cons
+cell of the form @code{(@var{coding} . @var{source})}, where
+@var{coding} is the coding system to use and @var{source} is a symbol,
+one of @code{auto-coding-alist}, @code{auto-coding-regexp-alist},
+@code{:coding}, or @code{auto-coding-functions}, indicating which one
+supplied the matching rule.  The value @code{:coding} means the coding
+system was specified by the @code{coding:} tag in the file
+(@pxref{Specify Coding,, coding tag, emacs, The GNU Emacs Manual}).
+The order of looking for a matching rule is @code{auto-coding-alist}
+first, then @code{auto-coding-regexp-alist}, then the @code{coding:}
+tag, and lastly @code{auto-coding-functions}.  If no matching rule was
+found, the function returns @code{nil}.
+
+The second argument @var{size} is the size of text, in characters,
+following point.  The function examines text only within @var{size}
+characters after point.  Normally, the buffer should be positioned at
+the beginning when this function is called, because one of the places
+for the @code{coding:} tag is the first one or two lines of the file;
+in that case, @var{size} should be the size of the buffer.
+@end defun
+
+@defun set-auto-coding filename size
+This function returns a suitable coding system for file
+@var{filename}.  It uses @code{find-auto-coding} to find the coding
+system.  If no coding system could be determined, the function returns
+@code{nil}.  The meaning of the argument @var{size} is like in
+@code{find-auto-coding}.
+@end defun
  
  @defun find-operation-coding-system operation &rest arguments
  This function returns the coding system to use (by default) for
  
  @defun find-operation-coding-system operation &rest arguments
  This function returns the coding system to use (by default) for
@@ -1174,12 +1521,39 @@ When a single operation does both input and output, as do
  affect it.
  @end defvar
  
  affect it.
  @end defvar
  
-@defvar inhibit-eol-conversion
+@defopt inhibit-eol-conversion
  When this variable is non-@code{nil}, no end-of-line conversion is done,
  no matter which coding system is specified.  This applies to all the
  Emacs I/O and subprocess primitives, and to the explicit encoding and
  decoding functions (@pxref{Explicit Encoding}).
  When this variable is non-@code{nil}, no end-of-line conversion is done,
  no matter which coding system is specified.  This applies to all the
  Emacs I/O and subprocess primitives, and to the explicit encoding and
  decoding functions (@pxref{Explicit Encoding}).
-@end defvar
+@end defopt
+
+@cindex priority order of coding systems
+@cindex coding systems, priority
+  Sometimes, you need to prefer several coding systems for some
+operation, rather than fix a single one.  Emacs lets you specify a
+priority order for using coding systems.  This ordering affects the
+sorting of lists of coding sysems returned by functions such as
+@code{find-coding-systems-region} (@pxref{Lisp and Coding Systems}).
+
+@defun coding-system-priority-list &optional highestp
+This function returns the list of coding systems in the order of their
+current priorities.  Optional argument @var{highestp}, if
+non-@code{nil}, means return only the highest priority coding system.
+@end defun
+
+@defun set-coding-system-priority &rest coding-systems
+This function puts @var{coding-systems} at the beginning of the
+priority list for coding systems, thus making their priority higher
+than all the rest.
+@end defun
+
+@defmac with-coding-priority coding-systems &rest body@dots{}
+This macro execute @var{body}, like @code{progn} does
+(@pxref{Sequencing, progn}), with @var{coding-systems} at the front of
+the priority list for coding systems.  @var{coding-systems} should be
+a list of coding systems to prefer during execution of @var{body}.
+@end defmac
  
  @node Explicit Encoding
  @subsection Explicit Encoding and Decoding
  
  @node Explicit Encoding
  @subsection Explicit Encoding and Decoding
@@ -1193,10 +1567,12 @@ in this section.
  
    The result of encoding, and the input to decoding, are not ordinary
  text.  They logically consist of a series of byte values; that is, a
  
    The result of encoding, and the input to decoding, are not ordinary
  text.  They logically consist of a series of byte values; that is, a
-series of characters whose codes are in the range 0 through 255.  In a
-multibyte buffer or string, character codes 128 through 159 are
-represented by multibyte sequences, but this is invisible to Lisp
-programs.
+series of @acronym{ASCII} and eight-bit characters.  In unibyte
+buffers and strings, these characters have codes in the range 0
+through 255.  In a multibyte buffer or string, eight-bit characters
+have character codes higher than 255 (@pxref{Text Representations}),
+but Emacs transparently converts them to their single-byte values when
+you encode or decode such text.
  
    The usual way to read a file into a buffer as a sequence of bytes, so
  you can decode the contents explicitly, is with
  
    The usual way to read a file into a buffer as a sequence of bytes, so
  you can decode the contents explicitly, is with
@@ -1214,19 +1590,35 @@ encoding by binding @code{coding-system-for-write} to
    Here are the functions to perform explicit encoding or decoding.  The
  encoding functions produce sequences of bytes; the decoding functions
  are meant to operate on sequences of bytes.  All of these functions
    Here are the functions to perform explicit encoding or decoding.  The
  encoding functions produce sequences of bytes; the decoding functions
  are meant to operate on sequences of bytes.  All of these functions
-discard text properties.
+discard text properties.  They also set @code{last-coding-system-used}
+to the precise coding system they used.
  
  
-@deffn Command encode-coding-region start end coding-system
+@deffn Command encode-coding-region start end coding-system &optional destination
  This command encodes the text from @var{start} to @var{end} according
  This command encodes the text from @var{start} to @var{end} according
-to coding system @var{coding-system}.  The encoded text replaces the
-original text in the buffer.  The result of encoding is logically a
-sequence of bytes, but the buffer remains multibyte if it was multibyte
-before.
-
-This command returns the length of the encoded text.
+to coding system @var{coding-system}.  Normally, the encoded text
+replaces the original text in the buffer, but the optional argument
+@var{destination} can change that.  If @var{destination} is a buffer,
+the encoded text is inserted in that buffer after point (point does
+not move); if it is @code{t}, the command returns the encoded text as
+a unibyte string without inserting it.
+
+If encoded text is inserted in some buffer, this command returns the
+length of the encoded text.
+
+The result of encoding is logically a sequence of bytes, but the
+buffer remains multibyte if it was multibyte before, and any 8-bit
+bytes are converted to their multibyte representation (@pxref{Text
+Representations}).
+
+@cindex @code{undecided} coding-system, when encoding
+Do @emph{not} use @code{undecided} for @var{coding-system} when
+encoding text, since that may lead to unexpected results.  Instead,
+use @code{select-safe-coding-system} (@pxref{User-Chosen Coding
+Systems, select-safe-coding-system}) to suggest a suitable encoding,
+if there's no obvious pertinent value for @var{coding-system}.
  @end deffn
  
  @end deffn
  
-@defun encode-coding-string string coding-system &optional nocopy
+@defun encode-coding-string string coding-system &optional nocopy buffer
  This function encodes the text in @var{string} according to coding
  system @var{coding-system}.  It returns a new string containing the
  encoded text, except when @var{nocopy} is non-@code{nil}, in which
  This function encodes the text in @var{string} according to coding
  system @var{coding-system}.  It returns a new string containing the
  encoded text, except when @var{nocopy} is non-@code{nil}, in which
@@ -1234,24 +1626,52 @@ case the function may return @var{string} itself if the encoding
  operation is trivial.  The result of encoding is a unibyte string.
  @end defun
  
  operation is trivial.  The result of encoding is a unibyte string.
  @end defun
  
-@deffn Command decode-coding-region start end coding-system
+@deffn Command decode-coding-region start end coding-system &optional destination
  This command decodes the text from @var{start} to @var{end} according
  This command decodes the text from @var{start} to @var{end} according
-to coding system @var{coding-system}.  The decoded text replaces the
-original text in the buffer.  To make explicit decoding useful, the text
-before decoding ought to be a sequence of byte values, but both
-multibyte and unibyte buffers are acceptable.
-
-This command returns the length of the decoded text.
+to coding system @var{coding-system}.  To make explicit decoding
+useful, the text before decoding ought to be a sequence of byte
+values, but both multibyte and unibyte buffers are acceptable (in the
+multibyte case, the raw byte values should be represented as eight-bit
+characters).  Normally, the decoded text replaces the original text in
+the buffer, but the optional argument @var{destination} can change
+that.  If @var{destination} is a buffer, the decoded text is inserted
+in that buffer after point (point does not move); if it is @code{t},
+the command returns the decoded text as a multibyte string without
+inserting it.
+
+If decoded text is inserted in some buffer, this command returns the
+length of the decoded text.
+
+This command puts a @code{charset} text property on the decoded text.
+The value of the property states the character set used to decode the
+original text.
  @end deffn
  
  @end deffn
  
-@defun decode-coding-string string coding-system &optional nocopy
-This function decodes the text in @var{string} according to coding
-system @var{coding-system}.  It returns a new string containing the
-decoded text, except when @var{nocopy} is non-@code{nil}, in which
-case the function may return @var{string} itself if the decoding
-operation is trivial.  To make explicit decoding useful, the contents
-of @var{string} ought to be a sequence of byte values, but a multibyte
-string is acceptable.
+@defun decode-coding-string string coding-system &optional nocopy buffer
+This function decodes the text in @var{string} according to
+@var{coding-system}.  It returns a new string containing the decoded
+text, except when @var{nocopy} is non-@code{nil}, in which case the
+function may return @var{string} itself if the decoding operation is
+trivial.  To make explicit decoding useful, the contents of
+@var{string} ought to be a unibyte string with a sequence of byte
+values, but a multibyte string is also acceptable (assuming it
+contains 8-bit bytes in their multibyte form).
+
+If optional argument @var{buffer} specifies a buffer, the decoded text
+is inserted in that buffer after point (point does not move).  In this
+case, the return value is the length of the decoded text.
+
+@cindex @code{charset}, text property
+This function puts a @code{charset} text property on the decoded text.
+The value of the property states the character set used to decode the
+original text:
+
+@example
+@group
+(decode-coding-string "Gr\374ss Gott" 'latin-1)
+     @result{} #("Gr@"uss Gott" 0 9 (charset iso-8859-1))
+@end group
+@end example
  @end defun
  
  @defun decode-coding-inserted-region from to filename &optional visit beg end replace
  @end defun
  
  @defun decode-coding-inserted-region from to filename &optional visit beg end replace
@@ -1269,31 +1689,42 @@ decoding, you can call this function.
  @subsection Terminal I/O Encoding
  
    Emacs can decode keyboard input using a coding system, and encode
  @subsection Terminal I/O Encoding
  
    Emacs can decode keyboard input using a coding system, and encode
-terminal output.  This is useful for terminals that transmit or display
-text using a particular encoding such as Latin-1.  Emacs does not set
-@code{last-coding-system-used} for encoding or decoding for the
-terminal.
+terminal output.  This is useful for terminals that transmit or
+display text using a particular encoding such as Latin-1.  Emacs does
+not set @code{last-coding-system-used} for encoding or decoding of
+terminal I/O.
  
  
-@defun keyboard-coding-system
+@defun keyboard-coding-system &optional terminal
  This function returns the coding system that is in use for decoding
  This function returns the coding system that is in use for decoding
-keyboard input---or @code{nil} if no coding system is to be used.
+keyboard input from @var{terminal}---or @code{nil} if no coding system
+is to be used for that terminal.  If @var{terminal} is omitted or
+@code{nil}, it means the selected frame's terminal.  @xref{Multiple
+Terminals}.
  @end defun
  
  @end defun
  
-@deffn Command set-keyboard-coding-system coding-system
-This command specifies @var{coding-system} as the coding system to
-use for decoding keyboard input.  If @var{coding-system} is @code{nil},
-that means do not decode keyboard input.
+@deffn Command set-keyboard-coding-system coding-system &optional terminal
+This command specifies @var{coding-system} as the coding system to use
+for decoding keyboard input from @var{terminal}.  If
+@var{coding-system} is @code{nil}, that means do not decode keyboard
+input.  If @var{terminal} is a frame, it means that frame's terminal;
+if it is @code{nil}, that means the currently selected frame's
+terminal.  @xref{Multiple Terminals}.
  @end deffn
  
  @end deffn
  
-@defun terminal-coding-system
+@defun terminal-coding-system &optional terminal
  This function returns the coding system that is in use for encoding
  This function returns the coding system that is in use for encoding
-terminal output---or @code{nil} for no encoding.
+terminal output from @var{terminal}---or @code{nil} if the output is
+not encoded.  If @var{terminal} is a frame, it means that frame's
+terminal; if it is @code{nil}, that means the currently selected
+frame's terminal.
  @end defun
  
  @end defun
  
-@deffn Command set-terminal-coding-system coding-system
+@deffn Command set-terminal-coding-system coding-system &optional terminal
  This command specifies @var{coding-system} as the coding system to use
  This command specifies @var{coding-system} as the coding system to use
-for encoding terminal output.  If @var{coding-system} is @code{nil},
-that means do not encode terminal output.
+for encoding terminal output from @var{terminal}.  If
+@var{coding-system} is @code{nil}, terminal output is not encoded.  If
+@var{terminal} is a frame, it means that frame's terminal; if it is
+@code{nil}, that means the currently selected frame's terminal.
  @end deffn
  
  @node MS-DOS File Types
  @end deffn
  
  @node MS-DOS File Types