(Text Representations, Converting Representations, Character Sets,

[gnu-emacs] / doc / lispref / nonascii.texi
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi

index 16f70f57b9d7250e8fb2737b4156c6c84ddb446f..eab748bab8d1063415ae5a1c9f0a3a7712b31744 100644 (file)
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -1,7 +1,7 @@
  @c -*-texinfo-*-
  @c This is part of the GNU Emacs Lisp Reference Manual.
  @c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004,
-@c   2005, 2006, 2007  Free Software Foundation, Inc.
+@c   2005, 2006, 2007, 2008  Free Software Foundation, Inc.
  @c See the file elisp.texi for copying conditions.
  @setfilename ../../info/characters
  @node Non-ASCII Characters, Searching and Matching, Text, Top
@@ -10,19 +10,17 @@
  @cindex characters, multi-byte
  @cindex non-@acronym{ASCII} characters
  
-  This chapter covers the special issues relating to non-@acronym{ASCII}
-characters and how they are stored in strings and buffers.
+  This chapter covers the special issues relating to characters and
+how they are stored in strings and buffers.
  
  @menu
-* Text Representations::    Unibyte and multibyte representations
+* Text Representations::    How Emacs represents text.
  * Converting Representations::  Converting unibyte to multibyte and vice versa.
  * Selecting a Representation::  Treating a byte sequence as unibyte or multi.
  * Character Codes::         How unibyte and multibyte relate to
                                  codes of individual characters.
  * Character Sets::          The space of possible character codes
                                  is divided into various character sets.
-* Chars and Bytes::         More information about multibyte encodings.
-* Splitting Characters::    Converting a character to its byte sequence.
  * Scanning Charsets::       Which character sets are used in a buffer?
  * Translation of Characters::   Translation tables are used for conversion.
  * Coding Systems::          Coding systems are conversions for saving files.
@@ -33,41 +31,64 @@ characters and how they are stored in strings and buffers.
  
  @node Text Representations
  @section Text Representations
-@cindex text representations
-
-  Emacs has two @dfn{text representations}---two ways to represent text
-in a string or buffer.  These are called @dfn{unibyte} and
-@dfn{multibyte}.  Each string, and each buffer, uses one of these two
-representations.  For most purposes, you can ignore the issue of
-representations, because Emacs converts text between them as
-appropriate.  Occasionally in Lisp programming you will need to pay
-attention to the difference.
+@cindex text representation
+
+  Emacs buffers and strings support a large repertoire of characters
+from many different scripts.  This is so users could type and display
+text in most any known written language.
+
+@cindex character codepoint
+@cindex codespace
+@cindex Unicode
+  To support this multitude of characters and scripts, Emacs closely
+follows the @dfn{Unicode Standard}.  The Unicode Standard assigns a
+unique number, called a @dfn{codepoint}, to each and every character.
+The range of codepoints defined by Unicode, or the Unicode
+@dfn{codespace}, is @code{0..10FFFF} (in hex), inclusive.  Emacs
+extends this range with codepoints in the range @code{110000..3FFFFF},
+which it uses for representing characters that are not unified with
+Unicode and raw 8-bit bytes that cannot be interpreted as characters
+(the latter occupy the range @code{3FFF80..3FFFFF}).  Thus, a
+character codepoint in Emacs is a 22-bit integer number.
+
+@cindex internal representation of characters
+@cindex characters, representation in buffers and strings
+@cindex multibyte text
+  To conserve memory, Emacs does not hold fixed-length 22-bit numbers
+that are codepoints of text characters within buffers and strings.
+Rather, Emacs uses a variable-length internal representation of
+characters, that stores each character as a sequence of 1 to 5 8-bit
+bytes, depending on the magnitude of its codepoint@footnote{
+This internal representation is based on one of the encodings defined
+by the Unicode Standard, called @dfn{UTF-8}, for representing any
+Unicode codepoint, but Emacs extends UTF-8 to represent the additional
+codepoints it uses for raw 8-bit bytes and characters not unified with
+Unicode.}.
+For example, any @acronym{ASCII} character takes up only 1 byte, a
+Latin-1 character takes up 2 bytes, etc.  We call this representation
+of text @dfn{multibyte}, because it uses several bytes for each
+character.
+
+  Outside Emacs, characters can be represented in many different
+encodings, such as ISO-8859-1, GB-2312, Big-5, etc.  Emacs converts
+between these external encodings and the internal representation, as
+appropriate, when it reads text into a buffer or a string, or when it
+writes text to a disk file or passes it to some other process.
+
+  Occasionally, Emacs needs to hold and manipulate encoded text or
+binary non-text data in its buffers or strings.  For example, when
+Emacs visits a file, it first reads the file's text verbatim into a
+buffer, and only then converts it to the internal representation.
+Before the conversion, the buffer holds encoded text.
  
  @cindex unibyte text
-  In unibyte representation, each character occupies one byte and
-therefore the possible character codes range from 0 to 255.  Codes 0
-through 127 are @acronym{ASCII} characters; the codes from 128 through 255
-are used for one non-@acronym{ASCII} character set (you can choose which
-character set by setting the variable @code{nonascii-insert-offset}).
-
-@cindex leading code
-@cindex multibyte text
-@cindex trailing codes
-  In multibyte representation, a character may occupy more than one
-byte, and as a result, the full range of Emacs character codes can be
-stored.  The first byte of a multibyte character is always in the range
-128 through 159 (octal 0200 through 0237).  These values are called
-@dfn{leading codes}.  The second and subsequent bytes of a multibyte
-character are always in the range 160 through 255 (octal 0240 through
-0377); these values are @dfn{trailing codes}.
-
-  Some sequences of bytes are not valid in multibyte text: for example,
-a single isolated byte in the range 128 through 159 is not allowed.  But
-character codes 128 through 159 can appear in multibyte text,
-represented as two-byte sequences.  All the character codes 128 through
-255 are possible (though slightly abnormal) in multibyte text; they
-appear in multibyte buffers and strings when you do explicit encoding
-and decoding (@pxref{Explicit Encoding}).
+  Encoded text is not really text, as far as Emacs is concerned, but
+rather a sequence of raw 8-bit bytes.  We call buffers and strings
+that hold encoded text @dfn{unibyte} buffers and strings, because
+Emacs treats them as a sequence of individual bytes.  In particular,
+Emacs usually displays unibyte buffers and strings as octal codes such
+as @code{\237}.  We recommend that you never use unibyte buffers and
+strings except for manipulating encoded text or binary non-text data.
  
    In a buffer, the buffer-local value of the variable
  @code{enable-multibyte-characters} specifies the representation used.
@@ -77,7 +98,7 @@ when the string is constructed.
  @defvar enable-multibyte-characters
  This variable specifies the current buffer's text representation.
  If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
-it contains unibyte text.
+it contains unibyte encoded text or binary non-text data.
  
  You cannot set this variable directly; instead, use the function
  @code{set-buffer-multibyte} to change a buffer's representation.
@@ -96,20 +117,28 @@ default value to @code{nil} early in startup.
  @end defvar
  
  @defun position-bytes position
-Return the byte-position corresponding to buffer position
+Buffer positions are measured in character units.  This function
+returns the byte-position corresponding to buffer position
  @var{position} in the current buffer.  This is 1 at the start of the
  buffer, and counts upward in bytes.  If @var{position} is out of
  range, the value is @code{nil}.
  @end defun
  
  @defun byte-to-position byte-position
-Return the buffer position corresponding to byte-position
+Return the buffer position, in character units, corresponding to given
  @var{byte-position} in the current buffer.  If @var{byte-position} is
-out of range, the value is @code{nil}.
+out of range, the value is @code{nil}.  In a multibyte buffer, an
+arbitrary value of @var{byte-position} can be not at character
+boundary, but inside a multibyte sequence representing a single
+character; in this case, this function returns the buffer position of
+the character whose multibyte sequence includes @var{byte-position}.
+In other words, the value does not change for all byte positions that
+belong to the same character.
  @end defun
  
  @defun multibyte-string-p string
-Return @code{t} if @var{string} is a multibyte string.
+Return @code{t} if @var{string} is a multibyte string, @code{nil}
+otherwise.
  @end defun
  
  @defun string-bytes string
@@ -119,14 +148,20 @@ If @var{string} is a multibyte string, this can be greater than
  @code{(length @var{string})}.
  @end defun
  
+@defun unibyte-string &rest bytes
+This function concatenates all its argument @var{bytes} and makes the
+result a unibyte string.
+@end defun
+
  @node Converting Representations
  @section Converting Text Representations
  
    Emacs can convert unibyte text to multibyte; it can also convert
-multibyte text to unibyte, though this conversion loses information.  In
-general these conversions happen when inserting text into a buffer, or
-when putting text from several strings together in one string.  You can
-also explicitly convert a string's contents to either representation.
+multibyte text to unibyte, provided that the multibyte text contains
+only @acronym{ASCII} and 8-bit raw bytes.  In general, these
+conversions happen when inserting text into a buffer, or when putting
+text from several strings together in one string.  You can also
+explicitly convert a string's contents to either representation.
  
    Emacs chooses the representation for a string based on the text that
  it is constructed from.  The general rule is to convert unibyte text to
@@ -145,89 +180,47 @@ acceptable because the buffer's representation is a choice made by the
  user that cannot be overridden automatically.
  
    Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
-unchanged, and likewise character codes 128 through 159.  It converts
-the non-@acronym{ASCII} codes 160 through 255 by adding the value
-@code{nonascii-insert-offset} to each character code.  By setting this
-variable, you specify which character set the unibyte characters
-correspond to (@pxref{Character Sets}).  For example, if
-@code{nonascii-insert-offset} is 2048, which is @code{(- (make-char
-'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters
-correspond to Latin 1.  If it is 2688, which is @code{(- (make-char
-'greek-iso8859-7) 128)}, then they correspond to Greek letters.
-
-  Converting multibyte text to unibyte is simpler: it discards all but
-the low 8 bits of each character code.  If @code{nonascii-insert-offset}
-has a reasonable value, corresponding to the beginning of some character
-set, this conversion is the inverse of the other: converting unibyte
-text to multibyte and back to unibyte reproduces the original unibyte
-text.
+unchanged, and converts bytes with codes 128 through 159 to the
+multibyte representation of raw eight-bit bytes.
  
-@defvar nonascii-insert-offset
-This variable specifies the amount to add to a non-@acronym{ASCII} character
-when converting unibyte text to multibyte.  It also applies when
-@code{self-insert-command} inserts a character in the unibyte
-non-@acronym{ASCII} range, 128 through 255.  However, the functions
-@code{insert} and @code{insert-char} do not perform this conversion.
-
-The right value to use to select character set @var{cs} is @code{(-
-(make-char @var{cs}) 128)}.  If the value of
-@code{nonascii-insert-offset} is zero, then conversion actually uses the
-value for the Latin 1 character set, rather than zero.
-@end defvar
+  Converting multibyte text to unibyte converts all @acronym{ASCII}
+and eight-bit characters to their single-byte form, but loses
+information for non-@acronym{ASCII} characters by discarding all but
+the low 8 bits of each character's codepoint.  Converting unibyte text
+to multibyte and back to unibyte reproduces the original unibyte text.
  
-@defvar nonascii-translation-table
-This variable provides a more general alternative to
-@code{nonascii-insert-offset}.  You can use it to specify independently
-how to translate each code in the range of 128 through 255 into a
-multibyte character.  The value should be a char-table, or @code{nil}.
-If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}.
-@end defvar
-
-The next three functions either return the argument @var{string}, or a
+The next two functions either return the argument @var{string}, or a
  newly created string with no text properties.
  
-@defun string-make-unibyte string
-This function converts the text of @var{string} to unibyte
-representation, if it isn't already, and returns the result.  If
-@var{string} is a unibyte string, it is returned unchanged.  Multibyte
-character codes are converted to unibyte according to
-@code{nonascii-translation-table} or, if that is @code{nil}, using
-@code{nonascii-insert-offset}.  If the lookup in the translation table
-fails, this function takes just the low 8 bits of each character.
-@end defun
-
-@defun string-make-multibyte string
-This function converts the text of @var{string} to multibyte
-representation, if it isn't already, and returns the result.  If
-@var{string} is a multibyte string or consists entirely of
-@acronym{ASCII} characters, it is returned unchanged.  In particular,
-if @var{string} is unibyte and entirely @acronym{ASCII}, the returned
-string is unibyte.  (When the characters are all @acronym{ASCII},
-Emacs primitives will treat the string the same way whether it is
-unibyte or multibyte.)  If @var{string} is unibyte and contains
-non-@acronym{ASCII} characters, the function
-@code{unibyte-char-to-multibyte} is used to convert each unibyte
-character to a multibyte character.
-@end defun
-
  @defun string-to-multibyte string
  This function returns a multibyte string containing the same sequence
-of character codes as @var{string}.  Unlike
-@code{string-make-multibyte}, this function unconditionally returns a
-multibyte string.  If @var{string} is a multibyte string, it is
-returned unchanged.
+of characters as @var{string}.  If @var{string} is a multibyte string,
+it is returned unchanged.  The function assumes that @var{string}
+includes only @acronym{ASCII} characters and raw 8-bit bytes; the
+latter are converted to their multibyte representation corresponding
+to the codepoints in the @code{3FFF80..3FFFFF} area (@pxref{Text
+Representations, codepoints}).
+@end defun
+
+@defun string-to-unibyte string
+This function returns a unibyte string containing the same sequence of
+characters as @var{string}.  It signals an error if @var{string}
+contains a non-@acronym{ASCII} character.  If @var{string} is a
+unibyte string, it is returned unchanged.  Use this function for
+@var{string} arguments that contain only @acronym{ASCII} and eight-bit
+characters.
  @end defun
  
  @defun multibyte-char-to-unibyte char
  This convert the multibyte character @var{char} to a unibyte
-character, based on @code{nonascii-translation-table} and
-@code{nonascii-insert-offset}.
+character.  If @var{char} is a character that is neither
+@acronym{ASCII} nor eight-bit, the value is -1.
  @end defun
  
  @defun unibyte-char-to-multibyte char
  This convert the unibyte character @var{char} to a multibyte
-character, based on @code{nonascii-translation-table} and
-@code{nonascii-insert-offset}.
+character, assuming @var{char} is either @acronym{ASCII} or raw 8-bit
+byte.
  @end defun
  
  @node Selecting a Representation
@@ -242,13 +235,13 @@ is non-@code{nil}, the buffer becomes multibyte.  If @var{multibyte}
  is @code{nil}, the buffer becomes unibyte.
  
  This function leaves the buffer contents unchanged when viewed as a
-sequence of bytes.  As a consequence, it can change the contents viewed
-as characters; a sequence of two bytes which is treated as one character
-in multibyte representation will count as two characters in unibyte
-representation.  Character codes 128 through 159 are an exception.  They
-are represented by one byte in a unibyte buffer, but when the buffer is
-set to multibyte, they are converted to two-byte sequences, and vice
-versa.
+sequence of bytes.  As a consequence, it can change the contents
+viewed as characters; a sequence of three bytes which is treated as
+one character in multibyte representation will count as three
+characters in unibyte representation.  Eight-bit characters
+representing raw bytes are an exception.  They are represented by one
+byte in a unibyte buffer, but when the buffer is set to multibyte,
+they are converted to two-byte sequences, and vice versa.
  
  This function sets @code{enable-multibyte-characters} to record which
  representation is in use.  It also adjusts various data in the buffer
@@ -263,81 +256,96 @@ base buffer.
  @defun string-as-unibyte string
  This function returns a string with the same bytes as @var{string} but
  treating each byte as a character.  This means that the value may have
-more characters than @var{string} has.
+more characters than @var{string} has.  Eight-bit characters
+representing raw bytes are an exception: each one of them is converted
+to a single byte.
  
  If @var{string} is already a unibyte string, then the value is
  @var{string} itself.  Otherwise it is a newly created string, with no
-text properties.  If @var{string} is multibyte, any characters it
-contains of charset @code{eight-bit-control} or @code{eight-bit-graphic}
-are converted to the corresponding single byte.
+text properties.
  @end defun
  
  @defun string-as-multibyte string
  This function returns a string with the same bytes as @var{string} but
-treating each multibyte sequence as one character.  This means that the
-value may have fewer characters than @var{string} has.
+treating each multibyte sequence as one character.  This means that
+the value may have fewer characters than @var{string} has.  If a byte
+sequence in @var{string} is invalid as a multibyte representation of a
+single character, each byte in the sequence is treated as raw 8-bit
+byte.
  
  If @var{string} is already a multibyte string, then the value is
  @var{string} itself.  Otherwise it is a newly created string, with no
-text properties.  If @var{string} is unibyte and contains any individual
-8-bit bytes (i.e.@: not part of a multibyte form), they are converted to
-the corresponding multibyte character of charset @code{eight-bit-control}
-or @code{eight-bit-graphic}.
+text properties.
  @end defun
  
  @node Character Codes
  @section Character Codes
  @cindex character codes
  
-  The unibyte and multibyte text representations use different character
-codes.  The valid character codes for unibyte representation range from
-0 to 255---the values that can fit in one byte.  The valid character
-codes for multibyte representation range from 0 to 524287, but not all
-values in that range are valid.  The values 128 through 255 are not
-entirely proper in multibyte text, but they can occur if you do explicit
-encoding and decoding (@pxref{Explicit Encoding}).  Some other character
-codes cannot occur at all in multibyte text.  Only the @acronym{ASCII} codes
-0 through 127 are completely legitimate in both representations.
-
-@defun char-valid-p charcode &optional genericp
-This returns @code{t} if @var{charcode} is valid (either for unibyte
-text or for multibyte text).
+  The unibyte and multibyte text representations use different
+character codes.  The valid character codes for unibyte representation
+range from 0 to 255---the values that can fit in one byte.  The valid
+character codes for multibyte representation range from 0 to 4194303
+(#x3FFFFF).  In this code space, values 0 through 127 are for
+@acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F)
+are for non-@acronym{ASCII} characters.  Values 0 through 1114111
+(#10FFFF) corresponds to Unicode characters of the same codepoint,
+while values 4194176 (#x3FFF80) through 4194303 (#x3FFFFF) are for
+representing eight-bit raw bytes.
+
+@defun characterp charcode
+This returns @code{t} if @var{charcode} is a valid character, and
+@code{nil} otherwise.
  
  @example
-(char-valid-p 65)
+(characterp 65)
       @result{} t
-(char-valid-p 256)
-     @result{} nil
-(char-valid-p 2248)
+(characterp 4194303)
       @result{} t
+(characterp 4194304)
+     @result{} nil
  @end example
+@end defun
  
-If the optional argument @var{genericp} is non-@code{nil}, this
-function also returns @code{t} if @var{charcode} is a generic
-character (@pxref{Splitting Characters}).
+@defun get-byte pos &optional string
+This function returns the byte at current buffer's character position
+@var{pos}.  If the current buffer is unibyte, this is literally the
+byte at that position.  If the buffer is multibyte, byte values of
+@acronym{ASCII} characters are the same as character codepoints,
+whereas eight-bit raw bytes are converted to their 8-bit codes.  The
+function signals an error if the character at @var{pos} is
+non-@acronym{ASCII}.
+
+The optional argument @var{string} means to get a byte value from that
+string instead of the current buffer.
  @end defun
  
  @node Character Sets
  @section Character Sets
  @cindex character sets
  
-  Emacs classifies characters into various @dfn{character sets}, each of
-which has a name which is a symbol.  Each character belongs to one and
-only one character set.
-
-  In general, there is one character set for each distinct script.  For
-example, @code{latin-iso8859-1} is one character set,
-@code{greek-iso8859-7} is another, and @code{ascii} is another.  An
-Emacs character set can hold at most 9025 characters; therefore, in some
-cases, characters that would logically be grouped together are split
-into several character sets.  For example, one set of Chinese
-characters, generally known as Big 5, is divided into two Emacs
-character sets, @code{chinese-big5-1} and @code{chinese-big5-2}.
-
-  @acronym{ASCII} characters are in character set @code{ascii}.  The
-non-@acronym{ASCII} characters 128 through 159 are in character set
-@code{eight-bit-control}, and codes 160 through 255 are in character set
-@code{eight-bit-graphic}.
+@cindex charset
+@cindex coded character set
+An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters
+in which each character is assigned a numeric code point.  (The
+Unicode standard calls this a @dfn{coded character set}.)  Each Emacs
+charset has a name which is a symbol.  A single character can belong
+to any number of different character sets, but it will generally have
+a different code point in each charset.  Examples of character sets
+include @code{ascii}, @code{iso-8859-1}, @code{greek-iso8859-7}, and
+@code{windows-1255}.  The code point assigned to a character in a
+charset is usually different from its code point used in Emacs buffers
+and strings.
+
+@cindex @code{emacs}, a charset
+@cindex @code{unicode}, a charset
+@cindex @code{eight-bit}, a charset
+  Emacs defines several special character sets.  The character set
+@code{unicode} includes all the characters whose Emacs code points are
+in the range @code{0..10FFFF}.  The character set @code{emacs}
+includes all @acronym{ASCII} and non-@acronym{ASCII} characters.
+Finally, the @code{eight-bit} charset includes the 8-bit raw bytes;
+Emacs uses it to represent raw bytes encountered in text.
  
  @defun charsetp object
  Returns @code{t} if @var{object} is a symbol that names a character set,
@@ -348,155 +356,93 @@ Returns @code{t} if @var{object} is a symbol that names a character set,
  The value is a list of all defined character set names.
  @end defvar
  
-@defun charset-list
-This function returns the value of @code{charset-list}.  It is only
-provided for backward compatibility.
+@defun charset-priority-list &optional highestp
+This functions returns a list of all defined character sets ordered by
+their priority.  If @var{highestp} is non-@code{nil}, the function
+returns a single character set of the highest priority.
+@end defun
+
+@defun set-charset-priority &rest charsets
+This function makes @var{charsets} the highest priority character sets.
  @end defun
  
  @defun char-charset character
-This function returns the name of the character set that @var{character}
-belongs to, or the symbol @code{unknown} if @var{character} is not a
-valid character.
+This function returns the name of the character set of highest
+priority that @var{character} belongs to.  @acronym{ASCII} characters
+are an exception: for them, this function always returns @code{ascii}.
  @end defun
  
  @defun charset-plist charset
-This function returns the charset property list of the character set
-@var{charset}.  Although @var{charset} is a symbol, this is not the same
-as the property list of that symbol.  Charset properties are used for
-special purposes within Emacs.
+This function returns the property list of the character set
+@var{charset}.  Although @var{charset} is a symbol, this is not the
+same as the property list of that symbol.  Charset properties include
+important information about the charset, such as its documentation
+string, short name, etc.
  @end defun
  
-@deffn Command list-charset-chars charset
-This command displays a list of characters in the character set
-@var{charset}.
-@end deffn
-
-@node Chars and Bytes
-@section Characters and Bytes
-@cindex bytes and characters
-
-@cindex introduction sequence (of character)
-@cindex dimension (of character set)
-  In multibyte representation, each character occupies one or more
-bytes.  Each character set has an @dfn{introduction sequence}, which is
-normally one or two bytes long.  (Exception: the @code{ascii} character
-set and the @code{eight-bit-graphic} character set have a zero-length
-introduction sequence.)  The introduction sequence is the beginning of
-the byte sequence for any character in the character set.  The rest of
-the character's bytes distinguish it from the other characters in the
-same character set.  Depending on the character set, there are either
-one or two distinguishing bytes; the number of such bytes is called the
-@dfn{dimension} of the character set.
-
-@defun charset-dimension charset
-This function returns the dimension of @var{charset}; at present, the
-dimension is always 1 or 2.
+@defun put-charset-property charset propname value
+This function sets the @var{propname} property of @var{charset} to the
+given @var{value}.
  @end defun
  
-@defun charset-bytes charset
-This function returns the number of bytes used to represent a character
-in character set @var{charset}.
+@defun get-charset-property charset propname
+This function returns the value of @var{charset}s property
+@var{propname}.
  @end defun
  
-  This is the simplest way to determine the byte length of a character
-set's introduction sequence:
-
-@example
-(- (charset-bytes @var{charset})
-   (charset-dimension @var{charset}))
-@end example
-
-@node Splitting Characters
-@section Splitting Characters
-@cindex character as bytes
-
-  The functions in this section convert between characters and the byte
-values used to represent them.  For most purposes, there is no need to
-be concerned with the sequence of bytes used to represent a character,
-because Emacs translates automatically when necessary.
-
-@defun split-char character
-Return a list containing the name of the character set of
-@var{character}, followed by one or two byte values (integers) which
-identify @var{character} within that character set.  The number of byte
-values is the character set's dimension.
-
-If @var{character} is invalid as a character code, @code{split-char}
-returns a list consisting of the symbol @code{unknown} and @var{character}.
+@deffn Command list-charset-chars charset
+This command displays a list of characters in the character set
+@var{charset}.
+@end deffn
  
-@example
-(split-char 2248)
-     @result{} (latin-iso8859-1 72)
-(split-char 65)
-     @result{} (ascii 65)
-(split-char 128)
-     @result{} (eight-bit-control 128)
-@end example
+  Emacs can convert between its internal representation of a character
+and the character's codepoint in a specific charset.  The following
+two functions support these conversions.
+
+@c FIXME: decode-char and encode-char accept and ignore an additional
+@c argument @var{restriction}.  When that argument actually makes a
+@c difference, it should be documented here.
+@defun decode-char charset code-point
+This function decodes a character that is assigned a @var{code-point}
+in @var{charset}, to the corresponding Emacs character, and returns
+it.  If @var{charset} doesn't contain a character of that code point,
+the value is @code{nil}.  If @var{code-point} doesn't fit in a Lisp
+integer (@pxref{Integer Basics, most-positive-fixnum}), it can be
+specified as a cons cell @code{(@var{high} . @var{low})}, where
+@var{low} are the lower 16 bits of the value and @var{high} are the
+high 16 bits.
  @end defun
  
-@cindex generate characters in charsets
-@defun make-char charset &optional code1 code2
-This function returns the character in character set @var{charset} whose
-position codes are @var{code1} and @var{code2}.  This is roughly the
-inverse of @code{split-char}.  Normally, you should specify either one
-or both of @var{code1} and @var{code2} according to the dimension of
-@var{charset}.  For example,
-
-@example
-(make-char 'latin-iso8859-1 72)
-     @result{} 2248
-@end example
-
-Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed
-before they are used to index @var{charset}.  Thus you may use, for
-instance, an ISO 8859 character code rather than subtracting 128, as
-is necessary to index the corresponding Emacs charset.
+@defun encode-char char charset
+This function returns the code point assigned to the character
+@var{char} in @var{charset}.  If the result does not fit in a Lisp
+integer, it is returned as a cons cell @code{(@var{high} . @var{low})}
+that fits the second argument of @code{decode-char} above.  If
+@var{charset} doesn't have a codepoint for @var{char}, the value is
+@code{nil}.
  @end defun
  
-@cindex generic characters
-  If you call @code{make-char} with no @var{byte-values}, the result is
-a @dfn{generic character} which stands for @var{charset}.  A generic
-character is an integer, but it is @emph{not} valid for insertion in the
-buffer as a character.  It can be used in @code{char-table-range} to
-refer to the whole character set (@pxref{Char-Tables}).
-@code{char-valid-p} returns @code{nil} for generic characters.
-For example:
-
-@example
-(make-char 'latin-iso8859-1)
-     @result{} 2176
-(char-valid-p 2176)
-     @result{} nil
-(char-valid-p 2176 t)
-     @result{} t
-(split-char 2176)
-     @result{} (latin-iso8859-1 0)
-@end example
-
-The character sets @code{ascii}, @code{eight-bit-control}, and
-@code{eight-bit-graphic} don't have corresponding generic characters.  If
-@var{charset} is one of them and you don't supply @var{code1},
-@code{make-char} returns the character code corresponding to the
-smallest code in @var{charset}.
-
  @node Scanning Charsets
  @section Scanning for Character Sets
  
-  Sometimes it is useful to find out which character sets appear in a
-part of a buffer or a string.  One use for this is in determining which
-coding systems (@pxref{Coding Systems}) are capable of representing all
-of the text in question.
+  Sometimes it is useful to find out, for characters that appear in a
+certain part of a buffer or a string, to which character sets they
+belong.  One use for this is in determining which coding systems
+(@pxref{Coding Systems}) are capable of representing all of the text
+in question; another is to determine the font(s) for displaying that
+text.
  
  @defun charset-after &optional pos
-This function return the charset of a character in the current buffer
-at position @var{pos}.  If @var{pos} is omitted or @code{nil}, it
-defaults to the current value of point.  If @var{pos} is out of range,
-the value is @code{nil}.
+This function returns the charset of highest priority containing the
+character in the current buffer at position @var{pos}.  If @var{pos}
+is omitted or @code{nil}, it defaults to the current value of point.
+If @var{pos} is out of range, the value is @code{nil}.
  @end defun
  
  @defun find-charset-region beg end &optional translation
-This function returns a list of the character sets that appear in the
-current buffer between positions @var{beg} and @var{end}.
+This function returns a list of the character sets of highest priority
+that contain characters in the current buffer between positions
+@var{beg} and @var{end}.
  
  The optional argument @var{translation} specifies a translation table to
  be used in scanning the text (@pxref{Translation of Characters}).  If it
@@ -506,10 +452,10 @@ characters instead of the characters actually in the buffer.
  @end defun
  
  @defun find-charset-string string &optional translation
-This function returns a list of the character sets that appear in the
-string @var{string}.  It is just like @code{find-charset-region}, except
-that it applies to the contents of @var{string} instead of part of the
-current buffer.
+This function returns a list of the character sets of highest priority
+that contain characters in @var{string}.  It is just like
+@code{find-charset-region}, except that it applies to the contents of
+@var{string} instead of part of the current buffer.
  @end defun
  
  @node Translation of Characters
@@ -517,19 +463,18 @@ current buffer.
  @cindex character translation tables
  @cindex translation tables
  
-  A @dfn{translation table} is a char-table that specifies a mapping
-of characters into characters.  These tables are used in encoding and
-decoding, and for other purposes.  Some coding systems specify their
-own particular translation tables; there are also default translation
-tables which apply to all other coding systems.
+  A @dfn{translation table} is a char-table (@pxref{Char-Tables}) that
+specifies a mapping of characters into characters.  These tables are
+used in encoding and decoding, and for other purposes.  Some coding
+systems specify their own particular translation tables; there are
+also default translation tables which apply to all other coding
+systems.
  
-  For instance, the coding-system @code{utf-8} has a translation table
-that maps characters of various charsets (e.g.,
-@code{latin-iso8859-@var{x}}) into Unicode character sets.  This way,
-it can encode Latin-2 characters into UTF-8.  Meanwhile,
-@code{unify-8859-on-decoding-mode} operates by specifying
-@code{standard-translation-table-for-decode} to translate
-Latin-@var{x} characters into corresponding Unicode characters.
+  A translation table has two extra slots.  The first is either
+@code{nil} or a translation table that performs the reverse
+translation; the second is the maximum number of characters to look up
+for translating sequences of characters (see the description of
+@code{make-translation-table-from-alist} below).
  
  @defun make-translation-table &rest translations
  This function returns a translation table based on the argument
@@ -541,58 +486,68 @@ The arguments and the forms in each argument are processed in order,
  and if a previous form already translates @var{to} to some other
  character, say @var{to-alt}, @var{from} is also translated to
  @var{to-alt}.
-
-You can also map one whole character set into another character set with
-the same dimension.  To do this, you specify a generic character (which
-designates a character set) for @var{from} (@pxref{Splitting Characters}).
-In this case, if @var{to} is also a generic character, its character
-set should have the same dimension as @var{from}'s.  Then the
-translation table translates each character of @var{from}'s character
-set into the corresponding character of @var{to}'s character set.  If
-@var{from} is a generic character and @var{to} is an ordinary
-character, then the translation table translates every character of
-@var{from}'s character set into @var{to}.
  @end defun
  
-  In decoding, the translation table's translations are applied to the
-characters that result from ordinary decoding.  If a coding system has
-property @code{translation-table-for-decode}, that specifies the
-translation table to use.  (This is a property of the coding system,
-as returned by @code{coding-system-get}, not a property of the symbol
-that is the coding system's name. @xref{Coding System Basics,, Basic
-Concepts of Coding Systems}.)  Otherwise, if
-@code{standard-translation-table-for-decode} is non-@code{nil},
-decoding uses that table.
-
-  In encoding, the translation table's translations are applied to the
-characters in the buffer, and the result of translation is actually
-encoded.  If a coding system has property
-@code{translation-table-for-encode}, that specifies the translation
-table to use.  Otherwise the variable
-@code{standard-translation-table-for-encode} specifies the translation
-table.
+  During decoding, the translation table's translations are applied to
+the characters that result from ordinary decoding.  If a coding system
+has property @code{:decode-translation-table}, that specifies the
+translation table to use, or a list of translation tables to apply in
+sequence.  (This is a property of the coding system, as returned by
+@code{coding-system-get}, not a property of the symbol that is the
+coding system's name.  @xref{Coding System Basics,, Basic Concepts of
+Coding Systems}.)  Finally, if
+@code{standard-translation-table-for-decode} is non-@code{nil}, the
+resulting characters are translated by that table.
+
+  During encoding, the translation table's translations are applied to
+the characters in the buffer, and the result of translation is
+actually encoded.  If a coding system has property
+@code{:encode-translation-table}, that specifies the translation table
+to use, or a list of translation tables to apply in sequence.  In
+addition, if the variable @code{standard-translation-table-for-encode}
+is non-@code{nil}, it specifies the translation table to use for
+translating the result.
  
  @defvar standard-translation-table-for-decode
-This is the default translation table for decoding, for
-coding systems that don't specify any other translation table.
+This is the default translation table for decoding.  If a coding
+systems specifies its own translation tables, the table that is the
+value of this variable, if non-@code{nil}, is applied after them.
  @end defvar
  
  @defvar standard-translation-table-for-encode
-This is the default translation table for encoding, for
-coding systems that don't specify any other translation table.
+This is the default translation table for encoding.  If a coding
+systems specifies its own translation tables, the table that is the
+value of this variable, if non-@code{nil}, is applied after them.
  @end defvar
  
-@defvar translation-table-for-input
-Self-inserting characters are translated through this translation
-table before they are inserted.  Search commands also translate their
-input through this table, so they can compare more reliably with
-what's in the buffer.
+@defun make-translation-table-from-vector vec
+This function returns a translation table made from @var{vec} that is
+an array of 256 elements to map byte values 0 through 255 to
+characters.  Elements may be @code{nil} for untranslated bytes.  The
+returned table has a translation table for reverse mapping in the
+first extra slot, and the value @code{1} in the second extra slot.
+
+This function provides an easy way to make a private coding system
+that maps each byte to a specific character.  You can specify the
+returned table and the reverse translation table using the properties
+@code{:decode-translation-table} and @code{:encode-translation-table}
+respectively in the @var{props} argument to
+@code{define-coding-system}.
+@end defun
  
-@code{set-buffer-file-coding-system} sets this variable so that your
-keyboard input gets translated into the character sets that the buffer
-is likely to contain.  This variable automatically becomes
-buffer-local when set.
-@end defvar
+@defun make-translation-table-from-alist alist
+This function is similar to @code{make-translation-table} but returns
+a complex translation table rather than a simple one-to-one mapping.
+Each element of @var{alist} is of the form @code{(@var{from}
+. @var{to})}, where @var{from} and @var{to} are either a character or
+a vector specifying a sequence of characters.  If @var{from} is a
+character, that character is translated to @var{to} (i.e.@: to a
+character or a character sequence).  If @var{from} is a vector of
+characters, that sequence is translated to @var{to}.  The returned
+table has a translation table for reverse mapping in the first extra
+slot, and the maximum length of all the @var{from} character sequences
+in the second extra slot.
+@end defun
  
  @node Coding Systems
  @section Coding Systems