@c -*-texinfo-*-
@c This is part of the GNU Emacs Lisp Reference Manual.
@c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004,
-@c 2005, 2006, 2007 Free Software Foundation, Inc.
+@c 2005, 2006, 2007, 2008 Free Software Foundation, Inc.
@c See the file elisp.texi for copying conditions.
@setfilename ../../info/characters
@node Non-ASCII Characters, Searching and Matching, Text, Top
@cindex characters, multi-byte
@cindex non-@acronym{ASCII} characters
- This chapter covers the special issues relating to non-@acronym{ASCII}
-characters and how they are stored in strings and buffers.
+ This chapter covers the special issues relating to characters and
+how they are stored in strings and buffers.
@menu
-* Text Representations:: Unibyte and multibyte representations
+* Text Representations:: How Emacs represents text.
* Converting Representations:: Converting unibyte to multibyte and vice versa.
* Selecting a Representation:: Treating a byte sequence as unibyte or multi.
* Character Codes:: How unibyte and multibyte relate to
codes of individual characters.
* Character Sets:: The space of possible character codes
is divided into various character sets.
-* Chars and Bytes:: More information about multibyte encodings.
-* Splitting Characters:: Converting a character to its byte sequence.
* Scanning Charsets:: Which character sets are used in a buffer?
* Translation of Characters:: Translation tables are used for conversion.
* Coding Systems:: Coding systems are conversions for saving files.
@node Text Representations
@section Text Representations
-@cindex text representations
-
- Emacs has two @dfn{text representations}---two ways to represent text
-in a string or buffer. These are called @dfn{unibyte} and
-@dfn{multibyte}. Each string, and each buffer, uses one of these two
-representations. For most purposes, you can ignore the issue of
-representations, because Emacs converts text between them as
-appropriate. Occasionally in Lisp programming you will need to pay
-attention to the difference.
+@cindex text representation
+
+ Emacs buffers and strings support a large repertoire of characters
+from many different scripts. This is so users could type and display
+text in most any known written language.
+
+@cindex character codepoint
+@cindex codespace
+@cindex Unicode
+ To support this multitude of characters and scripts, Emacs closely
+follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
+unique number, called a @dfn{codepoint}, to each and every character.
+The range of codepoints defined by Unicode, or the Unicode
+@dfn{codespace}, is @code{0..10FFFF} (in hex), inclusive. Emacs
+extends this range with codepoints in the range @code{110000..3FFFFF},
+which it uses for representing characters that are not unified with
+Unicode and raw 8-bit bytes that cannot be interpreted as characters
+(the latter occupy the range @code{3FFF80..3FFFFF}). Thus, a
+character codepoint in Emacs is a 22-bit integer number.
+
+@cindex internal representation of characters
+@cindex characters, representation in buffers and strings
+@cindex multibyte text
+ To conserve memory, Emacs does not hold fixed-length 22-bit numbers
+that are codepoints of text characters within buffers and strings.
+Rather, Emacs uses a variable-length internal representation of
+characters, that stores each character as a sequence of 1 to 5 8-bit
+bytes, depending on the magnitude of its codepoint@footnote{
+This internal representation is based on one of the encodings defined
+by the Unicode Standard, called @dfn{UTF-8}, for representing any
+Unicode codepoint, but Emacs extends UTF-8 to represent the additional
+codepoints it uses for raw 8-bit bytes and characters not unified with
+Unicode.}.
+For example, any @acronym{ASCII} character takes up only 1 byte, a
+Latin-1 character takes up 2 bytes, etc. We call this representation
+of text @dfn{multibyte}, because it uses several bytes for each
+character.
+
+ Outside Emacs, characters can be represented in many different
+encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts
+between these external encodings and the internal representation, as
+appropriate, when it reads text into a buffer or a string, or when it
+writes text to a disk file or passes it to some other process.
+
+ Occasionally, Emacs needs to hold and manipulate encoded text or
+binary non-text data in its buffers or strings. For example, when
+Emacs visits a file, it first reads the file's text verbatim into a
+buffer, and only then converts it to the internal representation.
+Before the conversion, the buffer holds encoded text.
@cindex unibyte text
- In unibyte representation, each character occupies one byte and
-therefore the possible character codes range from 0 to 255. Codes 0
-through 127 are @acronym{ASCII} characters; the codes from 128 through 255
-are used for one non-@acronym{ASCII} character set (you can choose which
-character set by setting the variable @code{nonascii-insert-offset}).
-
-@cindex leading code
-@cindex multibyte text
-@cindex trailing codes
- In multibyte representation, a character may occupy more than one
-byte, and as a result, the full range of Emacs character codes can be
-stored. The first byte of a multibyte character is always in the range
-128 through 159 (octal 0200 through 0237). These values are called
-@dfn{leading codes}. The second and subsequent bytes of a multibyte
-character are always in the range 160 through 255 (octal 0240 through
-0377); these values are @dfn{trailing codes}.
-
- Some sequences of bytes are not valid in multibyte text: for example,
-a single isolated byte in the range 128 through 159 is not allowed. But
-character codes 128 through 159 can appear in multibyte text,
-represented as two-byte sequences. All the character codes 128 through
-255 are possible (though slightly abnormal) in multibyte text; they
-appear in multibyte buffers and strings when you do explicit encoding
-and decoding (@pxref{Explicit Encoding}).
+ Encoded text is not really text, as far as Emacs is concerned, but
+rather a sequence of raw 8-bit bytes. We call buffers and strings
+that hold encoded text @dfn{unibyte} buffers and strings, because
+Emacs treats them as a sequence of individual bytes. In particular,
+Emacs usually displays unibyte buffers and strings as octal codes such
+as @code{\237}. We recommend that you never use unibyte buffers and
+strings except for manipulating encoded text or binary non-text data.
In a buffer, the buffer-local value of the variable
@code{enable-multibyte-characters} specifies the representation used.
@defvar enable-multibyte-characters
This variable specifies the current buffer's text representation.
If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
-it contains unibyte text.
+it contains unibyte encoded text or binary non-text data.
You cannot set this variable directly; instead, use the function
@code{set-buffer-multibyte} to change a buffer's representation.
@end defvar
@defun position-bytes position
-Return the byte-position corresponding to buffer position
+Buffer positions are measured in character units. This function
+returns the byte-position corresponding to buffer position
@var{position} in the current buffer. This is 1 at the start of the
buffer, and counts upward in bytes. If @var{position} is out of
range, the value is @code{nil}.
@end defun
@defun byte-to-position byte-position
-Return the buffer position corresponding to byte-position
+Return the buffer position, in character units, corresponding to given
@var{byte-position} in the current buffer. If @var{byte-position} is
-out of range, the value is @code{nil}.
+out of range, the value is @code{nil}. In a multibyte buffer, an
+arbitrary value of @var{byte-position} can be not at character
+boundary, but inside a multibyte sequence representing a single
+character; in this case, this function returns the buffer position of
+the character whose multibyte sequence includes @var{byte-position}.
+In other words, the value does not change for all byte positions that
+belong to the same character.
@end defun
@defun multibyte-string-p string
-Return @code{t} if @var{string} is a multibyte string.
+Return @code{t} if @var{string} is a multibyte string, @code{nil}
+otherwise.
@end defun
@defun string-bytes string
@code{(length @var{string})}.
@end defun
+@defun unibyte-string &rest bytes
+This function concatenates all its argument @var{bytes} and makes the
+result a unibyte string.
+@end defun
+
@node Converting Representations
@section Converting Text Representations
Emacs can convert unibyte text to multibyte; it can also convert
-multibyte text to unibyte, though this conversion loses information. In
-general these conversions happen when inserting text into a buffer, or
-when putting text from several strings together in one string. You can
-also explicitly convert a string's contents to either representation.
+multibyte text to unibyte, provided that the multibyte text contains
+only @acronym{ASCII} and 8-bit raw bytes. In general, these
+conversions happen when inserting text into a buffer, or when putting
+text from several strings together in one string. You can also
+explicitly convert a string's contents to either representation.
Emacs chooses the representation for a string based on the text that
it is constructed from. The general rule is to convert unibyte text to
user that cannot be overridden automatically.
Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
-unchanged, and likewise character codes 128 through 159. It converts
-the non-@acronym{ASCII} codes 160 through 255 by adding the value
-@code{nonascii-insert-offset} to each character code. By setting this
-variable, you specify which character set the unibyte characters
-correspond to (@pxref{Character Sets}). For example, if
-@code{nonascii-insert-offset} is 2048, which is @code{(- (make-char
-'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters
-correspond to Latin 1. If it is 2688, which is @code{(- (make-char
-'greek-iso8859-7) 128)}, then they correspond to Greek letters.
-
- Converting multibyte text to unibyte is simpler: it discards all but
-the low 8 bits of each character code. If @code{nonascii-insert-offset}
-has a reasonable value, corresponding to the beginning of some character
-set, this conversion is the inverse of the other: converting unibyte
-text to multibyte and back to unibyte reproduces the original unibyte
-text.
+unchanged, and converts bytes with codes 128 through 159 to the
+multibyte representation of raw eight-bit bytes.
-@defvar nonascii-insert-offset
-This variable specifies the amount to add to a non-@acronym{ASCII} character
-when converting unibyte text to multibyte. It also applies when
-@code{self-insert-command} inserts a character in the unibyte
-non-@acronym{ASCII} range, 128 through 255. However, the functions
-@code{insert} and @code{insert-char} do not perform this conversion.
-
-The right value to use to select character set @var{cs} is @code{(-
-(make-char @var{cs}) 128)}. If the value of
-@code{nonascii-insert-offset} is zero, then conversion actually uses the
-value for the Latin 1 character set, rather than zero.
-@end defvar
+ Converting multibyte text to unibyte converts all @acronym{ASCII}
+and eight-bit characters to their single-byte form, but loses
+information for non-@acronym{ASCII} characters by discarding all but
+the low 8 bits of each character's codepoint. Converting unibyte text
+to multibyte and back to unibyte reproduces the original unibyte text.
-@defvar nonascii-translation-table
-This variable provides a more general alternative to
-@code{nonascii-insert-offset}. You can use it to specify independently
-how to translate each code in the range of 128 through 255 into a
-multibyte character. The value should be a char-table, or @code{nil}.
-If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}.
-@end defvar
-
-The next three functions either return the argument @var{string}, or a
+The next two functions either return the argument @var{string}, or a
newly created string with no text properties.
-@defun string-make-unibyte string
-This function converts the text of @var{string} to unibyte
-representation, if it isn't already, and returns the result. If
-@var{string} is a unibyte string, it is returned unchanged. Multibyte
-character codes are converted to unibyte according to
-@code{nonascii-translation-table} or, if that is @code{nil}, using
-@code{nonascii-insert-offset}. If the lookup in the translation table
-fails, this function takes just the low 8 bits of each character.
-@end defun
-
-@defun string-make-multibyte string
-This function converts the text of @var{string} to multibyte
-representation, if it isn't already, and returns the result. If
-@var{string} is a multibyte string or consists entirely of
-@acronym{ASCII} characters, it is returned unchanged. In particular,
-if @var{string} is unibyte and entirely @acronym{ASCII}, the returned
-string is unibyte. (When the characters are all @acronym{ASCII},
-Emacs primitives will treat the string the same way whether it is
-unibyte or multibyte.) If @var{string} is unibyte and contains
-non-@acronym{ASCII} characters, the function
-@code{unibyte-char-to-multibyte} is used to convert each unibyte
-character to a multibyte character.
-@end defun
-
@defun string-to-multibyte string
This function returns a multibyte string containing the same sequence
-of character codes as @var{string}. Unlike
-@code{string-make-multibyte}, this function unconditionally returns a
-multibyte string. If @var{string} is a multibyte string, it is
-returned unchanged.
+of characters as @var{string}. If @var{string} is a multibyte string,
+it is returned unchanged. The function assumes that @var{string}
+includes only @acronym{ASCII} characters and raw 8-bit bytes; the
+latter are converted to their multibyte representation corresponding
+to the codepoints in the @code{3FFF80..3FFFFF} area (@pxref{Text
+Representations, codepoints}).
+@end defun
+
+@defun string-to-unibyte string
+This function returns a unibyte string containing the same sequence of
+characters as @var{string}. It signals an error if @var{string}
+contains a non-@acronym{ASCII} character. If @var{string} is a
+unibyte string, it is returned unchanged. Use this function for
+@var{string} arguments that contain only @acronym{ASCII} and eight-bit
+characters.
@end defun
@defun multibyte-char-to-unibyte char
This convert the multibyte character @var{char} to a unibyte
-character, based on @code{nonascii-translation-table} and
-@code{nonascii-insert-offset}.
+character. If @var{char} is a character that is neither
+@acronym{ASCII} nor eight-bit, the value is -1.
@end defun
@defun unibyte-char-to-multibyte char
This convert the unibyte character @var{char} to a multibyte
-character, based on @code{nonascii-translation-table} and
-@code{nonascii-insert-offset}.
+character, assuming @var{char} is either @acronym{ASCII} or raw 8-bit
+byte.
@end defun
@node Selecting a Representation
is @code{nil}, the buffer becomes unibyte.
This function leaves the buffer contents unchanged when viewed as a
-sequence of bytes. As a consequence, it can change the contents viewed
-as characters; a sequence of two bytes which is treated as one character
-in multibyte representation will count as two characters in unibyte
-representation. Character codes 128 through 159 are an exception. They
-are represented by one byte in a unibyte buffer, but when the buffer is
-set to multibyte, they are converted to two-byte sequences, and vice
-versa.
+sequence of bytes. As a consequence, it can change the contents
+viewed as characters; a sequence of three bytes which is treated as
+one character in multibyte representation will count as three
+characters in unibyte representation. Eight-bit characters
+representing raw bytes are an exception. They are represented by one
+byte in a unibyte buffer, but when the buffer is set to multibyte,
+they are converted to two-byte sequences, and vice versa.
This function sets @code{enable-multibyte-characters} to record which
representation is in use. It also adjusts various data in the buffer
@defun string-as-unibyte string
This function returns a string with the same bytes as @var{string} but
treating each byte as a character. This means that the value may have
-more characters than @var{string} has.
+more characters than @var{string} has. Eight-bit characters
+representing raw bytes are an exception: each one of them is converted
+to a single byte.
If @var{string} is already a unibyte string, then the value is
@var{string} itself. Otherwise it is a newly created string, with no
-text properties. If @var{string} is multibyte, any characters it
-contains of charset @code{eight-bit-control} or @code{eight-bit-graphic}
-are converted to the corresponding single byte.
+text properties.
@end defun
@defun string-as-multibyte string
This function returns a string with the same bytes as @var{string} but
-treating each multibyte sequence as one character. This means that the
-value may have fewer characters than @var{string} has.
+treating each multibyte sequence as one character. This means that
+the value may have fewer characters than @var{string} has. If a byte
+sequence in @var{string} is invalid as a multibyte representation of a
+single character, each byte in the sequence is treated as raw 8-bit
+byte.
If @var{string} is already a multibyte string, then the value is
@var{string} itself. Otherwise it is a newly created string, with no
-text properties. If @var{string} is unibyte and contains any individual
-8-bit bytes (i.e.@: not part of a multibyte form), they are converted to
-the corresponding multibyte character of charset @code{eight-bit-control}
-or @code{eight-bit-graphic}.
+text properties.
@end defun
@node Character Codes
@section Character Codes
@cindex character codes
- The unibyte and multibyte text representations use different character
-codes. The valid character codes for unibyte representation range from
-0 to 255---the values that can fit in one byte. The valid character
-codes for multibyte representation range from 0 to 524287, but not all
-values in that range are valid. The values 128 through 255 are not
-entirely proper in multibyte text, but they can occur if you do explicit
-encoding and decoding (@pxref{Explicit Encoding}). Some other character
-codes cannot occur at all in multibyte text. Only the @acronym{ASCII} codes
-0 through 127 are completely legitimate in both representations.
-
-@defun char-valid-p charcode &optional genericp
-This returns @code{t} if @var{charcode} is valid (either for unibyte
-text or for multibyte text).
+ The unibyte and multibyte text representations use different
+character codes. The valid character codes for unibyte representation
+range from 0 to 255---the values that can fit in one byte. The valid
+character codes for multibyte representation range from 0 to 4194303
+(#x3FFFFF). In this code space, values 0 through 127 are for
+@acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F)
+are for non-@acronym{ASCII} characters. Values 0 through 1114111
+(#10FFFF) corresponds to Unicode characters of the same codepoint,
+while values 4194176 (#x3FFF80) through 4194303 (#x3FFFFF) are for
+representing eight-bit raw bytes.
+
+@defun characterp charcode
+This returns @code{t} if @var{charcode} is a valid character, and
+@code{nil} otherwise.
@example
-(char-valid-p 65)
+(characterp 65)
@result{} t
-(char-valid-p 256)
- @result{} nil
-(char-valid-p 2248)
+(characterp 4194303)
@result{} t
+(characterp 4194304)
+ @result{} nil
@end example
+@end defun
-If the optional argument @var{genericp} is non-@code{nil}, this
-function also returns @code{t} if @var{charcode} is a generic
-character (@pxref{Splitting Characters}).
+@defun get-byte pos &optional string
+This function returns the byte at current buffer's character position
+@var{pos}. If the current buffer is unibyte, this is literally the
+byte at that position. If the buffer is multibyte, byte values of
+@acronym{ASCII} characters are the same as character codepoints,
+whereas eight-bit raw bytes are converted to their 8-bit codes. The
+function signals an error if the character at @var{pos} is
+non-@acronym{ASCII}.
+
+The optional argument @var{string} means to get a byte value from that
+string instead of the current buffer.
@end defun
@node Character Sets
@section Character Sets
@cindex character sets
- Emacs classifies characters into various @dfn{character sets}, each of
-which has a name which is a symbol. Each character belongs to one and
-only one character set.
-
- In general, there is one character set for each distinct script. For
-example, @code{latin-iso8859-1} is one character set,
-@code{greek-iso8859-7} is another, and @code{ascii} is another. An
-Emacs character set can hold at most 9025 characters; therefore, in some
-cases, characters that would logically be grouped together are split
-into several character sets. For example, one set of Chinese
-characters, generally known as Big 5, is divided into two Emacs
-character sets, @code{chinese-big5-1} and @code{chinese-big5-2}.
-
- @acronym{ASCII} characters are in character set @code{ascii}. The
-non-@acronym{ASCII} characters 128 through 159 are in character set
-@code{eight-bit-control}, and codes 160 through 255 are in character set
-@code{eight-bit-graphic}.
+@cindex charset
+@cindex coded character set
+An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters
+in which each character is assigned a numeric code point. (The
+Unicode standard calls this a @dfn{coded character set}.) Each Emacs
+charset has a name which is a symbol. A single character can belong
+to any number of different character sets, but it will generally have
+a different code point in each charset. Examples of character sets
+include @code{ascii}, @code{iso-8859-1}, @code{greek-iso8859-7}, and
+@code{windows-1255}. The code point assigned to a character in a
+charset is usually different from its code point used in Emacs buffers
+and strings.
+
+@cindex @code{emacs}, a charset
+@cindex @code{unicode}, a charset
+@cindex @code{eight-bit}, a charset
+ Emacs defines several special character sets. The character set
+@code{unicode} includes all the characters whose Emacs code points are
+in the range @code{0..10FFFF}. The character set @code{emacs}
+includes all @acronym{ASCII} and non-@acronym{ASCII} characters.
+Finally, the @code{eight-bit} charset includes the 8-bit raw bytes;
+Emacs uses it to represent raw bytes encountered in text.
@defun charsetp object
Returns @code{t} if @var{object} is a symbol that names a character set,
The value is a list of all defined character set names.
@end defvar
-@defun charset-list
-This function returns the value of @code{charset-list}. It is only
-provided for backward compatibility.
+@defun charset-priority-list &optional highestp
+This functions returns a list of all defined character sets ordered by
+their priority. If @var{highestp} is non-@code{nil}, the function
+returns a single character set of the highest priority.
+@end defun
+
+@defun set-charset-priority &rest charsets
+This function makes @var{charsets} the highest priority character sets.
@end defun
@defun char-charset character
-This function returns the name of the character set that @var{character}
-belongs to, or the symbol @code{unknown} if @var{character} is not a
-valid character.
+This function returns the name of the character set of highest
+priority that @var{character} belongs to. @acronym{ASCII} characters
+are an exception: for them, this function always returns @code{ascii}.
@end defun
@defun charset-plist charset
-This function returns the charset property list of the character set
-@var{charset}. Although @var{charset} is a symbol, this is not the same
-as the property list of that symbol. Charset properties are used for
-special purposes within Emacs.
+This function returns the property list of the character set
+@var{charset}. Although @var{charset} is a symbol, this is not the
+same as the property list of that symbol. Charset properties include
+important information about the charset, such as its documentation
+string, short name, etc.
@end defun
-@deffn Command list-charset-chars charset
-This command displays a list of characters in the character set
-@var{charset}.
-@end deffn
-
-@node Chars and Bytes
-@section Characters and Bytes
-@cindex bytes and characters
-
-@cindex introduction sequence (of character)
-@cindex dimension (of character set)
- In multibyte representation, each character occupies one or more
-bytes. Each character set has an @dfn{introduction sequence}, which is
-normally one or two bytes long. (Exception: the @code{ascii} character
-set and the @code{eight-bit-graphic} character set have a zero-length
-introduction sequence.) The introduction sequence is the beginning of
-the byte sequence for any character in the character set. The rest of
-the character's bytes distinguish it from the other characters in the
-same character set. Depending on the character set, there are either
-one or two distinguishing bytes; the number of such bytes is called the
-@dfn{dimension} of the character set.
-
-@defun charset-dimension charset
-This function returns the dimension of @var{charset}; at present, the
-dimension is always 1 or 2.
+@defun put-charset-property charset propname value
+This function sets the @var{propname} property of @var{charset} to the
+given @var{value}.
@end defun
-@defun charset-bytes charset
-This function returns the number of bytes used to represent a character
-in character set @var{charset}.
+@defun get-charset-property charset propname
+This function returns the value of @var{charset}s property
+@var{propname}.
@end defun
- This is the simplest way to determine the byte length of a character
-set's introduction sequence:
-
-@example
-(- (charset-bytes @var{charset})
- (charset-dimension @var{charset}))
-@end example
-
-@node Splitting Characters
-@section Splitting Characters
-@cindex character as bytes
-
- The functions in this section convert between characters and the byte
-values used to represent them. For most purposes, there is no need to
-be concerned with the sequence of bytes used to represent a character,
-because Emacs translates automatically when necessary.
-
-@defun split-char character
-Return a list containing the name of the character set of
-@var{character}, followed by one or two byte values (integers) which
-identify @var{character} within that character set. The number of byte
-values is the character set's dimension.
-
-If @var{character} is invalid as a character code, @code{split-char}
-returns a list consisting of the symbol @code{unknown} and @var{character}.
+@deffn Command list-charset-chars charset
+This command displays a list of characters in the character set
+@var{charset}.
+@end deffn
-@example
-(split-char 2248)
- @result{} (latin-iso8859-1 72)
-(split-char 65)
- @result{} (ascii 65)
-(split-char 128)
- @result{} (eight-bit-control 128)
-@end example
+ Emacs can convert between its internal representation of a character
+and the character's codepoint in a specific charset. The following
+two functions support these conversions.
+
+@c FIXME: decode-char and encode-char accept and ignore an additional
+@c argument @var{restriction}. When that argument actually makes a
+@c difference, it should be documented here.
+@defun decode-char charset code-point
+This function decodes a character that is assigned a @var{code-point}
+in @var{charset}, to the corresponding Emacs character, and returns
+it. If @var{charset} doesn't contain a character of that code point,
+the value is @code{nil}. If @var{code-point} doesn't fit in a Lisp
+integer (@pxref{Integer Basics, most-positive-fixnum}), it can be
+specified as a cons cell @code{(@var{high} . @var{low})}, where
+@var{low} are the lower 16 bits of the value and @var{high} are the
+high 16 bits.
@end defun
-@cindex generate characters in charsets
-@defun make-char charset &optional code1 code2
-This function returns the character in character set @var{charset} whose
-position codes are @var{code1} and @var{code2}. This is roughly the
-inverse of @code{split-char}. Normally, you should specify either one
-or both of @var{code1} and @var{code2} according to the dimension of
-@var{charset}. For example,
-
-@example
-(make-char 'latin-iso8859-1 72)
- @result{} 2248
-@end example
-
-Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed
-before they are used to index @var{charset}. Thus you may use, for
-instance, an ISO 8859 character code rather than subtracting 128, as
-is necessary to index the corresponding Emacs charset.
+@defun encode-char char charset
+This function returns the code point assigned to the character
+@var{char} in @var{charset}. If the result does not fit in a Lisp
+integer, it is returned as a cons cell @code{(@var{high} . @var{low})}
+that fits the second argument of @code{decode-char} above. If
+@var{charset} doesn't have a codepoint for @var{char}, the value is
+@code{nil}.
@end defun
-@cindex generic characters
- If you call @code{make-char} with no @var{byte-values}, the result is
-a @dfn{generic character} which stands for @var{charset}. A generic
-character is an integer, but it is @emph{not} valid for insertion in the
-buffer as a character. It can be used in @code{char-table-range} to
-refer to the whole character set (@pxref{Char-Tables}).
-@code{char-valid-p} returns @code{nil} for generic characters.
-For example:
-
-@example
-(make-char 'latin-iso8859-1)
- @result{} 2176
-(char-valid-p 2176)
- @result{} nil
-(char-valid-p 2176 t)
- @result{} t
-(split-char 2176)
- @result{} (latin-iso8859-1 0)
-@end example
-
-The character sets @code{ascii}, @code{eight-bit-control}, and
-@code{eight-bit-graphic} don't have corresponding generic characters. If
-@var{charset} is one of them and you don't supply @var{code1},
-@code{make-char} returns the character code corresponding to the
-smallest code in @var{charset}.
-
@node Scanning Charsets
@section Scanning for Character Sets
- Sometimes it is useful to find out which character sets appear in a
-part of a buffer or a string. One use for this is in determining which
-coding systems (@pxref{Coding Systems}) are capable of representing all
-of the text in question.
+ Sometimes it is useful to find out, for characters that appear in a
+certain part of a buffer or a string, to which character sets they
+belong. One use for this is in determining which coding systems
+(@pxref{Coding Systems}) are capable of representing all of the text
+in question; another is to determine the font(s) for displaying that
+text.
@defun charset-after &optional pos
-This function return the charset of a character in the current buffer
-at position @var{pos}. If @var{pos} is omitted or @code{nil}, it
-defaults to the current value of point. If @var{pos} is out of range,
-the value is @code{nil}.
+This function returns the charset of highest priority containing the
+character in the current buffer at position @var{pos}. If @var{pos}
+is omitted or @code{nil}, it defaults to the current value of point.
+If @var{pos} is out of range, the value is @code{nil}.
@end defun
@defun find-charset-region beg end &optional translation
-This function returns a list of the character sets that appear in the
-current buffer between positions @var{beg} and @var{end}.
+This function returns a list of the character sets of highest priority
+that contain characters in the current buffer between positions
+@var{beg} and @var{end}.
The optional argument @var{translation} specifies a translation table to
be used in scanning the text (@pxref{Translation of Characters}). If it
@end defun
@defun find-charset-string string &optional translation
-This function returns a list of the character sets that appear in the
-string @var{string}. It is just like @code{find-charset-region}, except
-that it applies to the contents of @var{string} instead of part of the
-current buffer.
+This function returns a list of the character sets of highest priority
+that contain characters in @var{string}. It is just like
+@code{find-charset-region}, except that it applies to the contents of
+@var{string} instead of part of the current buffer.
@end defun
@node Translation of Characters
@cindex character translation tables
@cindex translation tables
- A @dfn{translation table} is a char-table that specifies a mapping
-of characters into characters. These tables are used in encoding and
-decoding, and for other purposes. Some coding systems specify their
-own particular translation tables; there are also default translation
-tables which apply to all other coding systems.
+ A @dfn{translation table} is a char-table (@pxref{Char-Tables}) that
+specifies a mapping of characters into characters. These tables are
+used in encoding and decoding, and for other purposes. Some coding
+systems specify their own particular translation tables; there are
+also default translation tables which apply to all other coding
+systems.
- For instance, the coding-system @code{utf-8} has a translation table
-that maps characters of various charsets (e.g.,
-@code{latin-iso8859-@var{x}}) into Unicode character sets. This way,
-it can encode Latin-2 characters into UTF-8. Meanwhile,
-@code{unify-8859-on-decoding-mode} operates by specifying
-@code{standard-translation-table-for-decode} to translate
-Latin-@var{x} characters into corresponding Unicode characters.
+ A translation table has two extra slots. The first is either
+@code{nil} or a translation table that performs the reverse
+translation; the second is the maximum number of characters to look up
+for translating sequences of characters (see the description of
+@code{make-translation-table-from-alist} below).
@defun make-translation-table &rest translations
This function returns a translation table based on the argument
and if a previous form already translates @var{to} to some other
character, say @var{to-alt}, @var{from} is also translated to
@var{to-alt}.
-
-You can also map one whole character set into another character set with
-the same dimension. To do this, you specify a generic character (which
-designates a character set) for @var{from} (@pxref{Splitting Characters}).
-In this case, if @var{to} is also a generic character, its character
-set should have the same dimension as @var{from}'s. Then the
-translation table translates each character of @var{from}'s character
-set into the corresponding character of @var{to}'s character set. If
-@var{from} is a generic character and @var{to} is an ordinary
-character, then the translation table translates every character of
-@var{from}'s character set into @var{to}.
@end defun
- In decoding, the translation table's translations are applied to the
-characters that result from ordinary decoding. If a coding system has
-property @code{translation-table-for-decode}, that specifies the
-translation table to use. (This is a property of the coding system,
-as returned by @code{coding-system-get}, not a property of the symbol
-that is the coding system's name. @xref{Coding System Basics,, Basic
-Concepts of Coding Systems}.) Otherwise, if
-@code{standard-translation-table-for-decode} is non-@code{nil},
-decoding uses that table.
-
- In encoding, the translation table's translations are applied to the
-characters in the buffer, and the result of translation is actually
-encoded. If a coding system has property
-@code{translation-table-for-encode}, that specifies the translation
-table to use. Otherwise the variable
-@code{standard-translation-table-for-encode} specifies the translation
-table.
+ During decoding, the translation table's translations are applied to
+the characters that result from ordinary decoding. If a coding system
+has property @code{:decode-translation-table}, that specifies the
+translation table to use, or a list of translation tables to apply in
+sequence. (This is a property of the coding system, as returned by
+@code{coding-system-get}, not a property of the symbol that is the
+coding system's name. @xref{Coding System Basics,, Basic Concepts of
+Coding Systems}.) Finally, if
+@code{standard-translation-table-for-decode} is non-@code{nil}, the
+resulting characters are translated by that table.
+
+ During encoding, the translation table's translations are applied to
+the characters in the buffer, and the result of translation is
+actually encoded. If a coding system has property
+@code{:encode-translation-table}, that specifies the translation table
+to use, or a list of translation tables to apply in sequence. In
+addition, if the variable @code{standard-translation-table-for-encode}
+is non-@code{nil}, it specifies the translation table to use for
+translating the result.
@defvar standard-translation-table-for-decode
-This is the default translation table for decoding, for
-coding systems that don't specify any other translation table.
+This is the default translation table for decoding. If a coding
+systems specifies its own translation tables, the table that is the
+value of this variable, if non-@code{nil}, is applied after them.
@end defvar
@defvar standard-translation-table-for-encode
-This is the default translation table for encoding, for
-coding systems that don't specify any other translation table.
+This is the default translation table for encoding. If a coding
+systems specifies its own translation tables, the table that is the
+value of this variable, if non-@code{nil}, is applied after them.
@end defvar
-@defvar translation-table-for-input
-Self-inserting characters are translated through this translation
-table before they are inserted. Search commands also translate their
-input through this table, so they can compare more reliably with
-what's in the buffer.
+@defun make-translation-table-from-vector vec
+This function returns a translation table made from @var{vec} that is
+an array of 256 elements to map byte values 0 through 255 to
+characters. Elements may be @code{nil} for untranslated bytes. The
+returned table has a translation table for reverse mapping in the
+first extra slot, and the value @code{1} in the second extra slot.
+
+This function provides an easy way to make a private coding system
+that maps each byte to a specific character. You can specify the
+returned table and the reverse translation table using the properties
+@code{:decode-translation-table} and @code{:encode-translation-table}
+respectively in the @var{props} argument to
+@code{define-coding-system}.
+@end defun
-@code{set-buffer-file-coding-system} sets this variable so that your
-keyboard input gets translated into the character sets that the buffer
-is likely to contain. This variable automatically becomes
-buffer-local when set.
-@end defvar
+@defun make-translation-table-from-alist alist
+This function is similar to @code{make-translation-table} but returns
+a complex translation table rather than a simple one-to-one mapping.
+Each element of @var{alist} is of the form @code{(@var{from}
+. @var{to})}, where @var{from} and @var{to} are either a character or
+a vector specifying a sequence of characters. If @var{from} is a
+character, that character is translated to @var{to} (i.e.@: to a
+character or a character sequence). If @var{from} is a vector of
+characters, that sequence is translated to @var{to}. The returned
+table has a translation table for reverse mapping in the first extra
+slot, and the maximum length of all the @var{from} character sequences
+in the second extra slot.
+@end defun
@node Coding Systems
@section Coding Systems