@c -*-texinfo-*-
@c This is part of the GNU Emacs Lisp Reference Manual.
@c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004,
-@c 2005, 2006, 2007, 2008 Free Software Foundation, Inc.
+@c 2005, 2006, 2007, 2008, 2009, 2010 Free Software Foundation, Inc.
@c See the file elisp.texi for copying conditions.
@setfilename ../../info/characters
@node Non-ASCII Characters, Searching and Matching, Text, Top
@cindex characters, multi-byte
@cindex non-@acronym{ASCII} characters
- This chapter covers the special issues relating to non-@acronym{ASCII}
-characters and how they are stored in strings and buffers.
+ This chapter covers the special issues relating to characters and
+how they are stored in strings and buffers.
@menu
-* Text Representations:: Unibyte and multibyte representations
+* Text Representations:: How Emacs represents text.
* Converting Representations:: Converting unibyte to multibyte and vice versa.
* Selecting a Representation:: Treating a byte sequence as unibyte or multi.
* Character Codes:: How unibyte and multibyte relate to
codes of individual characters.
+* Character Properties:: Character attributes that define their
+ behavior and handling.
* Character Sets:: The space of possible character codes
is divided into various character sets.
-* Chars and Bytes:: More information about multibyte encodings.
-* Splitting Characters:: Converting a character to its byte sequence.
* Scanning Charsets:: Which character sets are used in a buffer?
* Translation of Characters:: Translation tables are used for conversion.
* Coding Systems:: Coding systems are conversions for saving files.
@node Text Representations
@section Text Representations
-@cindex text representations
-
- Emacs has two @dfn{text representations}---two ways to represent text
-in a string or buffer. These are called @dfn{unibyte} and
-@dfn{multibyte}. Each string, and each buffer, uses one of these two
-representations. For most purposes, you can ignore the issue of
-representations, because Emacs converts text between them as
-appropriate. Occasionally in Lisp programming you will need to pay
-attention to the difference.
+@cindex text representation
+
+ Emacs buffers and strings support a large repertoire of characters
+from many different scripts, allowing users to type and display text
+in almost any known written language.
+
+@cindex character codepoint
+@cindex codespace
+@cindex Unicode
+ To support this multitude of characters and scripts, Emacs closely
+follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
+unique number, called a @dfn{codepoint}, to each and every character.
+The range of codepoints defined by Unicode, or the Unicode
+@dfn{codespace}, is @code{0..#x10FFFF} (in hexadecimal notation),
+inclusive. Emacs extends this range with codepoints in the range
+@code{#x110000..#x3FFFFF}, which it uses for representing characters
+that are not unified with Unicode and @dfn{raw 8-bit bytes} that
+cannot be interpreted as characters. Thus, a character codepoint in
+Emacs is a 22-bit integer number.
+
+@cindex internal representation of characters
+@cindex characters, representation in buffers and strings
+@cindex multibyte text
+ To conserve memory, Emacs does not hold fixed-length 22-bit numbers
+that are codepoints of text characters within buffers and strings.
+Rather, Emacs uses a variable-length internal representation of
+characters, that stores each character as a sequence of 1 to 5 8-bit
+bytes, depending on the magnitude of its codepoint@footnote{
+This internal representation is based on one of the encodings defined
+by the Unicode Standard, called @dfn{UTF-8}, for representing any
+Unicode codepoint, but Emacs extends UTF-8 to represent the additional
+codepoints it uses for raw 8-bit bytes and characters not unified with
+Unicode.}. For example, any @acronym{ASCII} character takes up only 1
+byte, a Latin-1 character takes up 2 bytes, etc. We call this
+representation of text @dfn{multibyte}.
+
+ Outside Emacs, characters can be represented in many different
+encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts
+between these external encodings and its internal representation, as
+appropriate, when it reads text into a buffer or a string, or when it
+writes text to a disk file or passes it to some other process.
+
+ Occasionally, Emacs needs to hold and manipulate encoded text or
+binary non-text data in its buffers or strings. For example, when
+Emacs visits a file, it first reads the file's text verbatim into a
+buffer, and only then converts it to the internal representation.
+Before the conversion, the buffer holds encoded text.
@cindex unibyte text
- In unibyte representation, each character occupies one byte and
-therefore the possible character codes range from 0 to 255. Codes 0
-through 127 are @acronym{ASCII} characters; the codes from 128 through 255
-are used for one non-@acronym{ASCII} character set (you can choose which
-character set by setting the variable @code{nonascii-insert-offset}).
-
-@cindex leading code
-@cindex multibyte text
-@cindex trailing codes
- In multibyte representation, a character may occupy more than one
-byte, and as a result, the full range of Emacs character codes can be
-stored. The first byte of a multibyte character is always in the range
-128 through 159 (octal 0200 through 0237). These values are called
-@dfn{leading codes}. The second and subsequent bytes of a multibyte
-character are always in the range 160 through 255 (octal 0240 through
-0377); these values are @dfn{trailing codes}.
-
- Some sequences of bytes are not valid in multibyte text: for example,
-a single isolated byte in the range 128 through 159 is not allowed. But
-character codes 128 through 159 can appear in multibyte text,
-represented as two-byte sequences. All the character codes 128 through
-255 are possible (though slightly abnormal) in multibyte text; they
-appear in multibyte buffers and strings when you do explicit encoding
-and decoding (@pxref{Explicit Encoding}).
+ Encoded text is not really text, as far as Emacs is concerned, but
+rather a sequence of raw 8-bit bytes. We call buffers and strings
+that hold encoded text @dfn{unibyte} buffers and strings, because
+Emacs treats them as a sequence of individual bytes. Usually, Emacs
+displays unibyte buffers and strings as octal codes such as
+@code{\237}. We recommend that you never use unibyte buffers and
+strings except for manipulating encoded text or binary non-text data.
In a buffer, the buffer-local value of the variable
@code{enable-multibyte-characters} specifies the representation used.
@defvar enable-multibyte-characters
This variable specifies the current buffer's text representation.
If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
-it contains unibyte text.
+it contains unibyte encoded text or binary non-text data.
You cannot set this variable directly; instead, use the function
@code{set-buffer-multibyte} to change a buffer's representation.
@end defvar
-@defvar default-enable-multibyte-characters
-This variable's value is entirely equivalent to @code{(default-value
-'enable-multibyte-characters)}, and setting this variable changes that
-default value. Setting the local binding of
-@code{enable-multibyte-characters} in a specific buffer is not allowed,
-but changing the default value is supported, and it is a reasonable
-thing to do, because it has no effect on existing buffers.
-
-The @samp{--unibyte} command line option does its job by setting the
-default value to @code{nil} early in startup.
-@end defvar
-
@defun position-bytes position
-Return the byte-position corresponding to buffer position
+Buffer positions are measured in character units. This function
+returns the byte-position corresponding to buffer position
@var{position} in the current buffer. This is 1 at the start of the
buffer, and counts upward in bytes. If @var{position} is out of
range, the value is @code{nil}.
@end defun
@defun byte-to-position byte-position
-Return the buffer position corresponding to byte-position
+Return the buffer position, in character units, corresponding to given
@var{byte-position} in the current buffer. If @var{byte-position} is
-out of range, the value is @code{nil}.
+out of range, the value is @code{nil}. In a multibyte buffer, an
+arbitrary value of @var{byte-position} can be not at character
+boundary, but inside a multibyte sequence representing a single
+character; in this case, this function returns the buffer position of
+the character whose multibyte sequence includes @var{byte-position}.
+In other words, the value does not change for all byte positions that
+belong to the same character.
@end defun
@defun multibyte-string-p string
-Return @code{t} if @var{string} is a multibyte string.
+Return @code{t} if @var{string} is a multibyte string, @code{nil}
+otherwise.
@end defun
@defun string-bytes string
@code{(length @var{string})}.
@end defun
+@defun unibyte-string &rest bytes
+This function concatenates all its argument @var{bytes} and makes the
+result a unibyte string.
+@end defun
+
@node Converting Representations
@section Converting Text Representations
Emacs can convert unibyte text to multibyte; it can also convert
-multibyte text to unibyte, though this conversion loses information. In
-general these conversions happen when inserting text into a buffer, or
-when putting text from several strings together in one string. You can
-also explicitly convert a string's contents to either representation.
-
- Emacs chooses the representation for a string based on the text that
-it is constructed from. The general rule is to convert unibyte text to
-multibyte text when combining it with other multibyte text, because the
-multibyte representation is more general and can hold whatever
+multibyte text to unibyte, provided that the multibyte text contains
+only @acronym{ASCII} and 8-bit raw bytes. In general, these
+conversions happen when inserting text into a buffer, or when putting
+text from several strings together in one string. You can also
+explicitly convert a string's contents to either representation.
+
+ Emacs chooses the representation for a string based on the text from
+which it is constructed. The general rule is to convert unibyte text
+to multibyte text when combining it with other multibyte text, because
+the multibyte representation is more general and can hold whatever
characters the unibyte text has.
When inserting text into a buffer, Emacs converts the text to the
acceptable because the buffer's representation is a choice made by the
user that cannot be overridden automatically.
- Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
-unchanged, and likewise character codes 128 through 159. It converts
-the non-@acronym{ASCII} codes 160 through 255 by adding the value
-@code{nonascii-insert-offset} to each character code. By setting this
-variable, you specify which character set the unibyte characters
-correspond to (@pxref{Character Sets}). For example, if
-@code{nonascii-insert-offset} is 2048, which is @code{(- (make-char
-'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters
-correspond to Latin 1. If it is 2688, which is @code{(- (make-char
-'greek-iso8859-7) 128)}, then they correspond to Greek letters.
-
- Converting multibyte text to unibyte is simpler: it discards all but
-the low 8 bits of each character code. If @code{nonascii-insert-offset}
-has a reasonable value, corresponding to the beginning of some character
-set, this conversion is the inverse of the other: converting unibyte
-text to multibyte and back to unibyte reproduces the original unibyte
-text.
-
-@defvar nonascii-insert-offset
-This variable specifies the amount to add to a non-@acronym{ASCII} character
-when converting unibyte text to multibyte. It also applies when
-@code{self-insert-command} inserts a character in the unibyte
-non-@acronym{ASCII} range, 128 through 255. However, the functions
-@code{insert} and @code{insert-char} do not perform this conversion.
-
-The right value to use to select character set @var{cs} is @code{(-
-(make-char @var{cs}) 128)}. If the value of
-@code{nonascii-insert-offset} is zero, then conversion actually uses the
-value for the Latin 1 character set, rather than zero.
-@end defvar
+ Converting unibyte text to multibyte text leaves @acronym{ASCII}
+characters unchanged, and converts bytes with codes 128 through 159 to
+the multibyte representation of raw eight-bit bytes.
-@defvar nonascii-translation-table
-This variable provides a more general alternative to
-@code{nonascii-insert-offset}. You can use it to specify independently
-how to translate each code in the range of 128 through 255 into a
-multibyte character. The value should be a char-table, or @code{nil}.
-If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}.
-@end defvar
+ Converting multibyte text to unibyte converts all @acronym{ASCII}
+and eight-bit characters to their single-byte form, but loses
+information for non-@acronym{ASCII} characters by discarding all but
+the low 8 bits of each character's codepoint. Converting unibyte text
+to multibyte and back to unibyte reproduces the original unibyte text.
-The next three functions either return the argument @var{string}, or a
+The next two functions either return the argument @var{string}, or a
newly created string with no text properties.
-@defun string-make-unibyte string
-This function converts the text of @var{string} to unibyte
-representation, if it isn't already, and returns the result. If
-@var{string} is a unibyte string, it is returned unchanged. Multibyte
-character codes are converted to unibyte according to
-@code{nonascii-translation-table} or, if that is @code{nil}, using
-@code{nonascii-insert-offset}. If the lookup in the translation table
-fails, this function takes just the low 8 bits of each character.
-@end defun
-
-@defun string-make-multibyte string
-This function converts the text of @var{string} to multibyte
-representation, if it isn't already, and returns the result. If
-@var{string} is a multibyte string or consists entirely of
-@acronym{ASCII} characters, it is returned unchanged. In particular,
-if @var{string} is unibyte and entirely @acronym{ASCII}, the returned
-string is unibyte. (When the characters are all @acronym{ASCII},
-Emacs primitives will treat the string the same way whether it is
-unibyte or multibyte.) If @var{string} is unibyte and contains
-non-@acronym{ASCII} characters, the function
-@code{unibyte-char-to-multibyte} is used to convert each unibyte
-character to a multibyte character.
-@end defun
-
@defun string-to-multibyte string
This function returns a multibyte string containing the same sequence
-of character codes as @var{string}. Unlike
-@code{string-make-multibyte}, this function unconditionally returns a
-multibyte string. If @var{string} is a multibyte string, it is
-returned unchanged.
+of characters as @var{string}. If @var{string} is a multibyte string,
+it is returned unchanged. The function assumes that @var{string}
+includes only @acronym{ASCII} characters and raw 8-bit bytes; the
+latter are converted to their multibyte representation corresponding
+to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive
+(@pxref{Text Representations, codepoints}).
+@end defun
+
+@defun string-to-unibyte string
+This function returns a unibyte string containing the same sequence of
+characters as @var{string}. It signals an error if @var{string}
+contains a non-@acronym{ASCII} character. If @var{string} is a
+unibyte string, it is returned unchanged. Use this function for
+@var{string} arguments that contain only @acronym{ASCII} and eight-bit
+characters.
@end defun
@defun multibyte-char-to-unibyte char
-This convert the multibyte character @var{char} to a unibyte
-character, based on @code{nonascii-translation-table} and
-@code{nonascii-insert-offset}.
+This converts the multibyte character @var{char} to a unibyte
+character, and returns that character. If @var{char} is neither
+@acronym{ASCII} nor eight-bit, the function returns -1.
@end defun
@defun unibyte-char-to-multibyte char
This convert the unibyte character @var{char} to a multibyte
-character, based on @code{nonascii-translation-table} and
-@code{nonascii-insert-offset}.
+character, assuming @var{char} is either @acronym{ASCII} or raw 8-bit
+byte.
@end defun
@node Selecting a Representation
is @code{nil}, the buffer becomes unibyte.
This function leaves the buffer contents unchanged when viewed as a
-sequence of bytes. As a consequence, it can change the contents viewed
-as characters; a sequence of two bytes which is treated as one character
-in multibyte representation will count as two characters in unibyte
-representation. Character codes 128 through 159 are an exception. They
-are represented by one byte in a unibyte buffer, but when the buffer is
-set to multibyte, they are converted to two-byte sequences, and vice
-versa.
+sequence of bytes. As a consequence, it can change the contents
+viewed as characters; for instance, a sequence of three bytes which is
+treated as one character in multibyte representation will count as
+three characters in unibyte representation. Eight-bit characters
+representing raw bytes are an exception. They are represented by one
+byte in a unibyte buffer, but when the buffer is set to multibyte,
+they are converted to two-byte sequences, and vice versa.
This function sets @code{enable-multibyte-characters} to record which
representation is in use. It also adjusts various data in the buffer
@end defun
@defun string-as-unibyte string
-This function returns a string with the same bytes as @var{string} but
-treating each byte as a character. This means that the value may have
-more characters than @var{string} has.
-
-If @var{string} is already a unibyte string, then the value is
-@var{string} itself. Otherwise it is a newly created string, with no
-text properties. If @var{string} is multibyte, any characters it
-contains of charset @code{eight-bit-control} or @code{eight-bit-graphic}
-are converted to the corresponding single byte.
+If @var{string} is already a unibyte string, this function returns
+@var{string} itself. Otherwise, it returns a new string with the same
+bytes as @var{string}, but treating each byte as a separate character
+(so that the value may have more characters than @var{string}); as an
+exception, each eight-bit character representing a raw byte is
+converted into a single byte. The newly-created string contains no
+text properties.
@end defun
@defun string-as-multibyte string
-This function returns a string with the same bytes as @var{string} but
-treating each multibyte sequence as one character. This means that the
-value may have fewer characters than @var{string} has.
-
-If @var{string} is already a multibyte string, then the value is
-@var{string} itself. Otherwise it is a newly created string, with no
-text properties. If @var{string} is unibyte and contains any individual
-8-bit bytes (i.e.@: not part of a multibyte form), they are converted to
-the corresponding multibyte character of charset @code{eight-bit-control}
-or @code{eight-bit-graphic}.
+If @var{string} is a multibyte string, this function returns
+@var{string} itself. Otherwise, it returns a new string with the same
+bytes as @var{string}, but treating each multibyte sequence as one
+character. This means that the value may have fewer characters than
+@var{string} has. If a byte sequence in @var{string} is invalid as a
+multibyte representation of a single character, each byte in the
+sequence is treated as a raw 8-bit byte. The newly-created string
+contains no text properties.
@end defun
@node Character Codes
@section Character Codes
@cindex character codes
- The unibyte and multibyte text representations use different character
-codes. The valid character codes for unibyte representation range from
-0 to 255---the values that can fit in one byte. The valid character
-codes for multibyte representation range from 0 to 524287, but not all
-values in that range are valid. The values 128 through 255 are not
-entirely proper in multibyte text, but they can occur if you do explicit
-encoding and decoding (@pxref{Explicit Encoding}). Some other character
-codes cannot occur at all in multibyte text. Only the @acronym{ASCII} codes
-0 through 127 are completely legitimate in both representations.
-
-@defun char-valid-p charcode &optional genericp
-This returns @code{t} if @var{charcode} is valid (either for unibyte
-text or for multibyte text).
+ The unibyte and multibyte text representations use different
+character codes. The valid character codes for unibyte representation
+range from 0 to @code{#xFF} (255)---the values that can fit in one
+byte. The valid character codes for multibyte representation range
+from 0 to @code{#x3FFFFF}. In this code space, values 0 through
+@code{#x7F} (127) are for @acronym{ASCII} characters, and values
+@code{#x80} (128) through @code{#x3FFF7F} (4194175) are for
+non-@acronym{ASCII} characters.
+
+ Emacs character codes are a superset of the Unicode standard.
+Values 0 through @code{#x10FFFF} (1114111) correspond to Unicode
+characters of the same codepoint; values @code{#x110000} (1114112)
+through @code{#x3FFF7F} (4194175) represent characters that are not
+unified with Unicode; and values @code{#x3FFF80} (4194176) through
+@code{#x3FFFFF} (4194303) represent eight-bit raw bytes.
+
+@defun characterp charcode
+This returns @code{t} if @var{charcode} is a valid character, and
+@code{nil} otherwise.
@example
-(char-valid-p 65)
+@group
+(characterp 65)
@result{} t
-(char-valid-p 256)
+@end group
+@group
+(characterp 4194303)
+ @result{} t
+@end group
+@group
+(characterp 4194304)
@result{} nil
-(char-valid-p 2248)
+@end group
+@end example
+@end defun
+
+@cindex maximum value of character codepoint
+@cindex codepoint, largest value
+@defun max-char
+This function returns the largest value that a valid character
+codepoint can have.
+
+@example
+@group
+(characterp (max-char))
@result{} t
+@end group
+@group
+(characterp (1+ (max-char)))
+ @result{} nil
+@end group
@end example
+@end defun
-If the optional argument @var{genericp} is non-@code{nil}, this
-function also returns @code{t} if @var{charcode} is a generic
-character (@pxref{Splitting Characters}).
+@defun get-byte &optional pos string
+This function returns the byte at character position @var{pos} in the
+current buffer. If the current buffer is unibyte, this is literally
+the byte at that position. If the buffer is multibyte, byte values of
+@acronym{ASCII} characters are the same as character codepoints,
+whereas eight-bit raw bytes are converted to their 8-bit codes. The
+function signals an error if the character at @var{pos} is
+non-@acronym{ASCII}.
+
+The optional argument @var{string} means to get a byte value from that
+string instead of the current buffer.
@end defun
-@node Character Sets
-@section Character Sets
-@cindex character sets
+@node Character Properties
+@section Character Properties
+@cindex character properties
+A @dfn{character property} is a named attribute of a character that
+specifies how the character behaves and how it should be handled
+during text processing and display. Thus, character properties are an
+important part of specifying the character's semantics.
+
+ On the whole, Emacs follows the Unicode Standard in its implementation
+of character properties. In particular, Emacs supports the
+@uref{http://www.unicode.org/reports/tr23/, Unicode Character Property
+Model}, and the Emacs character property database is derived from the
+Unicode Character Database (@acronym{UCD}). See the
+@uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character
+Properties chapter of the Unicode Standard}, for a detailed
+description of Unicode character properties and their meaning. This
+section assumes you are already familiar with that chapter of the
+Unicode Standard, and want to apply that knowledge to Emacs Lisp
+programs.
- Emacs classifies characters into various @dfn{character sets}, each of
-which has a name which is a symbol. Each character belongs to one and
-only one character set.
+ In Emacs, each property has a name, which is a symbol, and a set of
+possible values, whose types depend on the property; if a character
+does not have a certain property, the value is @code{nil}. As a
+general rule, the names of character properties in Emacs are produced
+from the corresponding Unicode properties by downcasing them and
+replacing each @samp{_} character with a dash @samp{-}. For example,
+@code{Canonical_Combining_Class} becomes
+@code{canonical-combining-class}. However, sometimes we shorten the
+names to make their use easier.
- In general, there is one character set for each distinct script. For
-example, @code{latin-iso8859-1} is one character set,
-@code{greek-iso8859-7} is another, and @code{ascii} is another. An
-Emacs character set can hold at most 9025 characters; therefore, in some
-cases, characters that would logically be grouped together are split
-into several character sets. For example, one set of Chinese
-characters, generally known as Big 5, is divided into two Emacs
-character sets, @code{chinese-big5-1} and @code{chinese-big5-2}.
+ Here is the full list of value types for all the character
+properties that Emacs knows about:
- @acronym{ASCII} characters are in character set @code{ascii}. The
-non-@acronym{ASCII} characters 128 through 159 are in character set
-@code{eight-bit-control}, and codes 160 through 255 are in character set
-@code{eight-bit-graphic}.
+@table @code
+@item name
+This property corresponds to the Unicode @code{Name} property. The
+value is a string consisting of upper-case Latin letters A to Z,
+digits, spaces, and hyphen @samp{-} characters.
+
+@cindex unicode general category
+@item general-category
+This property corresponds to the Unicode @code{General_Category}
+property. The value is a symbol whose name is a 2-letter abbreviation
+of the character's classification.
+
+@item canonical-combining-class
+Corresponds to the Unicode @code{Canonical_Combining_Class} property.
+The value is an integer number.
+
+@item bidi-class
+Corresponds to the Unicode @code{Bidi_Class} property. The value is a
+symbol whose name is the Unicode @dfn{directional type} of the
+character.
+
+@item decomposition
+Corresponds to the Unicode @code{Decomposition_Type} and
+@code{Decomposition_Value} properties. The value is a list, whose
+first element may be a symbol representing a compatibility formatting
+tag, such as @code{small}@footnote{
+Note that the Unicode spec writes these tag names inside
+@samp{<..>} brackets. The tag names in Emacs do not include the
+brackets; e.g., Unicode specifies @samp{<small>} where Emacs uses
+@samp{small}.
+}; the other elements are characters that give the compatibility
+decomposition sequence of this character.
+
+@item decimal-digit-value
+Corresponds to the Unicode @code{Numeric_Value} property for
+characters whose @code{Numeric_Type} is @samp{Digit}. The value is an
+integer number.
+
+@item digit
+Corresponds to the Unicode @code{Numeric_Value} property for
+characters whose @code{Numeric_Type} is @samp{Decimal}. The value is
+an integer number. Examples of such characters include compatibility
+subscript and superscript digits, for which the value is the
+corresponding number.
+
+@item numeric-value
+Corresponds to the Unicode @code{Numeric_Value} property for
+characters whose @code{Numeric_Type} is @samp{Numeric}. The value of
+this property is an integer or a floating-point number. Examples of
+characters that have this property include fractions, subscripts,
+superscripts, Roman numerals, currency numerators, and encircled
+numbers. For example, the value of this property for the character
+@code{U+2155} (@sc{vulgar fraction one fifth}) is @code{0.2}.
+
+@item mirrored
+Corresponds to the Unicode @code{Bidi_Mirrored} property. The value
+of this property is a symbol, either @code{Y} or @code{N}.
+
+@item old-name
+Corresponds to the Unicode @code{Unicode_1_Name} property. The value
+is a string.
+
+@item iso-10646-comment
+Corresponds to the Unicode @code{ISO_Comment} property. The value is
+a string.
+
+@item uppercase
+Corresponds to the Unicode @code{Simple_Uppercase_Mapping} property.
+The value of this property is a single character.
+
+@item lowercase
+Corresponds to the Unicode @code{Simple_Lowercase_Mapping} property.
+The value of this property is a single character.
+
+@item titlecase
+Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property.
+@dfn{Title case} is a special form of a character used when the first
+character of a word needs to be capitalized. The value of this
+property is a single character.
+@end table
-@defun charsetp object
-Returns @code{t} if @var{object} is a symbol that names a character set,
-@code{nil} otherwise.
+@defun get-char-code-property char propname
+This function returns the value of @var{char}'s @var{propname} property.
+
+@example
+@group
+(get-char-code-property ? 'general-category)
+ @result{} Zs
+@end group
+@group
+(get-char-code-property ?1 'general-category)
+ @result{} Nd
+@end group
+@group
+(get-char-code-property ?\u2084 'digit-value) ; subscript 4
+ @result{} 4
+@end group
+@group
+(get-char-code-property ?\u2155 'numeric-value) ; one fifth
+ @result{} 1/5
+@end group
+@group
+(get-char-code-property ?\u2163 'numeric-value) ; Roman IV
+ @result{} \4
+@end group
+@end example
@end defun
-@defvar charset-list
-The value is a list of all defined character set names.
-@end defvar
+@defun char-code-property-description prop value
+This function returns the description string of property @var{prop}'s
+@var{value}, or @code{nil} if @var{value} has no description.
-@defun charset-list
-This function returns the value of @code{charset-list}. It is only
-provided for backward compatibility.
+@example
+@group
+(char-code-property-description 'general-category 'Zs)
+ @result{} "Separator, Space"
+@end group
+@group
+(char-code-property-description 'general-category 'Nd)
+ @result{} "Number, Decimal Digit"
+@end group
+@group
+(char-code-property-description 'numeric-value '1/5)
+ @result{} nil
+@end group
+@end example
@end defun
-@defun char-charset character
-This function returns the name of the character set that @var{character}
-belongs to, or the symbol @code{unknown} if @var{character} is not a
-valid character.
+@defun put-char-code-property char propname value
+This function stores @var{value} as the value of the property
+@var{propname} for the character @var{char}.
@end defun
-@defun charset-plist charset
-This function returns the charset property list of the character set
-@var{charset}. Although @var{charset} is a symbol, this is not the same
-as the property list of that symbol. Charset properties are used for
-special purposes within Emacs.
-@end defun
+@defvar unicode-category-table
+The value of this variable is a char-table (@pxref{Char-Tables}) that
+specifies, for each character, its Unicode @code{General_Category}
+property as a symbol.
+@end defvar
-@deffn Command list-charset-chars charset
-This command displays a list of characters in the character set
-@var{charset}.
-@end deffn
+@defvar char-script-table
+The value of this variable is a char-table that specifies, for each
+character, a symbol whose name is the script to which the character
+belongs, according to the Unicode Standard classification of the
+Unicode code space into script-specific blocks. This char-table has a
+single extra slot whose value is the list of all script symbols.
+@end defvar
-@node Chars and Bytes
-@section Characters and Bytes
-@cindex bytes and characters
+@defvar char-width-table
+The value of this variable is a char-table that specifies the width of
+each character in columns that it will occupy on the screen.
+@end defvar
-@cindex introduction sequence (of character)
-@cindex dimension (of character set)
- In multibyte representation, each character occupies one or more
-bytes. Each character set has an @dfn{introduction sequence}, which is
-normally one or two bytes long. (Exception: the @code{ascii} character
-set and the @code{eight-bit-graphic} character set have a zero-length
-introduction sequence.) The introduction sequence is the beginning of
-the byte sequence for any character in the character set. The rest of
-the character's bytes distinguish it from the other characters in the
-same character set. Depending on the character set, there are either
-one or two distinguishing bytes; the number of such bytes is called the
-@dfn{dimension} of the character set.
+@defvar printable-chars
+The value of this variable is a char-table that specifies, for each
+character, whether it is printable or not. That is, if evaluating
+@code{(aref printable-chars char)} results in @code{t}, the character
+is printable, and if it results in @code{nil}, it is not.
+@end defvar
-@defun charset-dimension charset
-This function returns the dimension of @var{charset}; at present, the
-dimension is always 1 or 2.
-@end defun
+@node Character Sets
+@section Character Sets
+@cindex character sets
+
+@cindex charset
+@cindex coded character set
+An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters
+in which each character is assigned a numeric code point. (The
+Unicode Standard calls this a @dfn{coded character set}.) Each Emacs
+charset has a name which is a symbol. A single character can belong
+to any number of different character sets, but it will generally have
+a different code point in each charset. Examples of character sets
+include @code{ascii}, @code{iso-8859-1}, @code{greek-iso8859-7}, and
+@code{windows-1255}. The code point assigned to a character in a
+charset is usually different from its code point used in Emacs buffers
+and strings.
+
+@cindex @code{emacs}, a charset
+@cindex @code{unicode}, a charset
+@cindex @code{eight-bit}, a charset
+ Emacs defines several special character sets. The character set
+@code{unicode} includes all the characters whose Emacs code points are
+in the range @code{0..#x10FFFF}. The character set @code{emacs}
+includes all @acronym{ASCII} and non-@acronym{ASCII} characters.
+Finally, the @code{eight-bit} charset includes the 8-bit raw bytes;
+Emacs uses it to represent raw bytes encountered in text.
-@defun charset-bytes charset
-This function returns the number of bytes used to represent a character
-in character set @var{charset}.
+@defun charsetp object
+Returns @code{t} if @var{object} is a symbol that names a character set,
+@code{nil} otherwise.
@end defun
- This is the simplest way to determine the byte length of a character
-set's introduction sequence:
+@defvar charset-list
+The value is a list of all defined character set names.
+@end defvar
-@example
-(- (charset-bytes @var{charset})
- (charset-dimension @var{charset}))
-@end example
+@defun charset-priority-list &optional highestp
+This functions returns a list of all defined character sets ordered by
+their priority. If @var{highestp} is non-@code{nil}, the function
+returns a single character set of the highest priority.
+@end defun
-@node Splitting Characters
-@section Splitting Characters
-@cindex character as bytes
+@defun set-charset-priority &rest charsets
+This function makes @var{charsets} the highest priority character sets.
+@end defun
- The functions in this section convert between characters and the byte
-values used to represent them. For most purposes, there is no need to
-be concerned with the sequence of bytes used to represent a character,
-because Emacs translates automatically when necessary.
+@defun char-charset character &optional restriction
+This function returns the name of the character set of highest
+priority that @var{character} belongs to. @acronym{ASCII} characters
+are an exception: for them, this function always returns @code{ascii}.
-@defun split-char character
-Return a list containing the name of the character set of
-@var{character}, followed by one or two byte values (integers) which
-identify @var{character} within that character set. The number of byte
-values is the character set's dimension.
+If @var{restriction} is non-@code{nil}, it should be a list of
+charsets to search. Alternatively, it can be a coding system, in
+which case the returned charset must be supported by that coding
+system (@pxref{Coding Systems}).
+@end defun
-If @var{character} is invalid as a character code, @code{split-char}
-returns a list consisting of the symbol @code{unknown} and @var{character}.
+@defun charset-plist charset
+This function returns the property list of the character set
+@var{charset}. Although @var{charset} is a symbol, this is not the
+same as the property list of that symbol. Charset properties include
+important information about the charset, such as its documentation
+string, short name, etc.
+@end defun
-@example
-(split-char 2248)
- @result{} (latin-iso8859-1 72)
-(split-char 65)
- @result{} (ascii 65)
-(split-char 128)
- @result{} (eight-bit-control 128)
-@end example
+@defun put-charset-property charset propname value
+This function sets the @var{propname} property of @var{charset} to the
+given @var{value}.
@end defun
-@cindex generate characters in charsets
-@defun make-char charset &optional code1 code2
-This function returns the character in character set @var{charset} whose
-position codes are @var{code1} and @var{code2}. This is roughly the
-inverse of @code{split-char}. Normally, you should specify either one
-or both of @var{code1} and @var{code2} according to the dimension of
-@var{charset}. For example,
+@defun get-charset-property charset propname
+This function returns the value of @var{charset}s property
+@var{propname}.
+@end defun
-@example
-(make-char 'latin-iso8859-1 72)
- @result{} 2248
-@end example
+@deffn Command list-charset-chars charset
+This command displays a list of characters in the character set
+@var{charset}.
+@end deffn
-Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed
-before they are used to index @var{charset}. Thus you may use, for
-instance, an ISO 8859 character code rather than subtracting 128, as
-is necessary to index the corresponding Emacs charset.
+ Emacs can convert between its internal representation of a character
+and the character's codepoint in a specific charset. The following
+two functions support these conversions.
+
+@c FIXME: decode-char and encode-char accept and ignore an additional
+@c argument @var{restriction}. When that argument actually makes a
+@c difference, it should be documented here.
+@defun decode-char charset code-point
+This function decodes a character that is assigned a @var{code-point}
+in @var{charset}, to the corresponding Emacs character, and returns
+it. If @var{charset} doesn't contain a character of that code point,
+the value is @code{nil}. If @var{code-point} doesn't fit in a Lisp
+integer (@pxref{Integer Basics, most-positive-fixnum}), it can be
+specified as a cons cell @code{(@var{high} . @var{low})}, where
+@var{low} are the lower 16 bits of the value and @var{high} are the
+high 16 bits.
@end defun
-@cindex generic characters
- If you call @code{make-char} with no @var{byte-values}, the result is
-a @dfn{generic character} which stands for @var{charset}. A generic
-character is an integer, but it is @emph{not} valid for insertion in the
-buffer as a character. It can be used in @code{char-table-range} to
-refer to the whole character set (@pxref{Char-Tables}).
-@code{char-valid-p} returns @code{nil} for generic characters.
-For example:
-
-@example
-(make-char 'latin-iso8859-1)
- @result{} 2176
-(char-valid-p 2176)
- @result{} nil
-(char-valid-p 2176 t)
- @result{} t
-(split-char 2176)
- @result{} (latin-iso8859-1 0)
-@end example
+@defun encode-char char charset
+This function returns the code point assigned to the character
+@var{char} in @var{charset}. If the result does not fit in a Lisp
+integer, it is returned as a cons cell @code{(@var{high} . @var{low})}
+that fits the second argument of @code{decode-char} above. If
+@var{charset} doesn't have a codepoint for @var{char}, the value is
+@code{nil}.
+@end defun
-The character sets @code{ascii}, @code{eight-bit-control}, and
-@code{eight-bit-graphic} don't have corresponding generic characters. If
-@var{charset} is one of them and you don't supply @var{code1},
-@code{make-char} returns the character code corresponding to the
-smallest code in @var{charset}.
+ The following function comes in handy for applying a certain
+function to all or part of the characters in a charset:
+
+@defun map-charset-chars function charset &optional arg from-code to-code
+Call @var{function} for characters in @var{charset}. @var{function}
+is called with two arguments. The first one is a cons cell
+@code{(@var{from} . @var{to})}, where @var{from} and @var{to}
+indicate a range of characters contained in charset. The second
+argument passed to @var{function} is @var{arg}.
+
+By default, the range of codepoints passed to @var{function} includes
+all the characters in @var{charset}, but optional arguments
+@var{from-code} and @var{to-code} limit that to the range of
+characters between these two codepoints of @var{charset}. If either
+of them is @code{nil}, it defaults to the first or last codepoint of
+@var{charset}, respectively.
+@end defun
@node Scanning Charsets
@section Scanning for Character Sets
- Sometimes it is useful to find out which character sets appear in a
-part of a buffer or a string. One use for this is in determining which
-coding systems (@pxref{Coding Systems}) are capable of representing all
-of the text in question.
+ Sometimes it is useful to find out which character set a particular
+character belongs to. One use for this is in determining which coding
+systems (@pxref{Coding Systems}) are capable of representing all of
+the text in question; another is to determine the font(s) for
+displaying that text.
@defun charset-after &optional pos
-This function return the charset of a character in the current buffer
-at position @var{pos}. If @var{pos} is omitted or @code{nil}, it
-defaults to the current value of point. If @var{pos} is out of range,
-the value is @code{nil}.
+This function returns the charset of highest priority containing the
+character at position @var{pos} in the current buffer. If @var{pos}
+is omitted or @code{nil}, it defaults to the current value of point.
+If @var{pos} is out of range, the value is @code{nil}.
@end defun
@defun find-charset-region beg end &optional translation
-This function returns a list of the character sets that appear in the
-current buffer between positions @var{beg} and @var{end}.
+This function returns a list of the character sets of highest priority
+that contain characters in the current buffer between positions
+@var{beg} and @var{end}.
-The optional argument @var{translation} specifies a translation table to
-be used in scanning the text (@pxref{Translation of Characters}). If it
-is non-@code{nil}, then each character in the region is translated
+The optional argument @var{translation} specifies a translation table
+to use for scanning the text (@pxref{Translation of Characters}). If
+it is non-@code{nil}, then each character in the region is translated
through this table, and the value returned describes the translated
characters instead of the characters actually in the buffer.
@end defun
@defun find-charset-string string &optional translation
-This function returns a list of the character sets that appear in the
-string @var{string}. It is just like @code{find-charset-region}, except
-that it applies to the contents of @var{string} instead of part of the
-current buffer.
+This function returns a list of character sets of highest priority
+that contain characters in @var{string}. It is just like
+@code{find-charset-region}, except that it applies to the contents of
+@var{string} instead of part of the current buffer.
@end defun
@node Translation of Characters
@cindex character translation tables
@cindex translation tables
- A @dfn{translation table} is a char-table that specifies a mapping
-of characters into characters. These tables are used in encoding and
-decoding, and for other purposes. Some coding systems specify their
-own particular translation tables; there are also default translation
-tables which apply to all other coding systems.
+ A @dfn{translation table} is a char-table (@pxref{Char-Tables}) that
+specifies a mapping of characters into characters. These tables are
+used in encoding and decoding, and for other purposes. Some coding
+systems specify their own particular translation tables; there are
+also default translation tables which apply to all other coding
+systems.
- For instance, the coding-system @code{utf-8} has a translation table
-that maps characters of various charsets (e.g.,
-@code{latin-iso8859-@var{x}}) into Unicode character sets. This way,
-it can encode Latin-2 characters into UTF-8. Meanwhile,
-@code{unify-8859-on-decoding-mode} operates by specifying
-@code{standard-translation-table-for-decode} to translate
-Latin-@var{x} characters into corresponding Unicode characters.
+ A translation table has two extra slots. The first is either
+@code{nil} or a translation table that performs the reverse
+translation; the second is the maximum number of characters to look up
+for translating sequences of characters (see the description of
+@code{make-translation-table-from-alist} below).
@defun make-translation-table &rest translations
This function returns a translation table based on the argument
and if a previous form already translates @var{to} to some other
character, say @var{to-alt}, @var{from} is also translated to
@var{to-alt}.
+@end defun
-You can also map one whole character set into another character set with
-the same dimension. To do this, you specify a generic character (which
-designates a character set) for @var{from} (@pxref{Splitting Characters}).
-In this case, if @var{to} is also a generic character, its character
-set should have the same dimension as @var{from}'s. Then the
-translation table translates each character of @var{from}'s character
-set into the corresponding character of @var{to}'s character set. If
-@var{from} is a generic character and @var{to} is an ordinary
-character, then the translation table translates every character of
-@var{from}'s character set into @var{to}.
-@end defun
-
- In decoding, the translation table's translations are applied to the
-characters that result from ordinary decoding. If a coding system has
-property @code{translation-table-for-decode}, that specifies the
-translation table to use. (This is a property of the coding system,
-as returned by @code{coding-system-get}, not a property of the symbol
-that is the coding system's name. @xref{Coding System Basics,, Basic
-Concepts of Coding Systems}.) Otherwise, if
-@code{standard-translation-table-for-decode} is non-@code{nil},
-decoding uses that table.
-
- In encoding, the translation table's translations are applied to the
-characters in the buffer, and the result of translation is actually
-encoded. If a coding system has property
-@code{translation-table-for-encode}, that specifies the translation
-table to use. Otherwise the variable
-@code{standard-translation-table-for-encode} specifies the translation
-table.
+ During decoding, the translation table's translations are applied to
+the characters that result from ordinary decoding. If a coding system
+has the property @code{:decode-translation-table}, that specifies the
+translation table to use, or a list of translation tables to apply in
+sequence. (This is a property of the coding system, as returned by
+@code{coding-system-get}, not a property of the symbol that is the
+coding system's name. @xref{Coding System Basics,, Basic Concepts of
+Coding Systems}.) Finally, if
+@code{standard-translation-table-for-decode} is non-@code{nil}, the
+resulting characters are translated by that table.
+
+ During encoding, the translation table's translations are applied to
+the characters in the buffer, and the result of translation is
+actually encoded. If a coding system has property
+@code{:encode-translation-table}, that specifies the translation table
+to use, or a list of translation tables to apply in sequence. In
+addition, if the variable @code{standard-translation-table-for-encode}
+is non-@code{nil}, it specifies the translation table to use for
+translating the result.
@defvar standard-translation-table-for-decode
-This is the default translation table for decoding, for
-coding systems that don't specify any other translation table.
+This is the default translation table for decoding. If a coding
+systems specifies its own translation tables, the table that is the
+value of this variable, if non-@code{nil}, is applied after them.
@end defvar
@defvar standard-translation-table-for-encode
-This is the default translation table for encoding, for
-coding systems that don't specify any other translation table.
+This is the default translation table for encoding. If a coding
+systems specifies its own translation tables, the table that is the
+value of this variable, if non-@code{nil}, is applied after them.
@end defvar
+@defvar translation-table-for-input
+Self-inserting characters are translated through this translation
+table before they are inserted. Search commands also translate their
+input through this table, so they can compare more reliably with
+what's in the buffer.
+
+This variable automatically becomes buffer-local when set.
+@end defvar
+
+@defun make-translation-table-from-vector vec
+This function returns a translation table made from @var{vec} that is
+an array of 256 elements to map bytes (values 0 through #xFF) to
+characters. Elements may be @code{nil} for untranslated bytes. The
+returned table has a translation table for reverse mapping in the
+first extra slot, and the value @code{1} in the second extra slot.
+
+This function provides an easy way to make a private coding system
+that maps each byte to a specific character. You can specify the
+returned table and the reverse translation table using the properties
+@code{:decode-translation-table} and @code{:encode-translation-table}
+respectively in the @var{props} argument to
+@code{define-coding-system}.
+@end defun
+
+@defun make-translation-table-from-alist alist
+This function is similar to @code{make-translation-table} but returns
+a complex translation table rather than a simple one-to-one mapping.
+Each element of @var{alist} is of the form @code{(@var{from}
+. @var{to})}, where @var{from} and @var{to} are either characters or
+vectors specifying a sequence of characters. If @var{from} is a
+character, that character is translated to @var{to} (i.e.@: to a
+character or a character sequence). If @var{from} is a vector of
+characters, that sequence is translated to @var{to}. The returned
+table has a translation table for reverse mapping in the first extra
+slot, and the maximum length of all the @var{from} character sequences
+in the second extra slot.
+@end defun
+
@node Coding Systems
@section Coding Systems
@subsection Basic Concepts of Coding Systems
@cindex character code conversion
- @dfn{Character code conversion} involves conversion between the encoding
-used inside Emacs and some other encoding. Emacs supports many
-different encodings, in that it can convert to and from them. For
-example, it can convert text to or from encodings such as Latin 1, Latin
-2, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022. In some
-cases, Emacs supports several alternative encodings for the same
-characters; for example, there are three coding systems for the Cyrillic
-(Russian) alphabet: ISO, Alternativnyj, and KOI8.
-
- Most coding systems specify a particular character code for
-conversion, but some of them leave the choice unspecified---to be chosen
-heuristically for each file, based on the data.
+ @dfn{Character code conversion} involves conversion between the
+internal representation of characters used inside Emacs and some other
+encoding. Emacs supports many different encodings, in that it can
+convert to and from them. For example, it can convert text to or from
+encodings such as Latin 1, Latin 2, Latin 3, Latin 4, Latin 5, and
+several variants of ISO 2022. In some cases, Emacs supports several
+alternative encodings for the same characters; for example, there are
+three coding systems for the Cyrillic (Russian) alphabet: ISO,
+Alternativnyj, and KOI8.
+
+ Every coding system specifies a particular set of character code
+conversions, but the coding system @code{undecided} is special: it
+leaves the choice unspecified, to be chosen heuristically for each
+file, based on the file's data.
In general, a coding system doesn't guarantee roundtrip identity:
decoding a byte sequence using coding system, then encoding the
resulting text in the same coding system, can produce a different byte
-sequence. However, the following coding systems do guarantee that the
-byte sequence will be the same as what you originally decoded:
+sequence. But some coding systems do guarantee that the byte sequence
+will be the same as what you originally decoded. Here are a few
+examples:
@quotation
-chinese-big5 chinese-iso-8bit cyrillic-iso-8bit emacs-mule
-greek-iso-8bit hebrew-iso-8bit iso-latin-1 iso-latin-2 iso-latin-3
-iso-latin-4 iso-latin-5 iso-latin-8 iso-latin-9 iso-safe
-japanese-iso-8bit japanese-shift-jis korean-iso-8bit raw-text
+iso-8859-1, utf-8, big5, shift_jis, euc-jp
@end quotation
Encoding buffer text and then decoding the result can also fail to
-reproduce the original text. For instance, if you encode Latin-2
-characters with @code{utf-8} and decode the result using the same
-coding system, you'll get Unicode characters (of charset
-@code{mule-unicode-0100-24ff}). If you encode Unicode characters with
-@code{iso-latin-2} and decode the result with the same coding system,
-you'll get Latin-2 characters.
+reproduce the original text. For instance, if you encode a character
+with a coding system which does not support that character, the result
+is unpredictable, and thus decoding it using the same coding system
+may produce a different text. Currently, Emacs can't report errors
+that result from encoding unsupported characters.
@cindex EOL conversion
@cindex end-of-line conversion
@cindex line end conversion
- @dfn{End of line conversion} handles three different conventions used
-on various systems for representing end of line in files. The Unix
-convention is to use the linefeed character (also called newline). The
-DOS convention is to use a carriage-return and a linefeed at the end of
-a line. The Mac convention is to use just carriage-return.
+ @dfn{End of line conversion} handles three different conventions
+used on various systems for representing end of line in files. The
+Unix convention, used on GNU and Unix systems, is to use the linefeed
+character (also called newline). The DOS convention, used on
+MS-Windows and MS-DOS systems, is to use a carriage-return and a
+linefeed at the end of a line. The Mac convention is to use just
+carriage-return.
@cindex base coding system
@cindex variant coding system
well. Most base coding systems have three corresponding variants whose
names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
+@vindex raw-text@r{ coding system}
The coding system @code{raw-text} is special in that it prevents
-character code conversion, and causes the buffer visited with that
-coding system to be a unibyte buffer. It does not specify the
-end-of-line conversion, allowing that to be determined as usual by the
-data, and has the usual three variants which specify the end-of-line
-conversion. @code{no-conversion} is equivalent to @code{raw-text-unix}:
-it specifies no conversion of either character codes or end-of-line.
-
- The coding system @code{emacs-mule} specifies that the data is
-represented in the internal Emacs encoding. This is like
-@code{raw-text} in that no code conversion happens, but different in
-that the result is multibyte data.
+character code conversion, and causes the buffer visited with this
+coding system to be a unibyte buffer. For historical reasons, you can
+save both unibyte and multibyte text with this coding system. When
+you use @code{raw-text} to encode multibyte text, it does perform one
+character code conversion: it converts eight-bit characters to their
+single-byte external representation. @code{raw-text} does not specify
+the end-of-line conversion, allowing that to be determined as usual by
+the data, and has the usual three variants which specify the
+end-of-line conversion.
+
+@vindex no-conversion@r{ coding system}
+@vindex binary@r{ coding system}
+ @code{no-conversion} (and its alias @code{binary}) is equivalent to
+@code{raw-text-unix}: it specifies no conversion of either character
+codes or end-of-line.
+
+@vindex emacs-internal@r{ coding system}
+@vindex utf-8-emacs@r{ coding system}
+ The coding system @code{utf-8-emacs} specifies that the data is
+represented in the internal Emacs encoding (@pxref{Text
+Representations}). This is like @code{raw-text} in that no code
+conversion happens, but different in that the result is multibyte
+data. The name @code{emacs-internal} is an alias for
+@code{utf-8-emacs}.
@defun coding-system-get coding-system property
This function returns the specified property of the coding system
@var{coding-system}. Most coding system properties exist for internal
-purposes, but one that you might find useful is @code{mime-charset}.
+purposes, but one that you might find useful is @code{:mime-charset}.
That property's value is the name used in MIME for the character coding
which this coding system can read and write. Examples:
@example
-(coding-system-get 'iso-latin-1 'mime-charset)
+(coding-system-get 'iso-latin-1 :mime-charset)
@result{} iso-8859-1
-(coding-system-get 'iso-2022-cn 'mime-charset)
+(coding-system-get 'iso-2022-cn :mime-charset)
@result{} iso-2022-cn
-(coding-system-get 'cyrillic-koi8 'mime-charset)
+(coding-system-get 'cyrillic-koi8 :mime-charset)
@result{} koi8-r
@end example
-The value of the @code{mime-charset} property is also defined
+The value of the @code{:mime-charset} property is also defined
as an alias for the coding system.
@end defun
+@defun coding-system-aliases coding-system
+This function returns the list of aliases of @var{coding-system}.
+@end defun
+
@node Encoding and I/O
@subsection Encoding and I/O
The principal purpose of coding systems is for use in reading and
-writing files. The function @code{insert-file-contents} uses
-a coding system for decoding the file data, and @code{write-region}
-uses one to encode the buffer contents.
+writing files. The function @code{insert-file-contents} uses a coding
+system to decode the file data, and @code{write-region} uses one to
+encode the buffer contents.
You can specify the coding system to use either explicitly
(@pxref{Specifying Coding Systems}), or implicitly using a default
Here are the Lisp facilities for working with coding systems:
+@cindex list all coding systems
@defun coding-system-list &optional base-only
This function returns a list of all coding system names (symbols). If
@var{base-only} is non-@code{nil}, the value includes only the
name or @code{nil}.
@end defun
+@cindex validity of coding system
+@cindex coding system, validity check
@defun check-coding-system coding-system
-This function checks the validity of @var{coding-system}.
-If that is valid, it returns @var{coding-system}.
-Otherwise it signals an error with condition @code{coding-system-error}.
+This function checks the validity of @var{coding-system}. If that is
+valid, it returns @var{coding-system}. If @var{coding-system} is
+@code{nil}, the function return @code{nil}. For any other values, it
+signals an error whose @code{error-symbol} is @code{coding-system-error}
+(@pxref{Signaling Errors, signal}).
@end defun
+@cindex eol type of coding system
@defun coding-system-eol-type coding-system
This function returns the type of end-of-line (a.k.a.@: @dfn{eol})
conversion used by @var{coding-system}. If @var{coding-system}
eol conversion is set to match it (e.g., DOS-style CRLF format will
imply @code{dos} eol conversion). For encoding, the eol conversion is
taken from the appropriate default coding system (e.g.,
-@code{default-buffer-file-coding-system} for
+default value of @code{buffer-file-coding-system} for
@code{buffer-file-coding-system}), or from the default eol conversion
appropriate for the underlying platform.
@end defun
+@cindex eol conversion of coding system
@defun coding-system-change-eol-conversion coding-system eol-type
This function returns a coding system which is like @var{coding-system}
except for its eol conversion, which is specified by @code{eol-type}.
@code{dos} and @code{mac}, respectively.
@end defun
+@cindex text conversion of coding system
@defun coding-system-change-text-conversion eol-coding text-coding
This function returns a coding system which uses the end-of-line
conversion of @var{eol-coding}, and the text conversion of
@code{undecided}, or one of its variants according to @var{eol-coding}.
@end defun
+@cindex safely encode region
+@cindex coding systems for encoding region
@defun find-coding-systems-region from to
This function returns a list of coding systems that could be used to
encode a text between @var{from} and @var{to}. All coding systems in
list @code{(undecided)}.
@end defun
+@cindex safely encode a string
+@cindex coding systems for encoding a string
@defun find-coding-systems-string string
This function returns a list of coding systems that could be used to
encode the text of @var{string}. All coding systems in the list can
@code{(undecided)}.
@end defun
+@cindex charset, coding systems to encode
+@cindex safely encode characters in a charset
@defun find-coding-systems-for-charsets charsets
This function returns a list of coding systems that could be used to
encode all the character sets in the list @var{charsets}.
@end defun
+@defun check-coding-systems-region start end coding-system-list
+This function checks whether coding systems in the list
+@code{coding-system-list} can encode all the characters in the region
+between @var{start} and @var{end}. If all of the coding systems in
+the list can encode the specified text, the function returns
+@code{nil}. If some coding systems cannot encode some of the
+characters, the value is an alist, each element of which has the form
+@code{(@var{coding-system1} @var{pos1} @var{pos2} @dots{})}, meaning
+that @var{coding-system1} cannot encode characters at buffer positions
+@var{pos1}, @var{pos2}, @enddots{}.
+
+@var{start} may be a string, in which case @var{end} is ignored and
+the returned value references string indices instead of buffer
+positions.
+@end defun
+
@defun detect-coding-region start end &optional highest
This function chooses a plausible coding system for decoding the text
-from @var{start} to @var{end}. This text should be a byte sequence
-(@pxref{Explicit Encoding}).
+from @var{start} to @var{end}. This text should be a byte sequence,
+i.e.@: unibyte text or multibyte text with only @acronym{ASCII} and
+eight-bit characters (@pxref{Explicit Encoding}).
Normally this function returns a list of coding systems that could
handle decoding the text that was scanned. They are listed in order of
ISO-2022 control characters ISO-2022 as @code{ESC}, the value is
@code{undecided} or @code{(undecided)}, or a variant specifying
end-of-line conversion, if that can be deduced from the text.
+
+If the region contains null bytes, the value is @code{no-conversion},
+even if the region contains text encoded in some coding system.
@end defun
@defun detect-coding-string string &optional highest
This function is like @code{detect-coding-region} except that it
operates on the contents of @var{string} instead of bytes in the buffer.
+@end defun
+
+@cindex null bytes, and decoding text
+@defvar inhibit-null-byte-detection
+If this variable has a non-@code{nil} value, null bytes are ignored
+when detecting the encoding of a region or a string. This allows to
+correctly detect the encoding of text that contains null bytes, such
+as Info files with Index nodes.
+@end defvar
+
+@defvar inhibit-iso-escape-detection
+If this variable has a non-@code{nil} value, ISO-2022 escape sequences
+are ignored when detecting the encoding of a region or a string. The
+result is that no text is ever detected as encoded in some ISO-2022
+encoding, and all escape sequences become visible in a buffer.
+@strong{Warning:} @emph{Use this variable with extreme caution,
+because many files in the Emacs distribution use ISO-2022 encoding.}
+@end defvar
+
+@cindex charsets supported by a coding system
+@defun coding-system-charset-list coding-system
+This function returns the list of character sets (@pxref{Character
+Sets}) supported by @var{coding-system}. Some coding systems that
+support too many character sets to list them all yield special values:
+@itemize @bullet
+@item
+If @var{coding-system} supports all the ISO-2022 charsets, the value
+is @code{iso-2022}.
+@item
+If @var{coding-system} supports all Emacs characters, the value is
+@code{(emacs)}.
+@item
+If @var{coding-system} supports all emacs-mule characters, the value
+is @code{emacs-mule}.
+@item
+If @var{coding-system} supports all Unicode characters, the value is
+@code{(unicode)}.
+@end itemize
@end defun
@xref{Coding systems for a subprocess,, Process Information}, in
@var{from} is a string, the string specifies the text to encode, and
@var{to} is ignored.
+If the specified text includes raw bytes (@pxref{Text
+Representations}), @code{select-safe-coding-system} suggests
+@code{raw-text} for its encoding.
+
If @var{default-coding-system} is non-@code{nil}, that is the first
coding system to try; if that can handle the text,
@code{select-safe-coding-system} returns that coding system. It can
also be a list of coding systems; then the function tries each of them
one by one. After trying all of them, it next tries the current
buffer's value of @code{buffer-file-coding-system} (if it is not
-@code{undecided}), then the value of
-@code{default-buffer-file-coding-system} and finally the user's most
+@code{undecided}), then the default value of
+@code{buffer-file-coding-system} and finally the user's most
preferred coding system, which the user can set using the command
@code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing
Coding Systems, emacs, The GNU Emacs Manual}).
@vindex select-safe-coding-system-accept-default-p
If the variable @code{select-safe-coding-system-accept-default-p} is
-non-@code{nil}, its value overrides the value of
-@var{accept-default-p}.
+non-@code{nil}, it should be a function taking a single argument.
+It is used in place of @var{accept-default-p}, overriding any
+value supplied for this argument.
As a final step, before returning the chosen coding system,
@code{select-safe-coding-system} checks whether that coding system is
@node Default Coding Systems
@subsection Default Coding Systems
+@cindex default coding system
+@cindex coding system, automatically determined
This section describes variables that specify the default coding
system for certain files or when running certain subprograms, and the
@code{coding-system-for-read} and @code{coding-system-for-write}
(@pxref{Specifying Coding Systems}).
-@defvar auto-coding-regexp-alist
+@cindex file contents, and default coding system
+@defopt auto-coding-regexp-alist
This variable is an alist of text patterns and corresponding coding
systems. Each element has the form @code{(@var{regexp}
. @var{coding-system})}; a file whose first few kilobytes match
@code{file-coding-system-alist} (see below). The default value is set
so that Emacs automatically recognizes mail files in Babyl format and
reads them with no code conversions.
-@end defvar
+@end defopt
-@defvar file-coding-system-alist
+@cindex file name, and default coding system
+@defopt file-coding-system-alist
This variable is an alist that specifies the coding systems to use for
reading and writing particular files. Each element has the form
@code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
If @var{coding} (or what returned by the above function) is
@code{undecided}, the normal code-detection is performed.
-@end defvar
+@end defopt
+
+@defopt auto-coding-alist
+This variable is an alist that specifies the coding systems to use for
+reading and writing particular files. Its form is like that of
+@code{file-coding-system-alist}, but, unlike the latter, this variable
+takes priority over any @code{coding:} tags in the file.
+@end defopt
+@cindex program name, and default coding system
@defvar process-coding-system-alist
This variable is an alist specifying which coding systems to use for a
subprocess, depending on which program is running in the subprocess. It
the end of line conversion---that is, one like @code{latin-1-unix},
rather than @code{undecided} or @code{latin-1}.
+@cindex port number, and default coding system
+@cindex network service name, and default coding system
@defvar network-coding-system-alist
This variable is an alist that specifies the coding system to use for
network streams. It works much like @code{file-coding-system-alist},
the subprocess, and @var{output-coding} applies to output to it.
@end defvar
-@defvar auto-coding-functions
+@cindex default coding system, functions to determine
+@defopt auto-coding-functions
This variable holds a list of functions that try to determine a
coding system for a file based on its undecoded contents.
If a file has a @samp{coding:} tag, that takes precedence, so these
functions won't be called.
-@end defvar
+@end defopt
+
+@defun find-auto-coding filename size
+This function tries to determine a suitable coding system for
+@var{filename}. It examines the buffer visiting the named file, using
+the variables documented above in sequence, until it finds a match for
+one of the rules specified by these variables. It then returns a cons
+cell of the form @code{(@var{coding} . @var{source})}, where
+@var{coding} is the coding system to use and @var{source} is a symbol,
+one of @code{auto-coding-alist}, @code{auto-coding-regexp-alist},
+@code{:coding}, or @code{auto-coding-functions}, indicating which one
+supplied the matching rule. The value @code{:coding} means the coding
+system was specified by the @code{coding:} tag in the file
+(@pxref{Specify Coding,, coding tag, emacs, The GNU Emacs Manual}).
+The order of looking for a matching rule is @code{auto-coding-alist}
+first, then @code{auto-coding-regexp-alist}, then the @code{coding:}
+tag, and lastly @code{auto-coding-functions}. If no matching rule was
+found, the function returns @code{nil}.
+
+The second argument @var{size} is the size of text, in characters,
+following point. The function examines text only within @var{size}
+characters after point. Normally, the buffer should be positioned at
+the beginning when this function is called, because one of the places
+for the @code{coding:} tag is the first one or two lines of the file;
+in that case, @var{size} should be the size of the buffer.
+@end defun
+
+@defun set-auto-coding filename size
+This function returns a suitable coding system for file
+@var{filename}. It uses @code{find-auto-coding} to find the coding
+system. If no coding system could be determined, the function returns
+@code{nil}. The meaning of the argument @var{size} is like in
+@code{find-auto-coding}.
+@end defun
@defun find-operation-coding-system operation &rest arguments
This function returns the coding system to use (by default) for
affect it.
@end defvar
-@defvar inhibit-eol-conversion
+@defopt inhibit-eol-conversion
When this variable is non-@code{nil}, no end-of-line conversion is done,
no matter which coding system is specified. This applies to all the
Emacs I/O and subprocess primitives, and to the explicit encoding and
decoding functions (@pxref{Explicit Encoding}).
-@end defvar
+@end defopt
+
+@cindex priority order of coding systems
+@cindex coding systems, priority
+ Sometimes, you need to prefer several coding systems for some
+operation, rather than fix a single one. Emacs lets you specify a
+priority order for using coding systems. This ordering affects the
+sorting of lists of coding sysems returned by functions such as
+@code{find-coding-systems-region} (@pxref{Lisp and Coding Systems}).
+
+@defun coding-system-priority-list &optional highestp
+This function returns the list of coding systems in the order of their
+current priorities. Optional argument @var{highestp}, if
+non-@code{nil}, means return only the highest priority coding system.
+@end defun
+
+@defun set-coding-system-priority &rest coding-systems
+This function puts @var{coding-systems} at the beginning of the
+priority list for coding systems, thus making their priority higher
+than all the rest.
+@end defun
+
+@defmac with-coding-priority coding-systems &rest body@dots{}
+This macro execute @var{body}, like @code{progn} does
+(@pxref{Sequencing, progn}), with @var{coding-systems} at the front of
+the priority list for coding systems. @var{coding-systems} should be
+a list of coding systems to prefer during execution of @var{body}.
+@end defmac
@node Explicit Encoding
@subsection Explicit Encoding and Decoding
The result of encoding, and the input to decoding, are not ordinary
text. They logically consist of a series of byte values; that is, a
-series of characters whose codes are in the range 0 through 255. In a
-multibyte buffer or string, character codes 128 through 159 are
-represented by multibyte sequences, but this is invisible to Lisp
-programs.
+series of @acronym{ASCII} and eight-bit characters. In unibyte
+buffers and strings, these characters have codes in the range 0
+through #xFF (255). In a multibyte buffer or string, eight-bit
+characters have character codes higher than #xFF (@pxref{Text
+Representations}), but Emacs transparently converts them to their
+single-byte values when you encode or decode such text.
The usual way to read a file into a buffer as a sequence of bytes, so
you can decode the contents explicitly, is with
Here are the functions to perform explicit encoding or decoding. The
encoding functions produce sequences of bytes; the decoding functions
are meant to operate on sequences of bytes. All of these functions
-discard text properties.
+discard text properties. They also set @code{last-coding-system-used}
+to the precise coding system they used.
-@deffn Command encode-coding-region start end coding-system
+@deffn Command encode-coding-region start end coding-system &optional destination
This command encodes the text from @var{start} to @var{end} according
-to coding system @var{coding-system}. The encoded text replaces the
-original text in the buffer. The result of encoding is logically a
-sequence of bytes, but the buffer remains multibyte if it was multibyte
-before.
-
-This command returns the length of the encoded text.
+to coding system @var{coding-system}. Normally, the encoded text
+replaces the original text in the buffer, but the optional argument
+@var{destination} can change that. If @var{destination} is a buffer,
+the encoded text is inserted in that buffer after point (point does
+not move); if it is @code{t}, the command returns the encoded text as
+a unibyte string without inserting it.
+
+If encoded text is inserted in some buffer, this command returns the
+length of the encoded text.
+
+The result of encoding is logically a sequence of bytes, but the
+buffer remains multibyte if it was multibyte before, and any 8-bit
+bytes are converted to their multibyte representation (@pxref{Text
+Representations}).
+
+@cindex @code{undecided} coding-system, when encoding
+Do @emph{not} use @code{undecided} for @var{coding-system} when
+encoding text, since that may lead to unexpected results. Instead,
+use @code{select-safe-coding-system} (@pxref{User-Chosen Coding
+Systems, select-safe-coding-system}) to suggest a suitable encoding,
+if there's no obvious pertinent value for @var{coding-system}.
@end deffn
-@defun encode-coding-string string coding-system &optional nocopy
+@defun encode-coding-string string coding-system &optional nocopy buffer
This function encodes the text in @var{string} according to coding
system @var{coding-system}. It returns a new string containing the
encoded text, except when @var{nocopy} is non-@code{nil}, in which
operation is trivial. The result of encoding is a unibyte string.
@end defun
-@deffn Command decode-coding-region start end coding-system
+@deffn Command decode-coding-region start end coding-system &optional destination
This command decodes the text from @var{start} to @var{end} according
-to coding system @var{coding-system}. The decoded text replaces the
-original text in the buffer. To make explicit decoding useful, the text
-before decoding ought to be a sequence of byte values, but both
-multibyte and unibyte buffers are acceptable.
-
-This command returns the length of the decoded text.
+to coding system @var{coding-system}. To make explicit decoding
+useful, the text before decoding ought to be a sequence of byte
+values, but both multibyte and unibyte buffers are acceptable (in the
+multibyte case, the raw byte values should be represented as eight-bit
+characters). Normally, the decoded text replaces the original text in
+the buffer, but the optional argument @var{destination} can change
+that. If @var{destination} is a buffer, the decoded text is inserted
+in that buffer after point (point does not move); if it is @code{t},
+the command returns the decoded text as a multibyte string without
+inserting it.
+
+If decoded text is inserted in some buffer, this command returns the
+length of the decoded text.
+
+This command puts a @code{charset} text property on the decoded text.
+The value of the property states the character set used to decode the
+original text.
@end deffn
-@defun decode-coding-string string coding-system &optional nocopy
-This function decodes the text in @var{string} according to coding
-system @var{coding-system}. It returns a new string containing the
-decoded text, except when @var{nocopy} is non-@code{nil}, in which
-case the function may return @var{string} itself if the decoding
-operation is trivial. To make explicit decoding useful, the contents
-of @var{string} ought to be a sequence of byte values, but a multibyte
-string is acceptable.
+@defun decode-coding-string string coding-system &optional nocopy buffer
+This function decodes the text in @var{string} according to
+@var{coding-system}. It returns a new string containing the decoded
+text, except when @var{nocopy} is non-@code{nil}, in which case the
+function may return @var{string} itself if the decoding operation is
+trivial. To make explicit decoding useful, the contents of
+@var{string} ought to be a unibyte string with a sequence of byte
+values, but a multibyte string is also acceptable (assuming it
+contains 8-bit bytes in their multibyte form).
+
+If optional argument @var{buffer} specifies a buffer, the decoded text
+is inserted in that buffer after point (point does not move). In this
+case, the return value is the length of the decoded text.
+
+@cindex @code{charset}, text property
+This function puts a @code{charset} text property on the decoded text.
+The value of the property states the character set used to decode the
+original text:
+
+@example
+@group
+(decode-coding-string "Gr\374ss Gott" 'latin-1)
+ @result{} #("Gr@"uss Gott" 0 9 (charset iso-8859-1))
+@end group
+@end example
@end defun
@defun decode-coding-inserted-region from to filename &optional visit beg end replace
@subsection Terminal I/O Encoding
Emacs can decode keyboard input using a coding system, and encode
-terminal output. This is useful for terminals that transmit or display
-text using a particular encoding such as Latin-1. Emacs does not set
-@code{last-coding-system-used} for encoding or decoding for the
-terminal.
+terminal output. This is useful for terminals that transmit or
+display text using a particular encoding such as Latin-1. Emacs does
+not set @code{last-coding-system-used} for encoding or decoding of
+terminal I/O.
-@defun keyboard-coding-system
+@defun keyboard-coding-system &optional terminal
This function returns the coding system that is in use for decoding
-keyboard input---or @code{nil} if no coding system is to be used.
+keyboard input from @var{terminal}---or @code{nil} if no coding system
+is to be used for that terminal. If @var{terminal} is omitted or
+@code{nil}, it means the selected frame's terminal. @xref{Multiple
+Terminals}.
@end defun
-@deffn Command set-keyboard-coding-system coding-system
-This command specifies @var{coding-system} as the coding system to
-use for decoding keyboard input. If @var{coding-system} is @code{nil},
-that means do not decode keyboard input.
+@deffn Command set-keyboard-coding-system coding-system &optional terminal
+This command specifies @var{coding-system} as the coding system to use
+for decoding keyboard input from @var{terminal}. If
+@var{coding-system} is @code{nil}, that means do not decode keyboard
+input. If @var{terminal} is a frame, it means that frame's terminal;
+if it is @code{nil}, that means the currently selected frame's
+terminal. @xref{Multiple Terminals}.
@end deffn
-@defun terminal-coding-system
+@defun terminal-coding-system &optional terminal
This function returns the coding system that is in use for encoding
-terminal output---or @code{nil} for no encoding.
+terminal output from @var{terminal}---or @code{nil} if the output is
+not encoded. If @var{terminal} is a frame, it means that frame's
+terminal; if it is @code{nil}, that means the currently selected
+frame's terminal.
@end defun
-@deffn Command set-terminal-coding-system coding-system
+@deffn Command set-terminal-coding-system coding-system &optional terminal
This command specifies @var{coding-system} as the coding system to use
-for encoding terminal output. If @var{coding-system} is @code{nil},
-that means do not encode terminal output.
+for encoding terminal output from @var{terminal}. If
+@var{coding-system} is @code{nil}, terminal output is not encoded. If
+@var{terminal} is a frame, it means that frame's terminal; if it is
+@code{nil}, that means the currently selected frame's terminal.
@end deffn
@node MS-DOS File Types
Normally this variable is set by visiting a file; it is set to
@code{nil} if the file was visited without any actual conversion.
+
+Its default value is used to decide how to handle files for which
+@code{file-name-buffer-file-type-alist} says nothing about the type:
+If the default value is non-@code{nil}, then these files are treated as
+binary: the coding system @code{no-conversion} is used. Otherwise,
+nothing special is done for them---the coding system is deduced solely
+from the file contents, in the usual Emacs fashion.
@end defvar
@defopt file-name-buffer-file-type-alist
is used.
If no element in this alist matches a given file name, then
-@code{default-buffer-file-type} says how to treat the file.
-@end defopt
-
-@defopt default-buffer-file-type
-This variable says how to handle files for which
-@code{file-name-buffer-file-type-alist} says nothing about the type.
-
-If this variable is non-@code{nil}, then these files are treated as
-binary: the coding system @code{no-conversion} is used. Otherwise,
-nothing special is done for them---the coding system is deduced solely
-from the file contents, in the usual Emacs fashion.
+the default value of @code{buffer-file-type} says how to treat the file.
@end defopt
@node Input Methods