@c -*-texinfo-*-
@c This is part of the GNU Emacs Lisp Reference Manual.
@c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004,
-@c 2005, 2006, 2007, 2008, 2009 Free Software Foundation, Inc.
+@c 2005, 2006, 2007, 2008, 2009, 2010 Free Software Foundation, Inc.
@c See the file elisp.texi for copying conditions.
@setfilename ../../info/characters
@node Non-ASCII Characters, Searching and Matching, Text, Top
Emacs buffers and strings support a large repertoire of characters
from many different scripts, allowing users to type and display text
-in most any known written language.
+in almost any known written language.
@cindex character codepoint
@cindex codespace
follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
unique number, called a @dfn{codepoint}, to each and every character.
The range of codepoints defined by Unicode, or the Unicode
-@dfn{codespace}, is @code{0..10FFFF} (in hex), inclusive. Emacs
-extends this range with codepoints in the range @code{110000..3FFFFF},
-which it uses for representing characters that are not unified with
-Unicode and raw 8-bit bytes that cannot be interpreted as characters
-(the latter occupy the range @code{3FFF80..3FFFFF}). Thus, a
-character codepoint in Emacs is a 22-bit integer number.
+@dfn{codespace}, is @code{0..#x10FFFF} (in hexadecimal notation),
+inclusive. Emacs extends this range with codepoints in the range
+@code{#x110000..#x3FFFFF}, which it uses for representing characters
+that are not unified with Unicode and @dfn{raw 8-bit bytes} that
+cannot be interpreted as characters. Thus, a character codepoint in
+Emacs is a 22-bit integer number.
@cindex internal representation of characters
@cindex characters, representation in buffers and strings
You cannot set this variable directly; instead, use the function
@code{set-buffer-multibyte} to change a buffer's representation.
-@end defvar
-
-@defvar default-enable-multibyte-characters
-This variable's value is entirely equivalent to @code{(default-value
-'enable-multibyte-characters)}, and setting this variable changes that
-default value. Setting the local binding of
-@code{enable-multibyte-characters} in a specific buffer is not allowed,
-but changing the default value is supported, and it is a reasonable
-thing to do, because it has no effect on existing buffers.
The @samp{--unibyte} command line option does its job by setting the
default value to @code{nil} early in startup.
it is returned unchanged. The function assumes that @var{string}
includes only @acronym{ASCII} characters and raw 8-bit bytes; the
latter are converted to their multibyte representation corresponding
-to the codepoints in the @code{3FFF80..3FFFFF} area (@pxref{Text
-Representations, codepoints}).
+to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive
+(@pxref{Text Representations, codepoints}).
@end defun
@defun string-to-unibyte string
The unibyte and multibyte text representations use different
character codes. The valid character codes for unibyte representation
-range from 0 to 255---the values that can fit in one byte. The valid
-character codes for multibyte representation range from 0 to 4194303
-(#x3FFFFF). In this code space, values 0 through 127 are for
-@acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F)
-are for non-@acronym{ASCII} characters. Values 0 through 1114111
-(#10FFFF) correspond to Unicode characters of the same codepoint;
-values 1114112 (#110000) through 4194175 (#x3FFF7F) represent
-characters that are not unified with Unicode; and values 4194176
-(#x3FFF80) through 4194303 (#x3FFFFF) represent eight-bit raw bytes.
+range from 0 to @code{#xFF} (255)---the values that can fit in one
+byte. The valid character codes for multibyte representation range
+from 0 to @code{#x3FFFFF}. In this code space, values 0 through
+@code{#x7F} (127) are for @acronym{ASCII} characters, and values
+@code{#x80} (128) through @code{#x3FFF7F} (4194175) are for
+non-@acronym{ASCII} characters.
+
+ Emacs character codes are a superset of the Unicode standard.
+Values 0 through @code{#x10FFFF} (1114111) correspond to Unicode
+characters of the same codepoint; values @code{#x110000} (1114112)
+through @code{#x3FFF7F} (4194175) represent characters that are not
+unified with Unicode; and values @code{#x3FFF80} (4194176) through
+@code{#x3FFFFF} (4194303) represent eight-bit raw bytes.
@defun characterp charcode
This returns @code{t} if @var{charcode} is a valid character, and
@end example
@end defun
-@defun get-byte pos &optional string
+@defun get-byte &optional pos string
This function returns the byte at character position @var{pos} in the
current buffer. If the current buffer is unibyte, this is literally
the byte at that position. If the buffer is multibyte, byte values of
during text processing and display. Thus, character properties are an
important part of specifying the character's semantics.
- Emacs generally follows the Unicode Standard in its implementation
+ On the whole, Emacs follows the Unicode Standard in its implementation
of character properties. In particular, Emacs supports the
@uref{http://www.unicode.org/reports/tr23/, Unicode Character Property
Model}, and the Emacs character property database is derived from the
value is a string consisting of upper-case Latin letters A to Z,
digits, spaces, and hyphen @samp{-} characters.
+@cindex unicode general category
@item general-category
This property corresponds to the Unicode @code{General_Category}
property. The value is a symbol whose name is a 2-letter abbreviation
@var{propname} for the character @var{char}.
@end defun
-@defvar char-script-table
+@defvar unicode-category-table
The value of this variable is a char-table (@pxref{Char-Tables}) that
-specifies, for each character, a symbol whose name is the script to
-which the character belongs, according to the Unicode Standard
-classification of the Unicode code space into script-specific blocks.
-This char-table has a single extra slot whose value is the list of all
-script symbols.
+specifies, for each character, its Unicode @code{General_Category}
+property as a symbol.
+@end defvar
+
+@defvar char-script-table
+The value of this variable is a char-table that specifies, for each
+character, a symbol whose name is the script to which the character
+belongs, according to the Unicode Standard classification of the
+Unicode code space into script-specific blocks. This char-table has a
+single extra slot whose value is the list of all script symbols.
@end defvar
@defvar char-width-table
@cindex coded character set
An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters
in which each character is assigned a numeric code point. (The
-Unicode standard calls this a @dfn{coded character set}.) Each Emacs
+Unicode Standard calls this a @dfn{coded character set}.) Each Emacs
charset has a name which is a symbol. A single character can belong
to any number of different character sets, but it will generally have
a different code point in each charset. Examples of character sets
@cindex @code{eight-bit}, a charset
Emacs defines several special character sets. The character set
@code{unicode} includes all the characters whose Emacs code points are
-in the range @code{0..10FFFF}. The character set @code{emacs}
+in the range @code{0..#x10FFFF}. The character set @code{emacs}
includes all @acronym{ASCII} and non-@acronym{ASCII} characters.
Finally, the @code{eight-bit} charset includes the 8-bit raw bytes;
Emacs uses it to represent raw bytes encountered in text.
This function makes @var{charsets} the highest priority character sets.
@end defun
-@defun char-charset character
+@defun char-charset character &optional restriction
This function returns the name of the character set of highest
priority that @var{character} belongs to. @acronym{ASCII} characters
are an exception: for them, this function always returns @code{ascii}.
+
+If @var{restriction} is non-@code{nil}, it should be a list of
+charsets to search. Alternatively, it can be a coding system, in
+which case the returned charset must be supported by that coding
+system (@pxref{Coding Systems}).
@end defun
@defun charset-plist charset
The following function comes in handy for applying a certain
function to all or part of the characters in a charset:
-@defun map-charset-chars function charset &optional arg from to
+@defun map-charset-chars function charset &optional arg from-code to-code
Call @var{function} for characters in @var{charset}. @var{function}
is called with two arguments. The first one is a cons cell
@code{(@var{from} . @var{to})}, where @var{from} and @var{to}
indicate a range of characters contained in charset. The second
-argument is the optional argument @var{arg}.
+argument passed to @var{function} is @var{arg}.
By default, the range of codepoints passed to @var{function} includes
-all the characters in @var{charset}, but optional arguments @var{from}
-and @var{to} limit that to the range of characters between these two
-codepoints. If either of them is @code{nil}, it defaults to the first
-or last codepoint of @var{charset}, respectively.
+all the characters in @var{charset}, but optional arguments
+@var{from-code} and @var{to-code} limit that to the range of
+characters between these two codepoints of @var{charset}. If either
+of them is @code{nil}, it defaults to the first or last codepoint of
+@var{charset}, respectively.
@end defun
@node Scanning Charsets
@defun make-translation-table-from-vector vec
This function returns a translation table made from @var{vec} that is
-an array of 256 elements to map byte values 0 through 255 to
+an array of 256 elements to map bytes (values 0 through #xFF) to
characters. Elements may be @code{nil} for untranslated bytes. The
returned table has a translation table for reverse mapping in the
first extra slot, and the value @code{1} in the second extra slot.
Here are the Lisp facilities for working with coding systems:
+@cindex list all coding systems
@defun coding-system-list &optional base-only
This function returns a list of all coding system names (symbols). If
@var{base-only} is non-@code{nil}, the value includes only the
name or @code{nil}.
@end defun
+@cindex validity of coding system
+@cindex coding system, validity check
@defun check-coding-system coding-system
This function checks the validity of @var{coding-system}. If that is
valid, it returns @var{coding-system}. If @var{coding-system} is
(@pxref{Signaling Errors, signal}).
@end defun
+@cindex eol type of coding system
@defun coding-system-eol-type coding-system
This function returns the type of end-of-line (a.k.a.@: @dfn{eol})
conversion used by @var{coding-system}. If @var{coding-system}
eol conversion is set to match it (e.g., DOS-style CRLF format will
imply @code{dos} eol conversion). For encoding, the eol conversion is
taken from the appropriate default coding system (e.g.,
-@code{default-buffer-file-coding-system} for
+default value of @code{buffer-file-coding-system} for
@code{buffer-file-coding-system}), or from the default eol conversion
appropriate for the underlying platform.
@end defun
+@cindex eol conversion of coding system
@defun coding-system-change-eol-conversion coding-system eol-type
This function returns a coding system which is like @var{coding-system}
except for its eol conversion, which is specified by @code{eol-type}.
@code{dos} and @code{mac}, respectively.
@end defun
+@cindex text conversion of coding system
@defun coding-system-change-text-conversion eol-coding text-coding
This function returns a coding system which uses the end-of-line
conversion of @var{eol-coding}, and the text conversion of
@code{undecided}, or one of its variants according to @var{eol-coding}.
@end defun
+@cindex safely encode region
+@cindex coding systems for encoding region
@defun find-coding-systems-region from to
This function returns a list of coding systems that could be used to
encode a text between @var{from} and @var{to}. All coding systems in
list @code{(undecided)}.
@end defun
+@cindex safely encode a string
+@cindex coding systems for encoding a string
@defun find-coding-systems-string string
This function returns a list of coding systems that could be used to
encode the text of @var{string}. All coding systems in the list can
@code{(undecided)}.
@end defun
+@cindex charset, coding systems to encode
+@cindex safely encode characters in a charset
@defun find-coding-systems-for-charsets charsets
This function returns a list of coding systems that could be used to
encode all the character sets in the list @var{charsets}.
operates on the contents of @var{string} instead of bytes in the buffer.
@end defun
+@cindex null bytes, and decoding text
@defvar inhibit-null-byte-detection
If this variable has a non-@code{nil} value, null bytes are ignored
when detecting the encoding of a region or a string. This allows to
because many files in the Emacs distribution use ISO-2022 encoding.}
@end defvar
+@cindex charsets supported by a coding system
@defun coding-system-charset-list coding-system
This function returns the list of character sets (@pxref{Character
Sets}) supported by @var{coding-system}. Some coding systems that
also be a list of coding systems; then the function tries each of them
one by one. After trying all of them, it next tries the current
buffer's value of @code{buffer-file-coding-system} (if it is not
-@code{undecided}), then the value of
-@code{default-buffer-file-coding-system} and finally the user's most
+@code{undecided}), then the default value of
+@code{buffer-file-coding-system} and finally the user's most
preferred coding system, which the user can set using the command
@code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing
Coding Systems, emacs, The GNU Emacs Manual}).
@vindex select-safe-coding-system-accept-default-p
If the variable @code{select-safe-coding-system-accept-default-p} is
-non-@code{nil}, its value overrides the value of
-@var{accept-default-p}.
+non-@code{nil}, it should be a function taking a single argument.
+It is used in place of @var{accept-default-p}, overriding any
+value supplied for this argument.
As a final step, before returning the chosen coding system,
@code{select-safe-coding-system} checks whether that coding system is
@node Default Coding Systems
@subsection Default Coding Systems
+@cindex default coding system
+@cindex coding system, automatically determined
This section describes variables that specify the default coding
system for certain files or when running certain subprograms, and the
@code{coding-system-for-read} and @code{coding-system-for-write}
(@pxref{Specifying Coding Systems}).
-@defvar auto-coding-regexp-alist
+@cindex file contents, and default coding system
+@defopt auto-coding-regexp-alist
This variable is an alist of text patterns and corresponding coding
systems. Each element has the form @code{(@var{regexp}
. @var{coding-system})}; a file whose first few kilobytes match
@code{file-coding-system-alist} (see below). The default value is set
so that Emacs automatically recognizes mail files in Babyl format and
reads them with no code conversions.
-@end defvar
+@end defopt
-@defvar file-coding-system-alist
+@cindex file name, and default coding system
+@defopt file-coding-system-alist
This variable is an alist that specifies the coding systems to use for
reading and writing particular files. Each element has the form
@code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
If @var{coding} (or what returned by the above function) is
@code{undecided}, the normal code-detection is performed.
-@end defvar
+@end defopt
+
+@defopt auto-coding-alist
+This variable is an alist that specifies the coding systems to use for
+reading and writing particular files. Its form is like that of
+@code{file-coding-system-alist}, but, unlike the latter, this variable
+takes priority over any @code{coding:} tags in the file.
+@end defopt
+@cindex program name, and default coding system
@defvar process-coding-system-alist
This variable is an alist specifying which coding systems to use for a
subprocess, depending on which program is running in the subprocess. It
the end of line conversion---that is, one like @code{latin-1-unix},
rather than @code{undecided} or @code{latin-1}.
+@cindex port number, and default coding system
+@cindex network service name, and default coding system
@defvar network-coding-system-alist
This variable is an alist that specifies the coding system to use for
network streams. It works much like @code{file-coding-system-alist},
the subprocess, and @var{output-coding} applies to output to it.
@end defvar
-@defvar auto-coding-functions
+@cindex default coding system, functions to determine
+@defopt auto-coding-functions
This variable holds a list of functions that try to determine a
coding system for a file based on its undecoded contents.
If a file has a @samp{coding:} tag, that takes precedence, so these
functions won't be called.
-@end defvar
+@end defopt
+
+@defun find-auto-coding filename size
+This function tries to determine a suitable coding system for
+@var{filename}. It examines the buffer visiting the named file, using
+the variables documented above in sequence, until it finds a match for
+one of the rules specified by these variables. It then returns a cons
+cell of the form @code{(@var{coding} . @var{source})}, where
+@var{coding} is the coding system to use and @var{source} is a symbol,
+one of @code{auto-coding-alist}, @code{auto-coding-regexp-alist},
+@code{:coding}, or @code{auto-coding-functions}, indicating which one
+supplied the matching rule. The value @code{:coding} means the coding
+system was specified by the @code{coding:} tag in the file
+(@pxref{Specify Coding,, coding tag, emacs, The GNU Emacs Manual}).
+The order of looking for a matching rule is @code{auto-coding-alist}
+first, then @code{auto-coding-regexp-alist}, then the @code{coding:}
+tag, and lastly @code{auto-coding-functions}. If no matching rule was
+found, the function returns @code{nil}.
+
+The second argument @var{size} is the size of text, in characters,
+following point. The function examines text only within @var{size}
+characters after point. Normally, the buffer should be positioned at
+the beginning when this function is called, because one of the places
+for the @code{coding:} tag is the first one or two lines of the file;
+in that case, @var{size} should be the size of the buffer.
+@end defun
+
+@defun set-auto-coding filename size
+This function returns a suitable coding system for file
+@var{filename}. It uses @code{find-auto-coding} to find the coding
+system. If no coding system could be determined, the function returns
+@code{nil}. The meaning of the argument @var{size} is like in
+@code{find-auto-coding}.
+@end defun
@defun find-operation-coding-system operation &rest arguments
This function returns the coding system to use (by default) for
affect it.
@end defvar
-@defvar inhibit-eol-conversion
+@defopt inhibit-eol-conversion
When this variable is non-@code{nil}, no end-of-line conversion is done,
no matter which coding system is specified. This applies to all the
Emacs I/O and subprocess primitives, and to the explicit encoding and
decoding functions (@pxref{Explicit Encoding}).
-@end defvar
+@end defopt
@cindex priority order of coding systems
@cindex coding systems, priority
text. They logically consist of a series of byte values; that is, a
series of @acronym{ASCII} and eight-bit characters. In unibyte
buffers and strings, these characters have codes in the range 0
-through 255. In a multibyte buffer or string, eight-bit characters
-have character codes higher than 255 (@pxref{Text Representations}),
-but Emacs transparently converts them to their single-byte values when
-you encode or decode such text.
+through #xFF (255). In a multibyte buffer or string, eight-bit
+characters have character codes higher than #xFF (@pxref{Text
+Representations}), but Emacs transparently converts them to their
+single-byte values when you encode or decode such text.
The usual way to read a file into a buffer as a sequence of bytes, so
you can decode the contents explicitly, is with
operation is trivial. The result of encoding is a unibyte string.
@end defun
-@deffn Command decode-coding-region start end coding-system destination
+@deffn Command decode-coding-region start end coding-system &optional destination
This command decodes the text from @var{start} to @var{end} according
to coding system @var{coding-system}. To make explicit decoding
useful, the text before decoding ought to be a sequence of byte
terminal. @xref{Multiple Terminals}.
@end deffn
-@defun terminal-coding-system
+@defun terminal-coding-system &optional terminal
This function returns the coding system that is in use for encoding
-terminal output---or @code{nil} for no encoding.
+terminal output from @var{terminal}---or @code{nil} if the output is
+not encoded. If @var{terminal} is a frame, it means that frame's
+terminal; if it is @code{nil}, that means the currently selected
+frame's terminal.
@end defun
-@deffn Command set-terminal-coding-system coding-system
+@deffn Command set-terminal-coding-system coding-system &optional terminal
This command specifies @var{coding-system} as the coding system to use
-for encoding terminal output. If @var{coding-system} is @code{nil},
-that means do not encode terminal output.
+for encoding terminal output from @var{terminal}. If
+@var{coding-system} is @code{nil}, terminal output is not encoded. If
+@var{terminal} is a frame, it means that frame's terminal; if it is
+@code{nil}, that means the currently selected frame's terminal.
@end deffn
@node MS-DOS File Types
Normally this variable is set by visiting a file; it is set to
@code{nil} if the file was visited without any actual conversion.
+
+Its default value is used to decide how to handle files for which
+@code{file-name-buffer-file-type-alist} says nothing about the type:
+If the default value is non-@code{nil}, then these files are treated as
+binary: the coding system @code{no-conversion} is used. Otherwise,
+nothing special is done for them---the coding system is deduced solely
+from the file contents, in the usual Emacs fashion.
@end defvar
@defopt file-name-buffer-file-type-alist
is used.
If no element in this alist matches a given file name, then
-@code{default-buffer-file-type} says how to treat the file.
-@end defopt
-
-@defopt default-buffer-file-type
-This variable says how to handle files for which
-@code{file-name-buffer-file-type-alist} says nothing about the type.
-
-If this variable is non-@code{nil}, then these files are treated as
-binary: the coding system @code{no-conversion} is used. Otherwise,
-nothing special is done for them---the coding system is deduced solely
-from the file contents, in the usual Emacs fashion.
+the default value of @code{buffer-file-type} says how to treat the file.
@end defopt
@node Input Methods