X-Git-Url: https://code.delx.au/gnu-emacs/blobdiff_plain/02eccf6bfb2babe6f4a71342450690bdfcb00899..60dd06a08276422871cd3d491a44d10d4bdc690c:/doc/lispref/nonascii.texi diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 2ac927d82c..00a1dffed6 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi @@ -1,7 +1,7 @@ @c -*-texinfo-*- @c This is part of the GNU Emacs Lisp Reference Manual. @c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004, -@c 2005, 2006, 2007, 2008, 2009 Free Software Foundation, Inc. +@c 2005, 2006, 2007, 2008, 2009, 2010 Free Software Foundation, Inc. @c See the file elisp.texi for copying conditions. @setfilename ../../info/characters @node Non-ASCII Characters, Searching and Matching, Text, Top @@ -36,8 +36,8 @@ how they are stored in strings and buffers. @cindex text representation Emacs buffers and strings support a large repertoire of characters -from many different scripts. This is so users could type and display -text in most any known written language. +from many different scripts, allowing users to type and display text +in almost any known written language. @cindex character codepoint @cindex codespace @@ -46,12 +46,12 @@ text in most any known written language. follows the @dfn{Unicode Standard}. The Unicode Standard assigns a unique number, called a @dfn{codepoint}, to each and every character. The range of codepoints defined by Unicode, or the Unicode -@dfn{codespace}, is @code{0..10FFFF} (in hex), inclusive. Emacs -extends this range with codepoints in the range @code{110000..3FFFFF}, -which it uses for representing characters that are not unified with -Unicode and raw 8-bit bytes that cannot be interpreted as characters -(the latter occupy the range @code{3FFF80..3FFFFF}). Thus, a -character codepoint in Emacs is a 22-bit integer number. +@dfn{codespace}, is @code{0..#x10FFFF} (in hexadecimal notation), +inclusive. Emacs extends this range with codepoints in the range +@code{#x110000..#x3FFFFF}, which it uses for representing characters +that are not unified with Unicode and @dfn{raw 8-bit bytes} that +cannot be interpreted as characters. Thus, a character codepoint in +Emacs is a 22-bit integer number. @cindex internal representation of characters @cindex characters, representation in buffers and strings @@ -65,15 +65,13 @@ This internal representation is based on one of the encodings defined by the Unicode Standard, called @dfn{UTF-8}, for representing any Unicode codepoint, but Emacs extends UTF-8 to represent the additional codepoints it uses for raw 8-bit bytes and characters not unified with -Unicode.}. -For example, any @acronym{ASCII} character takes up only 1 byte, a -Latin-1 character takes up 2 bytes, etc. We call this representation -of text @dfn{multibyte}, because it uses several bytes for each -character. +Unicode.}. For example, any @acronym{ASCII} character takes up only 1 +byte, a Latin-1 character takes up 2 bytes, etc. We call this +representation of text @dfn{multibyte}. Outside Emacs, characters can be represented in many different encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts -between these external encodings and the internal representation, as +between these external encodings and its internal representation, as appropriate, when it reads text into a buffer or a string, or when it writes text to a disk file or passes it to some other process. @@ -87,9 +85,9 @@ Before the conversion, the buffer holds encoded text. Encoded text is not really text, as far as Emacs is concerned, but rather a sequence of raw 8-bit bytes. We call buffers and strings that hold encoded text @dfn{unibyte} buffers and strings, because -Emacs treats them as a sequence of individual bytes. In particular, -Emacs usually displays unibyte buffers and strings as octal codes such -as @code{\237}. We recommend that you never use unibyte buffers and +Emacs treats them as a sequence of individual bytes. Usually, Emacs +displays unibyte buffers and strings as octal codes such as +@code{\237}. We recommend that you never use unibyte buffers and strings except for manipulating encoded text or binary non-text data. In a buffer, the buffer-local value of the variable @@ -104,15 +102,6 @@ it contains unibyte encoded text or binary non-text data. You cannot set this variable directly; instead, use the function @code{set-buffer-multibyte} to change a buffer's representation. -@end defvar - -@defvar default-enable-multibyte-characters -This variable's value is entirely equivalent to @code{(default-value -'enable-multibyte-characters)}, and setting this variable changes that -default value. Setting the local binding of -@code{enable-multibyte-characters} in a specific buffer is not allowed, -but changing the default value is supported, and it is a reasonable -thing to do, because it has no effect on existing buffers. The @samp{--unibyte} command line option does its job by setting the default value to @code{nil} early in startup. @@ -165,10 +154,10 @@ conversions happen when inserting text into a buffer, or when putting text from several strings together in one string. You can also explicitly convert a string's contents to either representation. - Emacs chooses the representation for a string based on the text that -it is constructed from. The general rule is to convert unibyte text to -multibyte text when combining it with other multibyte text, because the -multibyte representation is more general and can hold whatever + Emacs chooses the representation for a string based on the text from +which it is constructed. The general rule is to convert unibyte text +to multibyte text when combining it with other multibyte text, because +the multibyte representation is more general and can hold whatever characters the unibyte text has. When inserting text into a buffer, Emacs converts the text to the @@ -181,9 +170,9 @@ alternative, to convert the buffer contents to multibyte, is not acceptable because the buffer's representation is a choice made by the user that cannot be overridden automatically. - Converting unibyte text to multibyte text leaves @acronym{ASCII} characters -unchanged, and converts bytes with codes 128 through 159 to the -multibyte representation of raw eight-bit bytes. + Converting unibyte text to multibyte text leaves @acronym{ASCII} +characters unchanged, and converts bytes with codes 128 through 159 to +the multibyte representation of raw eight-bit bytes. Converting multibyte text to unibyte converts all @acronym{ASCII} and eight-bit characters to their single-byte form, but loses @@ -200,8 +189,8 @@ of characters as @var{string}. If @var{string} is a multibyte string, it is returned unchanged. The function assumes that @var{string} includes only @acronym{ASCII} characters and raw 8-bit bytes; the latter are converted to their multibyte representation corresponding -to the codepoints in the @code{3FFF80..3FFFFF} area (@pxref{Text -Representations, codepoints}). +to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive +(@pxref{Text Representations, codepoints}). @end defun @defun string-to-unibyte string @@ -214,9 +203,9 @@ characters. @end defun @defun multibyte-char-to-unibyte char -This convert the multibyte character @var{char} to a unibyte -character. If @var{char} is a character that is neither -@acronym{ASCII} nor eight-bit, the value is -1. +This converts the multibyte character @var{char} to a unibyte +character, and returns that character. If @var{char} is neither +@acronym{ASCII} nor eight-bit, the function returns -1. @end defun @defun unibyte-char-to-multibyte char @@ -238,9 +227,9 @@ is @code{nil}, the buffer becomes unibyte. This function leaves the buffer contents unchanged when viewed as a sequence of bytes. As a consequence, it can change the contents -viewed as characters; a sequence of three bytes which is treated as -one character in multibyte representation will count as three -characters in unibyte representation. Eight-bit characters +viewed as characters; for instance, a sequence of three bytes which is +treated as one character in multibyte representation will count as +three characters in unibyte representation. Eight-bit characters representing raw bytes are an exception. They are represented by one byte in a unibyte buffer, but when the buffer is set to multibyte, they are converted to two-byte sequences, and vice versa. @@ -256,28 +245,24 @@ base buffer. @end defun @defun string-as-unibyte string -This function returns a string with the same bytes as @var{string} but -treating each byte as a character. This means that the value may have -more characters than @var{string} has. Eight-bit characters -representing raw bytes are an exception: each one of them is converted -to a single byte. - -If @var{string} is already a unibyte string, then the value is -@var{string} itself. Otherwise it is a newly created string, with no +If @var{string} is already a unibyte string, this function returns +@var{string} itself. Otherwise, it returns a new string with the same +bytes as @var{string}, but treating each byte as a separate character +(so that the value may have more characters than @var{string}); as an +exception, each eight-bit character representing a raw byte is +converted into a single byte. The newly-created string contains no text properties. @end defun @defun string-as-multibyte string -This function returns a string with the same bytes as @var{string} but -treating each multibyte sequence as one character. This means that -the value may have fewer characters than @var{string} has. If a byte -sequence in @var{string} is invalid as a multibyte representation of a -single character, each byte in the sequence is treated as raw 8-bit -byte. - -If @var{string} is already a multibyte string, then the value is -@var{string} itself. Otherwise it is a newly created string, with no -text properties. +If @var{string} is a multibyte string, this function returns +@var{string} itself. Otherwise, it returns a new string with the same +bytes as @var{string}, but treating each multibyte sequence as one +character. This means that the value may have fewer characters than +@var{string} has. If a byte sequence in @var{string} is invalid as a +multibyte representation of a single character, each byte in the +sequence is treated as a raw 8-bit byte. The newly-created string +contains no text properties. @end defun @node Character Codes @@ -286,14 +271,19 @@ text properties. The unibyte and multibyte text representations use different character codes. The valid character codes for unibyte representation -range from 0 to 255---the values that can fit in one byte. The valid -character codes for multibyte representation range from 0 to 4194303 -(#x3FFFFF). In this code space, values 0 through 127 are for -@acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F) -are for non-@acronym{ASCII} characters. Values 0 through 1114111 -(#10FFFF) corresponds to Unicode characters of the same codepoint, -while values 4194176 (#x3FFF80) through 4194303 (#x3FFFFF) are for -representing eight-bit raw bytes. +range from 0 to @code{#xFF} (255)---the values that can fit in one +byte. The valid character codes for multibyte representation range +from 0 to @code{#x3FFFFF}. In this code space, values 0 through +@code{#x7F} (127) are for @acronym{ASCII} characters, and values +@code{#x80} (128) through @code{#x3FFF7F} (4194175) are for +non-@acronym{ASCII} characters. + + Emacs character codes are a superset of the Unicode standard. +Values 0 through @code{#x10FFFF} (1114111) correspond to Unicode +characters of the same codepoint; values @code{#x110000} (1114112) +through @code{#x3FFF7F} (4194175) represent characters that are not +unified with Unicode; and values @code{#x3FFF80} (4194176) through +@code{#x3FFFFF} (4194303) represent eight-bit raw bytes. @defun characterp charcode This returns @code{t} if @var{charcode} is a valid character, and @@ -333,10 +323,10 @@ codepoint can have. @end example @end defun -@defun get-byte pos &optional string -This function returns the byte at current buffer's character position -@var{pos}. If the current buffer is unibyte, this is literally the -byte at that position. If the buffer is multibyte, byte values of +@defun get-byte &optional pos string +This function returns the byte at character position @var{pos} in the +current buffer. If the current buffer is unibyte, this is literally +the byte at that position. If the buffer is multibyte, byte values of @acronym{ASCII} characters are the same as character codepoints, whereas eight-bit raw bytes are converted to their 8-bit codes. The function signals an error if the character at @var{pos} is @@ -354,19 +344,17 @@ specifies how the character behaves and how it should be handled during text processing and display. Thus, character properties are an important part of specifying the character's semantics. - Emacs generally follows the Unicode Standard in its implementation + On the whole, Emacs follows the Unicode Standard in its implementation of character properties. In particular, Emacs supports the @uref{http://www.unicode.org/reports/tr23/, Unicode Character Property Model}, and the Emacs character property database is derived from the Unicode Character Database (@acronym{UCD}). See the @uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character -Properties chapter of the Unicode Standard}, for detailed description -of Unicode character properties and their meaning. This section -assumes you are already familiar with that chapter of the Unicode -Standard, and want to apply that knowledge to Emacs Lisp programs. - - The facilities documented in this section are useful for setting and -retrieving properties of characters. +Properties chapter of the Unicode Standard}, for a detailed +description of Unicode character properties and their meaning. This +section assumes you are already familiar with that chapter of the +Unicode Standard, and want to apply that knowledge to Emacs Lisp +programs. In Emacs, each property has a name, which is a symbol, and a set of possible values, whose types depend on the property; if a character @@ -378,8 +366,8 @@ replacing each @samp{_} character with a dash @samp{-}. For example, @code{canonical-combining-class}. However, sometimes we shorten the names to make their use easier. - Here's the full list of value types for all the character properties -that Emacs knows about: + Here is the full list of value types for all the character +properties that Emacs knows about: @table @code @item name @@ -387,6 +375,7 @@ This property corresponds to the Unicode @code{Name} property. The value is a string consisting of upper-case Latin letters A to Z, digits, spaces, and hyphen @samp{-} characters. +@cindex unicode general category @item general-category This property corresponds to the Unicode @code{General_Category} property. The value is a symbol whose name is a 2-letter abbreviation @@ -428,7 +417,7 @@ corresponding number. @item numeric-value Corresponds to the Unicode @code{Numeric_Value} property for characters whose @code{Numeric_Type} is @samp{Numeric}. The value of -this property is an integer of a floating-point number. Examples of +this property is an integer or a floating-point number. Examples of characters that have this property include fractions, subscripts, superscripts, Roman numerals, currency numerators, and encircled numbers. For example, the value of this property for the character @@ -513,13 +502,18 @@ This function stores @var{value} as the value of the property @var{propname} for the character @var{char}. @end defun -@defvar char-script-table +@defvar unicode-category-table The value of this variable is a char-table (@pxref{Char-Tables}) that -specifies, for each character, a symbol whose name is the script to -which the character belongs, according to the Unicode Standard -classification of the Unicode code space into script-specific blocks. -This char-table has a single extra slot whose value is the list of all -script symbols. +specifies, for each character, its Unicode @code{General_Category} +property as a symbol. +@end defvar + +@defvar char-script-table +The value of this variable is a char-table that specifies, for each +character, a symbol whose name is the script to which the character +belongs, according to the Unicode Standard classification of the +Unicode code space into script-specific blocks. This char-table has a +single extra slot whose value is the list of all script symbols. @end defvar @defvar char-width-table @@ -542,7 +536,7 @@ is printable, and if it results in @code{nil}, it is not. @cindex coded character set An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters in which each character is assigned a numeric code point. (The -Unicode standard calls this a @dfn{coded character set}.) Each Emacs +Unicode Standard calls this a @dfn{coded character set}.) Each Emacs charset has a name which is a symbol. A single character can belong to any number of different character sets, but it will generally have a different code point in each charset. Examples of character sets @@ -556,7 +550,7 @@ and strings. @cindex @code{eight-bit}, a charset Emacs defines several special character sets. The character set @code{unicode} includes all the characters whose Emacs code points are -in the range @code{0..10FFFF}. The character set @code{emacs} +in the range @code{0..#x10FFFF}. The character set @code{emacs} includes all @acronym{ASCII} and non-@acronym{ASCII} characters. Finally, the @code{eight-bit} charset includes the 8-bit raw bytes; Emacs uses it to represent raw bytes encountered in text. @@ -580,10 +574,15 @@ returns a single character set of the highest priority. This function makes @var{charsets} the highest priority character sets. @end defun -@defun char-charset character +@defun char-charset character &optional restriction This function returns the name of the character set of highest priority that @var{character} belongs to. @acronym{ASCII} characters are an exception: for them, this function always returns @code{ascii}. + +If @var{restriction} is non-@code{nil}, it should be a list of +charsets to search. Alternatively, it can be a coding system, in +which case the returned charset must be supported by that coding +system (@pxref{Coding Systems}). @end defun @defun charset-plist charset @@ -639,33 +638,33 @@ that fits the second argument of @code{decode-char} above. If The following function comes in handy for applying a certain function to all or part of the characters in a charset: -@defun map-charset-chars function charset &optional arg from to +@defun map-charset-chars function charset &optional arg from-code to-code Call @var{function} for characters in @var{charset}. @var{function} is called with two arguments. The first one is a cons cell @code{(@var{from} . @var{to})}, where @var{from} and @var{to} indicate a range of characters contained in charset. The second -argument is the optional argument @var{arg}. +argument passed to @var{function} is @var{arg}. By default, the range of codepoints passed to @var{function} includes -all the characters in @var{charset}, but optional arguments @var{from} -and @var{to} limit that to the range of characters between these two -codepoints. If either of them is @code{nil}, it defaults to the first -or last codepoint of @var{charset}, respectively. +all the characters in @var{charset}, but optional arguments +@var{from-code} and @var{to-code} limit that to the range of +characters between these two codepoints of @var{charset}. If either +of them is @code{nil}, it defaults to the first or last codepoint of +@var{charset}, respectively. @end defun @node Scanning Charsets @section Scanning for Character Sets - Sometimes it is useful to find out, for characters that appear in a -certain part of a buffer or a string, to which character sets they -belong. One use for this is in determining which coding systems -(@pxref{Coding Systems}) are capable of representing all of the text -in question; another is to determine the font(s) for displaying that -text. + Sometimes it is useful to find out which character set a particular +character belongs to. One use for this is in determining which coding +systems (@pxref{Coding Systems}) are capable of representing all of +the text in question; another is to determine the font(s) for +displaying that text. @defun charset-after &optional pos This function returns the charset of highest priority containing the -character in the current buffer at position @var{pos}. If @var{pos} +character at position @var{pos} in the current buffer. If @var{pos} is omitted or @code{nil}, it defaults to the current value of point. If @var{pos} is out of range, the value is @code{nil}. @end defun @@ -675,15 +674,15 @@ This function returns a list of the character sets of highest priority that contain characters in the current buffer between positions @var{beg} and @var{end}. -The optional argument @var{translation} specifies a translation table to -be used in scanning the text (@pxref{Translation of Characters}). If it -is non-@code{nil}, then each character in the region is translated +The optional argument @var{translation} specifies a translation table +to use for scanning the text (@pxref{Translation of Characters}). If +it is non-@code{nil}, then each character in the region is translated through this table, and the value returned describes the translated characters instead of the characters actually in the buffer. @end defun @defun find-charset-string string &optional translation -This function returns a list of the character sets of highest priority +This function returns a list of character sets of highest priority that contain characters in @var{string}. It is just like @code{find-charset-region}, except that it applies to the contents of @var{string} instead of part of the current buffer. @@ -721,7 +720,7 @@ character, say @var{to-alt}, @var{from} is also translated to During decoding, the translation table's translations are applied to the characters that result from ordinary decoding. If a coding system -has property @code{:decode-translation-table}, that specifies the +has the property @code{:decode-translation-table}, that specifies the translation table to use, or a list of translation tables to apply in sequence. (This is a property of the coding system, as returned by @code{coding-system-get}, not a property of the symbol that is the @@ -751,9 +750,18 @@ systems specifies its own translation tables, the table that is the value of this variable, if non-@code{nil}, is applied after them. @end defvar +@defvar translation-table-for-input +Self-inserting characters are translated through this translation +table before they are inserted. Search commands also translate their +input through this table, so they can compare more reliably with +what's in the buffer. + +This variable automatically becomes buffer-local when set. +@end defvar + @defun make-translation-table-from-vector vec This function returns a translation table made from @var{vec} that is -an array of 256 elements to map byte values 0 through 255 to +an array of 256 elements to map bytes (values 0 through #xFF) to characters. Elements may be @code{nil} for untranslated bytes. The returned table has a translation table for reverse mapping in the first extra slot, and the value @code{1} in the second extra slot. @@ -770,8 +778,8 @@ respectively in the @var{props} argument to This function is similar to @code{make-translation-table} but returns a complex translation table rather than a simple one-to-one mapping. Each element of @var{alist} is of the form @code{(@var{from} -. @var{to})}, where @var{from} and @var{to} are either a character or -a vector specifying a sequence of characters. If @var{from} is a +. @var{to})}, where @var{from} and @var{to} are either characters or +vectors specifying a sequence of characters. If @var{from} is a character, that character is translated to @var{to} (i.e.@: to a character or a character sequence). If @var{from} is a vector of characters, that sequence is translated to @var{to}. The returned @@ -882,10 +890,13 @@ end-of-line conversion. codes or end-of-line. @vindex emacs-internal@r{ coding system} - The coding system @code{emacs-internal} specifies that the data is -represented in the internal Emacs encoding. This is like -@code{raw-text} in that no code conversion happens, but different in -that the result is multibyte data. +@vindex utf-8-emacs@r{ coding system} + The coding system @code{utf-8-emacs} specifies that the data is +represented in the internal Emacs encoding (@pxref{Text +Representations}). This is like @code{raw-text} in that no code +conversion happens, but different in that the result is multibyte +data. The name @code{emacs-internal} is an alias for +@code{utf-8-emacs}. @defun coding-system-get coding-system property This function returns the specified property of the coding system @@ -915,9 +926,9 @@ This function returns the list of aliases of @var{coding-system}. @subsection Encoding and I/O The principal purpose of coding systems is for use in reading and -writing files. The function @code{insert-file-contents} uses -a coding system for decoding the file data, and @code{write-region} -uses one to encode the buffer contents. +writing files. The function @code{insert-file-contents} uses a coding +system to decode the file data, and @code{write-region} uses one to +encode the buffer contents. You can specify the coding system to use either explicitly (@pxref{Specifying Coding Systems}), or implicitly using a default @@ -997,6 +1008,7 @@ new file name for that buffer. Here are the Lisp facilities for working with coding systems: +@cindex list all coding systems @defun coding-system-list &optional base-only This function returns a list of all coding system names (symbols). If @var{base-only} is non-@code{nil}, the value includes only the @@ -1009,6 +1021,8 @@ This function returns @code{t} if @var{object} is a coding system name or @code{nil}. @end defun +@cindex validity of coding system +@cindex coding system, validity check @defun check-coding-system coding-system This function checks the validity of @var{coding-system}. If that is valid, it returns @var{coding-system}. If @var{coding-system} is @@ -1017,6 +1031,7 @@ signals an error whose @code{error-symbol} is @code{coding-system-error} (@pxref{Signaling Errors, signal}). @end defun +@cindex eol type of coding system @defun coding-system-eol-type coding-system This function returns the type of end-of-line (a.k.a.@: @dfn{eol}) conversion used by @var{coding-system}. If @var{coding-system} @@ -1038,11 +1053,12 @@ decoding, the end-of-line format of the text is auto-detected, and the eol conversion is set to match it (e.g., DOS-style CRLF format will imply @code{dos} eol conversion). For encoding, the eol conversion is taken from the appropriate default coding system (e.g., -@code{default-buffer-file-coding-system} for +default value of @code{buffer-file-coding-system} for @code{buffer-file-coding-system}), or from the default eol conversion appropriate for the underlying platform. @end defun +@cindex eol conversion of coding system @defun coding-system-change-eol-conversion coding-system eol-type This function returns a coding system which is like @var{coding-system} except for its eol conversion, which is specified by @code{eol-type}. @@ -1054,6 +1070,7 @@ the end-of-line conversion from the data. @code{dos} and @code{mac}, respectively. @end defun +@cindex text conversion of coding system @defun coding-system-change-text-conversion eol-coding text-coding This function returns a coding system which uses the end-of-line conversion of @var{eol-coding}, and the text conversion of @@ -1061,6 +1078,8 @@ conversion of @var{eol-coding}, and the text conversion of @code{undecided}, or one of its variants according to @var{eol-coding}. @end defun +@cindex safely encode region +@cindex coding systems for encoding region @defun find-coding-systems-region from to This function returns a list of coding systems that could be used to encode a text between @var{from} and @var{to}. All coding systems in @@ -1071,6 +1090,8 @@ If the text contains no multibyte characters, the function returns the list @code{(undecided)}. @end defun +@cindex safely encode a string +@cindex coding systems for encoding a string @defun find-coding-systems-string string This function returns a list of coding systems that could be used to encode the text of @var{string}. All coding systems in the list can @@ -1079,6 +1100,8 @@ contains no multibyte characters, this returns the list @code{(undecided)}. @end defun +@cindex charset, coding systems to encode +@cindex safely encode characters in a charset @defun find-coding-systems-for-charsets charsets This function returns a list of coding systems that could be used to encode all the character sets in the list @var{charsets}. @@ -1126,6 +1149,7 @@ This function is like @code{detect-coding-region} except that it operates on the contents of @var{string} instead of bytes in the buffer. @end defun +@cindex null bytes, and decoding text @defvar inhibit-null-byte-detection If this variable has a non-@code{nil} value, null bytes are ignored when detecting the encoding of a region or a string. This allows to @@ -1142,6 +1166,7 @@ encoding, and all escape sequences become visible in a buffer. because many files in the Emacs distribution use ISO-2022 encoding.} @end defvar +@cindex charsets supported by a coding system @defun coding-system-charset-list coding-system This function returns the list of character sets (@pxref{Character Sets}) supported by @var{coding-system}. Some coding systems that @@ -1178,14 +1203,18 @@ is the text in the current buffer between @var{from} and @var{to}. If @var{from} is a string, the string specifies the text to encode, and @var{to} is ignored. +If the specified text includes raw bytes (@pxref{Text +Representations}), @code{select-safe-coding-system} suggests +@code{raw-text} for its encoding. + If @var{default-coding-system} is non-@code{nil}, that is the first coding system to try; if that can handle the text, @code{select-safe-coding-system} returns that coding system. It can also be a list of coding systems; then the function tries each of them one by one. After trying all of them, it next tries the current buffer's value of @code{buffer-file-coding-system} (if it is not -@code{undecided}), then the value of -@code{default-buffer-file-coding-system} and finally the user's most +@code{undecided}), then the default value of +@code{buffer-file-coding-system} and finally the user's most preferred coding system, which the user can set using the command @code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing Coding Systems, emacs, The GNU Emacs Manual}). @@ -1212,8 +1241,9 @@ possible candidates. @vindex select-safe-coding-system-accept-default-p If the variable @code{select-safe-coding-system-accept-default-p} is -non-@code{nil}, its value overrides the value of -@var{accept-default-p}. +non-@code{nil}, it should be a function taking a single argument. +It is used in place of @var{accept-default-p}, overriding any +value supplied for this argument. As a final step, before returning the chosen coding system, @code{select-safe-coding-system} checks whether that coding system is @@ -1247,6 +1277,8 @@ the user tries to enter null input, it asks the user to try again. @node Default Coding Systems @subsection Default Coding Systems +@cindex default coding system +@cindex coding system, automatically determined This section describes variables that specify the default coding system for certain files or when running certain subprograms, and the @@ -1259,7 +1291,8 @@ don't change these variables; instead, override them using @code{coding-system-for-read} and @code{coding-system-for-write} (@pxref{Specifying Coding Systems}). -@defvar auto-coding-regexp-alist +@cindex file contents, and default coding system +@defopt auto-coding-regexp-alist This variable is an alist of text patterns and corresponding coding systems. Each element has the form @code{(@var{regexp} . @var{coding-system})}; a file whose first few kilobytes match @@ -1269,9 +1302,10 @@ read into a buffer. The settings in this alist take priority over @code{file-coding-system-alist} (see below). The default value is set so that Emacs automatically recognizes mail files in Babyl format and reads them with no code conversions. -@end defvar +@end defopt -@defvar file-coding-system-alist +@cindex file name, and default coding system +@defopt file-coding-system-alist This variable is an alist that specifies the coding systems to use for reading and writing particular files. Each element has the form @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular @@ -1294,8 +1328,16 @@ meaning as described above. If @var{coding} (or what returned by the above function) is @code{undecided}, the normal code-detection is performed. -@end defvar +@end defopt + +@defopt auto-coding-alist +This variable is an alist that specifies the coding systems to use for +reading and writing particular files. Its form is like that of +@code{file-coding-system-alist}, but, unlike the latter, this variable +takes priority over any @code{coding:} tags in the file. +@end defopt +@cindex program name, and default coding system @defvar process-coding-system-alist This variable is an alist specifying which coding systems to use for a subprocess, depending on which program is running in the subprocess. It @@ -1319,6 +1361,8 @@ coding system which determines both the character code conversion and the end of line conversion---that is, one like @code{latin-1-unix}, rather than @code{undecided} or @code{latin-1}. +@cindex port number, and default coding system +@cindex network service name, and default coding system @defvar network-coding-system-alist This variable is an alist that specifies the coding system to use for network streams. It works much like @code{file-coding-system-alist}, @@ -1338,7 +1382,8 @@ The value should be a cons cell of the form @code{(@var{input-coding} the subprocess, and @var{output-coding} applies to output to it. @end defvar -@defvar auto-coding-functions +@cindex default coding system, functions to determine +@defopt auto-coding-functions This variable holds a list of functions that try to determine a coding system for a file based on its undecoded contents. @@ -1352,7 +1397,40 @@ Otherwise, it should return @code{nil}. If a file has a @samp{coding:} tag, that takes precedence, so these functions won't be called. -@end defvar +@end defopt + +@defun find-auto-coding filename size +This function tries to determine a suitable coding system for +@var{filename}. It examines the buffer visiting the named file, using +the variables documented above in sequence, until it finds a match for +one of the rules specified by these variables. It then returns a cons +cell of the form @code{(@var{coding} . @var{source})}, where +@var{coding} is the coding system to use and @var{source} is a symbol, +one of @code{auto-coding-alist}, @code{auto-coding-regexp-alist}, +@code{:coding}, or @code{auto-coding-functions}, indicating which one +supplied the matching rule. The value @code{:coding} means the coding +system was specified by the @code{coding:} tag in the file +(@pxref{Specify Coding,, coding tag, emacs, The GNU Emacs Manual}). +The order of looking for a matching rule is @code{auto-coding-alist} +first, then @code{auto-coding-regexp-alist}, then the @code{coding:} +tag, and lastly @code{auto-coding-functions}. If no matching rule was +found, the function returns @code{nil}. + +The second argument @var{size} is the size of text, in characters, +following point. The function examines text only within @var{size} +characters after point. Normally, the buffer should be positioned at +the beginning when this function is called, because one of the places +for the @code{coding:} tag is the first one or two lines of the file; +in that case, @var{size} should be the size of the buffer. +@end defun + +@defun set-auto-coding filename size +This function returns a suitable coding system for file +@var{filename}. It uses @code{find-auto-coding} to find the coding +system. If no coding system could be determined, the function returns +@code{nil}. The meaning of the argument @var{size} is like in +@code{find-auto-coding}. +@end defun @defun find-operation-coding-system operation &rest arguments This function returns the coding system to use (by default) for @@ -1446,12 +1524,12 @@ When a single operation does both input and output, as do affect it. @end defvar -@defvar inhibit-eol-conversion +@defopt inhibit-eol-conversion When this variable is non-@code{nil}, no end-of-line conversion is done, no matter which coding system is specified. This applies to all the Emacs I/O and subprocess primitives, and to the explicit encoding and decoding functions (@pxref{Explicit Encoding}). -@end defvar +@end defopt @cindex priority order of coding systems @cindex coding systems, priority @@ -1494,10 +1572,10 @@ in this section. text. They logically consist of a series of byte values; that is, a series of @acronym{ASCII} and eight-bit characters. In unibyte buffers and strings, these characters have codes in the range 0 -through 255. In a multibyte buffer or string, eight-bit characters -have character codes higher than 255 (@pxref{Text Representations}), -but Emacs transparently converts them to their single-byte values when -you encode or decode such text. +through #xFF (255). In a multibyte buffer or string, eight-bit +characters have character codes higher than #xFF (@pxref{Text +Representations}), but Emacs transparently converts them to their +single-byte values when you encode or decode such text. The usual way to read a file into a buffer as a sequence of bytes, so you can decode the contents explicitly, is with @@ -1534,6 +1612,13 @@ The result of encoding is logically a sequence of bytes, but the buffer remains multibyte if it was multibyte before, and any 8-bit bytes are converted to their multibyte representation (@pxref{Text Representations}). + +@cindex @code{undecided} coding-system, when encoding +Do @emph{not} use @code{undecided} for @var{coding-system} when +encoding text, since that may lead to unexpected results. Instead, +use @code{select-safe-coding-system} (@pxref{User-Chosen Coding +Systems, select-safe-coding-system}) to suggest a suitable encoding, +if there's no obvious pertinent value for @var{coding-system}. @end deffn @defun encode-coding-string string coding-system &optional nocopy buffer @@ -1544,7 +1629,7 @@ case the function may return @var{string} itself if the encoding operation is trivial. The result of encoding is a unibyte string. @end defun -@deffn Command decode-coding-region start end coding-system destination +@deffn Command decode-coding-region start end coding-system &optional destination This command decodes the text from @var{start} to @var{end} according to coding system @var{coding-system}. To make explicit decoding useful, the text before decoding ought to be a sequence of byte @@ -1559,6 +1644,10 @@ inserting it. If decoded text is inserted in some buffer, this command returns the length of the decoded text. + +This command puts a @code{charset} text property on the decoded text. +The value of the property states the character set used to decode the +original text. @end deffn @defun decode-coding-string string coding-system &optional nocopy buffer @@ -1574,6 +1663,18 @@ contains 8-bit bytes in their multibyte form). If optional argument @var{buffer} specifies a buffer, the decoded text is inserted in that buffer after point (point does not move). In this case, the return value is the length of the decoded text. + +@cindex @code{charset}, text property +This function puts a @code{charset} text property on the decoded text. +The value of the property states the character set used to decode the +original text: + +@example +@group +(decode-coding-string "Gr\374ss Gott" 'latin-1) + @result{} #("Gr@"uss Gott" 0 9 (charset iso-8859-1)) +@end group +@end example @end defun @defun decode-coding-inserted-region from to filename &optional visit beg end replace @@ -1596,26 +1697,37 @@ display text using a particular encoding such as Latin-1. Emacs does not set @code{last-coding-system-used} for encoding or decoding of terminal I/O. -@defun keyboard-coding-system +@defun keyboard-coding-system &optional terminal This function returns the coding system that is in use for decoding -keyboard input---or @code{nil} if no coding system is to be used. +keyboard input from @var{terminal}---or @code{nil} if no coding system +is to be used for that terminal. If @var{terminal} is omitted or +@code{nil}, it means the selected frame's terminal. @xref{Multiple +Terminals}. @end defun -@deffn Command set-keyboard-coding-system coding-system -This command specifies @var{coding-system} as the coding system to -use for decoding keyboard input. If @var{coding-system} is @code{nil}, -that means do not decode keyboard input. +@deffn Command set-keyboard-coding-system coding-system &optional terminal +This command specifies @var{coding-system} as the coding system to use +for decoding keyboard input from @var{terminal}. If +@var{coding-system} is @code{nil}, that means do not decode keyboard +input. If @var{terminal} is a frame, it means that frame's terminal; +if it is @code{nil}, that means the currently selected frame's +terminal. @xref{Multiple Terminals}. @end deffn -@defun terminal-coding-system +@defun terminal-coding-system &optional terminal This function returns the coding system that is in use for encoding -terminal output---or @code{nil} for no encoding. +terminal output from @var{terminal}---or @code{nil} if the output is +not encoded. If @var{terminal} is a frame, it means that frame's +terminal; if it is @code{nil}, that means the currently selected +frame's terminal. @end defun -@deffn Command set-terminal-coding-system coding-system +@deffn Command set-terminal-coding-system coding-system &optional terminal This command specifies @var{coding-system} as the coding system to use -for encoding terminal output. If @var{coding-system} is @code{nil}, -that means do not encode terminal output. +for encoding terminal output from @var{terminal}. If +@var{coding-system} is @code{nil}, terminal output is not encoded. If +@var{terminal} is a frame, it means that frame's terminal; if it is +@code{nil}, that means the currently selected frame's terminal. @end deffn @node MS-DOS File Types @@ -1648,6 +1760,13 @@ Otherwise, @code{undecided-dos} is used. Normally this variable is set by visiting a file; it is set to @code{nil} if the file was visited without any actual conversion. + +Its default value is used to decide how to handle files for which +@code{file-name-buffer-file-type-alist} says nothing about the type: +If the default value is non-@code{nil}, then these files are treated as +binary: the coding system @code{no-conversion} is used. Otherwise, +nothing special is done for them---the coding system is deduced solely +from the file contents, in the usual Emacs fashion. @end defvar @defopt file-name-buffer-file-type-alist @@ -1664,17 +1783,7 @@ which coding system to use when reading a file. For a text file, is used. If no element in this alist matches a given file name, then -@code{default-buffer-file-type} says how to treat the file. -@end defopt - -@defopt default-buffer-file-type -This variable says how to handle files for which -@code{file-name-buffer-file-type-alist} says nothing about the type. - -If this variable is non-@code{nil}, then these files are treated as -binary: the coding system @code{no-conversion} is used. Otherwise, -nothing special is done for them---the coding system is deduced solely -from the file contents, in the usual Emacs fashion. +the default value of @code{buffer-file-type} says how to treat the file. @end defopt @node Input Methods