X-Git-Url: https://code.delx.au/gnu-emacs/blobdiff_plain/e5e76c04310d287a56675876dd83e1089faba215..233ba4d924933cb56129bd7511e6137b7c0b8e3e:/doc/lispref/nonascii.texi diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 7c504aef2c..409ecc7e20 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi @@ -1,7 +1,6 @@ @c -*-texinfo-*- @c This is part of the GNU Emacs Lisp Reference Manual. -@c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004, -@c 2005, 2006, 2007 Free Software Foundation, Inc. +@c Copyright (C) 1998-1999, 2001-2011 Free Software Foundation, Inc. @c See the file elisp.texi for copying conditions. @setfilename ../../info/characters @node Non-ASCII Characters, Searching and Matching, Text, Top @@ -10,19 +9,19 @@ @cindex characters, multi-byte @cindex non-@acronym{ASCII} characters - This chapter covers the special issues relating to non-@acronym{ASCII} -characters and how they are stored in strings and buffers. + This chapter covers the special issues relating to characters and +how they are stored in strings and buffers. @menu -* Text Representations:: Unibyte and multibyte representations +* Text Representations:: How Emacs represents text. * Converting Representations:: Converting unibyte to multibyte and vice versa. * Selecting a Representation:: Treating a byte sequence as unibyte or multi. * Character Codes:: How unibyte and multibyte relate to codes of individual characters. +* Character Properties:: Character attributes that define their + behavior and handling. * Character Sets:: The space of possible character codes is divided into various character sets. -* Chars and Bytes:: More information about multibyte encodings. -* Splitting Characters:: Converting a character to its byte sequence. * Scanning Charsets:: Which character sets are used in a buffer? * Translation of Characters:: Translation tables are used for conversion. * Coding Systems:: Coding systems are conversions for saving files. @@ -33,41 +32,62 @@ characters and how they are stored in strings and buffers. @node Text Representations @section Text Representations -@cindex text representations - - Emacs has two @dfn{text representations}---two ways to represent text -in a string or buffer. These are called @dfn{unibyte} and -@dfn{multibyte}. Each string, and each buffer, uses one of these two -representations. For most purposes, you can ignore the issue of -representations, because Emacs converts text between them as -appropriate. Occasionally in Lisp programming you will need to pay -attention to the difference. +@cindex text representation + + Emacs buffers and strings support a large repertoire of characters +from many different scripts, allowing users to type and display text +in almost any known written language. + +@cindex character codepoint +@cindex codespace +@cindex Unicode + To support this multitude of characters and scripts, Emacs closely +follows the @dfn{Unicode Standard}. The Unicode Standard assigns a +unique number, called a @dfn{codepoint}, to each and every character. +The range of codepoints defined by Unicode, or the Unicode +@dfn{codespace}, is @code{0..#x10FFFF} (in hexadecimal notation), +inclusive. Emacs extends this range with codepoints in the range +@code{#x110000..#x3FFFFF}, which it uses for representing characters +that are not unified with Unicode and @dfn{raw 8-bit bytes} that +cannot be interpreted as characters. Thus, a character codepoint in +Emacs is a 22-bit integer number. + +@cindex internal representation of characters +@cindex characters, representation in buffers and strings +@cindex multibyte text + To conserve memory, Emacs does not hold fixed-length 22-bit numbers +that are codepoints of text characters within buffers and strings. +Rather, Emacs uses a variable-length internal representation of +characters, that stores each character as a sequence of 1 to 5 8-bit +bytes, depending on the magnitude of its codepoint@footnote{ +This internal representation is based on one of the encodings defined +by the Unicode Standard, called @dfn{UTF-8}, for representing any +Unicode codepoint, but Emacs extends UTF-8 to represent the additional +codepoints it uses for raw 8-bit bytes and characters not unified with +Unicode.}. For example, any @acronym{ASCII} character takes up only 1 +byte, a Latin-1 character takes up 2 bytes, etc. We call this +representation of text @dfn{multibyte}. + + Outside Emacs, characters can be represented in many different +encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts +between these external encodings and its internal representation, as +appropriate, when it reads text into a buffer or a string, or when it +writes text to a disk file or passes it to some other process. + + Occasionally, Emacs needs to hold and manipulate encoded text or +binary non-text data in its buffers or strings. For example, when +Emacs visits a file, it first reads the file's text verbatim into a +buffer, and only then converts it to the internal representation. +Before the conversion, the buffer holds encoded text. @cindex unibyte text - In unibyte representation, each character occupies one byte and -therefore the possible character codes range from 0 to 255. Codes 0 -through 127 are @acronym{ASCII} characters; the codes from 128 through 255 -are used for one non-@acronym{ASCII} character set (you can choose which -character set by setting the variable @code{nonascii-insert-offset}). - -@cindex leading code -@cindex multibyte text -@cindex trailing codes - In multibyte representation, a character may occupy more than one -byte, and as a result, the full range of Emacs character codes can be -stored. The first byte of a multibyte character is always in the range -128 through 159 (octal 0200 through 0237). These values are called -@dfn{leading codes}. The second and subsequent bytes of a multibyte -character are always in the range 160 through 255 (octal 0240 through -0377); these values are @dfn{trailing codes}. - - Some sequences of bytes are not valid in multibyte text: for example, -a single isolated byte in the range 128 through 159 is not allowed. But -character codes 128 through 159 can appear in multibyte text, -represented as two-byte sequences. All the character codes 128 through -255 are possible (though slightly abnormal) in multibyte text; they -appear in multibyte buffers and strings when you do explicit encoding -and decoding (@pxref{Explicit Encoding}). + Encoded text is not really text, as far as Emacs is concerned, but +rather a sequence of raw 8-bit bytes. We call buffers and strings +that hold encoded text @dfn{unibyte} buffers and strings, because +Emacs treats them as a sequence of individual bytes. Usually, Emacs +displays unibyte buffers and strings as octal codes such as +@code{\237}. We recommend that you never use unibyte buffers and +strings except for manipulating encoded text or binary non-text data. In a buffer, the buffer-local value of the variable @code{enable-multibyte-characters} specifies the representation used. @@ -77,39 +97,35 @@ when the string is constructed. @defvar enable-multibyte-characters This variable specifies the current buffer's text representation. If it is non-@code{nil}, the buffer contains multibyte text; otherwise, -it contains unibyte text. +it contains unibyte encoded text or binary non-text data. You cannot set this variable directly; instead, use the function @code{set-buffer-multibyte} to change a buffer's representation. @end defvar -@defvar default-enable-multibyte-characters -This variable's value is entirely equivalent to @code{(default-value -'enable-multibyte-characters)}, and setting this variable changes that -default value. Setting the local binding of -@code{enable-multibyte-characters} in a specific buffer is not allowed, -but changing the default value is supported, and it is a reasonable -thing to do, because it has no effect on existing buffers. - -The @samp{--unibyte} command line option does its job by setting the -default value to @code{nil} early in startup. -@end defvar - @defun position-bytes position -Return the byte-position corresponding to buffer position +Buffer positions are measured in character units. This function +returns the byte-position corresponding to buffer position @var{position} in the current buffer. This is 1 at the start of the buffer, and counts upward in bytes. If @var{position} is out of range, the value is @code{nil}. @end defun @defun byte-to-position byte-position -Return the buffer position corresponding to byte-position +Return the buffer position, in character units, corresponding to given @var{byte-position} in the current buffer. If @var{byte-position} is -out of range, the value is @code{nil}. +out of range, the value is @code{nil}. In a multibyte buffer, an +arbitrary value of @var{byte-position} can be not at character +boundary, but inside a multibyte sequence representing a single +character; in this case, this function returns the buffer position of +the character whose multibyte sequence includes @var{byte-position}. +In other words, the value does not change for all byte positions that +belong to the same character. @end defun @defun multibyte-string-p string -Return @code{t} if @var{string} is a multibyte string. +Return @code{t} if @var{string} is a multibyte string, @code{nil} +otherwise. @end defun @defun string-bytes string @@ -119,19 +135,25 @@ If @var{string} is a multibyte string, this can be greater than @code{(length @var{string})}. @end defun +@defun unibyte-string &rest bytes +This function concatenates all its argument @var{bytes} and makes the +result a unibyte string. +@end defun + @node Converting Representations @section Converting Text Representations Emacs can convert unibyte text to multibyte; it can also convert -multibyte text to unibyte, though this conversion loses information. In -general these conversions happen when inserting text into a buffer, or -when putting text from several strings together in one string. You can -also explicitly convert a string's contents to either representation. - - Emacs chooses the representation for a string based on the text that -it is constructed from. The general rule is to convert unibyte text to -multibyte text when combining it with other multibyte text, because the -multibyte representation is more general and can hold whatever +multibyte text to unibyte, provided that the multibyte text contains +only @acronym{ASCII} and 8-bit raw bytes. In general, these +conversions happen when inserting text into a buffer, or when putting +text from several strings together in one string. You can also +explicitly convert a string's contents to either representation. + + Emacs chooses the representation for a string based on the text from +which it is constructed. The general rule is to convert unibyte text +to multibyte text when combining it with other multibyte text, because +the multibyte representation is more general and can hold whatever characters the unibyte text has. When inserting text into a buffer, Emacs converts the text to the @@ -144,90 +166,55 @@ alternative, to convert the buffer contents to multibyte, is not acceptable because the buffer's representation is a choice made by the user that cannot be overridden automatically. - Converting unibyte text to multibyte text leaves @acronym{ASCII} characters -unchanged, and likewise character codes 128 through 159. It converts -the non-@acronym{ASCII} codes 160 through 255 by adding the value -@code{nonascii-insert-offset} to each character code. By setting this -variable, you specify which character set the unibyte characters -correspond to (@pxref{Character Sets}). For example, if -@code{nonascii-insert-offset} is 2048, which is @code{(- (make-char -'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters -correspond to Latin 1. If it is 2688, which is @code{(- (make-char -'greek-iso8859-7) 128)}, then they correspond to Greek letters. - - Converting multibyte text to unibyte is simpler: it discards all but -the low 8 bits of each character code. If @code{nonascii-insert-offset} -has a reasonable value, corresponding to the beginning of some character -set, this conversion is the inverse of the other: converting unibyte -text to multibyte and back to unibyte reproduces the original unibyte -text. - -@defvar nonascii-insert-offset -This variable specifies the amount to add to a non-@acronym{ASCII} character -when converting unibyte text to multibyte. It also applies when -@code{self-insert-command} inserts a character in the unibyte -non-@acronym{ASCII} range, 128 through 255. However, the functions -@code{insert} and @code{insert-char} do not perform this conversion. - -The right value to use to select character set @var{cs} is @code{(- -(make-char @var{cs}) 128)}. If the value of -@code{nonascii-insert-offset} is zero, then conversion actually uses the -value for the Latin 1 character set, rather than zero. -@end defvar + Converting unibyte text to multibyte text leaves @acronym{ASCII} +characters unchanged, and converts bytes with codes 128 through 159 to +the multibyte representation of raw eight-bit bytes. -@defvar nonascii-translation-table -This variable provides a more general alternative to -@code{nonascii-insert-offset}. You can use it to specify independently -how to translate each code in the range of 128 through 255 into a -multibyte character. The value should be a char-table, or @code{nil}. -If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}. -@end defvar + Converting multibyte text to unibyte converts all @acronym{ASCII} +and eight-bit characters to their single-byte form, but loses +information for non-@acronym{ASCII} characters by discarding all but +the low 8 bits of each character's codepoint. Converting unibyte text +to multibyte and back to unibyte reproduces the original unibyte text. -The next three functions either return the argument @var{string}, or a +The next two functions either return the argument @var{string}, or a newly created string with no text properties. -@defun string-make-unibyte string -This function converts the text of @var{string} to unibyte -representation, if it isn't already, and returns the result. If -@var{string} is a unibyte string, it is returned unchanged. Multibyte -character codes are converted to unibyte according to -@code{nonascii-translation-table} or, if that is @code{nil}, using -@code{nonascii-insert-offset}. If the lookup in the translation table -fails, this function takes just the low 8 bits of each character. -@end defun - -@defun string-make-multibyte string -This function converts the text of @var{string} to multibyte -representation, if it isn't already, and returns the result. If -@var{string} is a multibyte string or consists entirely of -@acronym{ASCII} characters, it is returned unchanged. In particular, -if @var{string} is unibyte and entirely @acronym{ASCII}, the returned -string is unibyte. (When the characters are all @acronym{ASCII}, -Emacs primitives will treat the string the same way whether it is -unibyte or multibyte.) If @var{string} is unibyte and contains -non-@acronym{ASCII} characters, the function -@code{unibyte-char-to-multibyte} is used to convert each unibyte -character to a multibyte character. -@end defun - @defun string-to-multibyte string This function returns a multibyte string containing the same sequence -of character codes as @var{string}. Unlike -@code{string-make-multibyte}, this function unconditionally returns a -multibyte string. If @var{string} is a multibyte string, it is -returned unchanged. +of characters as @var{string}. If @var{string} is a multibyte string, +it is returned unchanged. The function assumes that @var{string} +includes only @acronym{ASCII} characters and raw 8-bit bytes; the +latter are converted to their multibyte representation corresponding +to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive +(@pxref{Text Representations, codepoints}). +@end defun + +@defun string-to-unibyte string +This function returns a unibyte string containing the same sequence of +characters as @var{string}. It signals an error if @var{string} +contains a non-@acronym{ASCII} character. If @var{string} is a +unibyte string, it is returned unchanged. Use this function for +@var{string} arguments that contain only @acronym{ASCII} and eight-bit +characters. +@end defun + +@defun byte-to-string byte +@cindex byte to string +This function returns a unibyte string containing a single byte of +character data, @var{character}. It signals a error if +@var{character} is not an integer between 0 and 255. @end defun @defun multibyte-char-to-unibyte char -This convert the multibyte character @var{char} to a unibyte -character, based on @code{nonascii-translation-table} and -@code{nonascii-insert-offset}. +This converts the multibyte character @var{char} to a unibyte +character, and returns that character. If @var{char} is neither +@acronym{ASCII} nor eight-bit, the function returns -1. @end defun @defun unibyte-char-to-multibyte char This convert the unibyte character @var{char} to a multibyte -character, based on @code{nonascii-translation-table} and -@code{nonascii-insert-offset}. +character, assuming @var{char} is either @acronym{ASCII} or raw 8-bit +byte. @end defun @node Selecting a Representation @@ -242,13 +229,13 @@ is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte} is @code{nil}, the buffer becomes unibyte. This function leaves the buffer contents unchanged when viewed as a -sequence of bytes. As a consequence, it can change the contents viewed -as characters; a sequence of two bytes which is treated as one character -in multibyte representation will count as two characters in unibyte -representation. Character codes 128 through 159 are an exception. They -are represented by one byte in a unibyte buffer, but when the buffer is -set to multibyte, they are converted to two-byte sequences, and vice -versa. +sequence of bytes. As a consequence, it can change the contents +viewed as characters; for instance, a sequence of three bytes which is +treated as one character in multibyte representation will count as +three characters in unibyte representation. Eight-bit characters +representing raw bytes are an exception. They are represented by one +byte in a unibyte buffer, but when the buffer is set to multibyte, +they are converted to two-byte sequences, and vice versa. This function sets @code{enable-multibyte-characters} to record which representation is in use. It also adjusts various data in the buffer @@ -261,255 +248,447 @@ base buffer. @end defun @defun string-as-unibyte string -This function returns a string with the same bytes as @var{string} but -treating each byte as a character. This means that the value may have -more characters than @var{string} has. - -If @var{string} is already a unibyte string, then the value is -@var{string} itself. Otherwise it is a newly created string, with no -text properties. If @var{string} is multibyte, any characters it -contains of charset @code{eight-bit-control} or @code{eight-bit-graphic} -are converted to the corresponding single byte. +If @var{string} is already a unibyte string, this function returns +@var{string} itself. Otherwise, it returns a new string with the same +bytes as @var{string}, but treating each byte as a separate character +(so that the value may have more characters than @var{string}); as an +exception, each eight-bit character representing a raw byte is +converted into a single byte. The newly-created string contains no +text properties. @end defun @defun string-as-multibyte string -This function returns a string with the same bytes as @var{string} but -treating each multibyte sequence as one character. This means that the -value may have fewer characters than @var{string} has. - -If @var{string} is already a multibyte string, then the value is -@var{string} itself. Otherwise it is a newly created string, with no -text properties. If @var{string} is unibyte and contains any individual -8-bit bytes (i.e.@: not part of a multibyte form), they are converted to -the corresponding multibyte character of charset @code{eight-bit-control} -or @code{eight-bit-graphic}. +If @var{string} is a multibyte string, this function returns +@var{string} itself. Otherwise, it returns a new string with the same +bytes as @var{string}, but treating each multibyte sequence as one +character. This means that the value may have fewer characters than +@var{string} has. If a byte sequence in @var{string} is invalid as a +multibyte representation of a single character, each byte in the +sequence is treated as a raw 8-bit byte. The newly-created string +contains no text properties. @end defun @node Character Codes @section Character Codes @cindex character codes - The unibyte and multibyte text representations use different character -codes. The valid character codes for unibyte representation range from -0 to 255---the values that can fit in one byte. The valid character -codes for multibyte representation range from 0 to 524287, but not all -values in that range are valid. The values 128 through 255 are not -entirely proper in multibyte text, but they can occur if you do explicit -encoding and decoding (@pxref{Explicit Encoding}). Some other character -codes cannot occur at all in multibyte text. Only the @acronym{ASCII} codes -0 through 127 are completely legitimate in both representations. - -@defun char-valid-p charcode &optional genericp -This returns @code{t} if @var{charcode} is valid (either for unibyte -text or for multibyte text). + The unibyte and multibyte text representations use different +character codes. The valid character codes for unibyte representation +range from 0 to @code{#xFF} (255)---the values that can fit in one +byte. The valid character codes for multibyte representation range +from 0 to @code{#x3FFFFF}. In this code space, values 0 through +@code{#x7F} (127) are for @acronym{ASCII} characters, and values +@code{#x80} (128) through @code{#x3FFF7F} (4194175) are for +non-@acronym{ASCII} characters. + + Emacs character codes are a superset of the Unicode standard. +Values 0 through @code{#x10FFFF} (1114111) correspond to Unicode +characters of the same codepoint; values @code{#x110000} (1114112) +through @code{#x3FFF7F} (4194175) represent characters that are not +unified with Unicode; and values @code{#x3FFF80} (4194176) through +@code{#x3FFFFF} (4194303) represent eight-bit raw bytes. + +@defun characterp charcode +This returns @code{t} if @var{charcode} is a valid character, and +@code{nil} otherwise. @example -(char-valid-p 65) +@group +(characterp 65) + @result{} t +@end group +@group +(characterp 4194303) @result{} t -(char-valid-p 256) +@end group +@group +(characterp 4194304) @result{} nil -(char-valid-p 2248) +@end group +@end example +@end defun + +@cindex maximum value of character codepoint +@cindex codepoint, largest value +@defun max-char +This function returns the largest value that a valid character +codepoint can have. + +@example +@group +(characterp (max-char)) @result{} t +@end group +@group +(characterp (1+ (max-char))) + @result{} nil +@end group @end example +@end defun -If the optional argument @var{genericp} is non-@code{nil}, this -function also returns @code{t} if @var{charcode} is a generic -character (@pxref{Splitting Characters}). +@defun get-byte &optional pos string +This function returns the byte at character position @var{pos} in the +current buffer. If the current buffer is unibyte, this is literally +the byte at that position. If the buffer is multibyte, byte values of +@acronym{ASCII} characters are the same as character codepoints, +whereas eight-bit raw bytes are converted to their 8-bit codes. The +function signals an error if the character at @var{pos} is +non-@acronym{ASCII}. + +The optional argument @var{string} means to get a byte value from that +string instead of the current buffer. @end defun -@node Character Sets -@section Character Sets -@cindex character sets +@node Character Properties +@section Character Properties +@cindex character properties +A @dfn{character property} is a named attribute of a character that +specifies how the character behaves and how it should be handled +during text processing and display. Thus, character properties are an +important part of specifying the character's semantics. + + On the whole, Emacs follows the Unicode Standard in its implementation +of character properties. In particular, Emacs supports the +@uref{http://www.unicode.org/reports/tr23/, Unicode Character Property +Model}, and the Emacs character property database is derived from the +Unicode Character Database (@acronym{UCD}). See the +@uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character +Properties chapter of the Unicode Standard}, for a detailed +description of Unicode character properties and their meaning. This +section assumes you are already familiar with that chapter of the +Unicode Standard, and want to apply that knowledge to Emacs Lisp +programs. - Emacs classifies characters into various @dfn{character sets}, each of -which has a name which is a symbol. Each character belongs to one and -only one character set. + In Emacs, each property has a name, which is a symbol, and a set of +possible values, whose types depend on the property; if a character +does not have a certain property, the value is @code{nil}. As a +general rule, the names of character properties in Emacs are produced +from the corresponding Unicode properties by downcasing them and +replacing each @samp{_} character with a dash @samp{-}. For example, +@code{Canonical_Combining_Class} becomes +@code{canonical-combining-class}. However, sometimes we shorten the +names to make their use easier. - In general, there is one character set for each distinct script. For -example, @code{latin-iso8859-1} is one character set, -@code{greek-iso8859-7} is another, and @code{ascii} is another. An -Emacs character set can hold at most 9025 characters; therefore, in some -cases, characters that would logically be grouped together are split -into several character sets. For example, one set of Chinese -characters, generally known as Big 5, is divided into two Emacs -character sets, @code{chinese-big5-1} and @code{chinese-big5-2}. + Here is the full list of value types for all the character +properties that Emacs knows about: - @acronym{ASCII} characters are in character set @code{ascii}. The -non-@acronym{ASCII} characters 128 through 159 are in character set -@code{eight-bit-control}, and codes 160 through 255 are in character set -@code{eight-bit-graphic}. +@table @code +@item name +This property corresponds to the Unicode @code{Name} property. The +value is a string consisting of upper-case Latin letters A to Z, +digits, spaces, and hyphen @samp{-} characters. + +@cindex unicode general category +@item general-category +This property corresponds to the Unicode @code{General_Category} +property. The value is a symbol whose name is a 2-letter abbreviation +of the character's classification. + +@item canonical-combining-class +Corresponds to the Unicode @code{Canonical_Combining_Class} property. +The value is an integer number. + +@item bidi-class +Corresponds to the Unicode @code{Bidi_Class} property. The value is a +symbol whose name is the Unicode @dfn{directional type} of the +character. + +@item decomposition +Corresponds to the Unicode @code{Decomposition_Type} and +@code{Decomposition_Value} properties. The value is a list, whose +first element may be a symbol representing a compatibility formatting +tag, such as @code{small}@footnote{ +Note that the Unicode spec writes these tag names inside +@samp{<..>} brackets. The tag names in Emacs do not include the +brackets; e.g., Unicode specifies @samp{} where Emacs uses +@samp{small}. +}; the other elements are characters that give the compatibility +decomposition sequence of this character. + +@item decimal-digit-value +Corresponds to the Unicode @code{Numeric_Value} property for +characters whose @code{Numeric_Type} is @samp{Digit}. The value is an +integer number. + +@item digit +Corresponds to the Unicode @code{Numeric_Value} property for +characters whose @code{Numeric_Type} is @samp{Decimal}. The value is +an integer number. Examples of such characters include compatibility +subscript and superscript digits, for which the value is the +corresponding number. + +@item numeric-value +Corresponds to the Unicode @code{Numeric_Value} property for +characters whose @code{Numeric_Type} is @samp{Numeric}. The value of +this property is an integer or a floating-point number. Examples of +characters that have this property include fractions, subscripts, +superscripts, Roman numerals, currency numerators, and encircled +numbers. For example, the value of this property for the character +@code{U+2155} (@sc{vulgar fraction one fifth}) is @code{0.2}. + +@item mirrored +Corresponds to the Unicode @code{Bidi_Mirrored} property. The value +of this property is a symbol, either @code{Y} or @code{N}. + +@item old-name +Corresponds to the Unicode @code{Unicode_1_Name} property. The value +is a string. + +@item iso-10646-comment +Corresponds to the Unicode @code{ISO_Comment} property. The value is +a string. + +@item uppercase +Corresponds to the Unicode @code{Simple_Uppercase_Mapping} property. +The value of this property is a single character. + +@item lowercase +Corresponds to the Unicode @code{Simple_Lowercase_Mapping} property. +The value of this property is a single character. + +@item titlecase +Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property. +@dfn{Title case} is a special form of a character used when the first +character of a word needs to be capitalized. The value of this +property is a single character. +@end table -@defun charsetp object -Returns @code{t} if @var{object} is a symbol that names a character set, -@code{nil} otherwise. +@defun get-char-code-property char propname +This function returns the value of @var{char}'s @var{propname} property. + +@example +@group +(get-char-code-property ? 'general-category) + @result{} Zs +@end group +@group +(get-char-code-property ?1 'general-category) + @result{} Nd +@end group +@group +(get-char-code-property ?\u2084 'digit-value) ; subscript 4 + @result{} 4 +@end group +@group +(get-char-code-property ?\u2155 'numeric-value) ; one fifth + @result{} 1/5 +@end group +@group +(get-char-code-property ?\u2163 'numeric-value) ; Roman IV + @result{} \4 +@end group +@end example @end defun -@defvar charset-list -The value is a list of all defined character set names. -@end defvar +@defun char-code-property-description prop value +This function returns the description string of property @var{prop}'s +@var{value}, or @code{nil} if @var{value} has no description. -@defun charset-list -This function returns the value of @code{charset-list}. It is only -provided for backward compatibility. +@example +@group +(char-code-property-description 'general-category 'Zs) + @result{} "Separator, Space" +@end group +@group +(char-code-property-description 'general-category 'Nd) + @result{} "Number, Decimal Digit" +@end group +@group +(char-code-property-description 'numeric-value '1/5) + @result{} nil +@end group +@end example @end defun -@defun char-charset character -This function returns the name of the character set that @var{character} -belongs to, or the symbol @code{unknown} if @var{character} is not a -valid character. +@defun put-char-code-property char propname value +This function stores @var{value} as the value of the property +@var{propname} for the character @var{char}. @end defun -@defun charset-plist charset -This function returns the charset property list of the character set -@var{charset}. Although @var{charset} is a symbol, this is not the same -as the property list of that symbol. Charset properties are used for -special purposes within Emacs. -@end defun +@defvar unicode-category-table +The value of this variable is a char-table (@pxref{Char-Tables}) that +specifies, for each character, its Unicode @code{General_Category} +property as a symbol. +@end defvar -@deffn Command list-charset-chars charset -This command displays a list of characters in the character set -@var{charset}. -@end deffn +@defvar char-script-table +The value of this variable is a char-table that specifies, for each +character, a symbol whose name is the script to which the character +belongs, according to the Unicode Standard classification of the +Unicode code space into script-specific blocks. This char-table has a +single extra slot whose value is the list of all script symbols. +@end defvar -@node Chars and Bytes -@section Characters and Bytes -@cindex bytes and characters +@defvar char-width-table +The value of this variable is a char-table that specifies the width of +each character in columns that it will occupy on the screen. +@end defvar -@cindex introduction sequence (of character) -@cindex dimension (of character set) - In multibyte representation, each character occupies one or more -bytes. Each character set has an @dfn{introduction sequence}, which is -normally one or two bytes long. (Exception: the @code{ascii} character -set and the @code{eight-bit-graphic} character set have a zero-length -introduction sequence.) The introduction sequence is the beginning of -the byte sequence for any character in the character set. The rest of -the character's bytes distinguish it from the other characters in the -same character set. Depending on the character set, there are either -one or two distinguishing bytes; the number of such bytes is called the -@dfn{dimension} of the character set. +@defvar printable-chars +The value of this variable is a char-table that specifies, for each +character, whether it is printable or not. That is, if evaluating +@code{(aref printable-chars char)} results in @code{t}, the character +is printable, and if it results in @code{nil}, it is not. +@end defvar -@defun charset-dimension charset -This function returns the dimension of @var{charset}; at present, the -dimension is always 1 or 2. -@end defun +@node Character Sets +@section Character Sets +@cindex character sets + +@cindex charset +@cindex coded character set +An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters +in which each character is assigned a numeric code point. (The +Unicode Standard calls this a @dfn{coded character set}.) Each Emacs +charset has a name which is a symbol. A single character can belong +to any number of different character sets, but it will generally have +a different code point in each charset. Examples of character sets +include @code{ascii}, @code{iso-8859-1}, @code{greek-iso8859-7}, and +@code{windows-1255}. The code point assigned to a character in a +charset is usually different from its code point used in Emacs buffers +and strings. + +@cindex @code{emacs}, a charset +@cindex @code{unicode}, a charset +@cindex @code{eight-bit}, a charset + Emacs defines several special character sets. The character set +@code{unicode} includes all the characters whose Emacs code points are +in the range @code{0..#x10FFFF}. The character set @code{emacs} +includes all @acronym{ASCII} and non-@acronym{ASCII} characters. +Finally, the @code{eight-bit} charset includes the 8-bit raw bytes; +Emacs uses it to represent raw bytes encountered in text. -@defun charset-bytes charset -This function returns the number of bytes used to represent a character -in character set @var{charset}. +@defun charsetp object +Returns @code{t} if @var{object} is a symbol that names a character set, +@code{nil} otherwise. @end defun - This is the simplest way to determine the byte length of a character -set's introduction sequence: +@defvar charset-list +The value is a list of all defined character set names. +@end defvar -@example -(- (charset-bytes @var{charset}) - (charset-dimension @var{charset})) -@end example +@defun charset-priority-list &optional highestp +This functions returns a list of all defined character sets ordered by +their priority. If @var{highestp} is non-@code{nil}, the function +returns a single character set of the highest priority. +@end defun -@node Splitting Characters -@section Splitting Characters -@cindex character as bytes +@defun set-charset-priority &rest charsets +This function makes @var{charsets} the highest priority character sets. +@end defun - The functions in this section convert between characters and the byte -values used to represent them. For most purposes, there is no need to -be concerned with the sequence of bytes used to represent a character, -because Emacs translates automatically when necessary. +@defun char-charset character &optional restriction +This function returns the name of the character set of highest +priority that @var{character} belongs to. @acronym{ASCII} characters +are an exception: for them, this function always returns @code{ascii}. -@defun split-char character -Return a list containing the name of the character set of -@var{character}, followed by one or two byte values (integers) which -identify @var{character} within that character set. The number of byte -values is the character set's dimension. +If @var{restriction} is non-@code{nil}, it should be a list of +charsets to search. Alternatively, it can be a coding system, in +which case the returned charset must be supported by that coding +system (@pxref{Coding Systems}). +@end defun -If @var{character} is invalid as a character code, @code{split-char} -returns a list consisting of the symbol @code{unknown} and @var{character}. +@defun charset-plist charset +This function returns the property list of the character set +@var{charset}. Although @var{charset} is a symbol, this is not the +same as the property list of that symbol. Charset properties include +important information about the charset, such as its documentation +string, short name, etc. +@end defun -@example -(split-char 2248) - @result{} (latin-iso8859-1 72) -(split-char 65) - @result{} (ascii 65) -(split-char 128) - @result{} (eight-bit-control 128) -@end example +@defun put-charset-property charset propname value +This function sets the @var{propname} property of @var{charset} to the +given @var{value}. @end defun -@cindex generate characters in charsets -@defun make-char charset &optional code1 code2 -This function returns the character in character set @var{charset} whose -position codes are @var{code1} and @var{code2}. This is roughly the -inverse of @code{split-char}. Normally, you should specify either one -or both of @var{code1} and @var{code2} according to the dimension of -@var{charset}. For example, +@defun get-charset-property charset propname +This function returns the value of @var{charset}s property +@var{propname}. +@end defun -@example -(make-char 'latin-iso8859-1 72) - @result{} 2248 -@end example +@deffn Command list-charset-chars charset +This command displays a list of characters in the character set +@var{charset}. +@end deffn -Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed -before they are used to index @var{charset}. Thus you may use, for -instance, an ISO 8859 character code rather than subtracting 128, as -is necessary to index the corresponding Emacs charset. + Emacs can convert between its internal representation of a character +and the character's codepoint in a specific charset. The following +two functions support these conversions. + +@c FIXME: decode-char and encode-char accept and ignore an additional +@c argument @var{restriction}. When that argument actually makes a +@c difference, it should be documented here. +@defun decode-char charset code-point +This function decodes a character that is assigned a @var{code-point} +in @var{charset}, to the corresponding Emacs character, and returns +it. If @var{charset} doesn't contain a character of that code point, +the value is @code{nil}. If @var{code-point} doesn't fit in a Lisp +integer (@pxref{Integer Basics, most-positive-fixnum}), it can be +specified as a cons cell @code{(@var{high} . @var{low})}, where +@var{low} are the lower 16 bits of the value and @var{high} are the +high 16 bits. @end defun -@cindex generic characters - If you call @code{make-char} with no @var{byte-values}, the result is -a @dfn{generic character} which stands for @var{charset}. A generic -character is an integer, but it is @emph{not} valid for insertion in the -buffer as a character. It can be used in @code{char-table-range} to -refer to the whole character set (@pxref{Char-Tables}). -@code{char-valid-p} returns @code{nil} for generic characters. -For example: - -@example -(make-char 'latin-iso8859-1) - @result{} 2176 -(char-valid-p 2176) - @result{} nil -(char-valid-p 2176 t) - @result{} t -(split-char 2176) - @result{} (latin-iso8859-1 0) -@end example +@defun encode-char char charset +This function returns the code point assigned to the character +@var{char} in @var{charset}. If the result does not fit in a Lisp +integer, it is returned as a cons cell @code{(@var{high} . @var{low})} +that fits the second argument of @code{decode-char} above. If +@var{charset} doesn't have a codepoint for @var{char}, the value is +@code{nil}. +@end defun -The character sets @code{ascii}, @code{eight-bit-control}, and -@code{eight-bit-graphic} don't have corresponding generic characters. If -@var{charset} is one of them and you don't supply @var{code1}, -@code{make-char} returns the character code corresponding to the -smallest code in @var{charset}. + The following function comes in handy for applying a certain +function to all or part of the characters in a charset: + +@defun map-charset-chars function charset &optional arg from-code to-code +Call @var{function} for characters in @var{charset}. @var{function} +is called with two arguments. The first one is a cons cell +@code{(@var{from} . @var{to})}, where @var{from} and @var{to} +indicate a range of characters contained in charset. The second +argument passed to @var{function} is @var{arg}. + +By default, the range of codepoints passed to @var{function} includes +all the characters in @var{charset}, but optional arguments +@var{from-code} and @var{to-code} limit that to the range of +characters between these two codepoints of @var{charset}. If either +of them is @code{nil}, it defaults to the first or last codepoint of +@var{charset}, respectively. +@end defun @node Scanning Charsets @section Scanning for Character Sets - Sometimes it is useful to find out which character sets appear in a -part of a buffer or a string. One use for this is in determining which -coding systems (@pxref{Coding Systems}) are capable of representing all -of the text in question. + Sometimes it is useful to find out which character set a particular +character belongs to. One use for this is in determining which coding +systems (@pxref{Coding Systems}) are capable of representing all of +the text in question; another is to determine the font(s) for +displaying that text. @defun charset-after &optional pos -This function return the charset of a character in the current buffer -at position @var{pos}. If @var{pos} is omitted or @code{nil}, it -defaults to the current value of point. If @var{pos} is out of range, -the value is @code{nil}. +This function returns the charset of highest priority containing the +character at position @var{pos} in the current buffer. If @var{pos} +is omitted or @code{nil}, it defaults to the current value of point. +If @var{pos} is out of range, the value is @code{nil}. @end defun @defun find-charset-region beg end &optional translation -This function returns a list of the character sets that appear in the -current buffer between positions @var{beg} and @var{end}. +This function returns a list of the character sets of highest priority +that contain characters in the current buffer between positions +@var{beg} and @var{end}. -The optional argument @var{translation} specifies a translation table to -be used in scanning the text (@pxref{Translation of Characters}). If it -is non-@code{nil}, then each character in the region is translated +The optional argument @var{translation} specifies a translation table +to use for scanning the text (@pxref{Translation of Characters}). If +it is non-@code{nil}, then each character in the region is translated through this table, and the value returned describes the translated characters instead of the characters actually in the buffer. @end defun @defun find-charset-string string &optional translation -This function returns a list of the character sets that appear in the -string @var{string}. It is just like @code{find-charset-region}, except -that it applies to the contents of @var{string} instead of part of the -current buffer. +This function returns a list of character sets of highest priority +that contain characters in @var{string}. It is just like +@code{find-charset-region}, except that it applies to the contents of +@var{string} instead of part of the current buffer. @end defun @node Translation of Characters @@ -517,19 +696,18 @@ current buffer. @cindex character translation tables @cindex translation tables - A @dfn{translation table} is a char-table that specifies a mapping -of characters into characters. These tables are used in encoding and -decoding, and for other purposes. Some coding systems specify their -own particular translation tables; there are also default translation -tables which apply to all other coding systems. + A @dfn{translation table} is a char-table (@pxref{Char-Tables}) that +specifies a mapping of characters into characters. These tables are +used in encoding and decoding, and for other purposes. Some coding +systems specify their own particular translation tables; there are +also default translation tables which apply to all other coding +systems. - For instance, the coding-system @code{utf-8} has a translation table -that maps characters of various charsets (e.g., -@code{latin-iso8859-@var{x}}) into Unicode character sets. This way, -it can encode Latin-2 characters into UTF-8. Meanwhile, -@code{unify-8859-on-decoding-mode} operates by specifying -@code{standard-translation-table-for-decode} to translate -Latin-@var{x} characters into corresponding Unicode characters. + A translation table has two extra slots. The first is either +@code{nil} or a translation table that performs the reverse +translation; the second is the maximum number of characters to look up +for translating sequences of characters (see the description of +@code{make-translation-table-from-alist} below). @defun make-translation-table &rest translations This function returns a translation table based on the argument @@ -541,45 +719,38 @@ The arguments and the forms in each argument are processed in order, and if a previous form already translates @var{to} to some other character, say @var{to-alt}, @var{from} is also translated to @var{to-alt}. +@end defun -You can also map one whole character set into another character set with -the same dimension. To do this, you specify a generic character (which -designates a character set) for @var{from} (@pxref{Splitting Characters}). -In this case, if @var{to} is also a generic character, its character -set should have the same dimension as @var{from}'s. Then the -translation table translates each character of @var{from}'s character -set into the corresponding character of @var{to}'s character set. If -@var{from} is a generic character and @var{to} is an ordinary -character, then the translation table translates every character of -@var{from}'s character set into @var{to}. -@end defun - - In decoding, the translation table's translations are applied to the -characters that result from ordinary decoding. If a coding system has -property @code{translation-table-for-decode}, that specifies the -translation table to use. (This is a property of the coding system, -as returned by @code{coding-system-get}, not a property of the symbol -that is the coding system's name. @xref{Coding System Basics,, Basic -Concepts of Coding Systems}.) Otherwise, if -@code{standard-translation-table-for-decode} is non-@code{nil}, -decoding uses that table. - - In encoding, the translation table's translations are applied to the -characters in the buffer, and the result of translation is actually -encoded. If a coding system has property -@code{translation-table-for-encode}, that specifies the translation -table to use. Otherwise the variable -@code{standard-translation-table-for-encode} specifies the translation -table. + During decoding, the translation table's translations are applied to +the characters that result from ordinary decoding. If a coding system +has the property @code{:decode-translation-table}, that specifies the +translation table to use, or a list of translation tables to apply in +sequence. (This is a property of the coding system, as returned by +@code{coding-system-get}, not a property of the symbol that is the +coding system's name. @xref{Coding System Basics,, Basic Concepts of +Coding Systems}.) Finally, if +@code{standard-translation-table-for-decode} is non-@code{nil}, the +resulting characters are translated by that table. + + During encoding, the translation table's translations are applied to +the characters in the buffer, and the result of translation is +actually encoded. If a coding system has property +@code{:encode-translation-table}, that specifies the translation table +to use, or a list of translation tables to apply in sequence. In +addition, if the variable @code{standard-translation-table-for-encode} +is non-@code{nil}, it specifies the translation table to use for +translating the result. @defvar standard-translation-table-for-decode -This is the default translation table for decoding, for -coding systems that don't specify any other translation table. +This is the default translation table for decoding. If a coding +systems specifies its own translation tables, the table that is the +value of this variable, if non-@code{nil}, is applied after them. @end defvar @defvar standard-translation-table-for-encode -This is the default translation table for encoding, for -coding systems that don't specify any other translation table. +This is the default translation table for encoding. If a coding +systems specifies its own translation tables, the table that is the +value of this variable, if non-@code{nil}, is applied after them. @end defvar @defvar translation-table-for-input @@ -588,12 +759,38 @@ table before they are inserted. Search commands also translate their input through this table, so they can compare more reliably with what's in the buffer. -@code{set-buffer-file-coding-system} sets this variable so that your -keyboard input gets translated into the character sets that the buffer -is likely to contain. This variable automatically becomes -buffer-local when set. +This variable automatically becomes buffer-local when set. @end defvar +@defun make-translation-table-from-vector vec +This function returns a translation table made from @var{vec} that is +an array of 256 elements to map bytes (values 0 through #xFF) to +characters. Elements may be @code{nil} for untranslated bytes. The +returned table has a translation table for reverse mapping in the +first extra slot, and the value @code{1} in the second extra slot. + +This function provides an easy way to make a private coding system +that maps each byte to a specific character. You can specify the +returned table and the reverse translation table using the properties +@code{:decode-translation-table} and @code{:encode-translation-table} +respectively in the @var{props} argument to +@code{define-coding-system}. +@end defun + +@defun make-translation-table-from-alist alist +This function is similar to @code{make-translation-table} but returns +a complex translation table rather than a simple one-to-one mapping. +Each element of @var{alist} is of the form @code{(@var{from} +. @var{to})}, where @var{from} and @var{to} are either characters or +vectors specifying a sequence of characters. If @var{from} is a +character, that character is translated to @var{to} (i.e.@: to a +character or a character sequence). If @var{from} is a vector of +characters, that sequence is translated to @var{to}. The returned +table has a translation table for reverse mapping in the first extra +slot, and the maximum length of all the @var{from} character sequences +in the second extra slot. +@end defun + @node Coding Systems @section Coding Systems @@ -624,48 +821,49 @@ documented here. @subsection Basic Concepts of Coding Systems @cindex character code conversion - @dfn{Character code conversion} involves conversion between the encoding -used inside Emacs and some other encoding. Emacs supports many -different encodings, in that it can convert to and from them. For -example, it can convert text to or from encodings such as Latin 1, Latin -2, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022. In some -cases, Emacs supports several alternative encodings for the same -characters; for example, there are three coding systems for the Cyrillic -(Russian) alphabet: ISO, Alternativnyj, and KOI8. - - Most coding systems specify a particular character code for -conversion, but some of them leave the choice unspecified---to be chosen -heuristically for each file, based on the data. + @dfn{Character code conversion} involves conversion between the +internal representation of characters used inside Emacs and some other +encoding. Emacs supports many different encodings, in that it can +convert to and from them. For example, it can convert text to or from +encodings such as Latin 1, Latin 2, Latin 3, Latin 4, Latin 5, and +several variants of ISO 2022. In some cases, Emacs supports several +alternative encodings for the same characters; for example, there are +three coding systems for the Cyrillic (Russian) alphabet: ISO, +Alternativnyj, and KOI8. + + Every coding system specifies a particular set of character code +conversions, but the coding system @code{undecided} is special: it +leaves the choice unspecified, to be chosen heuristically for each +file, based on the file's data. In general, a coding system doesn't guarantee roundtrip identity: decoding a byte sequence using coding system, then encoding the resulting text in the same coding system, can produce a different byte -sequence. However, the following coding systems do guarantee that the -byte sequence will be the same as what you originally decoded: +sequence. But some coding systems do guarantee that the byte sequence +will be the same as what you originally decoded. Here are a few +examples: @quotation -chinese-big5 chinese-iso-8bit cyrillic-iso-8bit emacs-mule -greek-iso-8bit hebrew-iso-8bit iso-latin-1 iso-latin-2 iso-latin-3 -iso-latin-4 iso-latin-5 iso-latin-8 iso-latin-9 iso-safe -japanese-iso-8bit japanese-shift-jis korean-iso-8bit raw-text +iso-8859-1, utf-8, big5, shift_jis, euc-jp @end quotation Encoding buffer text and then decoding the result can also fail to -reproduce the original text. For instance, if you encode Latin-2 -characters with @code{utf-8} and decode the result using the same -coding system, you'll get Unicode characters (of charset -@code{mule-unicode-0100-24ff}). If you encode Unicode characters with -@code{iso-latin-2} and decode the result with the same coding system, -you'll get Latin-2 characters. +reproduce the original text. For instance, if you encode a character +with a coding system which does not support that character, the result +is unpredictable, and thus decoding it using the same coding system +may produce a different text. Currently, Emacs can't report errors +that result from encoding unsupported characters. @cindex EOL conversion @cindex end-of-line conversion @cindex line end conversion - @dfn{End of line conversion} handles three different conventions used -on various systems for representing end of line in files. The Unix -convention is to use the linefeed character (also called newline). The -DOS convention is to use a carriage-return and a linefeed at the end of -a line. The Mac convention is to use just carriage-return. + @dfn{End of line conversion} handles three different conventions +used on various systems for representing end of line in files. The +Unix convention, used on GNU and Unix systems, is to use the linefeed +character (also called newline). The DOS convention, used on +MS-Windows and MS-DOS systems, is to use a carriage-return and a +linefeed at the end of a line. The Mac convention is to use just +carriage-return. @cindex base coding system @cindex variant coding system @@ -676,46 +874,64 @@ coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and well. Most base coding systems have three corresponding variants whose names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}. +@vindex raw-text@r{ coding system} The coding system @code{raw-text} is special in that it prevents -character code conversion, and causes the buffer visited with that -coding system to be a unibyte buffer. It does not specify the -end-of-line conversion, allowing that to be determined as usual by the -data, and has the usual three variants which specify the end-of-line -conversion. @code{no-conversion} is equivalent to @code{raw-text-unix}: -it specifies no conversion of either character codes or end-of-line. - - The coding system @code{emacs-mule} specifies that the data is -represented in the internal Emacs encoding. This is like -@code{raw-text} in that no code conversion happens, but different in -that the result is multibyte data. +character code conversion, and causes the buffer visited with this +coding system to be a unibyte buffer. For historical reasons, you can +save both unibyte and multibyte text with this coding system. When +you use @code{raw-text} to encode multibyte text, it does perform one +character code conversion: it converts eight-bit characters to their +single-byte external representation. @code{raw-text} does not specify +the end-of-line conversion, allowing that to be determined as usual by +the data, and has the usual three variants which specify the +end-of-line conversion. + +@vindex no-conversion@r{ coding system} +@vindex binary@r{ coding system} + @code{no-conversion} (and its alias @code{binary}) is equivalent to +@code{raw-text-unix}: it specifies no conversion of either character +codes or end-of-line. + +@vindex emacs-internal@r{ coding system} +@vindex utf-8-emacs@r{ coding system} + The coding system @code{utf-8-emacs} specifies that the data is +represented in the internal Emacs encoding (@pxref{Text +Representations}). This is like @code{raw-text} in that no code +conversion happens, but different in that the result is multibyte +data. The name @code{emacs-internal} is an alias for +@code{utf-8-emacs}. @defun coding-system-get coding-system property This function returns the specified property of the coding system @var{coding-system}. Most coding system properties exist for internal -purposes, but one that you might find useful is @code{mime-charset}. +purposes, but one that you might find useful is @code{:mime-charset}. That property's value is the name used in MIME for the character coding which this coding system can read and write. Examples: @example -(coding-system-get 'iso-latin-1 'mime-charset) +(coding-system-get 'iso-latin-1 :mime-charset) @result{} iso-8859-1 -(coding-system-get 'iso-2022-cn 'mime-charset) +(coding-system-get 'iso-2022-cn :mime-charset) @result{} iso-2022-cn -(coding-system-get 'cyrillic-koi8 'mime-charset) +(coding-system-get 'cyrillic-koi8 :mime-charset) @result{} koi8-r @end example -The value of the @code{mime-charset} property is also defined +The value of the @code{:mime-charset} property is also defined as an alias for the coding system. @end defun +@defun coding-system-aliases coding-system +This function returns the list of aliases of @var{coding-system}. +@end defun + @node Encoding and I/O @subsection Encoding and I/O The principal purpose of coding systems is for use in reading and -writing files. The function @code{insert-file-contents} uses -a coding system for decoding the file data, and @code{write-region} -uses one to encode the buffer contents. +writing files. The function @code{insert-file-contents} uses a coding +system to decode the file data, and @code{write-region} uses one to +encode the buffer contents. You can specify the coding system to use either explicitly (@pxref{Specifying Coding Systems}), or implicitly using a default @@ -727,15 +943,15 @@ operation finishes the job of choosing a coding system. Very often you will want to find out afterwards which coding system was chosen. @defvar buffer-file-coding-system -This buffer-local variable records the coding system that was used to visit -the current buffer. It is used for saving the buffer, and for writing part -of the buffer with @code{write-region}. If the text to be written -cannot be safely encoded using the coding system specified by this -variable, these operations select an alternative encoding by calling -the function @code{select-safe-coding-system} (@pxref{User-Chosen -Coding Systems}). If selecting a different encoding requires to ask -the user to specify a coding system, @code{buffer-file-coding-system} -is updated to the newly selected coding system. +This buffer-local variable records the coding system used for saving the +buffer and for writing part of the buffer with @code{write-region}. If +the text to be written cannot be safely encoded using the coding system +specified by this variable, these operations select an alternative +encoding by calling the function @code{select-safe-coding-system} +(@pxref{User-Chosen Coding Systems}). If selecting a different encoding +requires to ask the user to specify a coding system, +@code{buffer-file-coding-system} is updated to the newly selected coding +system. @code{buffer-file-coding-system} does @emph{not} affect sending text to a subprocess. @@ -795,6 +1011,7 @@ new file name for that buffer. Here are the Lisp facilities for working with coding systems: +@cindex list all coding systems @defun coding-system-list &optional base-only This function returns a list of all coding system names (symbols). If @var{base-only} is non-@code{nil}, the value includes only the @@ -807,12 +1024,17 @@ This function returns @code{t} if @var{object} is a coding system name or @code{nil}. @end defun +@cindex validity of coding system +@cindex coding system, validity check @defun check-coding-system coding-system -This function checks the validity of @var{coding-system}. -If that is valid, it returns @var{coding-system}. -Otherwise it signals an error with condition @code{coding-system-error}. +This function checks the validity of @var{coding-system}. If that is +valid, it returns @var{coding-system}. If @var{coding-system} is +@code{nil}, the function return @code{nil}. For any other values, it +signals an error whose @code{error-symbol} is @code{coding-system-error} +(@pxref{Signaling Errors, signal}). @end defun +@cindex eol type of coding system @defun coding-system-eol-type coding-system This function returns the type of end-of-line (a.k.a.@: @dfn{eol}) conversion used by @var{coding-system}. If @var{coding-system} @@ -834,11 +1056,12 @@ decoding, the end-of-line format of the text is auto-detected, and the eol conversion is set to match it (e.g., DOS-style CRLF format will imply @code{dos} eol conversion). For encoding, the eol conversion is taken from the appropriate default coding system (e.g., -@code{default-buffer-file-coding-system} for +default value of @code{buffer-file-coding-system} for @code{buffer-file-coding-system}), or from the default eol conversion appropriate for the underlying platform. @end defun +@cindex eol conversion of coding system @defun coding-system-change-eol-conversion coding-system eol-type This function returns a coding system which is like @var{coding-system} except for its eol conversion, which is specified by @code{eol-type}. @@ -850,6 +1073,7 @@ the end-of-line conversion from the data. @code{dos} and @code{mac}, respectively. @end defun +@cindex text conversion of coding system @defun coding-system-change-text-conversion eol-coding text-coding This function returns a coding system which uses the end-of-line conversion of @var{eol-coding}, and the text conversion of @@ -857,6 +1081,8 @@ conversion of @var{eol-coding}, and the text conversion of @code{undecided}, or one of its variants according to @var{eol-coding}. @end defun +@cindex safely encode region +@cindex coding systems for encoding region @defun find-coding-systems-region from to This function returns a list of coding systems that could be used to encode a text between @var{from} and @var{to}. All coding systems in @@ -867,6 +1093,8 @@ If the text contains no multibyte characters, the function returns the list @code{(undecided)}. @end defun +@cindex safely encode a string +@cindex coding systems for encoding a string @defun find-coding-systems-string string This function returns a list of coding systems that could be used to encode the text of @var{string}. All coding systems in the list can @@ -875,15 +1103,34 @@ contains no multibyte characters, this returns the list @code{(undecided)}. @end defun +@cindex charset, coding systems to encode +@cindex safely encode characters in a charset @defun find-coding-systems-for-charsets charsets This function returns a list of coding systems that could be used to encode all the character sets in the list @var{charsets}. @end defun +@defun check-coding-systems-region start end coding-system-list +This function checks whether coding systems in the list +@code{coding-system-list} can encode all the characters in the region +between @var{start} and @var{end}. If all of the coding systems in +the list can encode the specified text, the function returns +@code{nil}. If some coding systems cannot encode some of the +characters, the value is an alist, each element of which has the form +@code{(@var{coding-system1} @var{pos1} @var{pos2} @dots{})}, meaning +that @var{coding-system1} cannot encode characters at buffer positions +@var{pos1}, @var{pos2}, @enddots{}. + +@var{start} may be a string, in which case @var{end} is ignored and +the returned value references string indices instead of buffer +positions. +@end defun + @defun detect-coding-region start end &optional highest This function chooses a plausible coding system for decoding the text -from @var{start} to @var{end}. This text should be a byte sequence -(@pxref{Explicit Encoding}). +from @var{start} to @var{end}. This text should be a byte sequence, +i.e.@: unibyte text or multibyte text with only @acronym{ASCII} and +eight-bit characters (@pxref{Explicit Encoding}). Normally this function returns a list of coding systems that could handle decoding the text that was scanned. They are listed in order of @@ -895,11 +1142,52 @@ If the region contains only @acronym{ASCII} characters except for such ISO-2022 control characters ISO-2022 as @code{ESC}, the value is @code{undecided} or @code{(undecided)}, or a variant specifying end-of-line conversion, if that can be deduced from the text. + +If the region contains null bytes, the value is @code{no-conversion}, +even if the region contains text encoded in some coding system. @end defun @defun detect-coding-string string &optional highest This function is like @code{detect-coding-region} except that it operates on the contents of @var{string} instead of bytes in the buffer. +@end defun + +@cindex null bytes, and decoding text +@defvar inhibit-null-byte-detection +If this variable has a non-@code{nil} value, null bytes are ignored +when detecting the encoding of a region or a string. This allows to +correctly detect the encoding of text that contains null bytes, such +as Info files with Index nodes. +@end defvar + +@defvar inhibit-iso-escape-detection +If this variable has a non-@code{nil} value, ISO-2022 escape sequences +are ignored when detecting the encoding of a region or a string. The +result is that no text is ever detected as encoded in some ISO-2022 +encoding, and all escape sequences become visible in a buffer. +@strong{Warning:} @emph{Use this variable with extreme caution, +because many files in the Emacs distribution use ISO-2022 encoding.} +@end defvar + +@cindex charsets supported by a coding system +@defun coding-system-charset-list coding-system +This function returns the list of character sets (@pxref{Character +Sets}) supported by @var{coding-system}. Some coding systems that +support too many character sets to list them all yield special values: +@itemize @bullet +@item +If @var{coding-system} supports all the ISO-2022 charsets, the value +is @code{iso-2022}. +@item +If @var{coding-system} supports all Emacs characters, the value is +@code{(emacs)}. +@item +If @var{coding-system} supports all emacs-mule characters, the value +is @code{emacs-mule}. +@item +If @var{coding-system} supports all Unicode characters, the value is +@code{(unicode)}. +@end itemize @end defun @xref{Coding systems for a subprocess,, Process Information}, in @@ -918,14 +1206,18 @@ is the text in the current buffer between @var{from} and @var{to}. If @var{from} is a string, the string specifies the text to encode, and @var{to} is ignored. +If the specified text includes raw bytes (@pxref{Text +Representations}), @code{select-safe-coding-system} suggests +@code{raw-text} for its encoding. + If @var{default-coding-system} is non-@code{nil}, that is the first coding system to try; if that can handle the text, @code{select-safe-coding-system} returns that coding system. It can also be a list of coding systems; then the function tries each of them one by one. After trying all of them, it next tries the current buffer's value of @code{buffer-file-coding-system} (if it is not -@code{undecided}), then the value of -@code{default-buffer-file-coding-system} and finally the user's most +@code{undecided}), then the default value of +@code{buffer-file-coding-system} and finally the user's most preferred coding system, which the user can set using the command @code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing Coding Systems, emacs, The GNU Emacs Manual}). @@ -952,8 +1244,9 @@ possible candidates. @vindex select-safe-coding-system-accept-default-p If the variable @code{select-safe-coding-system-accept-default-p} is -non-@code{nil}, its value overrides the value of -@var{accept-default-p}. +non-@code{nil}, it should be a function taking a single argument. +It is used in place of @var{accept-default-p}, overriding any +value supplied for this argument. As a final step, before returning the chosen coding system, @code{select-safe-coding-system} checks whether that coding system is @@ -987,6 +1280,8 @@ the user tries to enter null input, it asks the user to try again. @node Default Coding Systems @subsection Default Coding Systems +@cindex default coding system +@cindex coding system, automatically determined This section describes variables that specify the default coding system for certain files or when running certain subprograms, and the @@ -999,7 +1294,8 @@ don't change these variables; instead, override them using @code{coding-system-for-read} and @code{coding-system-for-write} (@pxref{Specifying Coding Systems}). -@defvar auto-coding-regexp-alist +@cindex file contents, and default coding system +@defopt auto-coding-regexp-alist This variable is an alist of text patterns and corresponding coding systems. Each element has the form @code{(@var{regexp} . @var{coding-system})}; a file whose first few kilobytes match @@ -1009,9 +1305,10 @@ read into a buffer. The settings in this alist take priority over @code{file-coding-system-alist} (see below). The default value is set so that Emacs automatically recognizes mail files in Babyl format and reads them with no code conversions. -@end defvar +@end defopt -@defvar file-coding-system-alist +@cindex file name, and default coding system +@defopt file-coding-system-alist This variable is an alist that specifies the coding systems to use for reading and writing particular files. Each element has the form @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular @@ -1034,8 +1331,16 @@ meaning as described above. If @var{coding} (or what returned by the above function) is @code{undecided}, the normal code-detection is performed. -@end defvar +@end defopt + +@defopt auto-coding-alist +This variable is an alist that specifies the coding systems to use for +reading and writing particular files. Its form is like that of +@code{file-coding-system-alist}, but, unlike the latter, this variable +takes priority over any @code{coding:} tags in the file. +@end defopt +@cindex program name, and default coding system @defvar process-coding-system-alist This variable is an alist specifying which coding systems to use for a subprocess, depending on which program is running in the subprocess. It @@ -1059,6 +1364,8 @@ coding system which determines both the character code conversion and the end of line conversion---that is, one like @code{latin-1-unix}, rather than @code{undecided} or @code{latin-1}. +@cindex port number, and default coding system +@cindex network service name, and default coding system @defvar network-coding-system-alist This variable is an alist that specifies the coding system to use for network streams. It works much like @code{file-coding-system-alist}, @@ -1078,7 +1385,8 @@ The value should be a cons cell of the form @code{(@var{input-coding} the subprocess, and @var{output-coding} applies to output to it. @end defvar -@defvar auto-coding-functions +@cindex default coding system, functions to determine +@defopt auto-coding-functions This variable holds a list of functions that try to determine a coding system for a file based on its undecoded contents. @@ -1092,7 +1400,40 @@ Otherwise, it should return @code{nil}. If a file has a @samp{coding:} tag, that takes precedence, so these functions won't be called. -@end defvar +@end defopt + +@defun find-auto-coding filename size +This function tries to determine a suitable coding system for +@var{filename}. It examines the buffer visiting the named file, using +the variables documented above in sequence, until it finds a match for +one of the rules specified by these variables. It then returns a cons +cell of the form @code{(@var{coding} . @var{source})}, where +@var{coding} is the coding system to use and @var{source} is a symbol, +one of @code{auto-coding-alist}, @code{auto-coding-regexp-alist}, +@code{:coding}, or @code{auto-coding-functions}, indicating which one +supplied the matching rule. The value @code{:coding} means the coding +system was specified by the @code{coding:} tag in the file +(@pxref{Specify Coding,, coding tag, emacs, The GNU Emacs Manual}). +The order of looking for a matching rule is @code{auto-coding-alist} +first, then @code{auto-coding-regexp-alist}, then the @code{coding:} +tag, and lastly @code{auto-coding-functions}. If no matching rule was +found, the function returns @code{nil}. + +The second argument @var{size} is the size of text, in characters, +following point. The function examines text only within @var{size} +characters after point. Normally, the buffer should be positioned at +the beginning when this function is called, because one of the places +for the @code{coding:} tag is the first one or two lines of the file; +in that case, @var{size} should be the size of the buffer. +@end defun + +@defun set-auto-coding filename size +This function returns a suitable coding system for file +@var{filename}. It uses @code{find-auto-coding} to find the coding +system. If no coding system could be determined, the function returns +@code{nil}. The meaning of the argument @var{size} is like in +@code{find-auto-coding}. +@end defun @defun find-operation-coding-system operation &rest arguments This function returns the coding system to use (by default) for @@ -1186,12 +1527,39 @@ When a single operation does both input and output, as do affect it. @end defvar -@defvar inhibit-eol-conversion +@defopt inhibit-eol-conversion When this variable is non-@code{nil}, no end-of-line conversion is done, no matter which coding system is specified. This applies to all the Emacs I/O and subprocess primitives, and to the explicit encoding and decoding functions (@pxref{Explicit Encoding}). -@end defvar +@end defopt + +@cindex priority order of coding systems +@cindex coding systems, priority + Sometimes, you need to prefer several coding systems for some +operation, rather than fix a single one. Emacs lets you specify a +priority order for using coding systems. This ordering affects the +sorting of lists of coding sysems returned by functions such as +@code{find-coding-systems-region} (@pxref{Lisp and Coding Systems}). + +@defun coding-system-priority-list &optional highestp +This function returns the list of coding systems in the order of their +current priorities. Optional argument @var{highestp}, if +non-@code{nil}, means return only the highest priority coding system. +@end defun + +@defun set-coding-system-priority &rest coding-systems +This function puts @var{coding-systems} at the beginning of the +priority list for coding systems, thus making their priority higher +than all the rest. +@end defun + +@defmac with-coding-priority coding-systems &rest body@dots{} +This macro execute @var{body}, like @code{progn} does +(@pxref{Sequencing, progn}), with @var{coding-systems} at the front of +the priority list for coding systems. @var{coding-systems} should be +a list of coding systems to prefer during execution of @var{body}. +@end defmac @node Explicit Encoding @subsection Explicit Encoding and Decoding @@ -1205,10 +1573,12 @@ in this section. The result of encoding, and the input to decoding, are not ordinary text. They logically consist of a series of byte values; that is, a -series of characters whose codes are in the range 0 through 255. In a -multibyte buffer or string, character codes 128 through 159 are -represented by multibyte sequences, but this is invisible to Lisp -programs. +series of @acronym{ASCII} and eight-bit characters. In unibyte +buffers and strings, these characters have codes in the range 0 +through #xFF (255). In a multibyte buffer or string, eight-bit +characters have character codes higher than #xFF (@pxref{Text +Representations}), but Emacs transparently converts them to their +single-byte values when you encode or decode such text. The usual way to read a file into a buffer as a sequence of bytes, so you can decode the contents explicitly, is with @@ -1226,19 +1596,35 @@ encoding by binding @code{coding-system-for-write} to Here are the functions to perform explicit encoding or decoding. The encoding functions produce sequences of bytes; the decoding functions are meant to operate on sequences of bytes. All of these functions -discard text properties. +discard text properties. They also set @code{last-coding-system-used} +to the precise coding system they used. -@deffn Command encode-coding-region start end coding-system +@deffn Command encode-coding-region start end coding-system &optional destination This command encodes the text from @var{start} to @var{end} according -to coding system @var{coding-system}. The encoded text replaces the -original text in the buffer. The result of encoding is logically a -sequence of bytes, but the buffer remains multibyte if it was multibyte -before. - -This command returns the length of the encoded text. +to coding system @var{coding-system}. Normally, the encoded text +replaces the original text in the buffer, but the optional argument +@var{destination} can change that. If @var{destination} is a buffer, +the encoded text is inserted in that buffer after point (point does +not move); if it is @code{t}, the command returns the encoded text as +a unibyte string without inserting it. + +If encoded text is inserted in some buffer, this command returns the +length of the encoded text. + +The result of encoding is logically a sequence of bytes, but the +buffer remains multibyte if it was multibyte before, and any 8-bit +bytes are converted to their multibyte representation (@pxref{Text +Representations}). + +@cindex @code{undecided} coding-system, when encoding +Do @emph{not} use @code{undecided} for @var{coding-system} when +encoding text, since that may lead to unexpected results. Instead, +use @code{select-safe-coding-system} (@pxref{User-Chosen Coding +Systems, select-safe-coding-system}) to suggest a suitable encoding, +if there's no obvious pertinent value for @var{coding-system}. @end deffn -@defun encode-coding-string string coding-system &optional nocopy +@defun encode-coding-string string coding-system &optional nocopy buffer This function encodes the text in @var{string} according to coding system @var{coding-system}. It returns a new string containing the encoded text, except when @var{nocopy} is non-@code{nil}, in which @@ -1246,24 +1632,52 @@ case the function may return @var{string} itself if the encoding operation is trivial. The result of encoding is a unibyte string. @end defun -@deffn Command decode-coding-region start end coding-system +@deffn Command decode-coding-region start end coding-system &optional destination This command decodes the text from @var{start} to @var{end} according -to coding system @var{coding-system}. The decoded text replaces the -original text in the buffer. To make explicit decoding useful, the text -before decoding ought to be a sequence of byte values, but both -multibyte and unibyte buffers are acceptable. - -This command returns the length of the decoded text. +to coding system @var{coding-system}. To make explicit decoding +useful, the text before decoding ought to be a sequence of byte +values, but both multibyte and unibyte buffers are acceptable (in the +multibyte case, the raw byte values should be represented as eight-bit +characters). Normally, the decoded text replaces the original text in +the buffer, but the optional argument @var{destination} can change +that. If @var{destination} is a buffer, the decoded text is inserted +in that buffer after point (point does not move); if it is @code{t}, +the command returns the decoded text as a multibyte string without +inserting it. + +If decoded text is inserted in some buffer, this command returns the +length of the decoded text. + +This command puts a @code{charset} text property on the decoded text. +The value of the property states the character set used to decode the +original text. @end deffn -@defun decode-coding-string string coding-system &optional nocopy -This function decodes the text in @var{string} according to coding -system @var{coding-system}. It returns a new string containing the -decoded text, except when @var{nocopy} is non-@code{nil}, in which -case the function may return @var{string} itself if the decoding -operation is trivial. To make explicit decoding useful, the contents -of @var{string} ought to be a sequence of byte values, but a multibyte -string is acceptable. +@defun decode-coding-string string coding-system &optional nocopy buffer +This function decodes the text in @var{string} according to +@var{coding-system}. It returns a new string containing the decoded +text, except when @var{nocopy} is non-@code{nil}, in which case the +function may return @var{string} itself if the decoding operation is +trivial. To make explicit decoding useful, the contents of +@var{string} ought to be a unibyte string with a sequence of byte +values, but a multibyte string is also acceptable (assuming it +contains 8-bit bytes in their multibyte form). + +If optional argument @var{buffer} specifies a buffer, the decoded text +is inserted in that buffer after point (point does not move). In this +case, the return value is the length of the decoded text. + +@cindex @code{charset}, text property +This function puts a @code{charset} text property on the decoded text. +The value of the property states the character set used to decode the +original text: + +@example +@group +(decode-coding-string "Gr\374ss Gott" 'latin-1) + @result{} #("Gr@"uss Gott" 0 9 (charset iso-8859-1)) +@end group +@end example @end defun @defun decode-coding-inserted-region from to filename &optional visit beg end replace @@ -1281,31 +1695,42 @@ decoding, you can call this function. @subsection Terminal I/O Encoding Emacs can decode keyboard input using a coding system, and encode -terminal output. This is useful for terminals that transmit or display -text using a particular encoding such as Latin-1. Emacs does not set -@code{last-coding-system-used} for encoding or decoding for the -terminal. +terminal output. This is useful for terminals that transmit or +display text using a particular encoding such as Latin-1. Emacs does +not set @code{last-coding-system-used} for encoding or decoding of +terminal I/O. -@defun keyboard-coding-system +@defun keyboard-coding-system &optional terminal This function returns the coding system that is in use for decoding -keyboard input---or @code{nil} if no coding system is to be used. +keyboard input from @var{terminal}---or @code{nil} if no coding system +is to be used for that terminal. If @var{terminal} is omitted or +@code{nil}, it means the selected frame's terminal. @xref{Multiple +Terminals}. @end defun -@deffn Command set-keyboard-coding-system coding-system -This command specifies @var{coding-system} as the coding system to -use for decoding keyboard input. If @var{coding-system} is @code{nil}, -that means do not decode keyboard input. +@deffn Command set-keyboard-coding-system coding-system &optional terminal +This command specifies @var{coding-system} as the coding system to use +for decoding keyboard input from @var{terminal}. If +@var{coding-system} is @code{nil}, that means do not decode keyboard +input. If @var{terminal} is a frame, it means that frame's terminal; +if it is @code{nil}, that means the currently selected frame's +terminal. @xref{Multiple Terminals}. @end deffn -@defun terminal-coding-system +@defun terminal-coding-system &optional terminal This function returns the coding system that is in use for encoding -terminal output---or @code{nil} for no encoding. +terminal output from @var{terminal}---or @code{nil} if the output is +not encoded. If @var{terminal} is a frame, it means that frame's +terminal; if it is @code{nil}, that means the currently selected +frame's terminal. @end defun -@deffn Command set-terminal-coding-system coding-system +@deffn Command set-terminal-coding-system coding-system &optional terminal This command specifies @var{coding-system} as the coding system to use -for encoding terminal output. If @var{coding-system} is @code{nil}, -that means do not encode terminal output. +for encoding terminal output from @var{terminal}. If +@var{coding-system} is @code{nil}, terminal output is not encoded. If +@var{terminal} is a frame, it means that frame's terminal; if it is +@code{nil}, that means the currently selected frame's terminal. @end deffn @node MS-DOS File Types @@ -1338,6 +1763,13 @@ Otherwise, @code{undecided-dos} is used. Normally this variable is set by visiting a file; it is set to @code{nil} if the file was visited without any actual conversion. + +Its default value is used to decide how to handle files for which +@code{file-name-buffer-file-type-alist} says nothing about the type: +If the default value is non-@code{nil}, then these files are treated as +binary: the coding system @code{no-conversion} is used. Otherwise, +nothing special is done for them---the coding system is deduced solely +from the file contents, in the usual Emacs fashion. @end defvar @defopt file-name-buffer-file-type-alist @@ -1354,17 +1786,7 @@ which coding system to use when reading a file. For a text file, is used. If no element in this alist matches a given file name, then -@code{default-buffer-file-type} says how to treat the file. -@end defopt - -@defopt default-buffer-file-type -This variable says how to handle files for which -@code{file-name-buffer-file-type-alist} says nothing about the type. - -If this variable is non-@code{nil}, then these files are treated as -binary: the coding system @code{no-conversion} is used. Otherwise, -nothing special is done for them---the coding system is deduced solely -from the file contents, in the usual Emacs fashion. +the default value of @code{buffer-file-type} says how to treat the file. @end defopt @node Input Methods @@ -1498,7 +1920,3 @@ strings in the return value are decoded using @code{locale-coding-system}. @xref{Locales,,, libc, The GNU Libc Manual}, for more information about locales and locale items. @end defun - -@ignore - arch-tag: be705bf8-941b-4c35-84fc-ad7d20ddb7cb -@end ignore