X-Git-Url: https://code.delx.au/gnu-emacs/blobdiff_plain/2846c6e3607995ce250435e5998ea6a08f60dd89..8b80cdf500c514dc9c448b4fe37265cf16127ae5:/doc/lispref/nonascii.texi diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 233fe59e1b..eab748bab8 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi @@ -10,19 +10,17 @@ @cindex characters, multi-byte @cindex non-@acronym{ASCII} characters - This chapter covers the special issues relating to non-@acronym{ASCII} -characters and how they are stored in strings and buffers. + This chapter covers the special issues relating to characters and +how they are stored in strings and buffers. @menu -* Text Representations:: Unibyte and multibyte representations +* Text Representations:: How Emacs represents text. * Converting Representations:: Converting unibyte to multibyte and vice versa. * Selecting a Representation:: Treating a byte sequence as unibyte or multi. * Character Codes:: How unibyte and multibyte relate to codes of individual characters. * Character Sets:: The space of possible character codes is divided into various character sets. -* Chars and Bytes:: More information about multibyte encodings. -* Splitting Characters:: Converting a character to its byte sequence. * Scanning Charsets:: Which character sets are used in a buffer? * Translation of Characters:: Translation tables are used for conversion. * Coding Systems:: Coding systems are conversions for saving files. @@ -33,41 +31,64 @@ characters and how they are stored in strings and buffers. @node Text Representations @section Text Representations -@cindex text representations - - Emacs has two @dfn{text representations}---two ways to represent text -in a string or buffer. These are called @dfn{unibyte} and -@dfn{multibyte}. Each string, and each buffer, uses one of these two -representations. For most purposes, you can ignore the issue of -representations, because Emacs converts text between them as -appropriate. Occasionally in Lisp programming you will need to pay -attention to the difference. +@cindex text representation + + Emacs buffers and strings support a large repertoire of characters +from many different scripts. This is so users could type and display +text in most any known written language. + +@cindex character codepoint +@cindex codespace +@cindex Unicode + To support this multitude of characters and scripts, Emacs closely +follows the @dfn{Unicode Standard}. The Unicode Standard assigns a +unique number, called a @dfn{codepoint}, to each and every character. +The range of codepoints defined by Unicode, or the Unicode +@dfn{codespace}, is @code{0..10FFFF} (in hex), inclusive. Emacs +extends this range with codepoints in the range @code{110000..3FFFFF}, +which it uses for representing characters that are not unified with +Unicode and raw 8-bit bytes that cannot be interpreted as characters +(the latter occupy the range @code{3FFF80..3FFFFF}). Thus, a +character codepoint in Emacs is a 22-bit integer number. + +@cindex internal representation of characters +@cindex characters, representation in buffers and strings +@cindex multibyte text + To conserve memory, Emacs does not hold fixed-length 22-bit numbers +that are codepoints of text characters within buffers and strings. +Rather, Emacs uses a variable-length internal representation of +characters, that stores each character as a sequence of 1 to 5 8-bit +bytes, depending on the magnitude of its codepoint@footnote{ +This internal representation is based on one of the encodings defined +by the Unicode Standard, called @dfn{UTF-8}, for representing any +Unicode codepoint, but Emacs extends UTF-8 to represent the additional +codepoints it uses for raw 8-bit bytes and characters not unified with +Unicode.}. +For example, any @acronym{ASCII} character takes up only 1 byte, a +Latin-1 character takes up 2 bytes, etc. We call this representation +of text @dfn{multibyte}, because it uses several bytes for each +character. + + Outside Emacs, characters can be represented in many different +encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts +between these external encodings and the internal representation, as +appropriate, when it reads text into a buffer or a string, or when it +writes text to a disk file or passes it to some other process. + + Occasionally, Emacs needs to hold and manipulate encoded text or +binary non-text data in its buffers or strings. For example, when +Emacs visits a file, it first reads the file's text verbatim into a +buffer, and only then converts it to the internal representation. +Before the conversion, the buffer holds encoded text. @cindex unibyte text - In unibyte representation, each character occupies one byte and -therefore the possible character codes range from 0 to 255. Codes 0 -through 127 are @acronym{ASCII} characters; the codes from 128 through 255 -are used for one non-@acronym{ASCII} character set (you can choose which -character set by setting the variable @code{nonascii-insert-offset}). - -@cindex leading code -@cindex multibyte text -@cindex trailing codes - In multibyte representation, a character may occupy more than one -byte, and as a result, the full range of Emacs character codes can be -stored. The first byte of a multibyte character is always in the range -128 through 159 (octal 0200 through 0237). These values are called -@dfn{leading codes}. The second and subsequent bytes of a multibyte -character are always in the range 160 through 255 (octal 0240 through -0377); these values are @dfn{trailing codes}. - - Some sequences of bytes are not valid in multibyte text: for example, -a single isolated byte in the range 128 through 159 is not allowed. But -character codes 128 through 159 can appear in multibyte text, -represented as two-byte sequences. All the character codes 128 through -255 are possible (though slightly abnormal) in multibyte text; they -appear in multibyte buffers and strings when you do explicit encoding -and decoding (@pxref{Explicit Encoding}). + Encoded text is not really text, as far as Emacs is concerned, but +rather a sequence of raw 8-bit bytes. We call buffers and strings +that hold encoded text @dfn{unibyte} buffers and strings, because +Emacs treats them as a sequence of individual bytes. In particular, +Emacs usually displays unibyte buffers and strings as octal codes such +as @code{\237}. We recommend that you never use unibyte buffers and +strings except for manipulating encoded text or binary non-text data. In a buffer, the buffer-local value of the variable @code{enable-multibyte-characters} specifies the representation used. @@ -77,7 +98,7 @@ when the string is constructed. @defvar enable-multibyte-characters This variable specifies the current buffer's text representation. If it is non-@code{nil}, the buffer contains multibyte text; otherwise, -it contains unibyte text. +it contains unibyte encoded text or binary non-text data. You cannot set this variable directly; instead, use the function @code{set-buffer-multibyte} to change a buffer's representation. @@ -96,20 +117,28 @@ default value to @code{nil} early in startup. @end defvar @defun position-bytes position -Return the byte-position corresponding to buffer position +Buffer positions are measured in character units. This function +returns the byte-position corresponding to buffer position @var{position} in the current buffer. This is 1 at the start of the buffer, and counts upward in bytes. If @var{position} is out of range, the value is @code{nil}. @end defun @defun byte-to-position byte-position -Return the buffer position corresponding to byte-position +Return the buffer position, in character units, corresponding to given @var{byte-position} in the current buffer. If @var{byte-position} is -out of range, the value is @code{nil}. +out of range, the value is @code{nil}. In a multibyte buffer, an +arbitrary value of @var{byte-position} can be not at character +boundary, but inside a multibyte sequence representing a single +character; in this case, this function returns the buffer position of +the character whose multibyte sequence includes @var{byte-position}. +In other words, the value does not change for all byte positions that +belong to the same character. @end defun @defun multibyte-string-p string -Return @code{t} if @var{string} is a multibyte string. +Return @code{t} if @var{string} is a multibyte string, @code{nil} +otherwise. @end defun @defun string-bytes string @@ -119,14 +148,20 @@ If @var{string} is a multibyte string, this can be greater than @code{(length @var{string})}. @end defun +@defun unibyte-string &rest bytes +This function concatenates all its argument @var{bytes} and makes the +result a unibyte string. +@end defun + @node Converting Representations @section Converting Text Representations Emacs can convert unibyte text to multibyte; it can also convert -multibyte text to unibyte, though this conversion loses information. In -general these conversions happen when inserting text into a buffer, or -when putting text from several strings together in one string. You can -also explicitly convert a string's contents to either representation. +multibyte text to unibyte, provided that the multibyte text contains +only @acronym{ASCII} and 8-bit raw bytes. In general, these +conversions happen when inserting text into a buffer, or when putting +text from several strings together in one string. You can also +explicitly convert a string's contents to either representation. Emacs chooses the representation for a string based on the text that it is constructed from. The general rule is to convert unibyte text to @@ -145,89 +180,47 @@ acceptable because the buffer's representation is a choice made by the user that cannot be overridden automatically. Converting unibyte text to multibyte text leaves @acronym{ASCII} characters -unchanged, and likewise character codes 128 through 159. It converts -the non-@acronym{ASCII} codes 160 through 255 by adding the value -@code{nonascii-insert-offset} to each character code. By setting this -variable, you specify which character set the unibyte characters -correspond to (@pxref{Character Sets}). For example, if -@code{nonascii-insert-offset} is 2048, which is @code{(- (make-char -'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters -correspond to Latin 1. If it is 2688, which is @code{(- (make-char -'greek-iso8859-7) 128)}, then they correspond to Greek letters. - - Converting multibyte text to unibyte is simpler: it discards all but -the low 8 bits of each character code. If @code{nonascii-insert-offset} -has a reasonable value, corresponding to the beginning of some character -set, this conversion is the inverse of the other: converting unibyte -text to multibyte and back to unibyte reproduces the original unibyte -text. +unchanged, and converts bytes with codes 128 through 159 to the +multibyte representation of raw eight-bit bytes. -@defvar nonascii-insert-offset -This variable specifies the amount to add to a non-@acronym{ASCII} character -when converting unibyte text to multibyte. It also applies when -@code{self-insert-command} inserts a character in the unibyte -non-@acronym{ASCII} range, 128 through 255. However, the functions -@code{insert} and @code{insert-char} do not perform this conversion. - -The right value to use to select character set @var{cs} is @code{(- -(make-char @var{cs}) 128)}. If the value of -@code{nonascii-insert-offset} is zero, then conversion actually uses the -value for the Latin 1 character set, rather than zero. -@end defvar + Converting multibyte text to unibyte converts all @acronym{ASCII} +and eight-bit characters to their single-byte form, but loses +information for non-@acronym{ASCII} characters by discarding all but +the low 8 bits of each character's codepoint. Converting unibyte text +to multibyte and back to unibyte reproduces the original unibyte text. -@defvar nonascii-translation-table -This variable provides a more general alternative to -@code{nonascii-insert-offset}. You can use it to specify independently -how to translate each code in the range of 128 through 255 into a -multibyte character. The value should be a char-table, or @code{nil}. -If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}. -@end defvar - -The next three functions either return the argument @var{string}, or a +The next two functions either return the argument @var{string}, or a newly created string with no text properties. -@defun string-make-unibyte string -This function converts the text of @var{string} to unibyte -representation, if it isn't already, and returns the result. If -@var{string} is a unibyte string, it is returned unchanged. Multibyte -character codes are converted to unibyte according to -@code{nonascii-translation-table} or, if that is @code{nil}, using -@code{nonascii-insert-offset}. If the lookup in the translation table -fails, this function takes just the low 8 bits of each character. -@end defun - -@defun string-make-multibyte string -This function converts the text of @var{string} to multibyte -representation, if it isn't already, and returns the result. If -@var{string} is a multibyte string or consists entirely of -@acronym{ASCII} characters, it is returned unchanged. In particular, -if @var{string} is unibyte and entirely @acronym{ASCII}, the returned -string is unibyte. (When the characters are all @acronym{ASCII}, -Emacs primitives will treat the string the same way whether it is -unibyte or multibyte.) If @var{string} is unibyte and contains -non-@acronym{ASCII} characters, the function -@code{unibyte-char-to-multibyte} is used to convert each unibyte -character to a multibyte character. -@end defun - @defun string-to-multibyte string This function returns a multibyte string containing the same sequence -of character codes as @var{string}. Unlike -@code{string-make-multibyte}, this function unconditionally returns a -multibyte string. If @var{string} is a multibyte string, it is -returned unchanged. +of characters as @var{string}. If @var{string} is a multibyte string, +it is returned unchanged. The function assumes that @var{string} +includes only @acronym{ASCII} characters and raw 8-bit bytes; the +latter are converted to their multibyte representation corresponding +to the codepoints in the @code{3FFF80..3FFFFF} area (@pxref{Text +Representations, codepoints}). +@end defun + +@defun string-to-unibyte string +This function returns a unibyte string containing the same sequence of +characters as @var{string}. It signals an error if @var{string} +contains a non-@acronym{ASCII} character. If @var{string} is a +unibyte string, it is returned unchanged. Use this function for +@var{string} arguments that contain only @acronym{ASCII} and eight-bit +characters. @end defun @defun multibyte-char-to-unibyte char This convert the multibyte character @var{char} to a unibyte -character, based on @code{nonascii-translation-table} and -@code{nonascii-insert-offset}. +character. If @var{char} is a character that is neither +@acronym{ASCII} nor eight-bit, the value is -1. @end defun @defun unibyte-char-to-multibyte char This convert the unibyte character @var{char} to a multibyte -character, based on @code{nonascii-translation-table} and -@code{nonascii-insert-offset}. +character, assuming @var{char} is either @acronym{ASCII} or raw 8-bit +byte. @end defun @node Selecting a Representation @@ -242,13 +235,13 @@ is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte} is @code{nil}, the buffer becomes unibyte. This function leaves the buffer contents unchanged when viewed as a -sequence of bytes. As a consequence, it can change the contents viewed -as characters; a sequence of two bytes which is treated as one character -in multibyte representation will count as two characters in unibyte -representation. Character codes 128 through 159 are an exception. They -are represented by one byte in a unibyte buffer, but when the buffer is -set to multibyte, they are converted to two-byte sequences, and vice -versa. +sequence of bytes. As a consequence, it can change the contents +viewed as characters; a sequence of three bytes which is treated as +one character in multibyte representation will count as three +characters in unibyte representation. Eight-bit characters +representing raw bytes are an exception. They are represented by one +byte in a unibyte buffer, but when the buffer is set to multibyte, +they are converted to two-byte sequences, and vice versa. This function sets @code{enable-multibyte-characters} to record which representation is in use. It also adjusts various data in the buffer @@ -263,81 +256,96 @@ base buffer. @defun string-as-unibyte string This function returns a string with the same bytes as @var{string} but treating each byte as a character. This means that the value may have -more characters than @var{string} has. +more characters than @var{string} has. Eight-bit characters +representing raw bytes are an exception: each one of them is converted +to a single byte. If @var{string} is already a unibyte string, then the value is @var{string} itself. Otherwise it is a newly created string, with no -text properties. If @var{string} is multibyte, any characters it -contains of charset @code{eight-bit-control} or @code{eight-bit-graphic} -are converted to the corresponding single byte. +text properties. @end defun @defun string-as-multibyte string This function returns a string with the same bytes as @var{string} but -treating each multibyte sequence as one character. This means that the -value may have fewer characters than @var{string} has. +treating each multibyte sequence as one character. This means that +the value may have fewer characters than @var{string} has. If a byte +sequence in @var{string} is invalid as a multibyte representation of a +single character, each byte in the sequence is treated as raw 8-bit +byte. If @var{string} is already a multibyte string, then the value is @var{string} itself. Otherwise it is a newly created string, with no -text properties. If @var{string} is unibyte and contains any individual -8-bit bytes (i.e.@: not part of a multibyte form), they are converted to -the corresponding multibyte character of charset @code{eight-bit-control} -or @code{eight-bit-graphic}. +text properties. @end defun @node Character Codes @section Character Codes @cindex character codes - The unibyte and multibyte text representations use different character -codes. The valid character codes for unibyte representation range from -0 to 255---the values that can fit in one byte. The valid character -codes for multibyte representation range from 0 to 524287, but not all -values in that range are valid. The values 128 through 255 are not -entirely proper in multibyte text, but they can occur if you do explicit -encoding and decoding (@pxref{Explicit Encoding}). Some other character -codes cannot occur at all in multibyte text. Only the @acronym{ASCII} codes -0 through 127 are completely legitimate in both representations. - -@defun char-valid-p charcode &optional genericp -This returns @code{t} if @var{charcode} is valid (either for unibyte -text or for multibyte text). + The unibyte and multibyte text representations use different +character codes. The valid character codes for unibyte representation +range from 0 to 255---the values that can fit in one byte. The valid +character codes for multibyte representation range from 0 to 4194303 +(#x3FFFFF). In this code space, values 0 through 127 are for +@acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F) +are for non-@acronym{ASCII} characters. Values 0 through 1114111 +(#10FFFF) corresponds to Unicode characters of the same codepoint, +while values 4194176 (#x3FFF80) through 4194303 (#x3FFFFF) are for +representing eight-bit raw bytes. + +@defun characterp charcode +This returns @code{t} if @var{charcode} is a valid character, and +@code{nil} otherwise. @example -(char-valid-p 65) +(characterp 65) @result{} t -(char-valid-p 256) - @result{} nil -(char-valid-p 2248) +(characterp 4194303) @result{} t +(characterp 4194304) + @result{} nil @end example +@end defun -If the optional argument @var{genericp} is non-@code{nil}, this -function also returns @code{t} if @var{charcode} is a generic -character (@pxref{Splitting Characters}). +@defun get-byte pos &optional string +This function returns the byte at current buffer's character position +@var{pos}. If the current buffer is unibyte, this is literally the +byte at that position. If the buffer is multibyte, byte values of +@acronym{ASCII} characters are the same as character codepoints, +whereas eight-bit raw bytes are converted to their 8-bit codes. The +function signals an error if the character at @var{pos} is +non-@acronym{ASCII}. + +The optional argument @var{string} means to get a byte value from that +string instead of the current buffer. @end defun @node Character Sets @section Character Sets @cindex character sets - Emacs classifies characters into various @dfn{character sets}, each of -which has a name which is a symbol. Each character belongs to one and -only one character set. - - In general, there is one character set for each distinct script. For -example, @code{latin-iso8859-1} is one character set, -@code{greek-iso8859-7} is another, and @code{ascii} is another. An -Emacs character set can hold at most 9025 characters; therefore, in some -cases, characters that would logically be grouped together are split -into several character sets. For example, one set of Chinese -characters, generally known as Big 5, is divided into two Emacs -character sets, @code{chinese-big5-1} and @code{chinese-big5-2}. - - @acronym{ASCII} characters are in character set @code{ascii}. The -non-@acronym{ASCII} characters 128 through 159 are in character set -@code{eight-bit-control}, and codes 160 through 255 are in character set -@code{eight-bit-graphic}. +@cindex charset +@cindex coded character set +An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters +in which each character is assigned a numeric code point. (The +Unicode standard calls this a @dfn{coded character set}.) Each Emacs +charset has a name which is a symbol. A single character can belong +to any number of different character sets, but it will generally have +a different code point in each charset. Examples of character sets +include @code{ascii}, @code{iso-8859-1}, @code{greek-iso8859-7}, and +@code{windows-1255}. The code point assigned to a character in a +charset is usually different from its code point used in Emacs buffers +and strings. + +@cindex @code{emacs}, a charset +@cindex @code{unicode}, a charset +@cindex @code{eight-bit}, a charset + Emacs defines several special character sets. The character set +@code{unicode} includes all the characters whose Emacs code points are +in the range @code{0..10FFFF}. The character set @code{emacs} +includes all @acronym{ASCII} and non-@acronym{ASCII} characters. +Finally, the @code{eight-bit} charset includes the 8-bit raw bytes; +Emacs uses it to represent raw bytes encountered in text. @defun charsetp object Returns @code{t} if @var{object} is a symbol that names a character set, @@ -348,155 +356,93 @@ Returns @code{t} if @var{object} is a symbol that names a character set, The value is a list of all defined character set names. @end defvar -@defun charset-list -This function returns the value of @code{charset-list}. It is only -provided for backward compatibility. +@defun charset-priority-list &optional highestp +This functions returns a list of all defined character sets ordered by +their priority. If @var{highestp} is non-@code{nil}, the function +returns a single character set of the highest priority. +@end defun + +@defun set-charset-priority &rest charsets +This function makes @var{charsets} the highest priority character sets. @end defun @defun char-charset character -This function returns the name of the character set that @var{character} -belongs to, or the symbol @code{unknown} if @var{character} is not a -valid character. +This function returns the name of the character set of highest +priority that @var{character} belongs to. @acronym{ASCII} characters +are an exception: for them, this function always returns @code{ascii}. @end defun @defun charset-plist charset -This function returns the charset property list of the character set -@var{charset}. Although @var{charset} is a symbol, this is not the same -as the property list of that symbol. Charset properties are used for -special purposes within Emacs. +This function returns the property list of the character set +@var{charset}. Although @var{charset} is a symbol, this is not the +same as the property list of that symbol. Charset properties include +important information about the charset, such as its documentation +string, short name, etc. @end defun -@deffn Command list-charset-chars charset -This command displays a list of characters in the character set -@var{charset}. -@end deffn - -@node Chars and Bytes -@section Characters and Bytes -@cindex bytes and characters - -@cindex introduction sequence (of character) -@cindex dimension (of character set) - In multibyte representation, each character occupies one or more -bytes. Each character set has an @dfn{introduction sequence}, which is -normally one or two bytes long. (Exception: the @code{ascii} character -set and the @code{eight-bit-graphic} character set have a zero-length -introduction sequence.) The introduction sequence is the beginning of -the byte sequence for any character in the character set. The rest of -the character's bytes distinguish it from the other characters in the -same character set. Depending on the character set, there are either -one or two distinguishing bytes; the number of such bytes is called the -@dfn{dimension} of the character set. - -@defun charset-dimension charset -This function returns the dimension of @var{charset}; at present, the -dimension is always 1 or 2. +@defun put-charset-property charset propname value +This function sets the @var{propname} property of @var{charset} to the +given @var{value}. @end defun -@defun charset-bytes charset -This function returns the number of bytes used to represent a character -in character set @var{charset}. +@defun get-charset-property charset propname +This function returns the value of @var{charset}s property +@var{propname}. @end defun - This is the simplest way to determine the byte length of a character -set's introduction sequence: - -@example -(- (charset-bytes @var{charset}) - (charset-dimension @var{charset})) -@end example - -@node Splitting Characters -@section Splitting Characters -@cindex character as bytes - - The functions in this section convert between characters and the byte -values used to represent them. For most purposes, there is no need to -be concerned with the sequence of bytes used to represent a character, -because Emacs translates automatically when necessary. - -@defun split-char character -Return a list containing the name of the character set of -@var{character}, followed by one or two byte values (integers) which -identify @var{character} within that character set. The number of byte -values is the character set's dimension. - -If @var{character} is invalid as a character code, @code{split-char} -returns a list consisting of the symbol @code{unknown} and @var{character}. +@deffn Command list-charset-chars charset +This command displays a list of characters in the character set +@var{charset}. +@end deffn -@example -(split-char 2248) - @result{} (latin-iso8859-1 72) -(split-char 65) - @result{} (ascii 65) -(split-char 128) - @result{} (eight-bit-control 128) -@end example + Emacs can convert between its internal representation of a character +and the character's codepoint in a specific charset. The following +two functions support these conversions. + +@c FIXME: decode-char and encode-char accept and ignore an additional +@c argument @var{restriction}. When that argument actually makes a +@c difference, it should be documented here. +@defun decode-char charset code-point +This function decodes a character that is assigned a @var{code-point} +in @var{charset}, to the corresponding Emacs character, and returns +it. If @var{charset} doesn't contain a character of that code point, +the value is @code{nil}. If @var{code-point} doesn't fit in a Lisp +integer (@pxref{Integer Basics, most-positive-fixnum}), it can be +specified as a cons cell @code{(@var{high} . @var{low})}, where +@var{low} are the lower 16 bits of the value and @var{high} are the +high 16 bits. @end defun -@cindex generate characters in charsets -@defun make-char charset &optional code1 code2 -This function returns the character in character set @var{charset} whose -position codes are @var{code1} and @var{code2}. This is roughly the -inverse of @code{split-char}. Normally, you should specify either one -or both of @var{code1} and @var{code2} according to the dimension of -@var{charset}. For example, - -@example -(make-char 'latin-iso8859-1 72) - @result{} 2248 -@end example - -Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed -before they are used to index @var{charset}. Thus you may use, for -instance, an ISO 8859 character code rather than subtracting 128, as -is necessary to index the corresponding Emacs charset. +@defun encode-char char charset +This function returns the code point assigned to the character +@var{char} in @var{charset}. If the result does not fit in a Lisp +integer, it is returned as a cons cell @code{(@var{high} . @var{low})} +that fits the second argument of @code{decode-char} above. If +@var{charset} doesn't have a codepoint for @var{char}, the value is +@code{nil}. @end defun -@cindex generic characters - If you call @code{make-char} with no @var{byte-values}, the result is -a @dfn{generic character} which stands for @var{charset}. A generic -character is an integer, but it is @emph{not} valid for insertion in the -buffer as a character. It can be used in @code{char-table-range} to -refer to the whole character set (@pxref{Char-Tables}). -@code{char-valid-p} returns @code{nil} for generic characters. -For example: - -@example -(make-char 'latin-iso8859-1) - @result{} 2176 -(char-valid-p 2176) - @result{} nil -(char-valid-p 2176 t) - @result{} t -(split-char 2176) - @result{} (latin-iso8859-1 0) -@end example - -The character sets @code{ascii}, @code{eight-bit-control}, and -@code{eight-bit-graphic} don't have corresponding generic characters. If -@var{charset} is one of them and you don't supply @var{code1}, -@code{make-char} returns the character code corresponding to the -smallest code in @var{charset}. - @node Scanning Charsets @section Scanning for Character Sets - Sometimes it is useful to find out which character sets appear in a -part of a buffer or a string. One use for this is in determining which -coding systems (@pxref{Coding Systems}) are capable of representing all -of the text in question. + Sometimes it is useful to find out, for characters that appear in a +certain part of a buffer or a string, to which character sets they +belong. One use for this is in determining which coding systems +(@pxref{Coding Systems}) are capable of representing all of the text +in question; another is to determine the font(s) for displaying that +text. @defun charset-after &optional pos -This function return the charset of a character in the current buffer -at position @var{pos}. If @var{pos} is omitted or @code{nil}, it -defaults to the current value of point. If @var{pos} is out of range, -the value is @code{nil}. +This function returns the charset of highest priority containing the +character in the current buffer at position @var{pos}. If @var{pos} +is omitted or @code{nil}, it defaults to the current value of point. +If @var{pos} is out of range, the value is @code{nil}. @end defun @defun find-charset-region beg end &optional translation -This function returns a list of the character sets that appear in the -current buffer between positions @var{beg} and @var{end}. +This function returns a list of the character sets of highest priority +that contain characters in the current buffer between positions +@var{beg} and @var{end}. The optional argument @var{translation} specifies a translation table to be used in scanning the text (@pxref{Translation of Characters}). If it @@ -506,10 +452,10 @@ characters instead of the characters actually in the buffer. @end defun @defun find-charset-string string &optional translation -This function returns a list of the character sets that appear in the -string @var{string}. It is just like @code{find-charset-region}, except -that it applies to the contents of @var{string} instead of part of the -current buffer. +This function returns a list of the character sets of highest priority +that contain characters in @var{string}. It is just like +@code{find-charset-region}, except that it applies to the contents of +@var{string} instead of part of the current buffer. @end defun @node Translation of Characters @@ -517,19 +463,18 @@ current buffer. @cindex character translation tables @cindex translation tables - A @dfn{translation table} is a char-table that specifies a mapping -of characters into characters. These tables are used in encoding and -decoding, and for other purposes. Some coding systems specify their -own particular translation tables; there are also default translation -tables which apply to all other coding systems. + A @dfn{translation table} is a char-table (@pxref{Char-Tables}) that +specifies a mapping of characters into characters. These tables are +used in encoding and decoding, and for other purposes. Some coding +systems specify their own particular translation tables; there are +also default translation tables which apply to all other coding +systems. - For instance, the coding-system @code{utf-8} has a translation table -that maps characters of various charsets (e.g., -@code{latin-iso8859-@var{x}}) into Unicode character sets. This way, -it can encode Latin-2 characters into UTF-8. Meanwhile, -@code{unify-8859-on-decoding-mode} operates by specifying -@code{standard-translation-table-for-decode} to translate -Latin-@var{x} characters into corresponding Unicode characters. + A translation table has two extra slots. The first is either +@code{nil} or a translation table that performs the reverse +translation; the second is the maximum number of characters to look up +for translating sequences of characters (see the description of +@code{make-translation-table-from-alist} below). @defun make-translation-table &rest translations This function returns a translation table based on the argument @@ -541,47 +486,69 @@ The arguments and the forms in each argument are processed in order, and if a previous form already translates @var{to} to some other character, say @var{to-alt}, @var{from} is also translated to @var{to-alt}. - -You can also map one whole character set into another character set with -the same dimension. To do this, you specify a generic character (which -designates a character set) for @var{from} (@pxref{Splitting Characters}). -In this case, if @var{to} is also a generic character, its character -set should have the same dimension as @var{from}'s. Then the -translation table translates each character of @var{from}'s character -set into the corresponding character of @var{to}'s character set. If -@var{from} is a generic character and @var{to} is an ordinary -character, then the translation table translates every character of -@var{from}'s character set into @var{to}. @end defun - In decoding, the translation table's translations are applied to the -characters that result from ordinary decoding. If a coding system has -property @code{translation-table-for-decode}, that specifies the -translation table to use. (This is a property of the coding system, -as returned by @code{coding-system-get}, not a property of the symbol -that is the coding system's name. @xref{Coding System Basics,, Basic -Concepts of Coding Systems}.) Otherwise, if -@code{standard-translation-table-for-decode} is non-@code{nil}, -decoding uses that table. - - In encoding, the translation table's translations are applied to the -characters in the buffer, and the result of translation is actually -encoded. If a coding system has property -@code{translation-table-for-encode}, that specifies the translation -table to use. Otherwise the variable -@code{standard-translation-table-for-encode} specifies the translation -table. + During decoding, the translation table's translations are applied to +the characters that result from ordinary decoding. If a coding system +has property @code{:decode-translation-table}, that specifies the +translation table to use, or a list of translation tables to apply in +sequence. (This is a property of the coding system, as returned by +@code{coding-system-get}, not a property of the symbol that is the +coding system's name. @xref{Coding System Basics,, Basic Concepts of +Coding Systems}.) Finally, if +@code{standard-translation-table-for-decode} is non-@code{nil}, the +resulting characters are translated by that table. + + During encoding, the translation table's translations are applied to +the characters in the buffer, and the result of translation is +actually encoded. If a coding system has property +@code{:encode-translation-table}, that specifies the translation table +to use, or a list of translation tables to apply in sequence. In +addition, if the variable @code{standard-translation-table-for-encode} +is non-@code{nil}, it specifies the translation table to use for +translating the result. @defvar standard-translation-table-for-decode -This is the default translation table for decoding, for -coding systems that don't specify any other translation table. +This is the default translation table for decoding. If a coding +systems specifies its own translation tables, the table that is the +value of this variable, if non-@code{nil}, is applied after them. @end defvar @defvar standard-translation-table-for-encode -This is the default translation table for encoding, for -coding systems that don't specify any other translation table. +This is the default translation table for encoding. If a coding +systems specifies its own translation tables, the table that is the +value of this variable, if non-@code{nil}, is applied after them. @end defvar +@defun make-translation-table-from-vector vec +This function returns a translation table made from @var{vec} that is +an array of 256 elements to map byte values 0 through 255 to +characters. Elements may be @code{nil} for untranslated bytes. The +returned table has a translation table for reverse mapping in the +first extra slot, and the value @code{1} in the second extra slot. + +This function provides an easy way to make a private coding system +that maps each byte to a specific character. You can specify the +returned table and the reverse translation table using the properties +@code{:decode-translation-table} and @code{:encode-translation-table} +respectively in the @var{props} argument to +@code{define-coding-system}. +@end defun + +@defun make-translation-table-from-alist alist +This function is similar to @code{make-translation-table} but returns +a complex translation table rather than a simple one-to-one mapping. +Each element of @var{alist} is of the form @code{(@var{from} +. @var{to})}, where @var{from} and @var{to} are either a character or +a vector specifying a sequence of characters. If @var{from} is a +character, that character is translated to @var{to} (i.e.@: to a +character or a character sequence). If @var{from} is a vector of +characters, that sequence is translated to @var{to}. The returned +table has a translation table for reverse mapping in the first extra +slot, and the maximum length of all the @var{from} character sequences +in the second extra slot. +@end defun + @node Coding Systems @section Coding Systems