Merge from emacs-24; up to 2012-12-17T11:17:34Z!rgm@gnu.org

[gnu-emacs] / doc / lispref / nonascii.texi
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi

index a3f25af47194688e296c2ff5f9dadba2b33d8680..e462c3b4ce40624d62874af31c62131ef4ab9160 100644 (file)
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -1,9 +1,8 @@
  @c -*-texinfo-*-
  @c This is part of the GNU Emacs Lisp Reference Manual.
-@c Copyright (C) 1998-1999, 2001-2011  Free Software Foundation, Inc.
+@c Copyright (C) 1998-1999, 2001-2013 Free Software Foundation, Inc.
  @c See the file elisp.texi for copying conditions.
-@setfilename ../../info/characters
-@node Non-ASCII Characters, Searching and Matching, Text, Top
+@node Non-ASCII Characters
  @chapter Non-@acronym{ASCII} Characters
  @cindex multibyte characters
  @cindex characters, multi-byte
@@ -201,7 +200,7 @@ characters.
  @defun byte-to-string byte
  @cindex byte to string
  This function returns a unibyte string containing a single byte of
-character data, @var{character}.  It signals a error if
+character data, @var{character}.  It signals an error if
  @var{character} is not an integer between 0 and 255.
  @end defun
  
@@ -242,8 +241,12 @@ representation is in use.  It also adjusts various data in the buffer
  (including overlays, text properties and markers) so that they cover the
  same text as they did before.
  
-You cannot use @code{set-buffer-multibyte} on an indirect buffer,
-because indirect buffers always inherit the representation of the
+This function signals an error if the buffer is narrowed, since the
+narrowing might have occurred in the middle of multibyte character
+sequences.
+
+This function also signals an error if the buffer is an indirect
+buffer.  An indirect buffer always inherits the representation of its
  base buffer.
  @end defun
  
@@ -369,53 +372,69 @@ replacing each @samp{_} character with a dash @samp{-}.  For example,
  @code{canonical-combining-class}.  However, sometimes we shorten the
  names to make their use easier.
  
+@cindex unassigned character codepoints
+  Some codepoints are left @dfn{unassigned} by the
+@acronym{UCD}---they don't correspond to any character.  The Unicode
+Standard defines default values of properties for such codepoints;
+they are mentioned below for each property.
+
    Here is the full list of value types for all the character
  properties that Emacs knows about:
  
  @table @code
  @item name
-This property corresponds to the Unicode @code{Name} property.  The
-value is a string consisting of upper-case Latin letters A to Z,
-digits, spaces, and hyphen @samp{-} characters.
+Corresponds to the @code{Name} Unicode property.  The value is a
+string consisting of upper-case Latin letters A to Z, digits, spaces,
+and hyphen @samp{-} characters.  For unassigned codepoints, the value
+is an empty string.
  
  @cindex unicode general category
  @item general-category
-This property corresponds to the Unicode @code{General_Category}
-property.  The value is a symbol whose name is a 2-letter abbreviation
-of the character's classification.
+Corresponds to the @code{General_Category} Unicode property.  The
+value is a symbol whose name is a 2-letter abbreviation of the
+character's classification.  For unassigned codepoints, the value
+is @code{Cn}.
  
  @item canonical-combining-class
-Corresponds to the Unicode @code{Canonical_Combining_Class} property.
-The value is an integer number.
+Corresponds to the @code{Canonical_Combining_Class} Unicode property.
+The value is an integer number.  For unassigned codepoints, the value
+is zero.
  
+@cindex bidirectional class of characters
  @item bidi-class
  Corresponds to the Unicode @code{Bidi_Class} property.  The value is a
  symbol whose name is the Unicode @dfn{directional type} of the
-character.
+character.  Emacs uses this property when it reorders bidirectional
+text for display (@pxref{Bidirectional Display}).  For unassigned
+codepoints, the value depends on the code blocks to which the
+codepoint belongs: most unassigned codepoints get the value of
+@code{L} (strong L), but some get values of @code{AL} (Arabic letter)
+or @code{R} (strong R).
  
  @item decomposition
-Corresponds to the Unicode @code{Decomposition_Type} and
-@code{Decomposition_Value} properties.  The value is a list, whose
-first element may be a symbol representing a compatibility formatting
-tag, such as @code{small}@footnote{
-Note that the Unicode spec writes these tag names inside
-@samp{<..>} brackets.  The tag names in Emacs do not include the
-brackets; e.g., Unicode specifies @samp{<small>} where Emacs uses
-@samp{small}.
-}; the other elements are characters that give the compatibility
-decomposition sequence of this character.
+Corresponds to the Unicode properties @code{Decomposition_Type} and
+@code{Decomposition_Value}.  The value is a list, whose first element
+may be a symbol representing a compatibility formatting tag, such as
+@code{small}@footnote{The Unicode specification writes these tag names
+inside @samp{<..>} brackets, but the tag names in Emacs do not include
+the brackets; e.g., Unicode specifies @samp{<small>} where Emacs uses
+@samp{small}.  }; the other elements are characters that give the
+compatibility decomposition sequence of this character.  For
+unassigned codepoints, the value is the character itself.
  
  @item decimal-digit-value
  Corresponds to the Unicode @code{Numeric_Value} property for
  characters whose @code{Numeric_Type} is @samp{Digit}.  The value is an
-integer number.
+integer number.  For unassigned codepoints, the value is @code{nil},
+which means @acronym{NaN}, or ``not-a-number''.
  
  @item digit-value
  Corresponds to the Unicode @code{Numeric_Value} property for
  characters whose @code{Numeric_Type} is @samp{Decimal}.  The value is
  an integer number.  Examples of such characters include compatibility
  subscript and superscript digits, for which the value is the
-corresponding number.
+corresponding number.  For unassigned codepoints, the value is
+@code{nil}, which means @acronym{NaN}.
  
  @item numeric-value
  Corresponds to the Unicode @code{Numeric_Value} property for
@@ -424,33 +443,53 @@ this property is an integer or a floating-point number.  Examples of
  characters that have this property include fractions, subscripts,
  superscripts, Roman numerals, currency numerators, and encircled
  numbers.  For example, the value of this property for the character
-@code{U+2155} (@sc{vulgar fraction one fifth}) is @code{0.2}.
+@code{U+2155} (@sc{vulgar fraction one fifth}) is @code{0.2}.  For
+unassigned codepoints, the value is @code{nil}, which means
+@acronym{NaN}.
  
+@cindex mirroring of characters
  @item mirrored
  Corresponds to the Unicode @code{Bidi_Mirrored} property.  The value
-of this property is a symbol, either @code{Y} or @code{N}.
+of this property is a symbol, either @code{Y} or @code{N}.  For
+unassigned codepoints, the value is @code{N}.
+
+@item mirroring
+Corresponds to the Unicode @code{Bidi_Mirroring_Glyph} property.  The
+value of this property is a character whose glyph represents the
+mirror image of the character's glyph, or @code{nil} if there's no
+defined mirroring glyph.  All the characters whose @code{mirrored}
+property is @code{N} have @code{nil} as their @code{mirroring}
+property; however, some characters whose @code{mirrored} property is
+@code{Y} also have @code{nil} for @code{mirroring}, because no
+appropriate characters exist with mirrored glyphs.  Emacs uses this
+property to display mirror images of characters when appropriate
+(@pxref{Bidirectional Display}).  For unassigned codepoints, the value
+is @code{nil}.
  
  @item old-name
  Corresponds to the Unicode @code{Unicode_1_Name} property.  The value
-is a string.
+is a string.  For unassigned codepoints, the value is an empty string.
  
  @item iso-10646-comment
  Corresponds to the Unicode @code{ISO_Comment} property.  The value is
-a string.
+a string.  For unassigned codepoints, the value is an empty string.
  
  @item uppercase
  Corresponds to the Unicode @code{Simple_Uppercase_Mapping} property.
-The value of this property is a single character.
+The value of this property is a single character.  For unassigned
+codepoints, the value is @code{nil}, which means the character itself.
  
  @item lowercase
  Corresponds to the Unicode @code{Simple_Lowercase_Mapping} property.
-The value of this property is a single character.
+The value of this property is a single character.  For unassigned
+codepoints, the value is @code{nil}, which means the character itself.
  
  @item titlecase
  Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property.
  @dfn{Title case} is a special form of a character used when the first
  character of a word needs to be capitalized.  The value of this
-property is a single character.
+property is a single character.  For unassigned codepoints, the value
+is @code{nil}, which means the character itself.
  @end table
  
  @defun get-char-code-property char propname
@@ -466,15 +505,18 @@ This function returns the value of @var{char}'s @var{propname} property.
       @result{} Nd
  @end group
  @group
-(get-char-code-property ?\u2084 'digit-value) ; subscript 4
+;; subscript 4
+(get-char-code-property ?\u2084 'digit-value)
       @result{} 4
  @end group
  @group
-(get-char-code-property ?\u2155 'numeric-value) ; one fifth
+;; one fifth
+(get-char-code-property ?\u2155 'numeric-value)
       @result{} 0.2
  @end group
  @group
-(get-char-code-property ?\u2163 'numeric-value) ; Roman IV
+;; Roman IV
+(get-char-code-property ?\u2163 'numeric-value)
       @result{} 4
  @end group
  @end example
@@ -568,7 +610,7 @@ The value is a list of all defined character set names.
  @end defvar
  
  @defun charset-priority-list &optional highestp
-This functions returns a list of all defined character sets ordered by
+This function returns a list of all defined character sets ordered by
  their priority.  If @var{highestp} is non-@code{nil}, the function
  returns a single character set of the highest priority.
  @end defun
@@ -783,7 +825,7 @@ a complex translation table rather than a simple one-to-one mapping.
  Each element of @var{alist} is of the form @code{(@var{from}
  . @var{to})}, where @var{from} and @var{to} are either characters or
  vectors specifying a sequence of characters.  If @var{from} is a
-character, that character is translated to @var{to} (i.e.@: to a
+character, that character is translated to @var{to} (i.e., to a
  character or a character sequence).  If @var{from} is a vector of
  characters, that sequence is translated to @var{to}.  The returned
  table has a translation table for reverse mapping in the first extra
@@ -813,8 +855,6 @@ documented here.
                                      for a single file operation.
  * Explicit Encoding::           Encoding or decoding text without doing I/O.
  * Terminal I/O Encoding::       Use of encoding for terminal I/O.
-* MS-DOS File Types::           How DOS "text" and "binary" files
-                                    relate to coding systems.
  @end menu
  
  @node Coding System Basics
@@ -1129,7 +1169,7 @@ positions.
  @defun detect-coding-region start end &optional highest
  This function chooses a plausible coding system for decoding the text
  from @var{start} to @var{end}.  This text should be a byte sequence,
-i.e.@: unibyte text or multibyte text with only @acronym{ASCII} and
+i.e., unibyte text or multibyte text with only @acronym{ASCII} and
  eight-bit characters (@pxref{Explicit Encoding}).
  
  Normally this function returns a list of coding systems that could
@@ -1449,11 +1489,11 @@ for decoding (in case @var{operation} does decoding), and
  @var{encoding-system} is the coding system for encoding (in case
  @var{operation} does encoding).
  
-The argument @var{operation} is a symbol, one of @code{write-region},
-@code{start-process}, @code{call-process}, @code{call-process-region},
-@code{insert-file-contents}, or @code{open-network-stream}.  These are
-the names of the Emacs I/O primitives that can do character code and
-eol conversion.
+The argument @var{operation} is a symbol; it should be one of
+@code{write-region}, @code{start-process}, @code{call-process},
+@code{call-process-region}, @code{insert-file-contents}, or
+@code{open-network-stream}.  These are the names of the Emacs I/O
+primitives that can do character code and eol conversion.
  
  The remaining arguments should be the same arguments that might be given
  to the corresponding I/O primitive.  Depending on the primitive, one
@@ -1539,7 +1579,7 @@ decoding functions (@pxref{Explicit Encoding}).
    Sometimes, you need to prefer several coding systems for some
  operation, rather than fix a single one.  Emacs lets you specify a
  priority order for using coding systems.  This ordering affects the
-sorting of lists of coding sysems returned by functions such as
+sorting of lists of coding systems returned by functions such as
  @code{find-coding-systems-region} (@pxref{Lisp and Coding Systems}).
  
  @defun coding-system-priority-list &optional highestp
@@ -1733,62 +1773,6 @@ for encoding terminal output from @var{terminal}.  If
  @code{nil}, that means the currently selected frame's terminal.
  @end deffn
  
-@node MS-DOS File Types
-@subsection MS-DOS File Types
-@cindex DOS file types
-@cindex MS-DOS file types
-@cindex Windows file types
-@cindex file types on MS-DOS and Windows
-@cindex text files and binary files
-@cindex binary files and text files
-
-  On MS-DOS and Microsoft Windows, Emacs guesses the appropriate
-end-of-line conversion for a file by looking at the file's name.  This
-feature classifies files as @dfn{text files} and @dfn{binary files}.  By
-``binary file'' we mean a file of literal byte values that are not
-necessarily meant to be characters; Emacs does no end-of-line conversion
-and no character code conversion for them.  On the other hand, the bytes
-in a text file are intended to represent characters; when you create a
-new file whose name implies that it is a text file, Emacs uses DOS
-end-of-line conversion.
-
-@defvar buffer-file-type
-This variable, automatically buffer-local in each buffer, records the
-file type of the buffer's visited file.  When a buffer does not specify
-a coding system with @code{buffer-file-coding-system}, this variable is
-used to determine which coding system to use when writing the contents
-of the buffer.  It should be @code{nil} for text, @code{t} for binary.
-If it is @code{t}, the coding system is @code{no-conversion}.
-Otherwise, @code{undecided-dos} is used.
-
-Normally this variable is set by visiting a file; it is set to
-@code{nil} if the file was visited without any actual conversion.
-
-Its default value is used to decide how to handle files for which
-@code{file-name-buffer-file-type-alist} says nothing about the type:
-If the default value is non-@code{nil}, then these files are treated as
-binary: the coding system @code{no-conversion} is used.  Otherwise,
-nothing special is done for them---the coding system is deduced solely
-from the file contents, in the usual Emacs fashion.
-@end defvar
-
-@defopt file-name-buffer-file-type-alist
-This variable holds an alist for recognizing text and binary files.
-Each element has the form (@var{regexp} . @var{type}), where
-@var{regexp} is matched against the file name, and @var{type} may be
-@code{nil} for text, @code{t} for binary, or a function to call to
-compute which.  If it is a function, then it is called with a single
-argument (the file name) and should return @code{t} or @code{nil}.
-
-When running on MS-DOS or MS-Windows, Emacs checks this alist to decide
-which coding system to use when reading a file.  For a text file,
-@code{undecided-dos} is used.  For a binary file, @code{no-conversion}
-is used.
-
-If no element in this alist matches a given file name, then
-the default value of @code{buffer-file-type} says how to treat the file.
-@end defopt
-
  @node Input Methods
  @section Input Methods
  @cindex input methods