@c This is part of the Emacs manual.
@c Copyright (C) 1997, 1999, 2000, 2001, 2002, 2003, 2004,
-@c 2005, 2006, 2007 Free Software Foundation, Inc.
+@c 2005, 2006, 2007, 2008, 2009 Free Software Foundation, Inc.
@c See file emacs.texi for copying conditions.
@node International, Major Modes, Frames, Top
@chapter International Character Set Support
+@c This node is referenced in the tutorial. When renaming or deleting
+@c it, the tutorial needs to be adjusted. (TUTORIAL.de)
@cindex MULE
@cindex international scripts
@cindex multibyte characters
* Fontsets:: Fontsets are collections of fonts
that cover the whole spectrum of characters.
* Defining Fontsets:: Defining a new fontset.
+* Modifying Fontsets:: Modifying an existing fontset.
* Undisplayable Characters:: When characters don't display.
* Unibyte Mode:: You can pick one European character set
to use without multibyte characters.
The prefix key @kbd{C-x @key{RET}} is used for commands that pertain
to multibyte characters, coding systems, and input methods.
+@kindex C-x =
+@findex what-cursor-position
+ The command @kbd{C-x =} (@code{what-cursor-position}) shows
+information about the character at point. In addition to the
+character position, which was described in @ref{Position Info}, this
+command displays how the character is encoded. For instance, it
+displays the following line in the echo area for the character
+@samp{c}:
+
+@smallexample
+Char: c (99, #o143, #x63) point=28062 of 36168 (78%) column=53
+@end smallexample
+
+ The four values after @samp{Char:} describe the character that
+follows point, first by showing it and then by giving its character
+code in decimal, octal and hex. For a non-@acronym{ASCII} multibyte
+character, these are followed by @samp{file} and the character's
+representation, in hex, in the buffer's coding system, if that coding
+system encodes the character safely and with a single byte
+(@pxref{Coding Systems}). If the character's encoding is longer than
+one byte, Emacs shows @samp{file ...}.
+
+ However, if the character displayed is in the range 0200 through
+0377 octal, it may actually stand for an invalid UTF-8 byte read from
+a file. In Emacs, that byte is represented as a sequence of 8-bit
+characters, but all of them together display as the original invalid
+byte, in octal code. In this case, @kbd{C-x =} shows @samp{part of
+display ...} instead of @samp{file}.
+
+@cindex character set of character at point
+@cindex font of character at point
+@cindex text properties at point
+@cindex face at point
+ With a prefix argument (@kbd{C-u C-x =}), this command displays a
+detailed description of the character in a window:
+
+@itemize @bullet
+@item
+The character set name, and the codes that identify the character
+within that character set; @acronym{ASCII} characters are identified
+as belonging to the @code{ascii} character set.
+
+@item
+The character's syntax and categories.
+
+@item
+The character's encodings, both internally in the buffer, and externally
+if you were to save the file.
+
+@item
+What keys to type to input the character in the current input method
+(if it supports the character).
+
+@item
+If you are running Emacs on a graphical display, the font name and
+glyph code for the character. If you are running Emacs on a text-only
+terminal, the code(s) sent to the terminal.
+
+@item
+The character's text properties (@pxref{Text Properties,,,
+elisp, the Emacs Lisp Reference Manual}), including any non-default
+faces used to display the character, and any overlays containing it
+(@pxref{Overlays,,, elisp, the same manual}).
+@end itemize
+
+ Here's an example showing the Latin-1 character A with grave accent,
+in a buffer whose coding system is @code{utf-8-unix}:
+
+@smallexample
+ character: @`A (192, #o300, #xc0)
+preferred charset: unicode (Unicode (ISO10646))
+ code point: 0xC0
+ syntax: w which means: word
+ category: j:Japanese l:Latin v:Vietnamese
+ buffer code: #xC3 #x80
+ file code: not encodable by coding system undecided-unix
+ display: by this font (glyph code)
+ xft:-unknown-DejaVu Sans Mono-normal-normal-normal-*-13-*-*-*-m-0-iso10646-1 (#x82)
+
+Character code properties: customize what to show
+ name: LATIN CAPITAL LETTER A WITH GRAVE
+ general-category: Lu (Letter, Uppercase)
+ decomposition: (65 768) ('A' '̀')
+ old-name: LATIN CAPITAL LETTER A GRAVE
+
+There are text properties here:
+ auto-composed t
+@end smallexample
+
@node Enabling Multibyte
@section Enabling Multibyte Characters
This sets the default input method to be @code{chinese-tonepy}
whenever you choose a Chinese-GB language environment.
+You can instruct Emacs to activate a certain input method
+automatically. For example:
+
+@lisp
+(add-hook 'text-mode-hook
+ (lambda () (set-input-method "german-prefix")))
+@end lisp
+
+@noindent
+This activates the input emthod ``german-prefix'' automatically in the
+Text mode.
+
@findex quail-set-keyboard-layout
Some input methods for alphabetic scripts work by (in effect)
remapping the keyboard to emulate various keyboard layouts commonly used
codepage. You can use these encodings just like any other coding
system; for example, to visit a file encoded in codepage 850, type
@kbd{C-x @key{RET} c cp850 @key{RET} C-x C-f @var{filename}
-@key{RET}}@footnote{
-In the MS-DOS port of Emacs, you need to create a @code{cp@var{nnn}}
-coding system with @kbd{M-x codepage-setup}, before you can use it.
-@iftex
-@xref{MS-DOS and MULE,,,emacs-extra,Specialized Emacs Features}.
-@end iftex
-@ifnottex
-@xref{MS-DOS and MULE}.
-@end ifnottex
-}.
+@key{RET}}.
In addition to converting various representations of non-@acronym{ASCII}
characters, a coding system can perform end-of-line conversion. Emacs
@key{RET} X} (@code{set-next-selection-coding-system}) specifies the
coding system for the next selection made in Emacs or read by Emacs.
+@vindex x-select-request-type
+ The variable @code{x-select-request-type} specifies the data type to
+request from the X Window System for receiving text selections from
+other applications. If the value is @code{nil} (the default), Emacs
+tries @code{COMPOUND_TEXT} and @code{UTF8_STRING}, in this order, and
+uses various heuristics to choose the more appropriate of the two
+results; if none of these succeed, Emacs falls back on @code{STRING}.
+If the value of @code{x-select-request-type} is one of the symbols
+@code{COMPOUND_TEXT}, @code{UTF8_STRING}, @code{STRING}, or
+@code{TEXT}, Emacs uses only that request type. If the value is a
+list of some of these symbols, Emacs tries only the request types in
+the list, in order, until one of them succeeds, or until the list is
+exhausted.
+
@kindex C-x RET p
@findex set-buffer-process-coding-system
The command @kbd{C-x @key{RET} p} (@code{set-buffer-process-coding-system})
specified above, whose value is nonempty is the one that determines
the text representation.)
+@vindex x-select-request-type
+ The variable @code{x-select-request-type} specifies a selection data
+type of selection to request from the X server. The default value is
+@code{nil}, which means Emacs tries @code{COMPOUND_TEXT} and
+@code{UTF8_STRING}, and uses whichever result seems more appropriate.
+You can explicitly specify the data type by setting the variable to
+one of the symbols @code{COMPOUND_TEXT}, @code{UTF8_STRING},
+@code{STRING} and @code{TEXT}.
+
@node File Name Coding
@section Coding Systems for File Names
A font typically defines shapes for a single alphabet or script.
Therefore, displaying the entire range of scripts that Emacs supports
requires a collection of many fonts. In Emacs, such a collection is
-called a @dfn{fontset}. A fontset is defined by a list of fonts, each
-assigned to handle a range of character codes.
+called a @dfn{fontset}. A fontset is defined by a list of font specs,
+each assigned to handle a range of character codes, and may fall back
+on another fontset for characters which are not covered by the fonts
+it specifies.
Each fontset has a name, like a font. However, while fonts are
stored in the system and the available font names are defined by the
installation instructions have information on additional font
support.}
- Emacs creates two fontsets automatically: the @dfn{standard fontset}
-and the @dfn{startup fontset}. The standard fontset is most likely to
-have fonts for a wide variety of non-@acronym{ASCII} characters;
-however, this is not the default for Emacs to use. (By default, Emacs
-tries to find a font that has bold and italic variants.) You can
+ Emacs creates three fontsets automatically: the @dfn{standard
+fontset}, the @dfn{startup fontset} and the @dfn{default fontset}.
+The default fontset is most likely to have fonts for a wide variety of
+non-@acronym{ASCII} characters and is the default fallback for the
+other two fontsets, and if you set a default font rather than fontset.
+However it does not specify font family names, so results can be
+somewhat random if you use it directly. The standard fontset merely
+falls back on the default fontset without defining any modifications
+of its own, and is defined for backwards compatibility. You can
specify use of the standard fontset with the @samp{-fn} option. For
example,
@section Defining fontsets
@vindex standard-fontset-spec
+@vindex w32-standard-fontset-spec
+@vindex ns-standard-fontset-spec
@cindex standard fontset
- Emacs creates a standard fontset automatically according to the value
+ When running on X, Emacs creates a standard fontset automatically according to the value
of @code{standard-fontset-spec}. This fontset's name is
@example
@noindent
or just @samp{fontset-standard} for short.
+ On GNUstep and Mac, fontset-standard is created using the value of
+@code{ns-standard-fontset-spec}, and on Windows it is
+created using the value of @code{w32-standard-fontset-spec}.
+
Bold, italic, and bold-italic variants of the standard fontset are
created automatically. Their names have @samp{bold} instead of
@samp{medium}, or @samp{i} instead of @samp{r}, or both.
@cindex startup fontset
- If you specify a default @acronym{ASCII} font with the @samp{Font} resource or
-the @samp{-fn} argument, Emacs generates a fontset from it
-automatically. This is the @dfn{startup fontset} and its name is
-@code{fontset-startup}. It does this by replacing the @var{foundry},
-@var{family}, @var{add_style}, and @var{average_width} fields of the
-font name with @samp{*}, replacing @var{charset_registry} field with
-@samp{fontset}, and replacing @var{charset_encoding} field with
-@samp{startup}, then using the resulting string to specify a fontset.
+ Emacs generates a fontset automatically, based on any default
+@acronym{ASCII} font that you specify with the @samp{Font} resource or
+the @samp{-fn} argument, or the default font that Emacs found when it
+started. This is the @dfn{startup fontset} and its name is
+@code{fontset-startup}. It does this by replacing the
+@var{charset_registry} field with @samp{fontset}, and replacing
+@var{charset_encoding} field with @samp{startup}, then using the
+resulting string to specify a fontset.
For instance, if you start Emacs this way,
window frame:
@example
--*-*-medium-r-normal-*-14-140-*-*-*-*-fontset-startup
+-*-courier-medium-r-normal-*-14-140-*-*-*-*-fontset-startup
@end example
+ The startup fontset will use the font that you specify or a variant
+with a different registry and encoding for all the characters which
+are supported by that font, and fallback on @samp{fontset-default} for
+other characters.
+
With the X resource @samp{Emacs.Font}, you can specify a fontset name
just like an actual font name. But be careful not to specify a fontset
name in a wildcard resource like @samp{Emacs*Font}---that wildcard
@xref{Font X}, for more information about font naming in X.
+@node Modifying Fontsets
+@section Modifying Fontsets
+@cindex fontsets, modifying
+@findex set-fontset-font
+
+ Fontsets do not always have to be created from scratch. If only
+minor changes are required it may be easier to modify an existing
+fontset. Modifying @samp{fontset-default} will also affect other
+fontsets that use it as a fallback, so can be an effective way of
+fixing problems with the fonts that Emacs chooses for a particular
+script.
+
+Fontsets can be modified using the function @code{set-fontset-font},
+specifying a character, a charset, a script, or a range of characters
+to modify the font for, and a font-spec for the font to be used. Some
+examples are:
+
+@example
+;; Use Liberation Mono for latin-3 charset.
+(set-fontset-font "fontset-default" 'iso-8859-3 "Liberation Mono")
+
+;; Prefer a big5 font for han characters
+(set-fontset-font "fontset-default" 'han (font-spec :registry "big5")
+ nil 'prepend)
+
+;; Use DejaVu Sans Mono as a fallback in fontset-startup before
+;; resorting to fontset-default.
+(set-fontset-font "fontset-startup" nil "DejaVu Sans Mono" nil 'append)
+
+;; Use MyPrivateFont for the Unicode private use area.
+(set-fontset-font "fontset-default" '(#xe000 . #xf8ff) "MyPrivateFont")
+
+@end example
+
+
@node Undisplayable Characters
@section Undisplayable Characters