+@c -*- coding: utf-8 -*-
@c This is part of the Emacs manual.
-@c Copyright (C) 1997, 1999-2013 Free Software Foundation, Inc.
+@c Copyright (C) 1997, 1999-2016 Free Software Foundation, Inc.
@c See file emacs.texi for copying conditions.
@node International
@chapter International Character Set Support
@c This node is referenced in the tutorial. When renaming or deleting
@c it, the tutorial needs to be adjusted. (TUTORIAL.de)
-@cindex MULE
@cindex international scripts
@cindex multibyte characters
@cindex encoding of characters
-@cindex Celtic
+@cindex Arabic
+@cindex Bengali
@cindex Chinese
@cindex Cyrillic
-@cindex Czech
-@cindex Devanagari
+@cindex Han
@cindex Hindi
-@cindex Marathi
@cindex Ethiopic
-@cindex German
+@cindex Georgian
@cindex Greek
+@cindex Hangul
@cindex Hebrew
+@cindex Hindi
@cindex IPA
@cindex Japanese
@cindex Korean
-@cindex Lao
@cindex Latin
-@cindex Polish
-@cindex Romanian
-@cindex Slovak
-@cindex Slovenian
@cindex Thai
-@cindex Tibetan
-@cindex Turkish
@cindex Vietnamese
-@cindex Dutch
-@cindex Spanish
Emacs supports a wide variety of international character sets,
including European and Vietnamese variants of the Latin alphabet, as
-well as Cyrillic, Devanagari (for Hindi and Marathi), Ethiopic, Greek,
-Han (for Chinese and Japanese), Hangul (for Korean), Hebrew, IPA,
-Kannada, Lao, Malayalam, Tamil, Thai, Tibetan, and Vietnamese scripts.
+well as Arabic scripts, Brahmic scripts (for languages such as
+Bengali, Hindi, and Thai), Cyrillic, Ethiopic, Georgian, Greek, Han
+(for Chinese and Japanese), Hangul (for Korean), Hebrew and IPA@.
Emacs also supports various encodings of these characters that are used by
other internationalized software, such as word processors and mailers.
@menu
* International Chars:: Basic concepts of multibyte characters.
-* Disabling Multibyte:: Controlling whether to use multibyte characters.
* Language Environments:: Setting things up for the language you use.
* Input Methods:: Entering text characters not on your keyboard.
* Select Input Method:: Specifying your choice of input methods.
Keyboards, even in the countries where these character sets are
used, generally don't have keys for all the characters in them. You
can insert characters that your keyboard does not support, using
-@kbd{C-q} (@code{quoted-insert}) or @kbd{C-x 8 @key{RET}}
-(@code{insert-char}). @xref{Inserting Text}. Emacs also supports
+@kbd{C-x 8 @key{RET}} (@code{insert-char}). @xref{Inserting Text}.
+Shorthands are available for some common characters; for example, you
+can insert a left single quotation mark @t{‘} by typing @kbd{C-x 8
+[}, or in Electric Quote mode often by simply typing @kbd{`}.
+@xref{Quotation Marks}. Emacs also supports
various @dfn{input methods}, typically one for each script or
language, which make it easier to type characters in the script.
@xref{Input Methods}.
one byte, Emacs shows @samp{file ...}.
As a special case, if the character lies in the range 128 (0200
-octal) through 159 (0237 octal), it stands for a ``raw'' byte that
+octal) through 159 (0237 octal), it stands for a raw byte that
does not correspond to any specific displayable character. Such a
-``character'' lies within the @code{eight-bit-control} character set,
+character lies within the @code{eight-bit-control} character set,
and is displayed as an escaped octal character code. In this case,
@kbd{C-x =} shows @samp{part of display ...} instead of @samp{file}.
as belonging to the @code{ascii} character set.
@item
-The character's syntax and categories.
-
-@item
-The character's encodings, both internally in the buffer, and externally
-if you were to save the file.
+The character's script, syntax and categories.
@item
What keys to type to input the character in the current input method
(if it supports the character).
+@item
+The character's encodings, both internally in the buffer, and externally
+if you were to save the file.
+
@item
If you are running Emacs on a graphical display, the font name and
glyph code for the character. If you are running Emacs on a text
(@pxref{Overlays,,, elisp, the same manual}).
@end itemize
- Here's an example showing the Latin-1 character A with grave accent,
-in a buffer whose coding system is @code{utf-8-unix}:
+ Here's an example, with some lines folded to fit into this manual:
@smallexample
position: 1 of 1 (0%), column: 0
- character: @`A (displayed as @`A) (codepoint 192, #o300, #xc0)
+ character: ê (displayed as ê) (codepoint 234, #o352, #xea)
preferred charset: unicode (Unicode (ISO10646))
-code point in charset: 0xC0
- syntax: w which means: word
- category: .:Base, L:Left-to-right (strong),
+code point in charset: 0xEA
+ script: latin
+ syntax: w which means: word
+ category: .:Base, L:Left-to-right (strong), c:Chinese,
j:Japanese, l:Latin, v:Viet
- buffer code: #xC3 #x80
- file code: not encodable by coding system undecided-unix
+ to input: type "C-x 8 RET ea" or
+ "C-x 8 RET LATIN SMALL LETTER E WITH CIRCUMFLEX"
+ buffer code: #xC3 #xAA
+ file code: #xC3 #xAA (encoded by coding system utf-8-unix)
display: by this font (glyph code)
- xft:-unknown-DejaVu Sans Mono-normal-normal-
- normal-*-13-*-*-*-m-0-iso10646-1 (#x82)
+ xft:-PfEd-DejaVu Sans Mono-normal-normal-
+ normal-*-15-*-*-*-m-0-iso10646-1 (#xAC)
Character code properties: customize what to show
- name: LATIN CAPITAL LETTER A WITH GRAVE
- old-name: LATIN CAPITAL LETTER A GRAVE
- general-category: Lu (Letter, Uppercase)
- decomposition: (65 768) ('A' '`')
+ name: LATIN SMALL LETTER E WITH CIRCUMFLEX
+ old-name: LATIN SMALL LETTER E CIRCUMFLEX
+ general-category: Ll (Letter, Lowercase)
+ decomposition: (101 770) ('e' '^')
@end smallexample
-@c FIXME? Does this section even belong in the user manual?
-@c Seems more appropriate to the lispref?
-@node Disabling Multibyte
-@section Disabling Multibyte Characters
-
- By default, Emacs starts in multibyte mode: it stores the contents
-of buffers and strings using an internal encoding that represents
-non-@acronym{ASCII} characters using multi-byte sequences. Multibyte
-mode allows you to use all the supported languages and scripts without
-limitations.
-
-@cindex turn multibyte support on or off
- Under very special circumstances, you may want to disable multibyte
-character support, for a specific buffer.
-When multibyte characters are disabled in a buffer, we call
-that @dfn{unibyte mode}. In unibyte mode, each character in the
-buffer has a character code ranging from 0 through 255 (0377 octal); 0
-through 127 (0177 octal) represent @acronym{ASCII} characters, and 128
-(0200 octal) through 255 (0377 octal) represent non-@acronym{ASCII}
-characters.
-
- To edit a particular file in unibyte representation, visit it using
-@code{find-file-literally}. @xref{Visiting}. You can convert a
-multibyte buffer to unibyte by saving it to a file, killing the
-buffer, and visiting the file again with @code{find-file-literally}.
-Alternatively, you can use @kbd{C-x @key{RET} c}
-(@code{universal-coding-system-argument}) and specify @samp{raw-text}
-as the coding system with which to visit or save a file. @xref{Text
-Coding}. Unlike @code{find-file-literally}, finding a file as
-@samp{raw-text} doesn't disable format conversion, uncompression, or
-auto mode selection.
-
-@c Not a single file in Emacs uses this feature. Is it really worth
-@c mentioning in the _user_ manual? Also, this duplicates somewhat
-@c "Loading Non-ASCII" from the lispref.
-@cindex Lisp files, and multibyte operation
-@cindex multibyte operation, and Lisp files
-@cindex unibyte operation, and Lisp files
-@cindex init file, and non-@acronym{ASCII} characters
- Emacs normally loads Lisp files as multibyte.
-This includes the Emacs initialization
-file, @file{.emacs}, and the initialization files of packages
-such as Gnus. However, you can specify unibyte loading for a
-particular Lisp file, by adding an entry @samp{coding: raw-text} in a file
-local variables section. @xref{Specify Coding}.
-Then that file is always loaded as unibyte text.
-@ignore
-@c I don't see the point of this statement:
-The motivation for these conventions is that it is more reliable to
-always load any particular Lisp file in the same way.
-@end ignore
-You can also load a Lisp file as unibyte, on any one occasion, by
-typing @kbd{C-x @key{RET} c raw-text @key{RET}} immediately before
-loading it.
-
-@c See http://debbugs.gnu.org/11226 for lack of unibyte tooltip.
-@vindex enable-multibyte-characters
-The buffer-local variable @code{enable-multibyte-characters} is
-non-@code{nil} in multibyte buffers, and @code{nil} in unibyte ones.
-The mode line also indicates whether a buffer is multibyte or not.
-@xref{Mode Line}. With a graphical display, in a multibyte buffer,
-the portion of the mode line that indicates the character set has a
-tooltip that (amongst other things) says that the buffer is multibyte.
-In a unibyte buffer, the character set indicator is absent. Thus, in
-a unibyte buffer (when using a graphical display) there is normally
-nothing before the indication of the visited file's end-of-line
-convention (colon, backslash, etc.), unless you are using an input
-method.
-
-@findex toggle-enable-multibyte-characters
-You can turn off multibyte support in a specific buffer by invoking the
-command @code{toggle-enable-multibyte-characters} in that buffer.
-
@node Language Environments
@section Language Environments
@cindex language environments
@code{current-language-environment} or use the command @kbd{M-x
set-language-environment}. It makes no difference which buffer is
current when you use this command, because the effects apply globally
-to the Emacs session. The supported language environments
-(see the variable @code{language-info-alist}) include:
-
-@cindex Euro sign
-@cindex UTF-8
+to the Emacs session. See the variable @code{language-info-alist} for
+the list of supported language environments, and use the command
+@kbd{C-h L @var{lang-env} @key{RET}} (@code{describe-language-environment})
+for more information about the language environment @var{lang-env}.
+Supported language environments include:
+
+@c @cindex entries below are split between portions of the list to
+@c make them more accurate, i.e., land on the line that mentions the
+@c language. However, makeinfo 4.x doesn't fill inside @quotation
+@c lines that follow a @cindex entry and whose text has no whitespace.
+@c To work around, we group the language environments together, so
+@c that the blank that separates them triggers refill.
@quotation
-ASCII, Belarusian, Bengali, Brazilian Portuguese, Bulgarian, Cham,
-Chinese-BIG5, Chinese-CNS, Chinese-EUC-TW, Chinese-GB, Chinese-GBK,
-Chinese-GB18030, Croatian, Cyrillic-ALT, Cyrillic-ISO, Cyrillic-KOI8,
-Czech, Devanagari, Dutch, English, Esperanto, Ethiopic, French,
-Georgian, German, Greek, Gujarati, Hebrew, IPA, Italian, Japanese,
-Kannada, Khmer, Korean, Lao, Latin-1, Latin-2, Latin-3, Latin-4,
-Latin-5, Latin-6, Latin-7, Latin-8 (Celtic), Latin-9 (updated Latin-1
-with the Euro sign), Latvian, Lithuanian, Malayalam, Oriya, Polish,
-Punjabi, Romanian, Russian, Sinhala, Slovak, Slovenian, Spanish,
-Swedish, TaiViet, Tajik, Tamil, Telugu, Thai, Tibetan, Turkish, UTF-8
-(for a setup which prefers Unicode characters and files encoded in
-UTF-8), Ukrainian, Vietnamese, Welsh, and Windows-1255 (for a setup
-which prefers Cyrillic characters and files encoded in Windows-1255).
+@cindex ASCII
+@cindex Arabic
+ASCII, Arabic,
+@cindex Belarusian
+@cindex Bengali
+Belarusian, Bengali,
+@cindex Brazilian Portuguese
+@cindex Bulgarian
+Brazilian Portuguese, Bulgarian,
+@cindex Burmese
+@cindex Cham
+Burmese, Cham,
+@cindex Chinese
+Chinese-BIG5, Chinese-CNS, Chinese-EUC-TW, Chinese-GB,
+Chinese-GB18030, Chinese-GBK,
+@cindex Croatian
+@cindex Cyrillic
+Croatian, Cyrillic-ALT, Cyrillic-ISO, Cyrillic-KOI8,
+@cindex Czech
+@cindex Devanagari
+Czech, Devanagari,
+@cindex Dutch
+@cindex English
+Dutch, English,
+@cindex Esperanto
+@cindex Ethiopic
+Esperanto, Ethiopic,
+@cindex French
+@cindex Georgian
+French, Georgian,
+@cindex German
+@cindex Greek
+@cindex Gujarati
+German, Greek, Gujarati,
+@cindex Hebrew
+@cindex IPA
+Hebrew, IPA,
+@cindex Italian
+Italian,
+@cindex Japanese
+@cindex Kannada
+Japanese, Kannada,
+@cindex Khmer
+@cindex Korean
+@cindex Lao
+Khmer, Korean, Lao,
+@cindex Latin
+Latin-1, Latin-2, Latin-3, Latin-4, Latin-5, Latin-6, Latin-7,
+Latin-8, Latin-9,
+@cindex Latvian
+@cindex Lithuanian
+Latvian, Lithuanian,
+@cindex Malayalam
+@cindex Oriya
+Malayalam, Oriya,
+@cindex Persian
+@cindex Polish
+Persian, Polish,
+@cindex Punjabi
+@cindex Romanian
+Punjabi, Romanian,
+@cindex Russian
+@cindex Sinhala
+Russian, Sinhala,
+@cindex Slovak
+@cindex Slovenian
+@cindex Spanish
+Slovak, Slovenian, Spanish,
+@cindex Swedish
+@cindex TaiViet
+Swedish, TaiViet,
+@cindex Tajik
+@cindex Tamil
+Tajik, Tamil,
+@cindex Telugu
+@cindex Thai
+Telugu, Thai,
+@cindex Tibetan
+@cindex Turkish
+Tibetan, Turkish,
+@cindex UTF-8
+@cindex Ukrainian
+UTF-8, Ukrainian,
+@cindex Vietnamese
+@cindex Welsh
+Vietnamese, Welsh,
+@cindex Windows-1255
+and Windows-1255.
@end quotation
To display the script(s) used by your language environment on a
also shows some sample text to illustrate scripts used in this
language environment. If you give an empty input for @var{lang-env},
this command describes the chosen language environment.
-@anchor{Describe Language Environment}
@vindex set-language-environment-hook
You can customize any language environment with the normal hook
To find out how to input the character after point using the current
input method, type @kbd{C-u C-x =}. @xref{Position Info}.
+@c TODO: document complex-only/default/t of
+@c @code{input-method-verbose-flag}
@vindex input-method-verbose-flag
@vindex input-method-highlight-flag
The variables @code{input-method-highlight-flag} and
@end lisp
@noindent
-This automatically activates the input method ``german-prefix'' in
+This automatically activates the input method @code{german-prefix} in
Text mode.
@findex quail-set-keyboard-layout
In addition to converting various representations of non-@acronym{ASCII}
characters, a coding system can perform end-of-line conversion. Emacs
handles three different conventions for how to separate lines in a file:
-newline (``unix''), carriage-return linefeed (``dos''), and just
-carriage-return (``mac'').
+newline (Unix), carriage-return linefeed (DOS), and just
+carriage-return (Mac).
@table @kbd
@item C-h C @var{coding} @key{RET}
Unlike the previous two, this variable does not override any
@samp{-*-coding:-*-} tag.
-@c FIXME? This seems somewhat out of place. Move to the Rmail section?
-@vindex rmail-file-coding-system
- When you get new mail in Rmail, each message is translated
-automatically from the coding system it is written in, as if it were a
-separate file. This uses the priority list of coding systems that you
-have specified. If a MIME message specifies a character set, Rmail
-obeys that specification. For reading and saving Rmail files
-themselves, Emacs uses the coding system specified by the variable
-@code{rmail-file-coding-system}. The default value is @code{nil},
-which means that Rmail files are not translated (they are read and
-written in the Emacs internal character code).
-
@node Specify Coding
@section Specifying a File's Coding System
to use when encoding and decoding system strings such as system error
messages and @code{format-time-string} formats and time stamps. That
coding system is also used for decoding non-@acronym{ASCII} keyboard
-input on the X Window System. You should choose a coding system that is compatible
+input on the X Window System and for encoding text sent to the
+standard output and error streams when in batch mode. You should
+choose a coding system that is compatible
with the underlying system's text representation, which is normally
specified by one of the environment variables @env{LC_ALL},
@env{LC_CTYPE}, and @env{LANG}. (The first one, in the order
file names are not encoded specially; they appear in the file system
using the internal Emacs representation.
+@cindex file-name encoding, MS-Windows
+@vindex w32-unicode-filenames
+ When Emacs runs on MS-Windows versions that are descendants of the
+NT family (Windows 2000, XP, Vista, Windows 7, and Windows 8), the
+value of @code{file-name-coding-system} is largely ignored, as Emacs
+by default uses APIs that allow passing Unicode file names directly.
+By contrast, on Windows 9X, file names are encoded using
+@code{file-name-coding-system}, which should be set to the codepage
+(@pxref{Coding Systems, codepage}) pertinent for the current system
+locale. The value of the variable @code{w32-unicode-filenames}
+controls whether Emacs uses the Unicode APIs when it calls OS
+functions that accept file names. This variable is set by the startup
+code to @code{nil} on Windows 9X, and to @code{t} on newer versions of
+MS-Windows.
+
@strong{Warning:} if you change @code{file-name-coding-system} (or the
language environment) in the middle of an Emacs session, problems can
result if you have already visited files whose names were encoded using
@end example
+@cindex ignore font
+@cindex fonts, how to ignore
+@vindex face-ignored-fonts
+ Some fonts installed on your system might be broken, or produce
+unpleasant results for characters for which they are used, and you may
+wish to instruct Emacs to completely ignore them while searching for a
+suitable font required to display a character. You can do that by
+adding the offending fonts to the value of @code{face-ignored-fonts}
+variable, which is a list. Here's an example to put in your
+@file{~/.emacs}:
+
+@example
+(add-to-list 'face-ignored-fonts "Some Bad Font")
+@end example
@node Undisplayable Characters
@section Undisplayable Characters
accented letters and punctuation needed by various European languages
(and some non-European ones). Note that Emacs considers bytes with
codes in this range as raw bytes, not as characters, even in a unibyte
-buffer, i.e., if you disable multibyte characters. However, Emacs
-can still handle these character codes as if they belonged to
-@emph{one} of the single-byte character sets at a time. To specify
-@emph{which} of these codes to use, invoke @kbd{M-x
-set-language-environment} and specify a suitable language environment
-such as @samp{Latin-@var{n}}.
-
- For more information about unibyte operation, see
-@ref{Disabling Multibyte}.
+buffer, i.e., if you disable multibyte characters. However, Emacs can
+still handle these character codes as if they belonged to @emph{one}
+of the single-byte character sets at a time. To specify @emph{which}
+of these codes to use, invoke @kbd{M-x set-language-environment} and
+specify a suitable language environment such as @samp{Latin-@var{n}}.
+@xref{Disabling Multibyte, , Disabling Multibyte Characters, elisp,
+GNU Emacs Lisp Reference Manual}.
@vindex unibyte-display-via-language-environment
Emacs can also display bytes in the range 160 to 255 as readable
@cindex 8-bit display
Normally non-ISO-8859 characters (decimal codes between 128 and 159
inclusive) are displayed as octal escapes. You can change this for
-non-standard ``extended'' versions of ISO-8859 character sets by using the
+non-standard extended versions of ISO-8859 character sets by using the
function @code{standard-display-8bit} in the @code{disp-table} library.
There are two ways to input single-byte non-@acronym{ASCII}
should use the command @code{M-x set-keyboard-coding-system} or customize the
variable @code{keyboard-coding-system} to specify which coding system
your keyboard uses (@pxref{Terminal Coding}). Enabling this feature
-will probably require you to use @kbd{ESC} to type Meta characters;
+will probably require you to use @key{ESC} to type Meta characters;
however, on a console terminal or in @code{xterm}, you can arrange for
-Meta to be converted to @kbd{ESC} and still be able type 8-bit
-characters present directly on the keyboard or using @kbd{Compose} or
-@kbd{AltGr} keys. @xref{User Input}.
+Meta to be converted to @key{ESC} and still be able type 8-bit
+characters present directly on the keyboard or using @key{Compose} or
+@key{AltGr} keys. @xref{User Input}.
@kindex C-x 8
@cindex @code{iso-transl} library
@cindex compose character
@cindex dead character
@item
-For Latin-1 only, you can use the key @kbd{C-x 8} as a ``compose
-character'' prefix for entry of non-@acronym{ASCII} Latin-1 printing
+You can use the key @kbd{C-x 8} as a compose-character prefix for
+entry of non-@acronym{ASCII} Latin-1 and a few other printing
characters. @kbd{C-x 8} is good for insertion (in the minibuffer as
well as other buffers), for searching, and in any other context where
a key sequence is allowed.
@kbd{C-x 8} works by loading the @code{iso-transl} library. Once that
-library is loaded, the @key{ALT} modifier key, if the keyboard has
-one, serves the same purpose as @kbd{C-x 8}: use @key{ALT} together
+library is loaded, the @key{Alt} modifier key, if the keyboard has
+one, serves the same purpose as @kbd{C-x 8}: use @key{Alt} together
with an accent character to modify the following letter. In addition,
-if the keyboard has keys for the Latin-1 ``dead accent characters'',
+if the keyboard has keys for the Latin-1 dead accent characters,
they too are defined to compose with the following character, once
@code{iso-transl} is loaded.
@code{unicode-bmp}, and @code{eight-bit}). All supported characters
belong to one or more charsets.
- Emacs normally ``does the right thing'' with respect to charsets, so
+ Emacs normally does the right thing with respect to charsets, so
that you don't have to worry about them. However, it is sometimes
helpful to know some of the underlying details about charsets.
One example is font selection (@pxref{Fonts}). Each language
-environment (@pxref{Language Environments}) defines a ``priority
-list'' for the various charsets. When searching for a font, Emacs
+environment (@pxref{Language Environments}) defines a priority
+list for the various charsets. When searching for a font, Emacs
initially attempts to find one that can display the highest-priority
charsets. For instance, in the Japanese language environment, the
charset @code{japanese-jisx0208} has the highest priority, so Emacs
@findex list-character-sets
@kbd{M-x list-character-sets} displays a list of all supported
charsets. The list gives the names of charsets and additional
-information to identity each charset; see the
-@url{http://www.itscj.ipsj.or.jp/ISO-IR/, International Register of
-Coded Character Sets} for more details. In this list,
+information to identity each charset; for more details, see the
+@url{https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf,
+ISO International Register of Coded Character Sets to be Used with
+Escape Sequences (ISO-IR)} maintained by
+the @url{https://www.itscj.ipsj.or.jp/itscj_english/,
+Information Processing Society of Japan/Information Technology
+Standards Commission of Japan (IPSJ/ITSCJ)}. In this list,
charsets are divided into two categories: @dfn{normal charsets} are
listed first, followed by @dfn{supplementary charsets}. A
supplementary charset is one that is used to define another charset
The special character @code{RIGHT-TO-LEFT MARK}, or @sc{rlm}, forces
the right-to-left direction on the following paragraph, while
@code{LEFT-TO-RIGHT MARK}, or @sc{lrm} forces the left-to-right
-direction. (You can use @kbd{C-x 8 RET} to insert these characters.)
+direction. (You can use @kbd{C-x 8 @key{RET}} to insert these characters.)
In a GUI session, the @sc{lrm} and @sc{rlm} characters display as very
thin blank characters; on text terminals they display as blanks.