-@cindex text representations
-
- Emacs has two @dfn{text representations}---two ways to represent text
-in a string or buffer. These are called @dfn{unibyte} and
-@dfn{multibyte}. Each string, and each buffer, uses one of these two
-representations. For most purposes, you can ignore the issue of
-representations, because Emacs converts text between them as
-appropriate. Occasionally in Lisp programming you will need to pay
-attention to the difference.
+@cindex text representation
+
+ Emacs buffers and strings support a large repertoire of characters
+from many different scripts, allowing users to type and display text
+in most any known written language.
+
+@cindex character codepoint
+@cindex codespace
+@cindex Unicode
+ To support this multitude of characters and scripts, Emacs closely
+follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
+unique number, called a @dfn{codepoint}, to each and every character.
+The range of codepoints defined by Unicode, or the Unicode
+@dfn{codespace}, is @code{0..10FFFF} (in hex), inclusive. Emacs
+extends this range with codepoints in the range @code{110000..3FFFFF},
+which it uses for representing characters that are not unified with
+Unicode and raw 8-bit bytes that cannot be interpreted as characters
+(the latter occupy the range @code{3FFF80..3FFFFF}). Thus, a
+character codepoint in Emacs is a 22-bit integer number.
+
+@cindex internal representation of characters
+@cindex characters, representation in buffers and strings
+@cindex multibyte text
+ To conserve memory, Emacs does not hold fixed-length 22-bit numbers
+that are codepoints of text characters within buffers and strings.
+Rather, Emacs uses a variable-length internal representation of
+characters, that stores each character as a sequence of 1 to 5 8-bit
+bytes, depending on the magnitude of its codepoint@footnote{
+This internal representation is based on one of the encodings defined
+by the Unicode Standard, called @dfn{UTF-8}, for representing any
+Unicode codepoint, but Emacs extends UTF-8 to represent the additional
+codepoints it uses for raw 8-bit bytes and characters not unified with
+Unicode.}. For example, any @acronym{ASCII} character takes up only 1
+byte, a Latin-1 character takes up 2 bytes, etc. We call this
+representation of text @dfn{multibyte}.
+
+ Outside Emacs, characters can be represented in many different
+encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts
+between these external encodings and its internal representation, as
+appropriate, when it reads text into a buffer or a string, or when it
+writes text to a disk file or passes it to some other process.
+
+ Occasionally, Emacs needs to hold and manipulate encoded text or
+binary non-text data in its buffers or strings. For example, when
+Emacs visits a file, it first reads the file's text verbatim into a
+buffer, and only then converts it to the internal representation.
+Before the conversion, the buffer holds encoded text.