@c -*-texinfo-*-
@c This is part of the GNU Emacs Lisp Reference Manual.
@c Copyright (C) 1990, 1991, 1992, 1993, 1994, 1995, 1998, 1999, 2001,
-@c 2002, 2003, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
+@c 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010
+@c Free Software Foundation, Inc.
@c See the file elisp.texi for copying conditions.
@setfilename ../../info/strings
@node Strings and Characters, Lists, Numbers, Top
* String Conversion:: Converting to and from characters and strings.
* Formatting Strings:: @code{format}: Emacs's analogue of @code{printf}.
* Case Conversion:: Case conversion functions.
-* Case Tables:: Customizing case conversion.
+* Case Tables:: Customizing case conversion.
@end menu
@node String Basics
Characters are represented in Emacs Lisp as integers;
whether an integer is a character or not is determined only by how it is
-used. Thus, strings really contain integers.
+used. Thus, strings really contain integers. @xref{Character Codes},
+for details about character representation in Emacs.
The length of a string (like any array) is fixed, and cannot be
altered once the string exists. Strings in Lisp are @emph{not}
There are two text representations for non-@acronym{ASCII} characters in
Emacs strings (and in buffers): unibyte and multibyte (@pxref{Text
-Representations}). An @acronym{ASCII} character always occupies one byte in a
-string; in fact, when a string is all @acronym{ASCII}, there is no real
-difference between the unibyte and multibyte representations.
-For most Lisp programming, you don't need to be concerned with these two
-representations.
-
- Sometimes key sequences are represented as strings. When a string is
-a key sequence, string elements in the range 128 to 255 represent meta
-characters (which are large integers) rather than character
-codes in the range 128 to 255.
-
- Strings cannot hold characters that have the hyper, super or alt
-modifiers; they can hold @acronym{ASCII} control characters, but no other
-control characters. They do not distinguish case in @acronym{ASCII} control
-characters. If you want to store such characters in a sequence, such as
-a key sequence, you must use a vector instead of a string.
-@xref{Character Type}, for more information about the representation of meta
-and other modifiers for keyboard input characters.
+Representations}). For most Lisp programming, you don't need to be
+concerned with these two representations.
+
+ Sometimes key sequences are represented as unibyte strings. When a
+unibyte string is a key sequence, string elements in the range 128 to
+255 represent meta characters (which are large integers) rather than
+character codes in the range 128 to 255. Strings cannot hold
+characters that have the hyper, super or alt modifiers; they can hold
+@acronym{ASCII} control characters, but no other control characters.
+They do not distinguish case in @acronym{ASCII} control characters.
+If you want to store such characters in a sequence, such as a key
+sequence, you must use a vector instead of a string. @xref{Character
+Type}, for more information about keyboard input characters.
Strings are useful for holding regular expressions. You can also
match regular expressions against strings with @code{string-match}
@end defun
@defun string-or-null-p object
-This function returns @code{t} if @var{object} is a string or nil,
-@code{nil} otherwise.
+This function returns @code{t} if @var{object} is a string or
+@code{nil}. It returns @code{nil} otherwise.
@end defun
@defun char-or-string-p object
@result{} ""
@end example
- Other functions to compare with this one include @code{char-to-string}
-(@pxref{String Conversion}), @code{make-vector} (@pxref{Vectors}), and
-@code{make-list} (@pxref{Building Lists}).
+ Other functions to compare with this one include @code{make-vector}
+(@pxref{Vectors}) and @code{make-list} (@pxref{Building Lists}).
@end defun
@defun string &rest characters
@end example
@noindent
-Here the index for @samp{a} is 0, the index for @samp{b} is 1, and the
-index for @samp{c} is 2. Thus, three letters, @samp{abc}, are copied
-from the string @code{"abcdefg"}. The index 3 marks the character
-position up to which the substring is copied. The character whose index
-is 3 is actually the fourth character in the string.
+In the above example, the index for @samp{a} is 0, the index for
+@samp{b} is 1, and the index for @samp{c} is 2. The index 3---which
+is the fourth character in the string---marks the character position
+up to which the substring is copied. Thus, @samp{abc} is copied from
+the string @code{"abcdefg"}.
A negative number counts from the end of the string, so that @minus{}1
signifies the index of the last character of the string. For example:
@end example
@noindent
-The @code{concat} function always constructs a new string that is
-not @code{eq} to any existing string, except when the result is empty
-(since empty strings are canonicalized to save space).
-
-In Emacs versions before 21, when an argument was an integer (not a
-sequence of integers), it was converted to a string of digits making up
-the decimal printed representation of the integer. This obsolete usage
-no longer works. The proper way to convert an integer to its decimal
-printed form is with @code{format} (@pxref{Formatting Strings}) or
-@code{number-to-string} (@pxref{String Conversion}).
+This function always constructs a new string that is not @code{eq} to
+any existing string, except when the result is the empty string (to
+save space, Emacs makes only one empty multibyte string).
For information about other concatenation functions, see the
description of @code{mapconcat} in @ref{Mapping Functions},
@code{vconcat} in @ref{Vector Functions}, and @code{append} in @ref{Building
-Lists}.
+Lists}. For concatenating individual command-line arguments into a
+string to be used as a shell command, see @ref{Shell Arguments,
+combine-and-quote-strings}.
@end defun
@defun split-string string &optional separators omit-nulls
-This function splits @var{string} into substrings at matches for the
-regular expression @var{separators}. Each match for @var{separators}
-defines a splitting point; the substrings between the splitting points
-are made into a list, which is the value returned by
-@code{split-string}.
+This function splits @var{string} into substrings based on the regular
+expression @var{separators} (@pxref{Regular Expressions}). Each match
+for @var{separators} defines a splitting point; the substrings between
+splitting points are made into a list, which is returned.
-If @var{omit-nulls} is @code{nil}, the result contains null strings
-whenever there are two consecutive matches for @var{separators}, or a
-match is adjacent to the beginning or end of @var{string}. If
-@var{omit-nulls} is @code{t}, these null strings are omitted from the
-result.
+If @var{omit-nulls} is @code{nil} (or omitted), the result contains
+null strings whenever there are two consecutive matches for
+@var{separators}, or a match is adjacent to the beginning or end of
+@var{string}. If @var{omit-nulls} is @code{t}, these null strings are
+omitted from the result.
-If @var{separators} is @code{nil} (or omitted),
-the default is the value of @code{split-string-default-separators}.
+If @var{separators} is @code{nil} (or omitted), the default is the
+value of @code{split-string-default-separators}.
As a special case, when @var{separators} is @code{nil} (or omitted),
null strings are always omitted from the result. Thus:
(split-string "ooo" "\\|o+" t)
@result{} ("o" "o" "o")
@end example
+
+If you need to split a string that is a shell command, where
+individual arguments could be quoted, see @ref{Shell Arguments,
+split-string-and-unquote}.
@end defun
@defvar split-string-default-separators
@code{equal} if and only if they contain the same sequence of
character codes and all these codes are either in the range 0 through
127 (@acronym{ASCII}) or 160 through 255 (@code{eight-bit-graphic}).
-However, when a unibyte string gets converted to a multibyte string,
-all characters with codes in the range 160 through 255 get converted
-to characters with higher codes, whereas @acronym{ASCII} characters
+However, when a unibyte string is converted to a multibyte string, all
+characters with codes in the range 160 through 255 are converted to
+characters with higher codes, whereas @acronym{ASCII} characters
remain unchanged. Thus, a unibyte string and its conversion to
multibyte are only @code{equal} if the string is all @acronym{ASCII}.
Character codes 160 through 255 are not entirely proper in multibyte
@xref{Association Lists}.
@end defun
- See also the @code{compare-buffer-substrings} function in
+ See also the function @code{compare-buffer-substrings} in
@ref{Comparing Text}, for a way to compare text in buffers. The
function @code{string-match}, which matches a regular expression
against a string, can be used for a kind of string comparison; see
@section Conversion of Characters and Strings
@cindex conversion of strings
- This section describes functions for conversions between characters,
-strings and integers. @code{format} (@pxref{Formatting Strings})
-and @code{prin1-to-string}
-(@pxref{Output Functions}) can also convert Lisp objects into strings.
-@code{read-from-string} (@pxref{Input Functions}) can ``convert'' a
-string representation of a Lisp object into an object. The functions
-@code{string-make-multibyte} and @code{string-make-unibyte} convert the
-text representation of a string (@pxref{Converting Representations}).
+ This section describes functions for converting between characters,
+strings and integers. @code{format} (@pxref{Formatting Strings}) and
+@code{prin1-to-string} (@pxref{Output Functions}) can also convert
+Lisp objects into strings. @code{read-from-string} (@pxref{Input
+Functions}) can ``convert'' a string representation of a Lisp object
+into an object. The functions @code{string-make-multibyte} and
+@code{string-make-unibyte} convert the text representation of a string
+(@pxref{Converting Representations}).
@xref{Documentation}, for functions that produce textual descriptions
of text characters and general input events
(@code{single-key-description} and @code{text-char-description}). These
are used primarily for making help messages.
-@defun char-to-string character
-@cindex character to string
-This function returns a new string containing one character,
-@var{character}. This function is semi-obsolete because the function
-@code{string} is more general. @xref{Creating Strings}.
-@end defun
-
-@defun string-to-char string
-@cindex string to character
- This function returns the first character in @var{string}. If the
-string is empty, the function returns 0. The value is also 0 when the
-first character of @var{string} is the null character, @acronym{ASCII} code
-0.
-
-@example
-(string-to-char "ABC")
- @result{} 65
-
-(string-to-char "xyz")
- @result{} 120
-(string-to-char "")
- @result{} 0
-@group
-(string-to-char "\000")
- @result{} 0
-@end group
-@end example
-
-This function may be eliminated in the future if it does not seem useful
-enough to retain.
-@end defun
-
@defun number-to-string number
@cindex integer to string
@cindex integer to decimal
@findex string-to-int
@code{string-to-int} is an obsolete alias for this function.
+@end defun
+
+@defun char-to-string character
+@cindex character to string
+This function returns a new string containing one character,
+@var{character}. This function is semi-obsolete because the function
+@code{string} is more general. @xref{Creating Strings}.
+@end defun
+
+@defun string-to-char string
+ This function returns the first character in @var{string}. This
+mostly identical to @code{(aref string 0)}, except that it returns 0
+if the string is empty. (The value is also 0 when the first character
+of @var{string} is the null character, @acronym{ASCII} code 0.) This
+function may be eliminated in the future if it does not seem useful
+enough to retain.
@end defun
Here are some other functions that can convert to or from a string:
@table @code
@item concat
-@code{concat} can convert a vector or a list into a string.
+This function converts a vector or a list into a string.
@xref{Creating Strings}.
@item vconcat
-@code{vconcat} can convert a string into a vector. @xref{Vector
+This function converts a string into a vector. @xref{Vector
Functions}.
@item append
-@code{append} can convert a string into a list. @xref{Building Lists}.
+This function converts a string into a list. @xref{Building Lists}.
+
+@item byte-to-string
+This function converts a byte of character data into a unibyte string.
+@xref{Converting Representations}.
@end table
@node Formatting Strings
@cindex formatting strings
@cindex strings, formatting them
- @dfn{Formatting} means constructing a string by substitution of
-computed values at various places in a constant string. This constant string
-controls how the other values are printed, as well as where they appear;
-it is called a @dfn{format string}.
+ @dfn{Formatting} means constructing a string by substituting
+computed values at various places in a constant string. This constant
+string controls how the other values are printed, as well as where
+they appear; it is called a @dfn{format string}.
Formatting is often useful for computing messages to be displayed. In
fact, the functions @code{message} and @code{error} provide the same
@cindex field width
@cindex padding
- A specification can have a @dfn{width}, which is a signed decimal
-number between the @samp{%} and the specification character. If the
-printed representation of the object contains fewer characters than
-this width, @code{format} extends it with padding. The padding goes
-on the left if the width is positive (or starts with zero) and on the
-right if the width is negative. The padding character is normally a
-space, but it's @samp{0} if the width starts with a zero.
-
- Some of these conventions are ignored for specification characters
-for which they do not make sense. That is, @samp{%s}, @samp{%S} and
-@samp{%c} accept a width starting with 0, but still pad with
-@emph{spaces} on the left. Also, @samp{%%} accepts a width, but
-ignores it. Here are some examples of padding:
+ A specification can have a @dfn{width}, which is a decimal number
+between the @samp{%} and the specification character. If the printed
+representation of the object contains fewer characters than this
+width, @code{format} extends it with padding. The width specifier is
+ignored for the @samp{%%} specification. Any padding introduced by
+the width specifier normally consists of spaces inserted on the left:
@example
-(format "%06d is padded on the left with zeros" 123)
- @result{} "000123 is padded on the left with zeros"
-
-(format "%-6d is padded on the right" 123)
- @result{} "123 is padded on the right"
+(format "%5d is padded on the left with spaces" 123)
+ @result{} " 123 is padded on the left with spaces"
@end example
@noindent
If the width is too small, @code{format} does not truncate the
object's printed representation. Thus, you can use a width to specify
a minimum spacing between columns with no risk of losing information.
+In the following three examples, @samp{%7s} specifies a minimum width
+of 7. In the first case, the string inserted in place of @samp{%7s}
+has only 3 letters, and needs 4 blank spaces as padding. In the
+second case, the string @code{"specification"} is 13 letters wide but
+is not truncated.
- In the following three examples, @samp{%7s} specifies a minimum
-width of 7. In the first case, the string inserted in place of
-@samp{%7s} has only 3 letters, it needs 4 blank spaces as padding. In
-the second case, the string @code{"specification"} is 13 letters wide
-but is not truncated. In the third case, the padding is on the right.
-
-@smallexample
+@example
@group
(format "The word `%7s' actually has %d letters in it."
"foo" (length "foo"))
@result{} "The word ` foo' actually has 3 letters in it."
-@end group
-
-@group
(format "The word `%7s' actually has %d letters in it."
"specification" (length "specification"))
@result{} "The word `specification' actually has 13 letters in it."
@end group
+@end example
+
+@cindex flags in format specifications
+ Immediately after the @samp{%} and before the optional width
+specifier, you can also put certain @dfn{flag characters}.
+
+ The flag @samp{+} inserts a plus sign before a positive number, so
+that it always has a sign. A space character as flag inserts a space
+before a positive number. (Otherwise, positive numbers start with the
+first digit.) These flags are useful for ensuring that positive
+numbers and negative numbers use the same number of columns. They are
+ignored except for @samp{%d}, @samp{%e}, @samp{%f}, @samp{%g}, and if
+both flags are used, @samp{+} takes precedence.
+
+ The flag @samp{#} specifies an ``alternate form'' which depends on
+the format in use. For @samp{%o}, it ensures that the result begins
+with a @samp{0}. For @samp{%x} and @samp{%X}, it prefixes the result
+with @samp{0x} or @samp{0X}. For @samp{%e}, @samp{%f}, and @samp{%g},
+the @samp{#} flag means include a decimal point even if the precision
+is zero.
+ The flag @samp{-} causes the padding inserted by the width
+specifier, if any, to be inserted on the right rather than the left.
+The flag @samp{0} ensures that the padding consists of @samp{0}
+characters instead of spaces, inserted on the left. These flags are
+ignored for specification characters for which they do not make sense:
+@samp{%s}, @samp{%S} and @samp{%c} accept the @samp{0} flag, but still
+pad with @emph{spaces} on the left. If both @samp{-} and @samp{0} are
+present and valid, @samp{-} takes precedence.
+
+@example
@group
+(format "%06d is padded on the left with zeros" 123)
+ @result{} "000123 is padded on the left with zeros"
+
+(format "%-6d is padded on the right" 123)
+ @result{} "123 is padded on the right"
+
(format "The word `%-7s' actually has %d letters in it."
"foo" (length "foo"))
@result{} "The word `foo ' actually has 3 letters in it."
@end group
-@end smallexample
+@end example
@cindex precision in format specifications
All the specification characters allow an optional @dfn{precision}
@var{object}. Precision has no effect for other specification
characters.
-@cindex flags in format specifications
- Immediately after the @samp{%} and before the optional width and
-precision, you can put certain ``flag'' characters.
-
- @samp{+} as a flag inserts a plus sign before a positive number, so
-that it always has a sign. A space character as flag inserts a space
-before a positive number. (Otherwise, positive numbers start with the
-first digit.) Either of these two flags ensures that positive numbers
-and negative numbers use the same number of columns. These flags are
-ignored except for @samp{%d}, @samp{%e}, @samp{%f}, @samp{%g}, and if
-both flags are used, the @samp{+} takes precedence.
-
- The flag @samp{#} specifies an ``alternate form'' which depends on
-the format in use. For @samp{%o} it ensures that the result begins
-with a @samp{0}. For @samp{%x} and @samp{%X}, it prefixes the result
-with @samp{0x} or @samp{0X}. For @samp{%e}, @samp{%f}, and @samp{%g},
-the @samp{#} flag means include a decimal point even if the precision
-is zero.
-
@node Case Conversion
@comment node-name, next, previous, up
@section Case Conversion in Lisp
@acronym{ASCII} codes 88 and 120 respectively.
@defun downcase string-or-char
-This function converts a character or a string to lower case.
+This function converts @var{string-or-char}, which should be either a
+character or a string, to lower case.
-When the argument to @code{downcase} is a string, the function creates
-and returns a new string in which each letter in the argument that is
-upper case is converted to lower case. When the argument to
-@code{downcase} is a character, @code{downcase} returns the
-corresponding lower case character. This value is an integer. If the
-original character is lower case, or is not a letter, then the value
-equals the original character.
+When @var{string-or-char} is a string, this function returns a new
+string in which each letter in the argument that is upper case is
+converted to lower case. When @var{string-or-char} is a character,
+this function returns the corresponding lower case character (an
+integer); if the original character is lower case, or is not a letter,
+the return value is equal to the original character.
@example
(downcase "The cat in the hat")
@end defun
@defun upcase string-or-char
-This function converts a character or a string to upper case.
-
-When the argument to @code{upcase} is a string, the function creates
-and returns a new string in which each letter in the argument that is
-lower case is converted to upper case.
+This function converts @var{string-or-char}, which should be either a
+character or a string, to upper case.
-When the argument to @code{upcase} is a character, @code{upcase}
-returns the corresponding upper case character. This value is an integer.
-If the original character is upper case, or is not a letter, then the
-value returned equals the original character.
+When @var{string-or-char} is a string, this function returns a new
+string in which each letter in the argument that is lower case is
+converted to upper case. When @var{string-or-char} is a character,
+this function returns the corresponding upper case character (an
+integer); if the original character is upper case, or is not a letter,
+the return value is equal to the original character.
@example
(upcase "The cat in the hat")
@defun capitalize string-or-char
@cindex capitalization
This function capitalizes strings or characters. If
-@var{string-or-char} is a string, the function creates and returns a new
-string, whose contents are a copy of @var{string-or-char} in which each
-word has been capitalized. This means that the first character of each
+@var{string-or-char} is a string, the function returns a new string
+whose contents are a copy of @var{string-or-char} in which each word
+has been capitalized. This means that the first character of each
word is converted to upper case, and the rest are converted to lower
case.
are assigned to the word constituent syntax class in the current syntax
table (@pxref{Syntax Class Table}).
-When the argument to @code{capitalize} is a character, @code{capitalize}
-has the same result as @code{upcase}.
+When @var{string-or-char} is a character, this function does the same
+thing as @code{upcase}.
@example
@group
@samp{A} and @samp{A} into @samp{a}, and likewise for each set of
equivalent characters.)
- When you construct a case table, you can provide @code{nil} for
+ When constructing a case table, you can provide @code{nil} for
@var{canonicalize}; then Emacs fills in this slot from the lower case
and upper case mappings. You can also provide @code{nil} for
@var{equivalences}; then Emacs fills in this slot from
@var{canonicalize}. In a case table that is actually in use, those
-components are non-@code{nil}. Do not try to specify @var{equivalences}
-without also specifying @var{canonicalize}.
+components are non-@code{nil}. Do not try to specify
+@var{equivalences} without also specifying @var{canonicalize}.
Here are the functions for working with case tables:
Exits}).
@end defmac
- Some language environments may modify the case conversions of
+ Some language environments modify the case conversions of
@acronym{ASCII} characters; for example, in the Turkish language
environment, the @acronym{ASCII} character @samp{I} is downcased into
a Turkish ``dotless i''. This can interfere with code that requires