@c -*-texinfo-*-
@c This is part of the GNU Emacs Lisp Reference Manual.
@c Copyright (C) 1990, 1991, 1992, 1993, 1994, 1995, 1998, 1999, 2002, 2003,
-@c 2004, 2005 Free Software Foundation, Inc.
+@c 2004, 2005, 2006 Free Software Foundation, Inc.
@c See the file elisp.texi for copying conditions.
@setfilename ../info/searching
@node Searching and Matching, Syntax Tables, Non-ASCII Characters, Top
@end menu
The @samp{skip-chars@dots{}} functions also perform a kind of searching.
-@xref{Skipping Characters}.
+@xref{Skipping Characters}. To search for changes in character
+properties, see @ref{Property Search}.
@node String Search
@section Searching for Strings
return the new position of point in that case, but some existing
programs may depend on a value of @code{nil}.)
+The argument @var{noerror} only affects valid searches which fail to
+find a match. Invalid arguments cause errors regardless of
+@var{noerror}.
+
If @var{repeat} is supplied (it must be a positive number), then the
search is repeated that many times (each time starting at the end of the
previous time's match). If these successive searches succeed, the
Regular expressions have a syntax in which a few characters are
special constructs and the rest are @dfn{ordinary}. An ordinary
-character is a simple regular expression that matches that character and
-nothing else. The special characters are @samp{.}, @samp{*}, @samp{+},
-@samp{?}, @samp{[}, @samp{]}, @samp{^}, @samp{$}, and @samp{\}; no new
-special characters will be defined in the future. Any other character
-appearing in a regular expression is ordinary, unless a @samp{\}
-precedes it.
+character is a simple regular expression that matches that character
+and nothing else. The special characters are @samp{.}, @samp{*},
+@samp{+}, @samp{?}, @samp{[}, @samp{^}, @samp{$}, and @samp{\}; no new
+special characters will be defined in the future. The character
+@samp{]} is special if it ends a character alternative (see later).
+The character @samp{-} is special inside a character alternative. A
+@samp{[:} and balancing @samp{:]} enclose a character class inside a
+character alternative. Any other character appearing in a regular
+expression is ordinary, unless a @samp{\} precedes it.
For example, @samp{f} is not a special character, so it is ordinary, and
therefore @samp{f} is a regular expression that matches the string
first tries to match all three @samp{a}s; but the rest of the pattern is
@samp{ar} and there is only @samp{r} left to match, so this try fails.
The next alternative is for @samp{a*} to match only two @samp{a}s. With
-this choice, the rest of the regexp matches successfully.@refill
+this choice, the rest of the regexp matches successfully.
-Nested repetition operators take a long time, or even forever, if they
+@strong{Warning:} Nested repetition operators take a long time,
+or even forever, if they
lead to ambiguous matching. For example, trying to match the regular
expression @samp{\(x+y*\)*a} against the string
@samp{xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxz} could take hours before it
@samp{x}s before concluding that none of them can work. Even worse,
@samp{\(x*\)*} can match the null string in infinitely many ways, so
it causes an infinite loop. To avoid these problems, check nested
-repetitions carefully.
+repetitions carefully, to make sure that they do not cause combinatorial
+explosions in backtracking.
@item @samp{+}
@cindex @samp{+} in regexp
beginning of the string or after a newline character.
For historical compatibility reasons, @samp{^} can be used only at the
-beginning of the regular expression, or after @samp{\(} or @samp{\|}.
+beginning of the regular expression, or after @samp{\(}, @samp{\(?:}
+or @samp{\|}.
@item @samp{$}
@cindex @samp{$} in regexp
can act. It is poor practice to depend on this behavior; quote the
special character anyway, regardless of where it appears.@refill
+As a @samp{\} is not special inside a character alternative, it can
+never remove the special meaning of @samp{-} or @samp{]}. So you
+should not quote these characters when they have no special meaning
+either. This would not clarify anything, since backslashes can
+legitimately precede these characters where they @emph{have} special
+meaning, as in @samp{[^\]} (@code{"[^\\]"} for Lisp string syntax),
+which matches any single character except a backslash.
+
+In practice, most @samp{]} that occur in regular expressions close a
+character alternative and hence are special. However, occasionally a
+regular expression may try to match a complex pattern of literal
+@samp{[} and @samp{]}. In such situations, it sometimes may be
+necessary to carefully parse the regexp from the start to determine
+which square brackets enclose a character alternative. For example,
+@samp{[^][]]} consists of the complemented character alternative
+@samp{[^][]} (which matches any single character that is not a square
+bracket), followed by a literal @samp{]}.
+
+The exact rules are that at the beginning of a regexp, @samp{[} is
+special and @samp{]} not. This lasts until the first unquoted
+@samp{[}, after which we are in a character alternative; @samp{[} is
+no longer special (except when it starts a character class) but @samp{]}
+is special, unless it immediately follows the special @samp{[} or that
+@samp{[} followed by a @samp{^}. This lasts until the next special
+@samp{]} that does not end a character class. This ends the character
+alternative and restores the ordinary syntax of regular expressions;
+an unquoted @samp{[} is special again and a @samp{]} not.
+
@node Char Classes
@subsubsection Character Classes
@cindex character classes in regexp
@table @samp
@item [:ascii:]
-This matches any @acronym{ASCII} (unibyte) character.
+This matches any @acronym{ASCII} character (codes 0--127).
@item [:alnum:]
This matches any letter or digit. (At present, for multibyte
characters, it matches anything that has word syntax.)
@item [:lower:]
This matches any lower-case letter, as determined by
the current case table (@pxref{Case Tables}).
+@item [:multibyte:]
+This matches any multibyte character (@pxref{Text Representations}).
@item [:nonascii:]
-This matches any non-@acronym{ASCII} (multibyte) character.
+This matches any non-@acronym{ASCII} character.
@item [:print:]
This matches printing characters---everything except @acronym{ASCII} control
characters and the delete character.
@item [:space:]
This matches any character that has whitespace syntax
(@pxref{Syntax Class Table}).
+@item [:unibyte:]
+This matches any unibyte character (@pxref{Text Representations}).
@item [:upper:]
This matches any upper-case letter, as determined by
the current case table (@pxref{Case Tables}).
@kindex invalid-regexp
Not every string is a valid regular expression. For example, a string
-with unbalanced square brackets is invalid (with a few exceptions, such
-as @samp{[]]}), and so is a string that ends with a single @samp{\}. If
+that ends inside a character alternative without terminating @samp{]}
+is invalid, and so is a string that ends with a single @samp{\}. If
an invalid regular expression is passed to any of the search functions,
an @code{invalid-regexp} error is signaled.
expressions can have subexpressions---after a simple string search, the
only information available is about the entire match.
+ Every successful search sets the match data. Therefore, you should
+query the match data immediately after searching, before calling any
+other function that might perform another search. Alternatively, you
+may save and restore the match data (@pxref{Saving Match Data}) around
+the call to functions that could perform another search.
+
A search which fails may or may not alter the match data. In the
past, a failing search did not do this, but we may change it in the
future. So don't try to rely on the value of the match data after