(viper-replace-overlay-pixmap)

[gnu-emacs] / lispref / searching.texi
diff --git a/lispref/searching.texi b/lispref/searching.texi

index ec082152aad1a29740237109784aad129fcd263b..9c0d4a22af249ce18dc2f34e98085d2acadcf094 100644 (file)
--- a/lispref/searching.texi
+++ b/lispref/searching.texi
@@ -17,6 +17,7 @@ portions of it.
  * String Search::         Search for an exact match.
  * Regular Expressions::   Describing classes of strings.
  * Regexp Search::         Searching for a match for a regexp.
+* POSIX Regexps::         Searching POSIX-style for the longest match.
  * Search and Replace::   Internals of @code{query-replace}.
  * Match Data::            Finding out which part of the text matched
                              various parts of a regexp, after regexp search.
@@ -204,15 +205,14 @@ matches any three-character string that begins with @samp{a} and ends with
  
  @item *
  @cindex @samp{*} in regexp
-is not a construct by itself; it is a suffix operator that means to
-repeat the preceding regular expression as many times as possible.  In
-@samp{fo*}, the @samp{*} applies to the @samp{o}, so @samp{fo*} matches
-one @samp{f} followed by any number of @samp{o}s.  The case of zero
-@samp{o}s is allowed: @samp{fo*} does match @samp{f}.@refill
+is not a construct by itself; it is a postfix operator that means to
+match the preceding regular expression repetitively as many times as
+possible.  Thus, @samp{o*} matches any number of @samp{o}s (including no
+@samp{o}s).
  
  @samp{*} always applies to the @emph{smallest} possible preceding
-expression.  Thus, @samp{fo*} has a repeating @samp{o}, not a
-repeating @samp{fo}.@refill
+expression.  Thus, @samp{fo*} has a repeating @samp{o}, not a repeating
+@samp{fo}.  It matches @samp{f}, @samp{fo}, @samp{foo}, and so on.
  
  The matcher processes a @samp{*} construct by matching, immediately,
  as many repetitions as can be found.  Then it continues with the rest
@@ -226,72 +226,72 @@ The next alternative is for @samp{a*} to match only two @samp{a}s.
  With this choice, the rest of the regexp matches successfully.@refill
  
  Nested repetition operators can be extremely slow if they specify
-backtracking loops.  For example, @samp{\(x+y*\)*a} could take hours to
-match the sequence @samp{xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxz}.  The
-slowness is because Emacs must try each imaginable way of grouping the
-35 @samp{x}'s before concluding that none of them can work.  To make
-sure your regular expressions run fast, check nested repetitions
-carefully.
+backtracking loops.  For example, it could take hours for the regular
+expression @samp{\(x+y*\)*a} to match the sequence
+@samp{xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxz}.  The slowness is because
+Emacs must try each imaginable way of grouping the 35 @samp{x}'s before
+concluding that none of them can work.  To make sure your regular
+expressions run fast, check nested repetitions carefully.
  
  @item +
  @cindex @samp{+} in regexp
-is a suffix operator similar to @samp{*} except that the preceding
-expression must match at least once.  So, for example, @samp{ca+r}
+is a postfix operator, similar to @samp{*} except that it must match
+the preceding expression at least once.  So, for example, @samp{ca+r}
  matches the strings @samp{car} and @samp{caaaar} but not the string
  @samp{cr}, whereas @samp{ca*r} matches all three strings.
  
  @item ?
  @cindex @samp{?} in regexp
-is a suffix operator similar to @samp{*} except that the preceding
-expression can match either once or not at all.  For example,
-@samp{ca?r} matches @samp{car} or @samp{cr}, but does not match anyhing
-else.
+is a postfix operator, similar to @samp{*} except that it can match the
+preceding expression either once or not at all.  For example,
+@samp{ca?r} matches @samp{car} or @samp{cr}; nothing else.
  
  @item [ @dots{} ]
  @cindex character set (in regexp)
  @cindex @samp{[} in regexp
  @cindex @samp{]} in regexp
-@samp{[} begins a @dfn{character set}, which is terminated by a
-@samp{]}.  In the simplest case, the characters between the two brackets
-form the set.  Thus, @samp{[ad]} matches either one @samp{a} or one
-@samp{d}, and @samp{[ad]*} matches any string composed of just @samp{a}s
-and @samp{d}s (including the empty string), from which it follows that
-@samp{c[ad]*r} matches @samp{cr}, @samp{car}, @samp{cdr},
-@samp{caddaar}, etc.@refill
-
-The usual regular expression special characters are not special inside a
+is a @dfn{character set}, which begins with @samp{[} and is terminated
+by @samp{]}.  In the simplest case, the characters between the two
+brackets are what this set can match.
+
+Thus, @samp{[ad]} matches either one @samp{a} or one @samp{d}, and
+@samp{[ad]*} matches any string composed of just @samp{a}s and @samp{d}s
+(including the empty string), from which it follows that @samp{c[ad]*r}
+matches @samp{cr}, @samp{car}, @samp{cdr}, @samp{caddaar}, etc.
+
+You can also include character ranges in a character set, by writing the
+starting and ending characters with a @samp{-} between them.  Thus,
+@samp{[a-z]} matches any lower-case ASCII letter.  Ranges may be
+intermixed freely with individual characters, as in @samp{[a-z$%.]},
+which matches any lower case ASCII letter or @samp{$}, @samp{%} or
+period.
+
+Note that the usual regexp special characters are not special inside a
  character set.  A completely different set of special characters exists
-inside character sets: @samp{]}, @samp{-} and @samp{^}.@refill
-
-@samp{-} is used for ranges of characters.  To write a range, write two
-characters with a @samp{-} between them.  Thus, @samp{[a-z]} matches any
-lower case letter.  Ranges may be intermixed freely with individual
-characters, as in @samp{[a-z$%.]}, which matches any lower case letter
-or @samp{$}, @samp{%}, or a period.@refill
-
-To include a @samp{]} in a character set, make it the first character.
-For example, @samp{[]a]} matches @samp{]} or @samp{a}.  To include a
-@samp{-}, write @samp{-} as the first character in the set, or put it
-immediately after a range.  (You can replace one individual character
-@var{c} with the range @samp{@var{c}-@var{c}} to make a place to put the
-@samp{-}.)  There is no way to write a set containing just @samp{-} and
-@samp{]}.
+inside character sets: @samp{]}, @samp{-} and @samp{^}.
+
+To include a @samp{]} in a character set, you must make it the first
+character.  For example, @samp{[]a]} matches @samp{]} or @samp{a}.  To
+include a @samp{-}, write @samp{-} as the first or last character of the
+set, or put it after a range.  Thus, @samp{[]-]} matches both @samp{]}
+and @samp{-}.
  
  To include @samp{^} in a set, put it anywhere but at the beginning of
  the set.
  
  @item [^ @dots{} ]
  @cindex @samp{^} in regexp
-@samp{[^} begins a @dfn{complement character set}, which matches any
-character except the ones specified.  Thus, @samp{[^a-z0-9A-Z]}
-matches all characters @emph{except} letters and digits.@refill
+@samp{[^} begins a @dfn{complemented character set}, which matches any
+character except the ones specified.  Thus, @samp{[^a-z0-9A-Z]} matches
+all characters @emph{except} letters and digits.
  
  @samp{^} is not special in a character set unless it is the first
  character.  The character following the @samp{^} is treated as if it
-were first (thus, @samp{-} and @samp{]} are not special there).
+were first (in other words, @samp{-} and @samp{]} are not special there).
  
-Note that a complement character set can match a newline, unless
-newline is mentioned as one of the characters not to match.
+A complemented character set can match a newline, unless newline is
+mentioned as one of the characters not to match.  This is in contrast to
+the handling of regexps in programs such as @code{grep}.
  
  @item ^
  @cindex @samp{^} in regexp
@@ -338,10 +338,10 @@ can act.  It is poor practice to depend on this behavior; quote the
  special character anyway, regardless of where it appears.@refill
  
  For the most part, @samp{\} followed by any character matches only
-that character.  However, there are several exceptions: characters
-that, when preceded by @samp{\}, are special constructs.  Such
-characters are always ordinary when encountered on their own.  Here
-is a table of @samp{\} constructs:
+that character.  However, there are several exceptions: two-character
+sequences starting with @samp{\} which have special meanings.  The
+second character in the sequence is always an ordinary character on
+their own.  Here is a table of @samp{\} constructs.
  
  @table @kbd
  @item \|
@@ -369,13 +369,15 @@ is a grouping construct that serves three purposes:
  
  @enumerate
  @item
-To enclose a set of @samp{\|} alternatives for other operations.
-Thus, @samp{\(foo\|bar\)x} matches either @samp{foox} or @samp{barx}.
+To enclose a set of @samp{\|} alternatives for other operations.  Thus,
+the regular expression @samp{\(foo\|bar\)x} matches either @samp{foox}
+or @samp{barx}.
  
  @item
-To enclose an expression for a suffix operator such as @samp{*} to act
-on.  Thus, @samp{ba\(na\)*} matches @samp{bananana}, etc., with any
-(zero or more) number of @samp{na} strings.@refill
+To enclose a complicated expression for the postfix operators @samp{*},
+@samp{+} and @samp{?} to operate on.  Thus, @samp{ba\(na\)*} matches
+@samp{bananana}, etc., with any (zero or more) number of @samp{na}
+strings.@refill
  
  @item
  To record a matched substring for future reference.
@@ -391,7 +393,7 @@ Here is an explanation of this feature:
  matches the same text that matched the @var{digit}th occurrence of a
  @samp{\( @dots{} \)} construct.
  
-In other words, after the end of a @samp{\( @dots{} \)} construct.  the
+In other words, after the end of a @samp{\( @dots{} \)} construct, the
  matcher remembers the beginning and end of the text matched by that
  construct.  Then, later on in the regular expression, you can use
  @samp{\} followed by @var{digit} to match that same text, whatever it
@@ -422,8 +424,9 @@ matches any character that is not a word constituent.
  matches any character whose syntax is @var{code}.  Here @var{code} is a
  character that represents a syntax code: thus, @samp{w} for word
  constituent, @samp{-} for whitespace, @samp{(} for open parenthesis,
-etc.  @xref{Syntax Tables}, for a list of syntax codes and the
-characters that stand for them.
+etc.  Represent a character of whitespace (which can be a newline) by
+either @samp{-} or a space character.  @xref{Syntax Tables}, for a list
+of syntax codes and the characters that stand for them.
  
  @item \S@var{code}
  @cindex @samp{\S} in regexp
@@ -457,6 +460,9 @@ end of a word.  Thus, @samp{\bfoo\b} matches any occurrence of
  @samp{foo} as a separate word.  @samp{\bballs?\b} matches
  @samp{ball} or @samp{balls} as a separate word.@refill
  
+@samp{\b} matches at the beginning or end of the buffer
+regardless of what text appears next to it.
+
  @item \B
  @cindex @samp{\B} in regexp
  matches the empty string, but @emph{not} at the beginning or
@@ -465,10 +471,14 @@ end of a word.
  @item \<
  @cindex @samp{\<} in regexp
  matches the empty string, but only at the beginning of a word.
+@samp{\<} matches at the beginning of the buffer only if a
+word-constituent character follows.
  
  @item \>
  @cindex @samp{\>} in regexp
-matches the empty string, but only at the end of a word.
+matches the empty string, but only at the end of a word.  @samp{\>}
+matches at the end of the buffer only if the contents end with a
+word-constituent character.
  @end table
  
  @kindex invalid-regexp
@@ -715,6 +725,48 @@ comes back" twice.
  @end example
  @end defun
  
+@node POSIX Regexps
+@section POSIX Regular Expression Searching
+
+  The usual regular expression functions do backtracking when necessary
+to handle the @samp{\|} and repetition constructs, but they continue
+this only until they find @emph{some} match.  Then they succeed and
+report the first match found.
+
+  This section describes alternative search functions which perform the
+full backtracking specified by the POSIX standard for regular expression
+matching.  They continue backtracking until they have tried all
+possibilities and found all matches, so they can report the longest
+match, as required by POSIX.  This is much slower, so use these
+functions only when you really need the longest match.
+
+  In Emacs versions prior to 19.29, these functions did not exist, and
+the functions described above implemented full POSIX backtracking.
+
+@defun posix-search-forward regexp &optional limit noerror repeat
+This is like @code{re-search-forward} except that it performs the full
+backtracking specified by the POSIX standard for regular expression
+matching.
+@end defun
+
+@defun posix-search-backward regexp &optional limit noerror repeat
+This is like @code{re-search-backward} except that it performs the full
+backtracking specified by the POSIX standard for regular expression
+matching.
+@end defun
+
+@defun posix-looking-at regexp
+This is like @code{looking-at} except that it performs the full
+backtracking specified by the POSIX standard for regular expression
+matching.
+@end defun
+
+@defun posix-string-match regexp string &optional start
+This is like @code{string-match} except that it performs the full
+backtracking specified by the POSIX standard for regular expression
+matching.
+@end defun
+
  @ignore
  @deffn Command delete-matching-lines regexp
  This function is identical to @code{delete-non-matching-lines}, save
@@ -807,9 +859,9 @@ The argument @var{replacements} specifies what to replace occurrences
  with.  If it is a string, that string is used.  It can also be a list of
  strings, to be used in cyclic order.
  
-If @var{repeat-count} is non-@code{nil}, it should be an integer, the
-number of occurrences to consider.  In this case, @code{perform-replace}
-returns after considering that many occurrences.
+If @var{repeat-count} is non-@code{nil}, it should be an integer.  Then
+it specifies how many times to use each of the strings in the
+@var{replacements} list before advancing cyclicly to the next one.
  
  Normally, the keymap @code{query-replace-map} defines the possible user
  responses for queries.  The argument @var{map}, if non-@code{nil}, is a
@@ -909,34 +961,57 @@ match data around it, to prevent it from being overwritten.
  @node Simple Match Data
  @subsection Simple Match Data Access
  
-  This section explains how to use the match data to find the starting
-point or ending point of the text that was matched by a particular
-search, or by a particular parenthetical subexpression of a regular
-expression.
+  This section explains how to use the match data to find out what was
+matched by the last search or match operation.
+
+  You can ask about the entire matching text, or about a particular
+parenthetical subexpression of a regular expression.  The @var{count}
+argument in the functions below specifies which.  If @var{count} is
+zero, you are asking about the entire match.  If @var{count} is
+positive, it specifies which subexpression you want.
+
+  Recall that the subexpressions of a regular expression are those
+expressions grouped with escaped parentheses, @samp{\(@dots{}\)}.  The
+@var{count}th subexpression is found by counting occurrences of
+@samp{\(} from the beginning of the whole regular expression.  The first
+subexpression is numbered 1, the second 2, and so on.  Only regular
+expressions can have subexpressions---after a simple string search, the
+only information available is about the entire match.
+
+@defun match-string count &optional in-string
+This function returns, as a string, the text matched in the last search
+or match operation.  It returns the entire text if @var{count} is zero,
+or just the portion corresponding to the @var{count}th parenthetical
+subexpression, if @var{count} is positive.  If @var{count} is out of
+range, or if that subexpression didn't match anything, the value is
+@code{nil}.
+
+If the last such operation was done against a string with
+@code{string-match}, then you should pass the same string as the
+argument @var{in-string}.  Otherwise, after a buffer search or match,
+you should omit @var{in-string} or pass @code{nil} for it; but you
+should make sure that the current buffer when you call
+@code{match-string} is the one in which you did the searching or
+matching.
+@end defun
  
  @defun match-beginning count
  This function returns the position of the start of text matched by the
  last regular expression searched for, or a subexpression of it.
  
  If @var{count} is zero, then the value is the position of the start of
-the text matched by the whole regexp.  Otherwise, @var{count}, specifies
-a subexpression in the regular expresion.  The value of the function is
-the starting position of the match for that subexpression.
-
-Subexpressions of a regular expression are those expressions grouped
-with escaped parentheses, @samp{\(@dots{}\)}.  The @var{count}th
-subexpression is found by counting occurrences of @samp{\(} from the
-beginning of the whole regular expression.  The first subexpression is
-numbered 1, the second 2, and so on.
-
-The value is @code{nil} for a subexpression inside a
-@samp{\|} alternative that wasn't used in the match.
+the entire match.  Otherwise, @var{count} specifies a subexpression in
+the regular expresion, and the value of the function is the starting
+position of the match for that subexpression.
+
+The value is @code{nil} for a subexpression inside a @samp{\|}
+alternative that wasn't used in the match.
  @end defun
  
  @defun match-end count
-This function returns the position of the end of the text that matched
-the last regular expression searched for, or a subexpression of it.
-This function is otherwise similar to @code{match-beginning}.
+This function is like @code{match-beginning} except that it returns the
+position of the end of the match, rather than the position of the
+beginning.
  @end defun
  
    Here is an example of using the match data, with a comment showing the
@@ -950,6 +1025,15 @@ positions within the text:
       @result{} 4
  @end group
  
+@group
+(match-string 0 "The quick fox jumped quickly.")
+     @result{} "quick"
+(match-string 1 "The quick fox jumped quickly.")
+     @result{} "qu"
+(match-string 2 "The quick fox jumped quickly.")
+     @result{} "ick"
+@end group
+
  @group
  (match-beginning 1)       ; @r{The beginning of the match}
       @result{} 4                 ;   @r{with @samp{qu} is at index 4.}
@@ -1004,11 +1088,19 @@ character of the buffer counts as 1.)
  @var{replacement}.
  
  @cindex case in replacements
-@defun replace-match replacement &optional fixedcase literal
-This function replaces the buffer text matched by the last search, with
-@var{replacement}.  It applies only to buffers; you can't use
-@code{replace-match} to replace a substring found with
-@code{string-match}.
+@defun replace-match replacement &optional fixedcase literal string subexp
+This function replaces the text in the buffer (or in @var{string}) that
+was matched by the last search.  It replaces that text with
+@var{replacement}.
+
+If you did the last search in a buffer, you should specify @code{nil}
+for @var{string}.  Then @code{replace-match} does the replacement by
+editing the buffer; it leaves point at the end of the replacement text,
+and returns @code{t}.
+
+If you did the search in a string, pass the same string as @var{string}.
+Then @code{replace-match} does the replacement by constructing and
+returning a new string.
  
  If @var{fixedcase} is non-@code{nil}, then the case of the replacement
  text is not changed; otherwise, the replacement text is converted to a
@@ -1045,8 +1137,11 @@ Subexpressions are those expressions grouped inside @samp{\(@dots{}\)}.
  @samp{\\} stands for a single @samp{\} in the replacement text.
  @end table
  
-@code{replace-match} leaves point at the end of the replacement text,
-and returns @code{t}.
+If @var{subexp} is non-@code{nil}, that says to replace just
+subexpression number @var{subexp} of the regexp that was matched, not
+the entire match.  For example, after matching @samp{foo \(ba*r\)},
+calling @code{replace-match} with 1 as @var{subexp} means to replace
+just the text that matched @samp{\(ba*r\)}.
  @end defun
  
  @node Entire Match Data
@@ -1132,10 +1227,10 @@ that shows the problem that arises if you fail to save the match data:
  
    You can save and restore the match data with @code{save-match-data}:
  
-@defspec save-match-data body@dots{}
+@defmac save-match-data body@dots{}
  This special form executes @var{body}, saving and restoring the match
  data around it.
-@end defspec
+@end defmac
  
    You can use @code{set-match-data} together with @code{match-data} to
  imitate the effect of the special form @code{save-match-data}.  This is
@@ -1239,19 +1334,28 @@ default value is @code{"^\014"} (i.e., @code{"^^L"} or @code{"^\C-l"});
  this matches a line that starts with a formfeed character.
  @end defvar
  
+  The following two regular expressions should @emph{not} assume the
+match always starts at the beginning of a line; they should not use
+@samp{^} to anchor the match.  Most often, the paragraph commands do
+check for a match only at the beginning of a line, which means that
+@samp{^} would be superfluous.  When there is a nonzero left margin,
+they accept matches that start after the left margin.  In that case, a
+@samp{^} would be incorrect.  However, a @samp{^} is harmless in modes
+where a left margin is never used.
+
  @defvar paragraph-separate
  This is the regular expression for recognizing the beginning of a line
  that separates paragraphs.  (If you change this, you may have to
  change @code{paragraph-start} also.)  The default value is
-@w{@code{"^[@ \t\f]*$"}}, which matches a line that consists entirely of
-spaces, tabs, and form feeds.
+@w{@code{"[@ \t\f]*$"}}, which matches a line that consists entirely of
+spaces, tabs, and form feeds (after its left margin).
  @end defvar
  
  @defvar paragraph-start
  This is the regular expression for recognizing the beginning of a line
  that starts @emph{or} separates paragraphs.  The default value is
-@w{@code{"^[@ \t\n\f]"}}, which matches a line starting with a space, tab,
-newline, or form feed.
+@w{@code{"[@ \t\n\f]"}}, which matches a line starting with a space, tab,
+newline, or form feed (after its left margin).
  @end defvar
  
  @defvar sentence-end