Planet
navi homePPSaboutscreenshotsdownloaddevelopmentforum

source: downloads/tcl8.5.2/doc/re_syntax.n @ 25

Last change on this file since 25 was 25, checked in by landauf, 16 years ago

added tcl to libs

File size: 26.2 KB
Line 
1'\"
2'\" Copyright (c) 1998 Sun Microsystems, Inc.
3'\" Copyright (c) 1999 Scriptics Corporation
4'\"
5'\" See the file "license.terms" for information on usage and redistribution
6'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
7'\"
8'\" RCS: @(#) $Id: re_syntax.n,v 1.18 2007/12/13 15:22:33 dgp Exp $
9'\"
10.so man.macros
11.TH re_syntax n "8.1" Tcl "Tcl Built-In Commands"
12.BS
13.SH NAME
14re_syntax \- Syntax of Tcl regular expressions
15.BE
16.SH DESCRIPTION
17.PP
18A \fIregular expression\fR describes strings of characters.
19It's a pattern that matches certain strings and does not match others.
20.SH "DIFFERENT FLAVORS OF REs"
21Regular expressions
22.PQ RE s ,
23as defined by POSIX, come in two flavors: \fIextended\fR REs
24.PQ ERE s
25and \fIbasic\fR REs
26.PQ BRE s .
27EREs are roughly those of the traditional \fIegrep\fR, while BREs are
28roughly those of the traditional \fIed\fR. This implementation adds
29a third flavor, \fIadvanced\fR REs
30.PQ ARE s ,
31basically EREs with some significant extensions.
32.PP
33This manual page primarily describes AREs. BREs mostly exist for
34backward compatibility in some old programs; they will be discussed at
35the end. POSIX EREs are almost an exact subset of AREs. Features of
36AREs that are not present in EREs will be indicated.
37.SH "REGULAR EXPRESSION SYNTAX"
38.PP
39Tcl regular expressions are implemented using the package written by
40Henry Spencer, based on the 1003.2 spec and some (not quite all) of
41the Perl5 extensions (thanks, Henry!). Much of the description of
42regular expressions below is copied verbatim from his manual entry.
43.PP
44An ARE is one or more \fIbranches\fR,
45separated by
46.QW \fB|\fR ,
47matching anything that matches any of the branches.
48.PP
49A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR,
50concatenated.
51It matches a match for the first, followed by a match for the second, etc;
52an empty branch matches the empty string.
53.SS QUANTIFIERS
54A quantified atom is an \fIatom\fR possibly followed
55by a single \fIquantifier\fR.
56Without a quantifier, it matches a single match for the atom.
57The quantifiers,
58and what a so-quantified atom matches, are:
59.RS 2
60.TP 6
61\fB*\fR
62.
63a sequence of 0 or more matches of the atom
64.TP
65\fB+\fR
66.
67a sequence of 1 or more matches of the atom
68.TP
69\fB?\fR
70.
71a sequence of 0 or 1 matches of the atom
72.TP
73\fB{\fIm\fB}\fR
74.
75a sequence of exactly \fIm\fR matches of the atom
76.TP
77\fB{\fIm\fB,}\fR
78.
79a sequence of \fIm\fR or more matches of the atom
80.TP
81\fB{\fIm\fB,\fIn\fB}\fR
82.
83a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom;
84\fIm\fR may not exceed \fIn\fR
85.TP
86\fB*?  +?  ??  {\fIm\fB}?  {\fIm\fB,}?  {\fIm\fB,\fIn\fB}?\fR
87.
88\fInon-greedy\fR quantifiers, which match the same possibilities,
89but prefer the smallest number rather than the largest number
90of matches (see \fBMATCHING\fR)
91.RE
92.PP
93The forms using \fB{\fR and \fB}\fR are known as \fIbound\fRs. The
94numbers \fIm\fR and \fIn\fR are unsigned decimal integers with
95permissible values from 0 to 255 inclusive.
96.SS ATOMS
97An atom is one of:
98.RS 2
99.IP \fB(\fIre\fB)\fR 6
100matches a match for \fIre\fR (\fIre\fR is any regular expression) with
101the match noted for possible reporting
102.IP \fB(?:\fIre\fB)\fR
103as previous, but does no reporting (a
104.QW non-capturing
105set of parentheses)
106.IP \fB()\fR
107matches an empty string, noted for possible reporting
108.IP \fB(?:)\fR
109matches an empty string, without reporting
110.IP \fB[\fIchars\fB]\fR
111a \fIbracket expression\fR, matching any one of the \fIchars\fR (see
112\fBBRACKET EXPRESSIONS\fR for more detail)
113.IP \fB.\fR
114matches any single character
115.IP \fB\e\fIk\fR
116matches the non-alphanumeric character \fIk\fR
117taken as an ordinary character, e.g. \fB\e\e\fR matches a backslash
118character
119.IP \fB\e\fIc\fR
120where \fIc\fR is alphanumeric (possibly followed by other characters),
121an \fIescape\fR (AREs only), see \fBESCAPES\fR below
122.IP \fB{\fR
123when followed by a character other than a digit, matches the
124left-brace character
125.QW \fB{\fR ;
126when followed by a digit, it is the beginning of a \fIbound\fR (see above)
127.IP \fIx\fR
128where \fIx\fR is a single character with no other significance,
129matches that character.
130.RE
131.SS CONSTRAINTS
132A \fIconstraint\fR matches an empty string when specific conditions
133are met. A constraint may not be followed by a quantifier. The
134simple constraints are as follows; some more constraints are described
135later, under \fBESCAPES\fR.
136.RS 2
137.TP 8
138\fB^\fR
139.
140matches at the beginning of a line
141.TP
142\fB$\fR
143.
144matches at the end of a line
145.TP
146\fB(?=\fIre\fB)\fR
147.
148\fIpositive lookahead\fR (AREs only), matches at any point where a
149substring matching \fIre\fR begins
150.TP
151\fB(?!\fIre\fB)\fR
152.
153\fInegative lookahead\fR (AREs only), matches at any point where no
154substring matching \fIre\fR begins
155.RE
156.PP
157The lookahead constraints may not contain back references (see later),
158and all parentheses within them are considered non-capturing.
159.PP
160An RE may not end with
161.QW \fB\e\fR .
162.SH "BRACKET EXPRESSIONS"
163A \fIbracket expression\fR is a list of characters enclosed in
164.QW \fB[\|]\fR .
165It normally matches any single character from the list
166(but see below). If the list begins with
167.QW \fB^\fR ,
168it matches any single character (but see below) \fInot\fR from the
169rest of the list.
170.PP
171If two characters in the list are separated by
172.QW \fB\-\fR ,
173this is shorthand for the full \fIrange\fR of characters between those two
174(inclusive) in the collating sequence, e.g.
175.QW \fB[0\-9]\fR
176in Unicode matches any conventional decimal digit. Two ranges may not share an
177endpoint, so e.g.
178.QW \fBa\-c\-e\fR
179is illegal. Ranges in Tcl always use the
180Unicode collating sequence, but other programs may use other collating
181sequences and this can be a source of incompatability between programs.
182.PP
183To include a literal \fB]\fR or \fB\-\fR in the list, the simplest
184method is to enclose it in \fB[.\fR and \fB.]\fR to make it a
185collating element (see below). Alternatively, make it the first
186character (following a possible
187.QW \fB^\fR ),
188or (AREs only) precede it with
189.QW \fB\e\fR .
190Alternatively, for
191.QW \fB\-\fR ,
192make it the last character, or the second endpoint of a range. To use
193a literal \fB\-\fR as the first endpoint of a range, make it a
194collating element or (AREs only) precede it with
195.QW \fB\e\fR .
196With the exception of
197these, some combinations using \fB[\fR (see next paragraphs), and
198escapes, all other special characters lose their special significance
199within a bracket expression.
200.SS "CHARACTER CLASSES"
201Within a bracket expression, the name of a \fIcharacter class\fR
202enclosed in \fB[:\fR and \fB:]\fR stands for the list of all
203characters (not all collating elements!) belonging to that class.
204Standard character classes are:
205.IP \fBalpha\fR 8
206A letter.
207.IP \fBupper\fR 8
208An upper-case letter.
209.IP \fBlower\fR 8
210A lower-case letter.
211.IP \fBdigit\fR 8
212A decimal digit.
213.IP \fBxdigit\fR 8
214A hexadecimal digit.
215.IP \fBalnum\fR 8
216An alphanumeric (letter or digit).
217.IP \fBprint\fR 8
218A "printable" (same as graph, except also including space).
219.IP \fBblank\fR 8
220A space or tab character.
221.IP \fBspace\fR 8
222A character producing white space in displayed text.
223.IP \fBpunct\fR 8
224A punctuation character.
225.IP \fBgraph\fR 8
226A character with a visible representation (includes both alnum and punct).
227.IP \fBcntrl\fR 8
228A control character.
229.PP
230A locale may provide others. A character class may not be used as an endpoint
231of a range.
232.RS
233.PP
234(\fINote:\fR the current Tcl implementation has only one locale, the Unicode
235locale, which supports exactly the above classes.)
236.RE
237.SS "BRACKETED CONSTRAINTS"
238There are two special cases of bracket expressions: the bracket
239expressions
240.QW \fB[[:<:]]\fR
241and
242.QW \fB[[:>:]]\fR
243are constraints, matching empty strings at the beginning and end of a word
244respectively.
245.\" note, discussion of escapes below references this definition of word
246A word is defined as a sequence of word characters that is neither preceded
247nor followed by word characters. A word character is an \fIalnum\fR character
248or an underscore
249.PQ \fB_\fR "" .
250These special bracket expressions are deprecated; users of AREs should use
251constraint escapes instead (see below).
252.SS "COLLATING ELEMENTS"
253Within a bracket expression, a collating element (a character, a
254multi-character sequence that collates as if it were a single
255character, or a collating-sequence name for either) enclosed in
256\fB[.\fR and \fB.]\fR stands for the sequence of characters of that
257collating element. The sequence is a single element of the bracket
258expression's list. A bracket expression in a locale that has
259multi-character collating elements can thus match more than one
260character. So (insidiously), a bracket expression that starts with
261\fB^\fR can match multi-character collating elements even if none of
262them appear in the bracket expression!
263.RS
264.PP
265(\fINote:\fR Tcl has no multi-character collating elements. This information
266is only for illustration.)
267.RE
268.PP
269For example, assume the collating sequence includes a \fBch\fR multi-character
270collating element. Then the RE
271.QW \fB[[.ch.]]*c\fR
272(zero or more
273.QW \fBch\fRs
274followed by
275.QW \fBc\fR )
276matches the first five characters of
277.QW \fBchchcc\fR .
278Also, the RE
279.QW \fB[^c]b\fR
280matches all of
281.QW \fBchb\fR
282(because
283.QW \fB[^c]\fR
284matches the multi-character
285.QW \fBch\fR ).
286.SS "EQUIVALENCE CLASSES"
287Within a bracket expression, a collating element enclosed in \fB[=\fR
288and \fB=]\fR is an equivalence class, standing for the sequences of
289characters of all collating elements equivalent to that one, including
290itself. (If there are no other equivalent collating elements, the
291treatment is as if the enclosing delimiters were
292.QW \fB[.\fR \&
293and
294.QW \fB.]\fR .)
295For example, if \fBo\fR and \fB\N'244'\fR are the members of an
296equivalence class, then
297.QW \fB[[=o=]]\fR ,
298.QW \fB[[=\N'244'=]]\fR ,
299and
300.QW \fB[o\N'244']\fR \&
301are all synonymous. An equivalence class may not be an endpoint of a range.
302.RS
303.PP
304(\fINote:\fR Tcl implements only the Unicode locale. It does not define any
305equivalence classes. The examples above are just illustrations.)
306.RE
307.SH ESCAPES
308Escapes (AREs only), which begin with a \fB\e\fR followed by an
309alphanumeric character, come in several varieties: character entry,
310class shorthands, constraint escapes, and back references. A \fB\e\fR
311followed by an alphanumeric character but not constituting a valid
312escape is illegal in AREs. In EREs, there are no escapes: outside a
313bracket expression, a \fB\e\fR followed by an alphanumeric character
314merely stands for that character as an ordinary character, and inside
315a bracket expression, \fB\e\fR is an ordinary character. (The latter
316is the one actual incompatibility between EREs and AREs.)
317.SS "CHARACTER-ENTRY ESCAPES"
318Character-entry escapes (AREs only) exist to make it easier to specify
319non-printing and otherwise inconvenient characters in REs:
320.RS 2
321.TP 5
322\fB\ea\fR
323.
324alert (bell) character, as in C
325.TP
326\fB\eb\fR
327.
328backspace, as in C
329.TP
330\fB\eB\fR
331.
332synonym for \fB\e\fR to help reduce backslash doubling in some
333applications where there are multiple levels of backslash processing
334.TP
335\fB\ec\fIX\fR
336.
337(where \fIX\fR is any character) the character whose low-order 5 bits
338are the same as those of \fIX\fR, and whose other bits are all zero
339.TP
340\fB\ee\fR
341.
342the character whose collating-sequence name is
343.QW \fBESC\fR ,
344or failing that, the character with octal value 033
345.TP
346\fB\ef\fR
347.
348formfeed, as in C
349.TP
350\fB\en\fR
351.
352newline, as in C
353.TP
354\fB\er\fR
355.
356carriage return, as in C
357.TP
358\fB\et\fR
359.
360horizontal tab, as in C
361.TP
362\fB\eu\fIwxyz\fR
363.
364(where \fIwxyz\fR is exactly four hexadecimal digits) the Unicode
365character \fBU+\fIwxyz\fR in the local byte ordering
366.TP
367\fB\eU\fIstuvwxyz\fR
368.
369(where \fIstuvwxyz\fR is exactly eight hexadecimal digits) reserved
370for a somewhat-hypothetical Unicode extension to 32 bits
371.TP
372\fB\ev\fR
373.
374vertical tab, as in C are all available.
375.TP
376\fB\ex\fIhhh\fR
377.
378(where \fIhhh\fR is any sequence of hexadecimal digits) the character
379whose hexadecimal value is \fB0x\fIhhh\fR (a single character no
380matter how many hexadecimal digits are used).
381.TP
382\fB\e0\fR
383.
384the character whose value is \fB0\fR
385.TP
386\fB\e\fIxy\fR
387.
388(where \fIxy\fR is exactly two octal digits, and is not a \fIback
389reference\fR (see below)) the character whose octal value is
390\fB0\fIxy\fR
391.TP
392\fB\e\fIxyz\fR
393.
394(where \fIxyz\fR is exactly three octal digits, and is not a back
395reference (see below)) the character whose octal value is
396\fB0\fIxyz\fR
397.RE
398.PP
399Hexadecimal digits are
400.QR \fB0\fR \fB9\fR ,
401.QR \fBa\fR \fBf\fR ,
402and
403.QR \fBA\fR \fBF\fR .
404Octal digits are
405.QR \fB0\fR \fB7\fR .
406.PP
407The character-entry escapes are always taken as ordinary characters.
408For example, \fB\e135\fR is \fB]\fR in Unicode, but \fB\e135\fR does
409not terminate a bracket expression. Beware, however, that some
410applications (e.g., C compilers and the Tcl interpreter if the regular
411expression is not quoted with braces) interpret such sequences
412themselves before the regular-expression package gets to see them,
413which may require doubling (quadrupling, etc.) the
414.QW \fB\e\fR .
415.SS "CLASS-SHORTHAND ESCAPES"
416Class-shorthand escapes (AREs only) provide shorthands for certain
417commonly-used character classes:
418.RS 2
419.TP 10
420\fB\ed\fR
421.
422\fB[[:digit:]]\fR
423.TP
424\fB\es\fR
425.
426\fB[[:space:]]\fR
427.TP
428\fB\ew\fR
429.
430\fB[[:alnum:]_]\fR (note underscore)
431.TP
432\fB\eD\fR
433.
434\fB[^[:digit:]]\fR
435.TP
436\fB\eS\fR
437.
438\fB[^[:space:]]\fR
439.TP
440\fB\eW\fR
441.
442\fB[^[:alnum:]_]\fR (note underscore)
443.RE
444.PP
445Within bracket expressions,
446.QW \fB\ed\fR ,
447.QW \fB\es\fR ,
448and
449.QW \fB\ew\fR \&
450lose their outer brackets, and
451.QW \fB\eD\fR ,
452.QW \fB\eS\fR ,
453and
454.QW \fB\eW\fR \&
455are illegal. (So, for example,
456.QW \fB[a-c\ed]\fR
457is equivalent to
458.QW \fB[a-c[:digit:]]\fR .
459Also,
460.QW \fB[a-c\eD]\fR ,
461which is equivalent to
462.QW \fB[a-c^[:digit:]]\fR ,
463is illegal.)
464.SS "CONSTRAINT ESCAPES"
465A constraint escape (AREs only) is a constraint, matching the empty
466string if specific conditions are met, written as an escape:
467.RS 2
468.TP 6
469\fB\eA\fR
470.
471matches only at the beginning of the string (see \fBMATCHING\fR,
472below, for how this differs from
473.QW \fB^\fR )
474.TP
475\fB\em\fR
476.
477matches only at the beginning of a word
478.TP
479\fB\eM\fR
480.
481matches only at the end of a word
482.TP
483\fB\ey\fR
484.
485matches only at the beginning or end of a word
486.TP
487\fB\eY\fR
488.
489matches only at a point that is not the beginning or end of a word
490.TP
491\fB\eZ\fR
492.
493matches only at the end of the string (see \fBMATCHING\fR, below, for
494how this differs from
495.QW \fB$\fR )
496.TP
497\fB\e\fIm\fR
498.
499(where \fIm\fR is a nonzero digit) a \fIback reference\fR, see below
500.TP
501\fB\e\fImnn\fR
502.
503(where \fIm\fR is a nonzero digit, and \fInn\fR is some more digits,
504and the decimal value \fImnn\fR is not greater than the number of
505closing capturing parentheses seen so far) a \fIback reference\fR, see
506below
507.RE
508.PP
509A word is defined as in the specification of
510.QW \fB[[:<:]]\fR
511and
512.QW \fB[[:>:]]\fR
513above. Constraint escapes are illegal within bracket expressions.
514.SS "BACK REFERENCES"
515A back reference (AREs only) matches the same string matched by the
516parenthesized subexpression specified by the number, so that (e.g.)
517.QW \fB([bc])\e1\fR
518matches
519.QW \fBbb\fR
520or
521.QW \fBcc\fR
522but not
523.QW \fBbc\fR .
524The subexpression must entirely precede the back reference in the RE.
525Subexpressions are numbered in the order of their leading parentheses.
526Non-capturing parentheses do not define subexpressions.
527.PP
528There is an inherent historical ambiguity between octal
529character-entry escapes and back references, which is resolved by
530heuristics, as hinted at above. A leading zero always indicates an
531octal escape. A single non-zero digit, not followed by another digit,
532is always taken as a back reference. A multi-digit sequence not
533starting with a zero is taken as a back reference if it comes after a
534suitable subexpression (i.e. the number is in the legal range for a
535back reference), and otherwise is taken as octal.
536.SH "METASYNTAX"
537In addition to the main syntax described above, there are some special
538forms and miscellaneous syntactic facilities available.
539.PP
540Normally the flavor of RE being used is specified by
541application-dependent means. However, this can be overridden by a
542\fIdirector\fR. If an RE of any flavor begins with
543.QW \fB***:\fR ,
544the rest of the RE is an ARE. If an RE of any flavor begins with
545.QW \fB***=\fR ,
546the rest of the RE is taken to be a literal string, with
547all characters considered ordinary characters.
548.PP
549An ARE may begin with \fIembedded options\fR: a sequence
550\fB(?\fIxyz\fB)\fR (where \fIxyz\fR is one or more alphabetic
551characters) specifies options affecting the rest of the RE. These
552supplement, and can override, any options specified by the
553application. The available option letters are:
554.RS 2
555.TP 3
556\fBb\fR
557.
558rest of RE is a BRE
559.TP 3
560\fBc\fR
561.
562case-sensitive matching (usual default)
563.TP 3
564\fBe\fR
565.
566rest of RE is an ERE
567.TP 3
568\fBi\fR
569.
570case-insensitive matching (see \fBMATCHING\fR, below)
571.TP 3
572\fBm\fR
573.
574historical synonym for \fBn\fR
575.TP 3
576\fBn\fR
577.
578newline-sensitive matching (see \fBMATCHING\fR, below)
579.TP 3
580\fBp\fR
581.
582partial newline-sensitive matching (see \fBMATCHING\fR, below)
583.TP 3
584\fBq\fR
585.
586rest of RE is a literal
587.PQ quoted
588string, all ordinary characters
589.TP 3
590\fBs\fR
591.
592non-newline-sensitive matching (usual default)
593.TP 3
594\fBt\fR
595.
596tight syntax (usual default; see below)
597.TP 3
598\fBw\fR
599.
600inverse partial newline-sensitive
601.PQ weird
602matching (see \fBMATCHING\fR, below)
603.TP 3
604\fBx\fR
605.
606expanded syntax (see below)
607.RE
608.PP
609Embedded options take effect at the \fB)\fR terminating the sequence.
610They are available only at the start of an ARE, and may not be used
611later within it.
612.PP
613In addition to the usual (\fItight\fR) RE syntax, in which all
614characters are significant, there is an \fIexpanded\fR syntax,
615available in all flavors of RE with the \fB\-expanded\fR switch, or in
616AREs with the embedded x option. In the expanded syntax, white-space
617characters are ignored and all characters between a \fB#\fR and the
618following newline (or the end of the RE) are ignored, permitting
619paragraphing and commenting a complex RE. There are three exceptions
620to that basic rule:
621.IP \(bu 3
622a white-space character or
623.QW \fB#\fR
624preceded by
625.QW \fB\e\fR
626is retained
627.IP \(bu 3
628white space or
629.QW \fB#\fR
630within a bracket expression is retained
631.IP \(bu 3
632white space and comments are illegal within multi-character symbols
633like the ARE
634.QW \fB(?:\fR
635or the BRE
636.QW \fB\e(\fR
637.PP
638Expanded-syntax white-space characters are blank, tab, newline, and
639any character that belongs to the \fIspace\fR character class.
640.PP
641Finally, in an ARE, outside bracket expressions, the sequence
642.QW \fB(?#\fIttt\fB)\fR
643(where \fIttt\fR is any text not containing a
644.QW \fB)\fR )
645is a comment, completely ignored. Again, this is not
646allowed between the characters of multi-character symbols like
647.QW \fB(?:\fR .
648Such comments are more a historical artifact than a useful facility,
649and their use is deprecated; use the expanded syntax instead.
650.PP
651\fINone\fR of these metasyntax extensions is available if the
652application (or an initial
653.QW \fB***=\fR
654director) has specified that the
655user's input be treated as a literal string rather than as an RE.
656.SH MATCHING
657In the event that an RE could match more than one substring of a given
658string, the RE matches the one starting earliest in the string. If
659the RE could match more than one substring starting at that point, its
660choice is determined by its \fIpreference\fR: either the longest
661substring, or the shortest.
662.PP
663Most atoms, and all constraints, have no preference. A parenthesized
664RE has the same preference (possibly none) as the RE. A quantified
665atom with quantifier \fB{\fIm\fB}\fR or \fB{\fIm\fB}?\fR has the same
666preference (possibly none) as the atom itself. A quantified atom with
667other normal quantifiers (including \fB{\fIm\fB,\fIn\fB}\fR with
668\fIm\fR equal to \fIn\fR) prefers longest match. A quantified atom
669with other non-greedy quantifiers (including \fB{\fIm\fB,\fIn\fB}?\fR
670with \fIm\fR equal to \fIn\fR) prefers shortest match. A branch has
671the same preference as the first quantified atom in it which has a
672preference. An RE consisting of two or more branches connected by the
673\fB|\fR operator prefers longest match.
674.PP
675Subject to the constraints imposed by the rules for matching the whole
676RE, subexpressions also match the longest or shortest possible
677substrings, based on their preferences, with subexpressions starting
678earlier in the RE taking priority over ones starting later. Note that
679outer subexpressions thus take priority over their component
680subexpressions.
681.PP
682Note that the quantifiers \fB{1,1}\fR and \fB{1,1}?\fR can be used to
683force longest and shortest preference, respectively, on a
684subexpression or a whole RE.
685.PP
686Match lengths are measured in characters, not collating elements. An
687empty string is considered longer than no match at all. For example,
688.QW \fBbb*\fR
689matches the three middle characters of
690.QW \fBabbbc\fR ,
691.QW \fB(week|wee)(night|knights)\fR
692matches all ten characters of
693.QW \fBweeknights\fR ,
694when
695.QW \fB(.*).*\fR
696is matched against
697.QW \fBabc\fR
698the parenthesized subexpression matches all three characters, and when
699.QW \fB(a*)*\fR
700is matched against
701.QW \fBbc\fR
702both the whole RE and the parenthesized subexpression match an empty string.
703.PP
704If case-independent matching is specified, the effect is much as if
705all case distinctions had vanished from the alphabet. When an
706alphabetic that exists in multiple cases appears as an ordinary
707character outside a bracket expression, it is effectively transformed
708into a bracket expression containing both cases, so that \fBx\fR
709becomes
710.QW \fB[xX]\fR .
711When it appears inside a bracket expression,
712all case counterparts of it are added to the bracket expression, so
713that
714.QW \fB[x]\fR
715becomes
716.QW \fB[xX]\fR
717and
718.QW \fB[^x]\fR
719becomes
720.QW \fB[^xX]\fR .
721.PP
722If newline-sensitive matching is specified, \fB.\fR and bracket
723expressions using \fB^\fR will never match the newline character (so
724that matches will never cross newlines unless the RE explicitly
725arranges it) and \fB^\fR and \fB$\fR will match the empty string after
726and before a newline respectively, in addition to matching at
727beginning and end of string respectively. ARE \fB\eA\fR and \fB\eZ\fR
728continue to match beginning or end of string \fIonly\fR.
729.PP
730If partial newline-sensitive matching is specified, this affects
731\fB.\fR and bracket expressions as with newline-sensitive matching,
732but not \fB^\fR and \fB$\fR.
733.PP
734If inverse partial newline-sensitive matching is specified, this
735affects \fB^\fR and \fB$\fR as with newline-sensitive matching, but
736not \fB.\fR and bracket expressions. This is not very useful but is
737provided for symmetry.
738.SH "LIMITS AND COMPATIBILITY"
739No particular limit is imposed on the length of REs. Programs
740intended to be highly portable should not employ REs longer than 256
741bytes, as a POSIX-compliant implementation can refuse to accept such
742REs.
743.PP
744The only feature of AREs that is actually incompatible with POSIX EREs
745is that \fB\e\fR does not lose its special significance inside bracket
746expressions. All other ARE features use syntax which is illegal or
747has undefined or unspecified effects in POSIX EREs; the \fB***\fR
748syntax of directors likewise is outside the POSIX syntax for both BREs
749and EREs.
750.PP
751Many of the ARE extensions are borrowed from Perl, but some have been
752changed to clean them up, and a few Perl extensions are not present.
753Incompatibilities of note include
754.QW \fB\eb\fR ,
755.QW \fB\eB\fR ,
756the lack of special treatment for a trailing newline, the addition of
757complemented bracket expressions to the things affected by
758newline-sensitive matching, the restrictions on parentheses and back
759references in lookahead constraints, and the longest/shortest-match
760(rather than first-match) matching semantics.
761.PP
762The matching rules for REs containing both normal and non-greedy
763quantifiers have changed since early beta-test versions of this
764package. (The new rules are much simpler and cleaner, but do not work
765as hard at guessing the user's real intentions.)
766.PP
767Henry Spencer's original 1986 \fIregexp\fR package, still in
768widespread use (e.g., in pre-8.1 releases of Tcl), implemented an
769early version of today's EREs. There are four incompatibilities
770between \fIregexp\fR's near-EREs
771.PQ RREs " for short"
772and AREs. In roughly increasing order of significance:
773.IP \(bu 3
774In AREs, \fB\e\fR followed by an alphanumeric character is either an
775escape or an error, while in RREs, it was just another way of writing
776the alphanumeric. This should not be a problem because there was no
777reason to write such a sequence in RREs.
778.IP \(bu 3
779\fB{\fR followed by a digit in an ARE is the beginning of a bound,
780while in RREs, \fB{\fR was always an ordinary character. Such
781sequences should be rare, and will often result in an error because
782following characters will not look like a valid bound.
783.IP \(bu 3
784In AREs, \fB\e\fR remains a special character within
785.QW \fB[\|]\fR ,
786so a literal \fB\e\fR within \fB[\|]\fR must be written
787.QW \fB\e\e\fR .
788\fB\e\e\fR also gives a literal \fB\e\fR within \fB[\|]\fR in RREs,
789but only truly paranoid programmers routinely doubled the backslash.
790.IP \(bu 3
791AREs report the longest/shortest match for the RE, rather than the
792first found in a specified search order. This may affect some RREs
793which were written in the expectation that the first match would be
794reported. (The careful crafting of RREs to optimize the search order
795for fast matching is obsolete (AREs examine all possible matches in
796parallel, and their performance is largely insensitive to their
797complexity) but cases where the search order was exploited to
798deliberately find a match which was \fInot\fR the longest/shortest
799will need rewriting.)
800.SH "BASIC REGULAR EXPRESSIONS"
801BREs differ from EREs in several respects.
802.QW \fB|\fR ,
803.QW \fB+\fR ,
804and \fB?\fR are ordinary characters and there is no equivalent for their
805functionality. The delimiters for bounds are \fB\e{\fR and
806.QW \fB\e}\fR ,
807with \fB{\fR and \fB}\fR by themselves ordinary characters. The
808parentheses for nested subexpressions are \fB\e(\fR and
809.QW \fB\e)\fR ,
810with \fB(\fR and \fB)\fR by themselves ordinary
811characters. \fB^\fR is an ordinary character except at the beginning
812of the RE or the beginning of a parenthesized subexpression, \fB$\fR
813is an ordinary character except at the end of the RE or the end of a
814parenthesized subexpression, and \fB*\fR is an ordinary character if
815it appears at the beginning of the RE or the beginning of a
816parenthesized subexpression (after a possible leading
817.QW \fB^\fR ).
818Finally, single-digit back references are available, and \fB\e<\fR and
819\fB\e>\fR are synonyms for
820.QW \fB[[:<:]]\fR
821and
822.QW \fB[[:>:]]\fR
823respectively; no other escapes are available.
824.SH "SEE ALSO"
825RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
826.SH KEYWORDS
827match, regular expression, string
828.\" Local Variables:
829.\" mode: nroff
830.\" End:
Note: See TracBrowser for help on using the repository browser.