1 | '\" |
---|
2 | '\" Copyright (c) 1998 Sun Microsystems, Inc. |
---|
3 | '\" Copyright (c) 1999 Scriptics Corporation |
---|
4 | '\" |
---|
5 | '\" See the file "license.terms" for information on usage and redistribution |
---|
6 | '\" of this file, and for a DISCLAIMER OF ALL WARRANTIES. |
---|
7 | '\" |
---|
8 | '\" RCS: @(#) $Id: re_syntax.n,v 1.18 2007/12/13 15:22:33 dgp Exp $ |
---|
9 | '\" |
---|
10 | .so man.macros |
---|
11 | .TH re_syntax n "8.1" Tcl "Tcl Built-In Commands" |
---|
12 | .BS |
---|
13 | .SH NAME |
---|
14 | re_syntax \- Syntax of Tcl regular expressions |
---|
15 | .BE |
---|
16 | .SH DESCRIPTION |
---|
17 | .PP |
---|
18 | A \fIregular expression\fR describes strings of characters. |
---|
19 | It's a pattern that matches certain strings and does not match others. |
---|
20 | .SH "DIFFERENT FLAVORS OF REs" |
---|
21 | Regular expressions |
---|
22 | .PQ RE s , |
---|
23 | as defined by POSIX, come in two flavors: \fIextended\fR REs |
---|
24 | .PQ ERE s |
---|
25 | and \fIbasic\fR REs |
---|
26 | .PQ BRE s . |
---|
27 | EREs are roughly those of the traditional \fIegrep\fR, while BREs are |
---|
28 | roughly those of the traditional \fIed\fR. This implementation adds |
---|
29 | a third flavor, \fIadvanced\fR REs |
---|
30 | .PQ ARE s , |
---|
31 | basically EREs with some significant extensions. |
---|
32 | .PP |
---|
33 | This manual page primarily describes AREs. BREs mostly exist for |
---|
34 | backward compatibility in some old programs; they will be discussed at |
---|
35 | the end. POSIX EREs are almost an exact subset of AREs. Features of |
---|
36 | AREs that are not present in EREs will be indicated. |
---|
37 | .SH "REGULAR EXPRESSION SYNTAX" |
---|
38 | .PP |
---|
39 | Tcl regular expressions are implemented using the package written by |
---|
40 | Henry Spencer, based on the 1003.2 spec and some (not quite all) of |
---|
41 | the Perl5 extensions (thanks, Henry!). Much of the description of |
---|
42 | regular expressions below is copied verbatim from his manual entry. |
---|
43 | .PP |
---|
44 | An ARE is one or more \fIbranches\fR, |
---|
45 | separated by |
---|
46 | .QW \fB|\fR , |
---|
47 | matching anything that matches any of the branches. |
---|
48 | .PP |
---|
49 | A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR, |
---|
50 | concatenated. |
---|
51 | It matches a match for the first, followed by a match for the second, etc; |
---|
52 | an empty branch matches the empty string. |
---|
53 | .SS QUANTIFIERS |
---|
54 | A quantified atom is an \fIatom\fR possibly followed |
---|
55 | by a single \fIquantifier\fR. |
---|
56 | Without a quantifier, it matches a single match for the atom. |
---|
57 | The quantifiers, |
---|
58 | and what a so-quantified atom matches, are: |
---|
59 | .RS 2 |
---|
60 | .TP 6 |
---|
61 | \fB*\fR |
---|
62 | . |
---|
63 | a sequence of 0 or more matches of the atom |
---|
64 | .TP |
---|
65 | \fB+\fR |
---|
66 | . |
---|
67 | a sequence of 1 or more matches of the atom |
---|
68 | .TP |
---|
69 | \fB?\fR |
---|
70 | . |
---|
71 | a sequence of 0 or 1 matches of the atom |
---|
72 | .TP |
---|
73 | \fB{\fIm\fB}\fR |
---|
74 | . |
---|
75 | a sequence of exactly \fIm\fR matches of the atom |
---|
76 | .TP |
---|
77 | \fB{\fIm\fB,}\fR |
---|
78 | . |
---|
79 | a sequence of \fIm\fR or more matches of the atom |
---|
80 | .TP |
---|
81 | \fB{\fIm\fB,\fIn\fB}\fR |
---|
82 | . |
---|
83 | a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom; |
---|
84 | \fIm\fR may not exceed \fIn\fR |
---|
85 | .TP |
---|
86 | \fB*? +? ?? {\fIm\fB}? {\fIm\fB,}? {\fIm\fB,\fIn\fB}?\fR |
---|
87 | . |
---|
88 | \fInon-greedy\fR quantifiers, which match the same possibilities, |
---|
89 | but prefer the smallest number rather than the largest number |
---|
90 | of matches (see \fBMATCHING\fR) |
---|
91 | .RE |
---|
92 | .PP |
---|
93 | The forms using \fB{\fR and \fB}\fR are known as \fIbound\fRs. The |
---|
94 | numbers \fIm\fR and \fIn\fR are unsigned decimal integers with |
---|
95 | permissible values from 0 to 255 inclusive. |
---|
96 | .SS ATOMS |
---|
97 | An atom is one of: |
---|
98 | .RS 2 |
---|
99 | .IP \fB(\fIre\fB)\fR 6 |
---|
100 | matches a match for \fIre\fR (\fIre\fR is any regular expression) with |
---|
101 | the match noted for possible reporting |
---|
102 | .IP \fB(?:\fIre\fB)\fR |
---|
103 | as previous, but does no reporting (a |
---|
104 | .QW non-capturing |
---|
105 | set of parentheses) |
---|
106 | .IP \fB()\fR |
---|
107 | matches an empty string, noted for possible reporting |
---|
108 | .IP \fB(?:)\fR |
---|
109 | matches an empty string, without reporting |
---|
110 | .IP \fB[\fIchars\fB]\fR |
---|
111 | a \fIbracket expression\fR, matching any one of the \fIchars\fR (see |
---|
112 | \fBBRACKET EXPRESSIONS\fR for more detail) |
---|
113 | .IP \fB.\fR |
---|
114 | matches any single character |
---|
115 | .IP \fB\e\fIk\fR |
---|
116 | matches the non-alphanumeric character \fIk\fR |
---|
117 | taken as an ordinary character, e.g. \fB\e\e\fR matches a backslash |
---|
118 | character |
---|
119 | .IP \fB\e\fIc\fR |
---|
120 | where \fIc\fR is alphanumeric (possibly followed by other characters), |
---|
121 | an \fIescape\fR (AREs only), see \fBESCAPES\fR below |
---|
122 | .IP \fB{\fR |
---|
123 | when followed by a character other than a digit, matches the |
---|
124 | left-brace character |
---|
125 | .QW \fB{\fR ; |
---|
126 | when followed by a digit, it is the beginning of a \fIbound\fR (see above) |
---|
127 | .IP \fIx\fR |
---|
128 | where \fIx\fR is a single character with no other significance, |
---|
129 | matches that character. |
---|
130 | .RE |
---|
131 | .SS CONSTRAINTS |
---|
132 | A \fIconstraint\fR matches an empty string when specific conditions |
---|
133 | are met. A constraint may not be followed by a quantifier. The |
---|
134 | simple constraints are as follows; some more constraints are described |
---|
135 | later, under \fBESCAPES\fR. |
---|
136 | .RS 2 |
---|
137 | .TP 8 |
---|
138 | \fB^\fR |
---|
139 | . |
---|
140 | matches at the beginning of a line |
---|
141 | .TP |
---|
142 | \fB$\fR |
---|
143 | . |
---|
144 | matches at the end of a line |
---|
145 | .TP |
---|
146 | \fB(?=\fIre\fB)\fR |
---|
147 | . |
---|
148 | \fIpositive lookahead\fR (AREs only), matches at any point where a |
---|
149 | substring matching \fIre\fR begins |
---|
150 | .TP |
---|
151 | \fB(?!\fIre\fB)\fR |
---|
152 | . |
---|
153 | \fInegative lookahead\fR (AREs only), matches at any point where no |
---|
154 | substring matching \fIre\fR begins |
---|
155 | .RE |
---|
156 | .PP |
---|
157 | The lookahead constraints may not contain back references (see later), |
---|
158 | and all parentheses within them are considered non-capturing. |
---|
159 | .PP |
---|
160 | An RE may not end with |
---|
161 | .QW \fB\e\fR . |
---|
162 | .SH "BRACKET EXPRESSIONS" |
---|
163 | A \fIbracket expression\fR is a list of characters enclosed in |
---|
164 | .QW \fB[\|]\fR . |
---|
165 | It normally matches any single character from the list |
---|
166 | (but see below). If the list begins with |
---|
167 | .QW \fB^\fR , |
---|
168 | it matches any single character (but see below) \fInot\fR from the |
---|
169 | rest of the list. |
---|
170 | .PP |
---|
171 | If two characters in the list are separated by |
---|
172 | .QW \fB\-\fR , |
---|
173 | this is shorthand for the full \fIrange\fR of characters between those two |
---|
174 | (inclusive) in the collating sequence, e.g. |
---|
175 | .QW \fB[0\-9]\fR |
---|
176 | in Unicode matches any conventional decimal digit. Two ranges may not share an |
---|
177 | endpoint, so e.g. |
---|
178 | .QW \fBa\-c\-e\fR |
---|
179 | is illegal. Ranges in Tcl always use the |
---|
180 | Unicode collating sequence, but other programs may use other collating |
---|
181 | sequences and this can be a source of incompatability between programs. |
---|
182 | .PP |
---|
183 | To include a literal \fB]\fR or \fB\-\fR in the list, the simplest |
---|
184 | method is to enclose it in \fB[.\fR and \fB.]\fR to make it a |
---|
185 | collating element (see below). Alternatively, make it the first |
---|
186 | character (following a possible |
---|
187 | .QW \fB^\fR ), |
---|
188 | or (AREs only) precede it with |
---|
189 | .QW \fB\e\fR . |
---|
190 | Alternatively, for |
---|
191 | .QW \fB\-\fR , |
---|
192 | make it the last character, or the second endpoint of a range. To use |
---|
193 | a literal \fB\-\fR as the first endpoint of a range, make it a |
---|
194 | collating element or (AREs only) precede it with |
---|
195 | .QW \fB\e\fR . |
---|
196 | With the exception of |
---|
197 | these, some combinations using \fB[\fR (see next paragraphs), and |
---|
198 | escapes, all other special characters lose their special significance |
---|
199 | within a bracket expression. |
---|
200 | .SS "CHARACTER CLASSES" |
---|
201 | Within a bracket expression, the name of a \fIcharacter class\fR |
---|
202 | enclosed in \fB[:\fR and \fB:]\fR stands for the list of all |
---|
203 | characters (not all collating elements!) belonging to that class. |
---|
204 | Standard character classes are: |
---|
205 | .IP \fBalpha\fR 8 |
---|
206 | A letter. |
---|
207 | .IP \fBupper\fR 8 |
---|
208 | An upper-case letter. |
---|
209 | .IP \fBlower\fR 8 |
---|
210 | A lower-case letter. |
---|
211 | .IP \fBdigit\fR 8 |
---|
212 | A decimal digit. |
---|
213 | .IP \fBxdigit\fR 8 |
---|
214 | A hexadecimal digit. |
---|
215 | .IP \fBalnum\fR 8 |
---|
216 | An alphanumeric (letter or digit). |
---|
217 | .IP \fBprint\fR 8 |
---|
218 | A "printable" (same as graph, except also including space). |
---|
219 | .IP \fBblank\fR 8 |
---|
220 | A space or tab character. |
---|
221 | .IP \fBspace\fR 8 |
---|
222 | A character producing white space in displayed text. |
---|
223 | .IP \fBpunct\fR 8 |
---|
224 | A punctuation character. |
---|
225 | .IP \fBgraph\fR 8 |
---|
226 | A character with a visible representation (includes both alnum and punct). |
---|
227 | .IP \fBcntrl\fR 8 |
---|
228 | A control character. |
---|
229 | .PP |
---|
230 | A locale may provide others. A character class may not be used as an endpoint |
---|
231 | of a range. |
---|
232 | .RS |
---|
233 | .PP |
---|
234 | (\fINote:\fR the current Tcl implementation has only one locale, the Unicode |
---|
235 | locale, which supports exactly the above classes.) |
---|
236 | .RE |
---|
237 | .SS "BRACKETED CONSTRAINTS" |
---|
238 | There are two special cases of bracket expressions: the bracket |
---|
239 | expressions |
---|
240 | .QW \fB[[:<:]]\fR |
---|
241 | and |
---|
242 | .QW \fB[[:>:]]\fR |
---|
243 | are constraints, matching empty strings at the beginning and end of a word |
---|
244 | respectively. |
---|
245 | .\" note, discussion of escapes below references this definition of word |
---|
246 | A word is defined as a sequence of word characters that is neither preceded |
---|
247 | nor followed by word characters. A word character is an \fIalnum\fR character |
---|
248 | or an underscore |
---|
249 | .PQ \fB_\fR "" . |
---|
250 | These special bracket expressions are deprecated; users of AREs should use |
---|
251 | constraint escapes instead (see below). |
---|
252 | .SS "COLLATING ELEMENTS" |
---|
253 | Within a bracket expression, a collating element (a character, a |
---|
254 | multi-character sequence that collates as if it were a single |
---|
255 | character, or a collating-sequence name for either) enclosed in |
---|
256 | \fB[.\fR and \fB.]\fR stands for the sequence of characters of that |
---|
257 | collating element. The sequence is a single element of the bracket |
---|
258 | expression's list. A bracket expression in a locale that has |
---|
259 | multi-character collating elements can thus match more than one |
---|
260 | character. So (insidiously), a bracket expression that starts with |
---|
261 | \fB^\fR can match multi-character collating elements even if none of |
---|
262 | them appear in the bracket expression! |
---|
263 | .RS |
---|
264 | .PP |
---|
265 | (\fINote:\fR Tcl has no multi-character collating elements. This information |
---|
266 | is only for illustration.) |
---|
267 | .RE |
---|
268 | .PP |
---|
269 | For example, assume the collating sequence includes a \fBch\fR multi-character |
---|
270 | collating element. Then the RE |
---|
271 | .QW \fB[[.ch.]]*c\fR |
---|
272 | (zero or more |
---|
273 | .QW \fBch\fRs |
---|
274 | followed by |
---|
275 | .QW \fBc\fR ) |
---|
276 | matches the first five characters of |
---|
277 | .QW \fBchchcc\fR . |
---|
278 | Also, the RE |
---|
279 | .QW \fB[^c]b\fR |
---|
280 | matches all of |
---|
281 | .QW \fBchb\fR |
---|
282 | (because |
---|
283 | .QW \fB[^c]\fR |
---|
284 | matches the multi-character |
---|
285 | .QW \fBch\fR ). |
---|
286 | .SS "EQUIVALENCE CLASSES" |
---|
287 | Within a bracket expression, a collating element enclosed in \fB[=\fR |
---|
288 | and \fB=]\fR is an equivalence class, standing for the sequences of |
---|
289 | characters of all collating elements equivalent to that one, including |
---|
290 | itself. (If there are no other equivalent collating elements, the |
---|
291 | treatment is as if the enclosing delimiters were |
---|
292 | .QW \fB[.\fR \& |
---|
293 | and |
---|
294 | .QW \fB.]\fR .) |
---|
295 | For example, if \fBo\fR and \fB\N'244'\fR are the members of an |
---|
296 | equivalence class, then |
---|
297 | .QW \fB[[=o=]]\fR , |
---|
298 | .QW \fB[[=\N'244'=]]\fR , |
---|
299 | and |
---|
300 | .QW \fB[o\N'244']\fR \& |
---|
301 | are all synonymous. An equivalence class may not be an endpoint of a range. |
---|
302 | .RS |
---|
303 | .PP |
---|
304 | (\fINote:\fR Tcl implements only the Unicode locale. It does not define any |
---|
305 | equivalence classes. The examples above are just illustrations.) |
---|
306 | .RE |
---|
307 | .SH ESCAPES |
---|
308 | Escapes (AREs only), which begin with a \fB\e\fR followed by an |
---|
309 | alphanumeric character, come in several varieties: character entry, |
---|
310 | class shorthands, constraint escapes, and back references. A \fB\e\fR |
---|
311 | followed by an alphanumeric character but not constituting a valid |
---|
312 | escape is illegal in AREs. In EREs, there are no escapes: outside a |
---|
313 | bracket expression, a \fB\e\fR followed by an alphanumeric character |
---|
314 | merely stands for that character as an ordinary character, and inside |
---|
315 | a bracket expression, \fB\e\fR is an ordinary character. (The latter |
---|
316 | is the one actual incompatibility between EREs and AREs.) |
---|
317 | .SS "CHARACTER-ENTRY ESCAPES" |
---|
318 | Character-entry escapes (AREs only) exist to make it easier to specify |
---|
319 | non-printing and otherwise inconvenient characters in REs: |
---|
320 | .RS 2 |
---|
321 | .TP 5 |
---|
322 | \fB\ea\fR |
---|
323 | . |
---|
324 | alert (bell) character, as in C |
---|
325 | .TP |
---|
326 | \fB\eb\fR |
---|
327 | . |
---|
328 | backspace, as in C |
---|
329 | .TP |
---|
330 | \fB\eB\fR |
---|
331 | . |
---|
332 | synonym for \fB\e\fR to help reduce backslash doubling in some |
---|
333 | applications where there are multiple levels of backslash processing |
---|
334 | .TP |
---|
335 | \fB\ec\fIX\fR |
---|
336 | . |
---|
337 | (where \fIX\fR is any character) the character whose low-order 5 bits |
---|
338 | are the same as those of \fIX\fR, and whose other bits are all zero |
---|
339 | .TP |
---|
340 | \fB\ee\fR |
---|
341 | . |
---|
342 | the character whose collating-sequence name is |
---|
343 | .QW \fBESC\fR , |
---|
344 | or failing that, the character with octal value 033 |
---|
345 | .TP |
---|
346 | \fB\ef\fR |
---|
347 | . |
---|
348 | formfeed, as in C |
---|
349 | .TP |
---|
350 | \fB\en\fR |
---|
351 | . |
---|
352 | newline, as in C |
---|
353 | .TP |
---|
354 | \fB\er\fR |
---|
355 | . |
---|
356 | carriage return, as in C |
---|
357 | .TP |
---|
358 | \fB\et\fR |
---|
359 | . |
---|
360 | horizontal tab, as in C |
---|
361 | .TP |
---|
362 | \fB\eu\fIwxyz\fR |
---|
363 | . |
---|
364 | (where \fIwxyz\fR is exactly four hexadecimal digits) the Unicode |
---|
365 | character \fBU+\fIwxyz\fR in the local byte ordering |
---|
366 | .TP |
---|
367 | \fB\eU\fIstuvwxyz\fR |
---|
368 | . |
---|
369 | (where \fIstuvwxyz\fR is exactly eight hexadecimal digits) reserved |
---|
370 | for a somewhat-hypothetical Unicode extension to 32 bits |
---|
371 | .TP |
---|
372 | \fB\ev\fR |
---|
373 | . |
---|
374 | vertical tab, as in C are all available. |
---|
375 | .TP |
---|
376 | \fB\ex\fIhhh\fR |
---|
377 | . |
---|
378 | (where \fIhhh\fR is any sequence of hexadecimal digits) the character |
---|
379 | whose hexadecimal value is \fB0x\fIhhh\fR (a single character no |
---|
380 | matter how many hexadecimal digits are used). |
---|
381 | .TP |
---|
382 | \fB\e0\fR |
---|
383 | . |
---|
384 | the character whose value is \fB0\fR |
---|
385 | .TP |
---|
386 | \fB\e\fIxy\fR |
---|
387 | . |
---|
388 | (where \fIxy\fR is exactly two octal digits, and is not a \fIback |
---|
389 | reference\fR (see below)) the character whose octal value is |
---|
390 | \fB0\fIxy\fR |
---|
391 | .TP |
---|
392 | \fB\e\fIxyz\fR |
---|
393 | . |
---|
394 | (where \fIxyz\fR is exactly three octal digits, and is not a back |
---|
395 | reference (see below)) the character whose octal value is |
---|
396 | \fB0\fIxyz\fR |
---|
397 | .RE |
---|
398 | .PP |
---|
399 | Hexadecimal digits are |
---|
400 | .QR \fB0\fR \fB9\fR , |
---|
401 | .QR \fBa\fR \fBf\fR , |
---|
402 | and |
---|
403 | .QR \fBA\fR \fBF\fR . |
---|
404 | Octal digits are |
---|
405 | .QR \fB0\fR \fB7\fR . |
---|
406 | .PP |
---|
407 | The character-entry escapes are always taken as ordinary characters. |
---|
408 | For example, \fB\e135\fR is \fB]\fR in Unicode, but \fB\e135\fR does |
---|
409 | not terminate a bracket expression. Beware, however, that some |
---|
410 | applications (e.g., C compilers and the Tcl interpreter if the regular |
---|
411 | expression is not quoted with braces) interpret such sequences |
---|
412 | themselves before the regular-expression package gets to see them, |
---|
413 | which may require doubling (quadrupling, etc.) the |
---|
414 | .QW \fB\e\fR . |
---|
415 | .SS "CLASS-SHORTHAND ESCAPES" |
---|
416 | Class-shorthand escapes (AREs only) provide shorthands for certain |
---|
417 | commonly-used character classes: |
---|
418 | .RS 2 |
---|
419 | .TP 10 |
---|
420 | \fB\ed\fR |
---|
421 | . |
---|
422 | \fB[[:digit:]]\fR |
---|
423 | .TP |
---|
424 | \fB\es\fR |
---|
425 | . |
---|
426 | \fB[[:space:]]\fR |
---|
427 | .TP |
---|
428 | \fB\ew\fR |
---|
429 | . |
---|
430 | \fB[[:alnum:]_]\fR (note underscore) |
---|
431 | .TP |
---|
432 | \fB\eD\fR |
---|
433 | . |
---|
434 | \fB[^[:digit:]]\fR |
---|
435 | .TP |
---|
436 | \fB\eS\fR |
---|
437 | . |
---|
438 | \fB[^[:space:]]\fR |
---|
439 | .TP |
---|
440 | \fB\eW\fR |
---|
441 | . |
---|
442 | \fB[^[:alnum:]_]\fR (note underscore) |
---|
443 | .RE |
---|
444 | .PP |
---|
445 | Within bracket expressions, |
---|
446 | .QW \fB\ed\fR , |
---|
447 | .QW \fB\es\fR , |
---|
448 | and |
---|
449 | .QW \fB\ew\fR \& |
---|
450 | lose their outer brackets, and |
---|
451 | .QW \fB\eD\fR , |
---|
452 | .QW \fB\eS\fR , |
---|
453 | and |
---|
454 | .QW \fB\eW\fR \& |
---|
455 | are illegal. (So, for example, |
---|
456 | .QW \fB[a-c\ed]\fR |
---|
457 | is equivalent to |
---|
458 | .QW \fB[a-c[:digit:]]\fR . |
---|
459 | Also, |
---|
460 | .QW \fB[a-c\eD]\fR , |
---|
461 | which is equivalent to |
---|
462 | .QW \fB[a-c^[:digit:]]\fR , |
---|
463 | is illegal.) |
---|
464 | .SS "CONSTRAINT ESCAPES" |
---|
465 | A constraint escape (AREs only) is a constraint, matching the empty |
---|
466 | string if specific conditions are met, written as an escape: |
---|
467 | .RS 2 |
---|
468 | .TP 6 |
---|
469 | \fB\eA\fR |
---|
470 | . |
---|
471 | matches only at the beginning of the string (see \fBMATCHING\fR, |
---|
472 | below, for how this differs from |
---|
473 | .QW \fB^\fR ) |
---|
474 | .TP |
---|
475 | \fB\em\fR |
---|
476 | . |
---|
477 | matches only at the beginning of a word |
---|
478 | .TP |
---|
479 | \fB\eM\fR |
---|
480 | . |
---|
481 | matches only at the end of a word |
---|
482 | .TP |
---|
483 | \fB\ey\fR |
---|
484 | . |
---|
485 | matches only at the beginning or end of a word |
---|
486 | .TP |
---|
487 | \fB\eY\fR |
---|
488 | . |
---|
489 | matches only at a point that is not the beginning or end of a word |
---|
490 | .TP |
---|
491 | \fB\eZ\fR |
---|
492 | . |
---|
493 | matches only at the end of the string (see \fBMATCHING\fR, below, for |
---|
494 | how this differs from |
---|
495 | .QW \fB$\fR ) |
---|
496 | .TP |
---|
497 | \fB\e\fIm\fR |
---|
498 | . |
---|
499 | (where \fIm\fR is a nonzero digit) a \fIback reference\fR, see below |
---|
500 | .TP |
---|
501 | \fB\e\fImnn\fR |
---|
502 | . |
---|
503 | (where \fIm\fR is a nonzero digit, and \fInn\fR is some more digits, |
---|
504 | and the decimal value \fImnn\fR is not greater than the number of |
---|
505 | closing capturing parentheses seen so far) a \fIback reference\fR, see |
---|
506 | below |
---|
507 | .RE |
---|
508 | .PP |
---|
509 | A word is defined as in the specification of |
---|
510 | .QW \fB[[:<:]]\fR |
---|
511 | and |
---|
512 | .QW \fB[[:>:]]\fR |
---|
513 | above. Constraint escapes are illegal within bracket expressions. |
---|
514 | .SS "BACK REFERENCES" |
---|
515 | A back reference (AREs only) matches the same string matched by the |
---|
516 | parenthesized subexpression specified by the number, so that (e.g.) |
---|
517 | .QW \fB([bc])\e1\fR |
---|
518 | matches |
---|
519 | .QW \fBbb\fR |
---|
520 | or |
---|
521 | .QW \fBcc\fR |
---|
522 | but not |
---|
523 | .QW \fBbc\fR . |
---|
524 | The subexpression must entirely precede the back reference in the RE. |
---|
525 | Subexpressions are numbered in the order of their leading parentheses. |
---|
526 | Non-capturing parentheses do not define subexpressions. |
---|
527 | .PP |
---|
528 | There is an inherent historical ambiguity between octal |
---|
529 | character-entry escapes and back references, which is resolved by |
---|
530 | heuristics, as hinted at above. A leading zero always indicates an |
---|
531 | octal escape. A single non-zero digit, not followed by another digit, |
---|
532 | is always taken as a back reference. A multi-digit sequence not |
---|
533 | starting with a zero is taken as a back reference if it comes after a |
---|
534 | suitable subexpression (i.e. the number is in the legal range for a |
---|
535 | back reference), and otherwise is taken as octal. |
---|
536 | .SH "METASYNTAX" |
---|
537 | In addition to the main syntax described above, there are some special |
---|
538 | forms and miscellaneous syntactic facilities available. |
---|
539 | .PP |
---|
540 | Normally the flavor of RE being used is specified by |
---|
541 | application-dependent means. However, this can be overridden by a |
---|
542 | \fIdirector\fR. If an RE of any flavor begins with |
---|
543 | .QW \fB***:\fR , |
---|
544 | the rest of the RE is an ARE. If an RE of any flavor begins with |
---|
545 | .QW \fB***=\fR , |
---|
546 | the rest of the RE is taken to be a literal string, with |
---|
547 | all characters considered ordinary characters. |
---|
548 | .PP |
---|
549 | An ARE may begin with \fIembedded options\fR: a sequence |
---|
550 | \fB(?\fIxyz\fB)\fR (where \fIxyz\fR is one or more alphabetic |
---|
551 | characters) specifies options affecting the rest of the RE. These |
---|
552 | supplement, and can override, any options specified by the |
---|
553 | application. The available option letters are: |
---|
554 | .RS 2 |
---|
555 | .TP 3 |
---|
556 | \fBb\fR |
---|
557 | . |
---|
558 | rest of RE is a BRE |
---|
559 | .TP 3 |
---|
560 | \fBc\fR |
---|
561 | . |
---|
562 | case-sensitive matching (usual default) |
---|
563 | .TP 3 |
---|
564 | \fBe\fR |
---|
565 | . |
---|
566 | rest of RE is an ERE |
---|
567 | .TP 3 |
---|
568 | \fBi\fR |
---|
569 | . |
---|
570 | case-insensitive matching (see \fBMATCHING\fR, below) |
---|
571 | .TP 3 |
---|
572 | \fBm\fR |
---|
573 | . |
---|
574 | historical synonym for \fBn\fR |
---|
575 | .TP 3 |
---|
576 | \fBn\fR |
---|
577 | . |
---|
578 | newline-sensitive matching (see \fBMATCHING\fR, below) |
---|
579 | .TP 3 |
---|
580 | \fBp\fR |
---|
581 | . |
---|
582 | partial newline-sensitive matching (see \fBMATCHING\fR, below) |
---|
583 | .TP 3 |
---|
584 | \fBq\fR |
---|
585 | . |
---|
586 | rest of RE is a literal |
---|
587 | .PQ quoted |
---|
588 | string, all ordinary characters |
---|
589 | .TP 3 |
---|
590 | \fBs\fR |
---|
591 | . |
---|
592 | non-newline-sensitive matching (usual default) |
---|
593 | .TP 3 |
---|
594 | \fBt\fR |
---|
595 | . |
---|
596 | tight syntax (usual default; see below) |
---|
597 | .TP 3 |
---|
598 | \fBw\fR |
---|
599 | . |
---|
600 | inverse partial newline-sensitive |
---|
601 | .PQ weird |
---|
602 | matching (see \fBMATCHING\fR, below) |
---|
603 | .TP 3 |
---|
604 | \fBx\fR |
---|
605 | . |
---|
606 | expanded syntax (see below) |
---|
607 | .RE |
---|
608 | .PP |
---|
609 | Embedded options take effect at the \fB)\fR terminating the sequence. |
---|
610 | They are available only at the start of an ARE, and may not be used |
---|
611 | later within it. |
---|
612 | .PP |
---|
613 | In addition to the usual (\fItight\fR) RE syntax, in which all |
---|
614 | characters are significant, there is an \fIexpanded\fR syntax, |
---|
615 | available in all flavors of RE with the \fB\-expanded\fR switch, or in |
---|
616 | AREs with the embedded x option. In the expanded syntax, white-space |
---|
617 | characters are ignored and all characters between a \fB#\fR and the |
---|
618 | following newline (or the end of the RE) are ignored, permitting |
---|
619 | paragraphing and commenting a complex RE. There are three exceptions |
---|
620 | to that basic rule: |
---|
621 | .IP \(bu 3 |
---|
622 | a white-space character or |
---|
623 | .QW \fB#\fR |
---|
624 | preceded by |
---|
625 | .QW \fB\e\fR |
---|
626 | is retained |
---|
627 | .IP \(bu 3 |
---|
628 | white space or |
---|
629 | .QW \fB#\fR |
---|
630 | within a bracket expression is retained |
---|
631 | .IP \(bu 3 |
---|
632 | white space and comments are illegal within multi-character symbols |
---|
633 | like the ARE |
---|
634 | .QW \fB(?:\fR |
---|
635 | or the BRE |
---|
636 | .QW \fB\e(\fR |
---|
637 | .PP |
---|
638 | Expanded-syntax white-space characters are blank, tab, newline, and |
---|
639 | any character that belongs to the \fIspace\fR character class. |
---|
640 | .PP |
---|
641 | Finally, in an ARE, outside bracket expressions, the sequence |
---|
642 | .QW \fB(?#\fIttt\fB)\fR |
---|
643 | (where \fIttt\fR is any text not containing a |
---|
644 | .QW \fB)\fR ) |
---|
645 | is a comment, completely ignored. Again, this is not |
---|
646 | allowed between the characters of multi-character symbols like |
---|
647 | .QW \fB(?:\fR . |
---|
648 | Such comments are more a historical artifact than a useful facility, |
---|
649 | and their use is deprecated; use the expanded syntax instead. |
---|
650 | .PP |
---|
651 | \fINone\fR of these metasyntax extensions is available if the |
---|
652 | application (or an initial |
---|
653 | .QW \fB***=\fR |
---|
654 | director) has specified that the |
---|
655 | user's input be treated as a literal string rather than as an RE. |
---|
656 | .SH MATCHING |
---|
657 | In the event that an RE could match more than one substring of a given |
---|
658 | string, the RE matches the one starting earliest in the string. If |
---|
659 | the RE could match more than one substring starting at that point, its |
---|
660 | choice is determined by its \fIpreference\fR: either the longest |
---|
661 | substring, or the shortest. |
---|
662 | .PP |
---|
663 | Most atoms, and all constraints, have no preference. A parenthesized |
---|
664 | RE has the same preference (possibly none) as the RE. A quantified |
---|
665 | atom with quantifier \fB{\fIm\fB}\fR or \fB{\fIm\fB}?\fR has the same |
---|
666 | preference (possibly none) as the atom itself. A quantified atom with |
---|
667 | other normal quantifiers (including \fB{\fIm\fB,\fIn\fB}\fR with |
---|
668 | \fIm\fR equal to \fIn\fR) prefers longest match. A quantified atom |
---|
669 | with other non-greedy quantifiers (including \fB{\fIm\fB,\fIn\fB}?\fR |
---|
670 | with \fIm\fR equal to \fIn\fR) prefers shortest match. A branch has |
---|
671 | the same preference as the first quantified atom in it which has a |
---|
672 | preference. An RE consisting of two or more branches connected by the |
---|
673 | \fB|\fR operator prefers longest match. |
---|
674 | .PP |
---|
675 | Subject to the constraints imposed by the rules for matching the whole |
---|
676 | RE, subexpressions also match the longest or shortest possible |
---|
677 | substrings, based on their preferences, with subexpressions starting |
---|
678 | earlier in the RE taking priority over ones starting later. Note that |
---|
679 | outer subexpressions thus take priority over their component |
---|
680 | subexpressions. |
---|
681 | .PP |
---|
682 | Note that the quantifiers \fB{1,1}\fR and \fB{1,1}?\fR can be used to |
---|
683 | force longest and shortest preference, respectively, on a |
---|
684 | subexpression or a whole RE. |
---|
685 | .PP |
---|
686 | Match lengths are measured in characters, not collating elements. An |
---|
687 | empty string is considered longer than no match at all. For example, |
---|
688 | .QW \fBbb*\fR |
---|
689 | matches the three middle characters of |
---|
690 | .QW \fBabbbc\fR , |
---|
691 | .QW \fB(week|wee)(night|knights)\fR |
---|
692 | matches all ten characters of |
---|
693 | .QW \fBweeknights\fR , |
---|
694 | when |
---|
695 | .QW \fB(.*).*\fR |
---|
696 | is matched against |
---|
697 | .QW \fBabc\fR |
---|
698 | the parenthesized subexpression matches all three characters, and when |
---|
699 | .QW \fB(a*)*\fR |
---|
700 | is matched against |
---|
701 | .QW \fBbc\fR |
---|
702 | both the whole RE and the parenthesized subexpression match an empty string. |
---|
703 | .PP |
---|
704 | If case-independent matching is specified, the effect is much as if |
---|
705 | all case distinctions had vanished from the alphabet. When an |
---|
706 | alphabetic that exists in multiple cases appears as an ordinary |
---|
707 | character outside a bracket expression, it is effectively transformed |
---|
708 | into a bracket expression containing both cases, so that \fBx\fR |
---|
709 | becomes |
---|
710 | .QW \fB[xX]\fR . |
---|
711 | When it appears inside a bracket expression, |
---|
712 | all case counterparts of it are added to the bracket expression, so |
---|
713 | that |
---|
714 | .QW \fB[x]\fR |
---|
715 | becomes |
---|
716 | .QW \fB[xX]\fR |
---|
717 | and |
---|
718 | .QW \fB[^x]\fR |
---|
719 | becomes |
---|
720 | .QW \fB[^xX]\fR . |
---|
721 | .PP |
---|
722 | If newline-sensitive matching is specified, \fB.\fR and bracket |
---|
723 | expressions using \fB^\fR will never match the newline character (so |
---|
724 | that matches will never cross newlines unless the RE explicitly |
---|
725 | arranges it) and \fB^\fR and \fB$\fR will match the empty string after |
---|
726 | and before a newline respectively, in addition to matching at |
---|
727 | beginning and end of string respectively. ARE \fB\eA\fR and \fB\eZ\fR |
---|
728 | continue to match beginning or end of string \fIonly\fR. |
---|
729 | .PP |
---|
730 | If partial newline-sensitive matching is specified, this affects |
---|
731 | \fB.\fR and bracket expressions as with newline-sensitive matching, |
---|
732 | but not \fB^\fR and \fB$\fR. |
---|
733 | .PP |
---|
734 | If inverse partial newline-sensitive matching is specified, this |
---|
735 | affects \fB^\fR and \fB$\fR as with newline-sensitive matching, but |
---|
736 | not \fB.\fR and bracket expressions. This is not very useful but is |
---|
737 | provided for symmetry. |
---|
738 | .SH "LIMITS AND COMPATIBILITY" |
---|
739 | No particular limit is imposed on the length of REs. Programs |
---|
740 | intended to be highly portable should not employ REs longer than 256 |
---|
741 | bytes, as a POSIX-compliant implementation can refuse to accept such |
---|
742 | REs. |
---|
743 | .PP |
---|
744 | The only feature of AREs that is actually incompatible with POSIX EREs |
---|
745 | is that \fB\e\fR does not lose its special significance inside bracket |
---|
746 | expressions. All other ARE features use syntax which is illegal or |
---|
747 | has undefined or unspecified effects in POSIX EREs; the \fB***\fR |
---|
748 | syntax of directors likewise is outside the POSIX syntax for both BREs |
---|
749 | and EREs. |
---|
750 | .PP |
---|
751 | Many of the ARE extensions are borrowed from Perl, but some have been |
---|
752 | changed to clean them up, and a few Perl extensions are not present. |
---|
753 | Incompatibilities of note include |
---|
754 | .QW \fB\eb\fR , |
---|
755 | .QW \fB\eB\fR , |
---|
756 | the lack of special treatment for a trailing newline, the addition of |
---|
757 | complemented bracket expressions to the things affected by |
---|
758 | newline-sensitive matching, the restrictions on parentheses and back |
---|
759 | references in lookahead constraints, and the longest/shortest-match |
---|
760 | (rather than first-match) matching semantics. |
---|
761 | .PP |
---|
762 | The matching rules for REs containing both normal and non-greedy |
---|
763 | quantifiers have changed since early beta-test versions of this |
---|
764 | package. (The new rules are much simpler and cleaner, but do not work |
---|
765 | as hard at guessing the user's real intentions.) |
---|
766 | .PP |
---|
767 | Henry Spencer's original 1986 \fIregexp\fR package, still in |
---|
768 | widespread use (e.g., in pre-8.1 releases of Tcl), implemented an |
---|
769 | early version of today's EREs. There are four incompatibilities |
---|
770 | between \fIregexp\fR's near-EREs |
---|
771 | .PQ RREs " for short" |
---|
772 | and AREs. In roughly increasing order of significance: |
---|
773 | .IP \(bu 3 |
---|
774 | In AREs, \fB\e\fR followed by an alphanumeric character is either an |
---|
775 | escape or an error, while in RREs, it was just another way of writing |
---|
776 | the alphanumeric. This should not be a problem because there was no |
---|
777 | reason to write such a sequence in RREs. |
---|
778 | .IP \(bu 3 |
---|
779 | \fB{\fR followed by a digit in an ARE is the beginning of a bound, |
---|
780 | while in RREs, \fB{\fR was always an ordinary character. Such |
---|
781 | sequences should be rare, and will often result in an error because |
---|
782 | following characters will not look like a valid bound. |
---|
783 | .IP \(bu 3 |
---|
784 | In AREs, \fB\e\fR remains a special character within |
---|
785 | .QW \fB[\|]\fR , |
---|
786 | so a literal \fB\e\fR within \fB[\|]\fR must be written |
---|
787 | .QW \fB\e\e\fR . |
---|
788 | \fB\e\e\fR also gives a literal \fB\e\fR within \fB[\|]\fR in RREs, |
---|
789 | but only truly paranoid programmers routinely doubled the backslash. |
---|
790 | .IP \(bu 3 |
---|
791 | AREs report the longest/shortest match for the RE, rather than the |
---|
792 | first found in a specified search order. This may affect some RREs |
---|
793 | which were written in the expectation that the first match would be |
---|
794 | reported. (The careful crafting of RREs to optimize the search order |
---|
795 | for fast matching is obsolete (AREs examine all possible matches in |
---|
796 | parallel, and their performance is largely insensitive to their |
---|
797 | complexity) but cases where the search order was exploited to |
---|
798 | deliberately find a match which was \fInot\fR the longest/shortest |
---|
799 | will need rewriting.) |
---|
800 | .SH "BASIC REGULAR EXPRESSIONS" |
---|
801 | BREs differ from EREs in several respects. |
---|
802 | .QW \fB|\fR , |
---|
803 | .QW \fB+\fR , |
---|
804 | and \fB?\fR are ordinary characters and there is no equivalent for their |
---|
805 | functionality. The delimiters for bounds are \fB\e{\fR and |
---|
806 | .QW \fB\e}\fR , |
---|
807 | with \fB{\fR and \fB}\fR by themselves ordinary characters. The |
---|
808 | parentheses for nested subexpressions are \fB\e(\fR and |
---|
809 | .QW \fB\e)\fR , |
---|
810 | with \fB(\fR and \fB)\fR by themselves ordinary |
---|
811 | characters. \fB^\fR is an ordinary character except at the beginning |
---|
812 | of the RE or the beginning of a parenthesized subexpression, \fB$\fR |
---|
813 | is an ordinary character except at the end of the RE or the end of a |
---|
814 | parenthesized subexpression, and \fB*\fR is an ordinary character if |
---|
815 | it appears at the beginning of the RE or the beginning of a |
---|
816 | parenthesized subexpression (after a possible leading |
---|
817 | .QW \fB^\fR ). |
---|
818 | Finally, single-digit back references are available, and \fB\e<\fR and |
---|
819 | \fB\e>\fR are synonyms for |
---|
820 | .QW \fB[[:<:]]\fR |
---|
821 | and |
---|
822 | .QW \fB[[:>:]]\fR |
---|
823 | respectively; no other escapes are available. |
---|
824 | .SH "SEE ALSO" |
---|
825 | RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n) |
---|
826 | .SH KEYWORDS |
---|
827 | match, regular expression, string |
---|
828 | .\" Local Variables: |
---|
829 | .\" mode: nroff |
---|
830 | .\" End: |
---|