| [25] | 1 | '\" |
|---|
| 2 | '\" Copyright (c) 1998 Sun Microsystems, Inc. |
|---|
| 3 | '\" Copyright (c) 1999 Scriptics Corporation |
|---|
| 4 | '\" |
|---|
| 5 | '\" See the file "license.terms" for information on usage and redistribution |
|---|
| 6 | '\" of this file, and for a DISCLAIMER OF ALL WARRANTIES. |
|---|
| 7 | '\" |
|---|
| 8 | '\" RCS: @(#) $Id: re_syntax.n,v 1.18 2007/12/13 15:22:33 dgp Exp $ |
|---|
| 9 | '\" |
|---|
| 10 | .so man.macros |
|---|
| 11 | .TH re_syntax n "8.1" Tcl "Tcl Built-In Commands" |
|---|
| 12 | .BS |
|---|
| 13 | .SH NAME |
|---|
| 14 | re_syntax \- Syntax of Tcl regular expressions |
|---|
| 15 | .BE |
|---|
| 16 | .SH DESCRIPTION |
|---|
| 17 | .PP |
|---|
| 18 | A \fIregular expression\fR describes strings of characters. |
|---|
| 19 | It's a pattern that matches certain strings and does not match others. |
|---|
| 20 | .SH "DIFFERENT FLAVORS OF REs" |
|---|
| 21 | Regular expressions |
|---|
| 22 | .PQ RE s , |
|---|
| 23 | as defined by POSIX, come in two flavors: \fIextended\fR REs |
|---|
| 24 | .PQ ERE s |
|---|
| 25 | and \fIbasic\fR REs |
|---|
| 26 | .PQ BRE s . |
|---|
| 27 | EREs are roughly those of the traditional \fIegrep\fR, while BREs are |
|---|
| 28 | roughly those of the traditional \fIed\fR. This implementation adds |
|---|
| 29 | a third flavor, \fIadvanced\fR REs |
|---|
| 30 | .PQ ARE s , |
|---|
| 31 | basically EREs with some significant extensions. |
|---|
| 32 | .PP |
|---|
| 33 | This manual page primarily describes AREs. BREs mostly exist for |
|---|
| 34 | backward compatibility in some old programs; they will be discussed at |
|---|
| 35 | the end. POSIX EREs are almost an exact subset of AREs. Features of |
|---|
| 36 | AREs that are not present in EREs will be indicated. |
|---|
| 37 | .SH "REGULAR EXPRESSION SYNTAX" |
|---|
| 38 | .PP |
|---|
| 39 | Tcl regular expressions are implemented using the package written by |
|---|
| 40 | Henry Spencer, based on the 1003.2 spec and some (not quite all) of |
|---|
| 41 | the Perl5 extensions (thanks, Henry!). Much of the description of |
|---|
| 42 | regular expressions below is copied verbatim from his manual entry. |
|---|
| 43 | .PP |
|---|
| 44 | An ARE is one or more \fIbranches\fR, |
|---|
| 45 | separated by |
|---|
| 46 | .QW \fB|\fR , |
|---|
| 47 | matching anything that matches any of the branches. |
|---|
| 48 | .PP |
|---|
| 49 | A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR, |
|---|
| 50 | concatenated. |
|---|
| 51 | It matches a match for the first, followed by a match for the second, etc; |
|---|
| 52 | an empty branch matches the empty string. |
|---|
| 53 | .SS QUANTIFIERS |
|---|
| 54 | A quantified atom is an \fIatom\fR possibly followed |
|---|
| 55 | by a single \fIquantifier\fR. |
|---|
| 56 | Without a quantifier, it matches a single match for the atom. |
|---|
| 57 | The quantifiers, |
|---|
| 58 | and what a so-quantified atom matches, are: |
|---|
| 59 | .RS 2 |
|---|
| 60 | .TP 6 |
|---|
| 61 | \fB*\fR |
|---|
| 62 | . |
|---|
| 63 | a sequence of 0 or more matches of the atom |
|---|
| 64 | .TP |
|---|
| 65 | \fB+\fR |
|---|
| 66 | . |
|---|
| 67 | a sequence of 1 or more matches of the atom |
|---|
| 68 | .TP |
|---|
| 69 | \fB?\fR |
|---|
| 70 | . |
|---|
| 71 | a sequence of 0 or 1 matches of the atom |
|---|
| 72 | .TP |
|---|
| 73 | \fB{\fIm\fB}\fR |
|---|
| 74 | . |
|---|
| 75 | a sequence of exactly \fIm\fR matches of the atom |
|---|
| 76 | .TP |
|---|
| 77 | \fB{\fIm\fB,}\fR |
|---|
| 78 | . |
|---|
| 79 | a sequence of \fIm\fR or more matches of the atom |
|---|
| 80 | .TP |
|---|
| 81 | \fB{\fIm\fB,\fIn\fB}\fR |
|---|
| 82 | . |
|---|
| 83 | a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom; |
|---|
| 84 | \fIm\fR may not exceed \fIn\fR |
|---|
| 85 | .TP |
|---|
| 86 | \fB*? +? ?? {\fIm\fB}? {\fIm\fB,}? {\fIm\fB,\fIn\fB}?\fR |
|---|
| 87 | . |
|---|
| 88 | \fInon-greedy\fR quantifiers, which match the same possibilities, |
|---|
| 89 | but prefer the smallest number rather than the largest number |
|---|
| 90 | of matches (see \fBMATCHING\fR) |
|---|
| 91 | .RE |
|---|
| 92 | .PP |
|---|
| 93 | The forms using \fB{\fR and \fB}\fR are known as \fIbound\fRs. The |
|---|
| 94 | numbers \fIm\fR and \fIn\fR are unsigned decimal integers with |
|---|
| 95 | permissible values from 0 to 255 inclusive. |
|---|
| 96 | .SS ATOMS |
|---|
| 97 | An atom is one of: |
|---|
| 98 | .RS 2 |
|---|
| 99 | .IP \fB(\fIre\fB)\fR 6 |
|---|
| 100 | matches a match for \fIre\fR (\fIre\fR is any regular expression) with |
|---|
| 101 | the match noted for possible reporting |
|---|
| 102 | .IP \fB(?:\fIre\fB)\fR |
|---|
| 103 | as previous, but does no reporting (a |
|---|
| 104 | .QW non-capturing |
|---|
| 105 | set of parentheses) |
|---|
| 106 | .IP \fB()\fR |
|---|
| 107 | matches an empty string, noted for possible reporting |
|---|
| 108 | .IP \fB(?:)\fR |
|---|
| 109 | matches an empty string, without reporting |
|---|
| 110 | .IP \fB[\fIchars\fB]\fR |
|---|
| 111 | a \fIbracket expression\fR, matching any one of the \fIchars\fR (see |
|---|
| 112 | \fBBRACKET EXPRESSIONS\fR for more detail) |
|---|
| 113 | .IP \fB.\fR |
|---|
| 114 | matches any single character |
|---|
| 115 | .IP \fB\e\fIk\fR |
|---|
| 116 | matches the non-alphanumeric character \fIk\fR |
|---|
| 117 | taken as an ordinary character, e.g. \fB\e\e\fR matches a backslash |
|---|
| 118 | character |
|---|
| 119 | .IP \fB\e\fIc\fR |
|---|
| 120 | where \fIc\fR is alphanumeric (possibly followed by other characters), |
|---|
| 121 | an \fIescape\fR (AREs only), see \fBESCAPES\fR below |
|---|
| 122 | .IP \fB{\fR |
|---|
| 123 | when followed by a character other than a digit, matches the |
|---|
| 124 | left-brace character |
|---|
| 125 | .QW \fB{\fR ; |
|---|
| 126 | when followed by a digit, it is the beginning of a \fIbound\fR (see above) |
|---|
| 127 | .IP \fIx\fR |
|---|
| 128 | where \fIx\fR is a single character with no other significance, |
|---|
| 129 | matches that character. |
|---|
| 130 | .RE |
|---|
| 131 | .SS CONSTRAINTS |
|---|
| 132 | A \fIconstraint\fR matches an empty string when specific conditions |
|---|
| 133 | are met. A constraint may not be followed by a quantifier. The |
|---|
| 134 | simple constraints are as follows; some more constraints are described |
|---|
| 135 | later, under \fBESCAPES\fR. |
|---|
| 136 | .RS 2 |
|---|
| 137 | .TP 8 |
|---|
| 138 | \fB^\fR |
|---|
| 139 | . |
|---|
| 140 | matches at the beginning of a line |
|---|
| 141 | .TP |
|---|
| 142 | \fB$\fR |
|---|
| 143 | . |
|---|
| 144 | matches at the end of a line |
|---|
| 145 | .TP |
|---|
| 146 | \fB(?=\fIre\fB)\fR |
|---|
| 147 | . |
|---|
| 148 | \fIpositive lookahead\fR (AREs only), matches at any point where a |
|---|
| 149 | substring matching \fIre\fR begins |
|---|
| 150 | .TP |
|---|
| 151 | \fB(?!\fIre\fB)\fR |
|---|
| 152 | . |
|---|
| 153 | \fInegative lookahead\fR (AREs only), matches at any point where no |
|---|
| 154 | substring matching \fIre\fR begins |
|---|
| 155 | .RE |
|---|
| 156 | .PP |
|---|
| 157 | The lookahead constraints may not contain back references (see later), |
|---|
| 158 | and all parentheses within them are considered non-capturing. |
|---|
| 159 | .PP |
|---|
| 160 | An RE may not end with |
|---|
| 161 | .QW \fB\e\fR . |
|---|
| 162 | .SH "BRACKET EXPRESSIONS" |
|---|
| 163 | A \fIbracket expression\fR is a list of characters enclosed in |
|---|
| 164 | .QW \fB[\|]\fR . |
|---|
| 165 | It normally matches any single character from the list |
|---|
| 166 | (but see below). If the list begins with |
|---|
| 167 | .QW \fB^\fR , |
|---|
| 168 | it matches any single character (but see below) \fInot\fR from the |
|---|
| 169 | rest of the list. |
|---|
| 170 | .PP |
|---|
| 171 | If two characters in the list are separated by |
|---|
| 172 | .QW \fB\-\fR , |
|---|
| 173 | this is shorthand for the full \fIrange\fR of characters between those two |
|---|
| 174 | (inclusive) in the collating sequence, e.g. |
|---|
| 175 | .QW \fB[0\-9]\fR |
|---|
| 176 | in Unicode matches any conventional decimal digit. Two ranges may not share an |
|---|
| 177 | endpoint, so e.g. |
|---|
| 178 | .QW \fBa\-c\-e\fR |
|---|
| 179 | is illegal. Ranges in Tcl always use the |
|---|
| 180 | Unicode collating sequence, but other programs may use other collating |
|---|
| 181 | sequences and this can be a source of incompatability between programs. |
|---|
| 182 | .PP |
|---|
| 183 | To include a literal \fB]\fR or \fB\-\fR in the list, the simplest |
|---|
| 184 | method is to enclose it in \fB[.\fR and \fB.]\fR to make it a |
|---|
| 185 | collating element (see below). Alternatively, make it the first |
|---|
| 186 | character (following a possible |
|---|
| 187 | .QW \fB^\fR ), |
|---|
| 188 | or (AREs only) precede it with |
|---|
| 189 | .QW \fB\e\fR . |
|---|
| 190 | Alternatively, for |
|---|
| 191 | .QW \fB\-\fR , |
|---|
| 192 | make it the last character, or the second endpoint of a range. To use |
|---|
| 193 | a literal \fB\-\fR as the first endpoint of a range, make it a |
|---|
| 194 | collating element or (AREs only) precede it with |
|---|
| 195 | .QW \fB\e\fR . |
|---|
| 196 | With the exception of |
|---|
| 197 | these, some combinations using \fB[\fR (see next paragraphs), and |
|---|
| 198 | escapes, all other special characters lose their special significance |
|---|
| 199 | within a bracket expression. |
|---|
| 200 | .SS "CHARACTER CLASSES" |
|---|
| 201 | Within a bracket expression, the name of a \fIcharacter class\fR |
|---|
| 202 | enclosed in \fB[:\fR and \fB:]\fR stands for the list of all |
|---|
| 203 | characters (not all collating elements!) belonging to that class. |
|---|
| 204 | Standard character classes are: |
|---|
| 205 | .IP \fBalpha\fR 8 |
|---|
| 206 | A letter. |
|---|
| 207 | .IP \fBupper\fR 8 |
|---|
| 208 | An upper-case letter. |
|---|
| 209 | .IP \fBlower\fR 8 |
|---|
| 210 | A lower-case letter. |
|---|
| 211 | .IP \fBdigit\fR 8 |
|---|
| 212 | A decimal digit. |
|---|
| 213 | .IP \fBxdigit\fR 8 |
|---|
| 214 | A hexadecimal digit. |
|---|
| 215 | .IP \fBalnum\fR 8 |
|---|
| 216 | An alphanumeric (letter or digit). |
|---|
| 217 | .IP \fBprint\fR 8 |
|---|
| 218 | A "printable" (same as graph, except also including space). |
|---|
| 219 | .IP \fBblank\fR 8 |
|---|
| 220 | A space or tab character. |
|---|
| 221 | .IP \fBspace\fR 8 |
|---|
| 222 | A character producing white space in displayed text. |
|---|
| 223 | .IP \fBpunct\fR 8 |
|---|
| 224 | A punctuation character. |
|---|
| 225 | .IP \fBgraph\fR 8 |
|---|
| 226 | A character with a visible representation (includes both alnum and punct). |
|---|
| 227 | .IP \fBcntrl\fR 8 |
|---|
| 228 | A control character. |
|---|
| 229 | .PP |
|---|
| 230 | A locale may provide others. A character class may not be used as an endpoint |
|---|
| 231 | of a range. |
|---|
| 232 | .RS |
|---|
| 233 | .PP |
|---|
| 234 | (\fINote:\fR the current Tcl implementation has only one locale, the Unicode |
|---|
| 235 | locale, which supports exactly the above classes.) |
|---|
| 236 | .RE |
|---|
| 237 | .SS "BRACKETED CONSTRAINTS" |
|---|
| 238 | There are two special cases of bracket expressions: the bracket |
|---|
| 239 | expressions |
|---|
| 240 | .QW \fB[[:<:]]\fR |
|---|
| 241 | and |
|---|
| 242 | .QW \fB[[:>:]]\fR |
|---|
| 243 | are constraints, matching empty strings at the beginning and end of a word |
|---|
| 244 | respectively. |
|---|
| 245 | .\" note, discussion of escapes below references this definition of word |
|---|
| 246 | A word is defined as a sequence of word characters that is neither preceded |
|---|
| 247 | nor followed by word characters. A word character is an \fIalnum\fR character |
|---|
| 248 | or an underscore |
|---|
| 249 | .PQ \fB_\fR "" . |
|---|
| 250 | These special bracket expressions are deprecated; users of AREs should use |
|---|
| 251 | constraint escapes instead (see below). |
|---|
| 252 | .SS "COLLATING ELEMENTS" |
|---|
| 253 | Within a bracket expression, a collating element (a character, a |
|---|
| 254 | multi-character sequence that collates as if it were a single |
|---|
| 255 | character, or a collating-sequence name for either) enclosed in |
|---|
| 256 | \fB[.\fR and \fB.]\fR stands for the sequence of characters of that |
|---|
| 257 | collating element. The sequence is a single element of the bracket |
|---|
| 258 | expression's list. A bracket expression in a locale that has |
|---|
| 259 | multi-character collating elements can thus match more than one |
|---|
| 260 | character. So (insidiously), a bracket expression that starts with |
|---|
| 261 | \fB^\fR can match multi-character collating elements even if none of |
|---|
| 262 | them appear in the bracket expression! |
|---|
| 263 | .RS |
|---|
| 264 | .PP |
|---|
| 265 | (\fINote:\fR Tcl has no multi-character collating elements. This information |
|---|
| 266 | is only for illustration.) |
|---|
| 267 | .RE |
|---|
| 268 | .PP |
|---|
| 269 | For example, assume the collating sequence includes a \fBch\fR multi-character |
|---|
| 270 | collating element. Then the RE |
|---|
| 271 | .QW \fB[[.ch.]]*c\fR |
|---|
| 272 | (zero or more |
|---|
| 273 | .QW \fBch\fRs |
|---|
| 274 | followed by |
|---|
| 275 | .QW \fBc\fR ) |
|---|
| 276 | matches the first five characters of |
|---|
| 277 | .QW \fBchchcc\fR . |
|---|
| 278 | Also, the RE |
|---|
| 279 | .QW \fB[^c]b\fR |
|---|
| 280 | matches all of |
|---|
| 281 | .QW \fBchb\fR |
|---|
| 282 | (because |
|---|
| 283 | .QW \fB[^c]\fR |
|---|
| 284 | matches the multi-character |
|---|
| 285 | .QW \fBch\fR ). |
|---|
| 286 | .SS "EQUIVALENCE CLASSES" |
|---|
| 287 | Within a bracket expression, a collating element enclosed in \fB[=\fR |
|---|
| 288 | and \fB=]\fR is an equivalence class, standing for the sequences of |
|---|
| 289 | characters of all collating elements equivalent to that one, including |
|---|
| 290 | itself. (If there are no other equivalent collating elements, the |
|---|
| 291 | treatment is as if the enclosing delimiters were |
|---|
| 292 | .QW \fB[.\fR \& |
|---|
| 293 | and |
|---|
| 294 | .QW \fB.]\fR .) |
|---|
| 295 | For example, if \fBo\fR and \fB\N'244'\fR are the members of an |
|---|
| 296 | equivalence class, then |
|---|
| 297 | .QW \fB[[=o=]]\fR , |
|---|
| 298 | .QW \fB[[=\N'244'=]]\fR , |
|---|
| 299 | and |
|---|
| 300 | .QW \fB[o\N'244']\fR \& |
|---|
| 301 | are all synonymous. An equivalence class may not be an endpoint of a range. |
|---|
| 302 | .RS |
|---|
| 303 | .PP |
|---|
| 304 | (\fINote:\fR Tcl implements only the Unicode locale. It does not define any |
|---|
| 305 | equivalence classes. The examples above are just illustrations.) |
|---|
| 306 | .RE |
|---|
| 307 | .SH ESCAPES |
|---|
| 308 | Escapes (AREs only), which begin with a \fB\e\fR followed by an |
|---|
| 309 | alphanumeric character, come in several varieties: character entry, |
|---|
| 310 | class shorthands, constraint escapes, and back references. A \fB\e\fR |
|---|
| 311 | followed by an alphanumeric character but not constituting a valid |
|---|
| 312 | escape is illegal in AREs. In EREs, there are no escapes: outside a |
|---|
| 313 | bracket expression, a \fB\e\fR followed by an alphanumeric character |
|---|
| 314 | merely stands for that character as an ordinary character, and inside |
|---|
| 315 | a bracket expression, \fB\e\fR is an ordinary character. (The latter |
|---|
| 316 | is the one actual incompatibility between EREs and AREs.) |
|---|
| 317 | .SS "CHARACTER-ENTRY ESCAPES" |
|---|
| 318 | Character-entry escapes (AREs only) exist to make it easier to specify |
|---|
| 319 | non-printing and otherwise inconvenient characters in REs: |
|---|
| 320 | .RS 2 |
|---|
| 321 | .TP 5 |
|---|
| 322 | \fB\ea\fR |
|---|
| 323 | . |
|---|
| 324 | alert (bell) character, as in C |
|---|
| 325 | .TP |
|---|
| 326 | \fB\eb\fR |
|---|
| 327 | . |
|---|
| 328 | backspace, as in C |
|---|
| 329 | .TP |
|---|
| 330 | \fB\eB\fR |
|---|
| 331 | . |
|---|
| 332 | synonym for \fB\e\fR to help reduce backslash doubling in some |
|---|
| 333 | applications where there are multiple levels of backslash processing |
|---|
| 334 | .TP |
|---|
| 335 | \fB\ec\fIX\fR |
|---|
| 336 | . |
|---|
| 337 | (where \fIX\fR is any character) the character whose low-order 5 bits |
|---|
| 338 | are the same as those of \fIX\fR, and whose other bits are all zero |
|---|
| 339 | .TP |
|---|
| 340 | \fB\ee\fR |
|---|
| 341 | . |
|---|
| 342 | the character whose collating-sequence name is |
|---|
| 343 | .QW \fBESC\fR , |
|---|
| 344 | or failing that, the character with octal value 033 |
|---|
| 345 | .TP |
|---|
| 346 | \fB\ef\fR |
|---|
| 347 | . |
|---|
| 348 | formfeed, as in C |
|---|
| 349 | .TP |
|---|
| 350 | \fB\en\fR |
|---|
| 351 | . |
|---|
| 352 | newline, as in C |
|---|
| 353 | .TP |
|---|
| 354 | \fB\er\fR |
|---|
| 355 | . |
|---|
| 356 | carriage return, as in C |
|---|
| 357 | .TP |
|---|
| 358 | \fB\et\fR |
|---|
| 359 | . |
|---|
| 360 | horizontal tab, as in C |
|---|
| 361 | .TP |
|---|
| 362 | \fB\eu\fIwxyz\fR |
|---|
| 363 | . |
|---|
| 364 | (where \fIwxyz\fR is exactly four hexadecimal digits) the Unicode |
|---|
| 365 | character \fBU+\fIwxyz\fR in the local byte ordering |
|---|
| 366 | .TP |
|---|
| 367 | \fB\eU\fIstuvwxyz\fR |
|---|
| 368 | . |
|---|
| 369 | (where \fIstuvwxyz\fR is exactly eight hexadecimal digits) reserved |
|---|
| 370 | for a somewhat-hypothetical Unicode extension to 32 bits |
|---|
| 371 | .TP |
|---|
| 372 | \fB\ev\fR |
|---|
| 373 | . |
|---|
| 374 | vertical tab, as in C are all available. |
|---|
| 375 | .TP |
|---|
| 376 | \fB\ex\fIhhh\fR |
|---|
| 377 | . |
|---|
| 378 | (where \fIhhh\fR is any sequence of hexadecimal digits) the character |
|---|
| 379 | whose hexadecimal value is \fB0x\fIhhh\fR (a single character no |
|---|
| 380 | matter how many hexadecimal digits are used). |
|---|
| 381 | .TP |
|---|
| 382 | \fB\e0\fR |
|---|
| 383 | . |
|---|
| 384 | the character whose value is \fB0\fR |
|---|
| 385 | .TP |
|---|
| 386 | \fB\e\fIxy\fR |
|---|
| 387 | . |
|---|
| 388 | (where \fIxy\fR is exactly two octal digits, and is not a \fIback |
|---|
| 389 | reference\fR (see below)) the character whose octal value is |
|---|
| 390 | \fB0\fIxy\fR |
|---|
| 391 | .TP |
|---|
| 392 | \fB\e\fIxyz\fR |
|---|
| 393 | . |
|---|
| 394 | (where \fIxyz\fR is exactly three octal digits, and is not a back |
|---|
| 395 | reference (see below)) the character whose octal value is |
|---|
| 396 | \fB0\fIxyz\fR |
|---|
| 397 | .RE |
|---|
| 398 | .PP |
|---|
| 399 | Hexadecimal digits are |
|---|
| 400 | .QR \fB0\fR \fB9\fR , |
|---|
| 401 | .QR \fBa\fR \fBf\fR , |
|---|
| 402 | and |
|---|
| 403 | .QR \fBA\fR \fBF\fR . |
|---|
| 404 | Octal digits are |
|---|
| 405 | .QR \fB0\fR \fB7\fR . |
|---|
| 406 | .PP |
|---|
| 407 | The character-entry escapes are always taken as ordinary characters. |
|---|
| 408 | For example, \fB\e135\fR is \fB]\fR in Unicode, but \fB\e135\fR does |
|---|
| 409 | not terminate a bracket expression. Beware, however, that some |
|---|
| 410 | applications (e.g., C compilers and the Tcl interpreter if the regular |
|---|
| 411 | expression is not quoted with braces) interpret such sequences |
|---|
| 412 | themselves before the regular-expression package gets to see them, |
|---|
| 413 | which may require doubling (quadrupling, etc.) the |
|---|
| 414 | .QW \fB\e\fR . |
|---|
| 415 | .SS "CLASS-SHORTHAND ESCAPES" |
|---|
| 416 | Class-shorthand escapes (AREs only) provide shorthands for certain |
|---|
| 417 | commonly-used character classes: |
|---|
| 418 | .RS 2 |
|---|
| 419 | .TP 10 |
|---|
| 420 | \fB\ed\fR |
|---|
| 421 | . |
|---|
| 422 | \fB[[:digit:]]\fR |
|---|
| 423 | .TP |
|---|
| 424 | \fB\es\fR |
|---|
| 425 | . |
|---|
| 426 | \fB[[:space:]]\fR |
|---|
| 427 | .TP |
|---|
| 428 | \fB\ew\fR |
|---|
| 429 | . |
|---|
| 430 | \fB[[:alnum:]_]\fR (note underscore) |
|---|
| 431 | .TP |
|---|
| 432 | \fB\eD\fR |
|---|
| 433 | . |
|---|
| 434 | \fB[^[:digit:]]\fR |
|---|
| 435 | .TP |
|---|
| 436 | \fB\eS\fR |
|---|
| 437 | . |
|---|
| 438 | \fB[^[:space:]]\fR |
|---|
| 439 | .TP |
|---|
| 440 | \fB\eW\fR |
|---|
| 441 | . |
|---|
| 442 | \fB[^[:alnum:]_]\fR (note underscore) |
|---|
| 443 | .RE |
|---|
| 444 | .PP |
|---|
| 445 | Within bracket expressions, |
|---|
| 446 | .QW \fB\ed\fR , |
|---|
| 447 | .QW \fB\es\fR , |
|---|
| 448 | and |
|---|
| 449 | .QW \fB\ew\fR \& |
|---|
| 450 | lose their outer brackets, and |
|---|
| 451 | .QW \fB\eD\fR , |
|---|
| 452 | .QW \fB\eS\fR , |
|---|
| 453 | and |
|---|
| 454 | .QW \fB\eW\fR \& |
|---|
| 455 | are illegal. (So, for example, |
|---|
| 456 | .QW \fB[a-c\ed]\fR |
|---|
| 457 | is equivalent to |
|---|
| 458 | .QW \fB[a-c[:digit:]]\fR . |
|---|
| 459 | Also, |
|---|
| 460 | .QW \fB[a-c\eD]\fR , |
|---|
| 461 | which is equivalent to |
|---|
| 462 | .QW \fB[a-c^[:digit:]]\fR , |
|---|
| 463 | is illegal.) |
|---|
| 464 | .SS "CONSTRAINT ESCAPES" |
|---|
| 465 | A constraint escape (AREs only) is a constraint, matching the empty |
|---|
| 466 | string if specific conditions are met, written as an escape: |
|---|
| 467 | .RS 2 |
|---|
| 468 | .TP 6 |
|---|
| 469 | \fB\eA\fR |
|---|
| 470 | . |
|---|
| 471 | matches only at the beginning of the string (see \fBMATCHING\fR, |
|---|
| 472 | below, for how this differs from |
|---|
| 473 | .QW \fB^\fR ) |
|---|
| 474 | .TP |
|---|
| 475 | \fB\em\fR |
|---|
| 476 | . |
|---|
| 477 | matches only at the beginning of a word |
|---|
| 478 | .TP |
|---|
| 479 | \fB\eM\fR |
|---|
| 480 | . |
|---|
| 481 | matches only at the end of a word |
|---|
| 482 | .TP |
|---|
| 483 | \fB\ey\fR |
|---|
| 484 | . |
|---|
| 485 | matches only at the beginning or end of a word |
|---|
| 486 | .TP |
|---|
| 487 | \fB\eY\fR |
|---|
| 488 | . |
|---|
| 489 | matches only at a point that is not the beginning or end of a word |
|---|
| 490 | .TP |
|---|
| 491 | \fB\eZ\fR |
|---|
| 492 | . |
|---|
| 493 | matches only at the end of the string (see \fBMATCHING\fR, below, for |
|---|
| 494 | how this differs from |
|---|
| 495 | .QW \fB$\fR ) |
|---|
| 496 | .TP |
|---|
| 497 | \fB\e\fIm\fR |
|---|
| 498 | . |
|---|
| 499 | (where \fIm\fR is a nonzero digit) a \fIback reference\fR, see below |
|---|
| 500 | .TP |
|---|
| 501 | \fB\e\fImnn\fR |
|---|
| 502 | . |
|---|
| 503 | (where \fIm\fR is a nonzero digit, and \fInn\fR is some more digits, |
|---|
| 504 | and the decimal value \fImnn\fR is not greater than the number of |
|---|
| 505 | closing capturing parentheses seen so far) a \fIback reference\fR, see |
|---|
| 506 | below |
|---|
| 507 | .RE |
|---|
| 508 | .PP |
|---|
| 509 | A word is defined as in the specification of |
|---|
| 510 | .QW \fB[[:<:]]\fR |
|---|
| 511 | and |
|---|
| 512 | .QW \fB[[:>:]]\fR |
|---|
| 513 | above. Constraint escapes are illegal within bracket expressions. |
|---|
| 514 | .SS "BACK REFERENCES" |
|---|
| 515 | A back reference (AREs only) matches the same string matched by the |
|---|
| 516 | parenthesized subexpression specified by the number, so that (e.g.) |
|---|
| 517 | .QW \fB([bc])\e1\fR |
|---|
| 518 | matches |
|---|
| 519 | .QW \fBbb\fR |
|---|
| 520 | or |
|---|
| 521 | .QW \fBcc\fR |
|---|
| 522 | but not |
|---|
| 523 | .QW \fBbc\fR . |
|---|
| 524 | The subexpression must entirely precede the back reference in the RE. |
|---|
| 525 | Subexpressions are numbered in the order of their leading parentheses. |
|---|
| 526 | Non-capturing parentheses do not define subexpressions. |
|---|
| 527 | .PP |
|---|
| 528 | There is an inherent historical ambiguity between octal |
|---|
| 529 | character-entry escapes and back references, which is resolved by |
|---|
| 530 | heuristics, as hinted at above. A leading zero always indicates an |
|---|
| 531 | octal escape. A single non-zero digit, not followed by another digit, |
|---|
| 532 | is always taken as a back reference. A multi-digit sequence not |
|---|
| 533 | starting with a zero is taken as a back reference if it comes after a |
|---|
| 534 | suitable subexpression (i.e. the number is in the legal range for a |
|---|
| 535 | back reference), and otherwise is taken as octal. |
|---|
| 536 | .SH "METASYNTAX" |
|---|
| 537 | In addition to the main syntax described above, there are some special |
|---|
| 538 | forms and miscellaneous syntactic facilities available. |
|---|
| 539 | .PP |
|---|
| 540 | Normally the flavor of RE being used is specified by |
|---|
| 541 | application-dependent means. However, this can be overridden by a |
|---|
| 542 | \fIdirector\fR. If an RE of any flavor begins with |
|---|
| 543 | .QW \fB***:\fR , |
|---|
| 544 | the rest of the RE is an ARE. If an RE of any flavor begins with |
|---|
| 545 | .QW \fB***=\fR , |
|---|
| 546 | the rest of the RE is taken to be a literal string, with |
|---|
| 547 | all characters considered ordinary characters. |
|---|
| 548 | .PP |
|---|
| 549 | An ARE may begin with \fIembedded options\fR: a sequence |
|---|
| 550 | \fB(?\fIxyz\fB)\fR (where \fIxyz\fR is one or more alphabetic |
|---|
| 551 | characters) specifies options affecting the rest of the RE. These |
|---|
| 552 | supplement, and can override, any options specified by the |
|---|
| 553 | application. The available option letters are: |
|---|
| 554 | .RS 2 |
|---|
| 555 | .TP 3 |
|---|
| 556 | \fBb\fR |
|---|
| 557 | . |
|---|
| 558 | rest of RE is a BRE |
|---|
| 559 | .TP 3 |
|---|
| 560 | \fBc\fR |
|---|
| 561 | . |
|---|
| 562 | case-sensitive matching (usual default) |
|---|
| 563 | .TP 3 |
|---|
| 564 | \fBe\fR |
|---|
| 565 | . |
|---|
| 566 | rest of RE is an ERE |
|---|
| 567 | .TP 3 |
|---|
| 568 | \fBi\fR |
|---|
| 569 | . |
|---|
| 570 | case-insensitive matching (see \fBMATCHING\fR, below) |
|---|
| 571 | .TP 3 |
|---|
| 572 | \fBm\fR |
|---|
| 573 | . |
|---|
| 574 | historical synonym for \fBn\fR |
|---|
| 575 | .TP 3 |
|---|
| 576 | \fBn\fR |
|---|
| 577 | . |
|---|
| 578 | newline-sensitive matching (see \fBMATCHING\fR, below) |
|---|
| 579 | .TP 3 |
|---|
| 580 | \fBp\fR |
|---|
| 581 | . |
|---|
| 582 | partial newline-sensitive matching (see \fBMATCHING\fR, below) |
|---|
| 583 | .TP 3 |
|---|
| 584 | \fBq\fR |
|---|
| 585 | . |
|---|
| 586 | rest of RE is a literal |
|---|
| 587 | .PQ quoted |
|---|
| 588 | string, all ordinary characters |
|---|
| 589 | .TP 3 |
|---|
| 590 | \fBs\fR |
|---|
| 591 | . |
|---|
| 592 | non-newline-sensitive matching (usual default) |
|---|
| 593 | .TP 3 |
|---|
| 594 | \fBt\fR |
|---|
| 595 | . |
|---|
| 596 | tight syntax (usual default; see below) |
|---|
| 597 | .TP 3 |
|---|
| 598 | \fBw\fR |
|---|
| 599 | . |
|---|
| 600 | inverse partial newline-sensitive |
|---|
| 601 | .PQ weird |
|---|
| 602 | matching (see \fBMATCHING\fR, below) |
|---|
| 603 | .TP 3 |
|---|
| 604 | \fBx\fR |
|---|
| 605 | . |
|---|
| 606 | expanded syntax (see below) |
|---|
| 607 | .RE |
|---|
| 608 | .PP |
|---|
| 609 | Embedded options take effect at the \fB)\fR terminating the sequence. |
|---|
| 610 | They are available only at the start of an ARE, and may not be used |
|---|
| 611 | later within it. |
|---|
| 612 | .PP |
|---|
| 613 | In addition to the usual (\fItight\fR) RE syntax, in which all |
|---|
| 614 | characters are significant, there is an \fIexpanded\fR syntax, |
|---|
| 615 | available in all flavors of RE with the \fB\-expanded\fR switch, or in |
|---|
| 616 | AREs with the embedded x option. In the expanded syntax, white-space |
|---|
| 617 | characters are ignored and all characters between a \fB#\fR and the |
|---|
| 618 | following newline (or the end of the RE) are ignored, permitting |
|---|
| 619 | paragraphing and commenting a complex RE. There are three exceptions |
|---|
| 620 | to that basic rule: |
|---|
| 621 | .IP \(bu 3 |
|---|
| 622 | a white-space character or |
|---|
| 623 | .QW \fB#\fR |
|---|
| 624 | preceded by |
|---|
| 625 | .QW \fB\e\fR |
|---|
| 626 | is retained |
|---|
| 627 | .IP \(bu 3 |
|---|
| 628 | white space or |
|---|
| 629 | .QW \fB#\fR |
|---|
| 630 | within a bracket expression is retained |
|---|
| 631 | .IP \(bu 3 |
|---|
| 632 | white space and comments are illegal within multi-character symbols |
|---|
| 633 | like the ARE |
|---|
| 634 | .QW \fB(?:\fR |
|---|
| 635 | or the BRE |
|---|
| 636 | .QW \fB\e(\fR |
|---|
| 637 | .PP |
|---|
| 638 | Expanded-syntax white-space characters are blank, tab, newline, and |
|---|
| 639 | any character that belongs to the \fIspace\fR character class. |
|---|
| 640 | .PP |
|---|
| 641 | Finally, in an ARE, outside bracket expressions, the sequence |
|---|
| 642 | .QW \fB(?#\fIttt\fB)\fR |
|---|
| 643 | (where \fIttt\fR is any text not containing a |
|---|
| 644 | .QW \fB)\fR ) |
|---|
| 645 | is a comment, completely ignored. Again, this is not |
|---|
| 646 | allowed between the characters of multi-character symbols like |
|---|
| 647 | .QW \fB(?:\fR . |
|---|
| 648 | Such comments are more a historical artifact than a useful facility, |
|---|
| 649 | and their use is deprecated; use the expanded syntax instead. |
|---|
| 650 | .PP |
|---|
| 651 | \fINone\fR of these metasyntax extensions is available if the |
|---|
| 652 | application (or an initial |
|---|
| 653 | .QW \fB***=\fR |
|---|
| 654 | director) has specified that the |
|---|
| 655 | user's input be treated as a literal string rather than as an RE. |
|---|
| 656 | .SH MATCHING |
|---|
| 657 | In the event that an RE could match more than one substring of a given |
|---|
| 658 | string, the RE matches the one starting earliest in the string. If |
|---|
| 659 | the RE could match more than one substring starting at that point, its |
|---|
| 660 | choice is determined by its \fIpreference\fR: either the longest |
|---|
| 661 | substring, or the shortest. |
|---|
| 662 | .PP |
|---|
| 663 | Most atoms, and all constraints, have no preference. A parenthesized |
|---|
| 664 | RE has the same preference (possibly none) as the RE. A quantified |
|---|
| 665 | atom with quantifier \fB{\fIm\fB}\fR or \fB{\fIm\fB}?\fR has the same |
|---|
| 666 | preference (possibly none) as the atom itself. A quantified atom with |
|---|
| 667 | other normal quantifiers (including \fB{\fIm\fB,\fIn\fB}\fR with |
|---|
| 668 | \fIm\fR equal to \fIn\fR) prefers longest match. A quantified atom |
|---|
| 669 | with other non-greedy quantifiers (including \fB{\fIm\fB,\fIn\fB}?\fR |
|---|
| 670 | with \fIm\fR equal to \fIn\fR) prefers shortest match. A branch has |
|---|
| 671 | the same preference as the first quantified atom in it which has a |
|---|
| 672 | preference. An RE consisting of two or more branches connected by the |
|---|
| 673 | \fB|\fR operator prefers longest match. |
|---|
| 674 | .PP |
|---|
| 675 | Subject to the constraints imposed by the rules for matching the whole |
|---|
| 676 | RE, subexpressions also match the longest or shortest possible |
|---|
| 677 | substrings, based on their preferences, with subexpressions starting |
|---|
| 678 | earlier in the RE taking priority over ones starting later. Note that |
|---|
| 679 | outer subexpressions thus take priority over their component |
|---|
| 680 | subexpressions. |
|---|
| 681 | .PP |
|---|
| 682 | Note that the quantifiers \fB{1,1}\fR and \fB{1,1}?\fR can be used to |
|---|
| 683 | force longest and shortest preference, respectively, on a |
|---|
| 684 | subexpression or a whole RE. |
|---|
| 685 | .PP |
|---|
| 686 | Match lengths are measured in characters, not collating elements. An |
|---|
| 687 | empty string is considered longer than no match at all. For example, |
|---|
| 688 | .QW \fBbb*\fR |
|---|
| 689 | matches the three middle characters of |
|---|
| 690 | .QW \fBabbbc\fR , |
|---|
| 691 | .QW \fB(week|wee)(night|knights)\fR |
|---|
| 692 | matches all ten characters of |
|---|
| 693 | .QW \fBweeknights\fR , |
|---|
| 694 | when |
|---|
| 695 | .QW \fB(.*).*\fR |
|---|
| 696 | is matched against |
|---|
| 697 | .QW \fBabc\fR |
|---|
| 698 | the parenthesized subexpression matches all three characters, and when |
|---|
| 699 | .QW \fB(a*)*\fR |
|---|
| 700 | is matched against |
|---|
| 701 | .QW \fBbc\fR |
|---|
| 702 | both the whole RE and the parenthesized subexpression match an empty string. |
|---|
| 703 | .PP |
|---|
| 704 | If case-independent matching is specified, the effect is much as if |
|---|
| 705 | all case distinctions had vanished from the alphabet. When an |
|---|
| 706 | alphabetic that exists in multiple cases appears as an ordinary |
|---|
| 707 | character outside a bracket expression, it is effectively transformed |
|---|
| 708 | into a bracket expression containing both cases, so that \fBx\fR |
|---|
| 709 | becomes |
|---|
| 710 | .QW \fB[xX]\fR . |
|---|
| 711 | When it appears inside a bracket expression, |
|---|
| 712 | all case counterparts of it are added to the bracket expression, so |
|---|
| 713 | that |
|---|
| 714 | .QW \fB[x]\fR |
|---|
| 715 | becomes |
|---|
| 716 | .QW \fB[xX]\fR |
|---|
| 717 | and |
|---|
| 718 | .QW \fB[^x]\fR |
|---|
| 719 | becomes |
|---|
| 720 | .QW \fB[^xX]\fR . |
|---|
| 721 | .PP |
|---|
| 722 | If newline-sensitive matching is specified, \fB.\fR and bracket |
|---|
| 723 | expressions using \fB^\fR will never match the newline character (so |
|---|
| 724 | that matches will never cross newlines unless the RE explicitly |
|---|
| 725 | arranges it) and \fB^\fR and \fB$\fR will match the empty string after |
|---|
| 726 | and before a newline respectively, in addition to matching at |
|---|
| 727 | beginning and end of string respectively. ARE \fB\eA\fR and \fB\eZ\fR |
|---|
| 728 | continue to match beginning or end of string \fIonly\fR. |
|---|
| 729 | .PP |
|---|
| 730 | If partial newline-sensitive matching is specified, this affects |
|---|
| 731 | \fB.\fR and bracket expressions as with newline-sensitive matching, |
|---|
| 732 | but not \fB^\fR and \fB$\fR. |
|---|
| 733 | .PP |
|---|
| 734 | If inverse partial newline-sensitive matching is specified, this |
|---|
| 735 | affects \fB^\fR and \fB$\fR as with newline-sensitive matching, but |
|---|
| 736 | not \fB.\fR and bracket expressions. This is not very useful but is |
|---|
| 737 | provided for symmetry. |
|---|
| 738 | .SH "LIMITS AND COMPATIBILITY" |
|---|
| 739 | No particular limit is imposed on the length of REs. Programs |
|---|
| 740 | intended to be highly portable should not employ REs longer than 256 |
|---|
| 741 | bytes, as a POSIX-compliant implementation can refuse to accept such |
|---|
| 742 | REs. |
|---|
| 743 | .PP |
|---|
| 744 | The only feature of AREs that is actually incompatible with POSIX EREs |
|---|
| 745 | is that \fB\e\fR does not lose its special significance inside bracket |
|---|
| 746 | expressions. All other ARE features use syntax which is illegal or |
|---|
| 747 | has undefined or unspecified effects in POSIX EREs; the \fB***\fR |
|---|
| 748 | syntax of directors likewise is outside the POSIX syntax for both BREs |
|---|
| 749 | and EREs. |
|---|
| 750 | .PP |
|---|
| 751 | Many of the ARE extensions are borrowed from Perl, but some have been |
|---|
| 752 | changed to clean them up, and a few Perl extensions are not present. |
|---|
| 753 | Incompatibilities of note include |
|---|
| 754 | .QW \fB\eb\fR , |
|---|
| 755 | .QW \fB\eB\fR , |
|---|
| 756 | the lack of special treatment for a trailing newline, the addition of |
|---|
| 757 | complemented bracket expressions to the things affected by |
|---|
| 758 | newline-sensitive matching, the restrictions on parentheses and back |
|---|
| 759 | references in lookahead constraints, and the longest/shortest-match |
|---|
| 760 | (rather than first-match) matching semantics. |
|---|
| 761 | .PP |
|---|
| 762 | The matching rules for REs containing both normal and non-greedy |
|---|
| 763 | quantifiers have changed since early beta-test versions of this |
|---|
| 764 | package. (The new rules are much simpler and cleaner, but do not work |
|---|
| 765 | as hard at guessing the user's real intentions.) |
|---|
| 766 | .PP |
|---|
| 767 | Henry Spencer's original 1986 \fIregexp\fR package, still in |
|---|
| 768 | widespread use (e.g., in pre-8.1 releases of Tcl), implemented an |
|---|
| 769 | early version of today's EREs. There are four incompatibilities |
|---|
| 770 | between \fIregexp\fR's near-EREs |
|---|
| 771 | .PQ RREs " for short" |
|---|
| 772 | and AREs. In roughly increasing order of significance: |
|---|
| 773 | .IP \(bu 3 |
|---|
| 774 | In AREs, \fB\e\fR followed by an alphanumeric character is either an |
|---|
| 775 | escape or an error, while in RREs, it was just another way of writing |
|---|
| 776 | the alphanumeric. This should not be a problem because there was no |
|---|
| 777 | reason to write such a sequence in RREs. |
|---|
| 778 | .IP \(bu 3 |
|---|
| 779 | \fB{\fR followed by a digit in an ARE is the beginning of a bound, |
|---|
| 780 | while in RREs, \fB{\fR was always an ordinary character. Such |
|---|
| 781 | sequences should be rare, and will often result in an error because |
|---|
| 782 | following characters will not look like a valid bound. |
|---|
| 783 | .IP \(bu 3 |
|---|
| 784 | In AREs, \fB\e\fR remains a special character within |
|---|
| 785 | .QW \fB[\|]\fR , |
|---|
| 786 | so a literal \fB\e\fR within \fB[\|]\fR must be written |
|---|
| 787 | .QW \fB\e\e\fR . |
|---|
| 788 | \fB\e\e\fR also gives a literal \fB\e\fR within \fB[\|]\fR in RREs, |
|---|
| 789 | but only truly paranoid programmers routinely doubled the backslash. |
|---|
| 790 | .IP \(bu 3 |
|---|
| 791 | AREs report the longest/shortest match for the RE, rather than the |
|---|
| 792 | first found in a specified search order. This may affect some RREs |
|---|
| 793 | which were written in the expectation that the first match would be |
|---|
| 794 | reported. (The careful crafting of RREs to optimize the search order |
|---|
| 795 | for fast matching is obsolete (AREs examine all possible matches in |
|---|
| 796 | parallel, and their performance is largely insensitive to their |
|---|
| 797 | complexity) but cases where the search order was exploited to |
|---|
| 798 | deliberately find a match which was \fInot\fR the longest/shortest |
|---|
| 799 | will need rewriting.) |
|---|
| 800 | .SH "BASIC REGULAR EXPRESSIONS" |
|---|
| 801 | BREs differ from EREs in several respects. |
|---|
| 802 | .QW \fB|\fR , |
|---|
| 803 | .QW \fB+\fR , |
|---|
| 804 | and \fB?\fR are ordinary characters and there is no equivalent for their |
|---|
| 805 | functionality. The delimiters for bounds are \fB\e{\fR and |
|---|
| 806 | .QW \fB\e}\fR , |
|---|
| 807 | with \fB{\fR and \fB}\fR by themselves ordinary characters. The |
|---|
| 808 | parentheses for nested subexpressions are \fB\e(\fR and |
|---|
| 809 | .QW \fB\e)\fR , |
|---|
| 810 | with \fB(\fR and \fB)\fR by themselves ordinary |
|---|
| 811 | characters. \fB^\fR is an ordinary character except at the beginning |
|---|
| 812 | of the RE or the beginning of a parenthesized subexpression, \fB$\fR |
|---|
| 813 | is an ordinary character except at the end of the RE or the end of a |
|---|
| 814 | parenthesized subexpression, and \fB*\fR is an ordinary character if |
|---|
| 815 | it appears at the beginning of the RE or the beginning of a |
|---|
| 816 | parenthesized subexpression (after a possible leading |
|---|
| 817 | .QW \fB^\fR ). |
|---|
| 818 | Finally, single-digit back references are available, and \fB\e<\fR and |
|---|
| 819 | \fB\e>\fR are synonyms for |
|---|
| 820 | .QW \fB[[:<:]]\fR |
|---|
| 821 | and |
|---|
| 822 | .QW \fB[[:>:]]\fR |
|---|
| 823 | respectively; no other escapes are available. |
|---|
| 824 | .SH "SEE ALSO" |
|---|
| 825 | RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n) |
|---|
| 826 | .SH KEYWORDS |
|---|
| 827 | match, regular expression, string |
|---|
| 828 | .\" Local Variables: |
|---|
| 829 | .\" mode: nroff |
|---|
| 830 | .\" End: |
|---|