Context Navigation

re_syntax.n @ 25

Last change on this file since 25 was 25, checked in by landauf, 17 years ago
added tcl to libs
File size: 26.2 KB

Line
1	'\"
2	'\" Copyright (c) 1998 Sun Microsystems, Inc.
3	'\" Copyright (c) 1999 Scriptics Corporation
4	'\"
5	'\" See the file "license.terms" for information on usage and redistribution
6	'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
7	'\"
8	'\" RCS: @(#) $Id: re_syntax.n,v 1.18 2007/12/13 15:22:33 dgp Exp $
9	'\"
10	.so man.macros
11	.TH re_syntax n "8.1" Tcl "Tcl Built-In Commands"
12	.BS
13	.SH NAME
14	re_syntax \- Syntax of Tcl regular expressions
15	.BE
16	.SH DESCRIPTION
17	.PP
18	A \fIregular expression\fR describes strings of characters.
19	It's a pattern that matches certain strings and does not match others.
20	.SH "DIFFERENT FLAVORS OF REs"
21	Regular expressions
22	.PQ RE s ,
23	as defined by POSIX, come in two flavors: \fIextended\fR REs
24	.PQ ERE s
25	and \fIbasic\fR REs
26	.PQ BRE s .
27	EREs are roughly those of the traditional \fIegrep\fR, while BREs are
28	roughly those of the traditional \fIed\fR. This implementation adds
29	a third flavor, \fIadvanced\fR REs
30	.PQ ARE s ,
31	basically EREs with some significant extensions.
32	.PP
33	This manual page primarily describes AREs. BREs mostly exist for
34	backward compatibility in some old programs; they will be discussed at
35	the end. POSIX EREs are almost an exact subset of AREs. Features of
36	AREs that are not present in EREs will be indicated.
37	.SH "REGULAR EXPRESSION SYNTAX"
38	.PP
39	Tcl regular expressions are implemented using the package written by
40	Henry Spencer, based on the 1003.2 spec and some (not quite all) of
41	the Perl5 extensions (thanks, Henry!). Much of the description of
42	regular expressions below is copied verbatim from his manual entry.
43	.PP
44	An ARE is one or more \fIbranches\fR,
45	separated by
46	.QW \fB\|\fR ,
47	matching anything that matches any of the branches.
48	.PP
49	A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR,
50	concatenated.
51	It matches a match for the first, followed by a match for the second, etc;
52	an empty branch matches the empty string.
53	.SS QUANTIFIERS
54	A quantified atom is an \fIatom\fR possibly followed
55	by a single \fIquantifier\fR.
56	Without a quantifier, it matches a single match for the atom.
57	The quantifiers,
58	and what a so-quantified atom matches, are:
59	.RS 2
60	.TP 6
61	\fB*\fR
62	.
63	a sequence of 0 or more matches of the atom
64	.TP
65	\fB+\fR
66	.
67	a sequence of 1 or more matches of the atom
68	.TP
69	\fB?\fR
70	.
71	a sequence of 0 or 1 matches of the atom
72	.TP
73	\fB{\fIm\fB}\fR
74	.
75	a sequence of exactly \fIm\fR matches of the atom
76	.TP
77	\fB{\fIm\fB,}\fR
78	.
79	a sequence of \fIm\fR or more matches of the atom
80	.TP
81	\fB{\fIm\fB,\fIn\fB}\fR
82	.
83	a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom;
84	\fIm\fR may not exceed \fIn\fR
85	.TP
86	\fB*? +? ?? {\fIm\fB}? {\fIm\fB,}? {\fIm\fB,\fIn\fB}?\fR
87	.
88	\fInon-greedy\fR quantifiers, which match the same possibilities,
89	but prefer the smallest number rather than the largest number
90	of matches (see \fBMATCHING\fR)
91	.RE
92	.PP
93	The forms using \fB{\fR and \fB}\fR are known as \fIbound\fRs. The
94	numbers \fIm\fR and \fIn\fR are unsigned decimal integers with
95	permissible values from 0 to 255 inclusive.
96	.SS ATOMS
97	An atom is one of:
98	.RS 2
99	.IP \fB(\fIre\fB)\fR 6
100	matches a match for \fIre\fR (\fIre\fR is any regular expression) with
101	the match noted for possible reporting
102	.IP \fB(?:\fIre\fB)\fR
103	as previous, but does no reporting (a
104	.QW non-capturing
105	set of parentheses)
106	.IP \fB()\fR
107	matches an empty string, noted for possible reporting
108	.IP \fB(?:)\fR
109	matches an empty string, without reporting
110	.IP \fB[\fIchars\fB]\fR
111	a \fIbracket expression\fR, matching any one of the \fIchars\fR (see
112	\fBBRACKET EXPRESSIONS\fR for more detail)
113	.IP \fB.\fR
114	matches any single character
115	.IP \fB\e\fIk\fR
116	matches the non-alphanumeric character \fIk\fR
117	taken as an ordinary character, e.g. \fB\e\e\fR matches a backslash
118	character
119	.IP \fB\e\fIc\fR
120	where \fIc\fR is alphanumeric (possibly followed by other characters),
121	an \fIescape\fR (AREs only), see \fBESCAPES\fR below
122	.IP \fB{\fR
123	when followed by a character other than a digit, matches the
124	left-brace character
125	.QW \fB{\fR ;
126	when followed by a digit, it is the beginning of a \fIbound\fR (see above)
127	.IP \fIx\fR
128	where \fIx\fR is a single character with no other significance,
129	matches that character.
130	.RE
131	.SS CONSTRAINTS
132	A \fIconstraint\fR matches an empty string when specific conditions
133	are met. A constraint may not be followed by a quantifier. The
134	simple constraints are as follows; some more constraints are described
135	later, under \fBESCAPES\fR.
136	.RS 2
137	.TP 8
138	\fB^\fR
139	.
140	matches at the beginning of a line
141	.TP
142	\fB$\fR
143	.
144	matches at the end of a line
145	.TP
146	\fB(?=\fIre\fB)\fR
147	.
148	\fIpositive lookahead\fR (AREs only), matches at any point where a
149	substring matching \fIre\fR begins
150	.TP
151	\fB(?!\fIre\fB)\fR
152	.
153	\fInegative lookahead\fR (AREs only), matches at any point where no
154	substring matching \fIre\fR begins
155	.RE
156	.PP
157	The lookahead constraints may not contain back references (see later),
158	and all parentheses within them are considered non-capturing.
159	.PP
160	An RE may not end with
161	.QW \fB\e\fR .
162	.SH "BRACKET EXPRESSIONS"
163	A \fIbracket expression\fR is a list of characters enclosed in
164	.QW \fB[\\|]\fR .
165	It normally matches any single character from the list
166	(but see below). If the list begins with
167	.QW \fB^\fR ,
168	it matches any single character (but see below) \fInot\fR from the
169	rest of the list.
170	.PP
171	If two characters in the list are separated by
172	.QW \fB\-\fR ,
173	this is shorthand for the full \fIrange\fR of characters between those two
174	(inclusive) in the collating sequence, e.g.
175	.QW \fB[0\-9]\fR
176	in Unicode matches any conventional decimal digit. Two ranges may not share an
177	endpoint, so e.g.
178	.QW \fBa\-c\-e\fR
179	is illegal. Ranges in Tcl always use the
180	Unicode collating sequence, but other programs may use other collating
181	sequences and this can be a source of incompatability between programs.
182	.PP
183	To include a literal \fB]\fR or \fB\-\fR in the list, the simplest
184	method is to enclose it in \fB[.\fR and \fB.]\fR to make it a
185	collating element (see below). Alternatively, make it the first
186	character (following a possible
187	.QW \fB^\fR ),
188	or (AREs only) precede it with
189	.QW \fB\e\fR .
190	Alternatively, for
191	.QW \fB\-\fR ,
192	make it the last character, or the second endpoint of a range. To use
193	a literal \fB\-\fR as the first endpoint of a range, make it a
194	collating element or (AREs only) precede it with
195	.QW \fB\e\fR .
196	With the exception of
197	these, some combinations using \fB[\fR (see next paragraphs), and
198	escapes, all other special characters lose their special significance
199	within a bracket expression.
200	.SS "CHARACTER CLASSES"
201	Within a bracket expression, the name of a \fIcharacter class\fR
202	enclosed in \fB[:\fR and \fB:]\fR stands for the list of all
203	characters (not all collating elements!) belonging to that class.
204	Standard character classes are:
205	.IP \fBalpha\fR 8
206	A letter.
207	.IP \fBupper\fR 8
208	An upper-case letter.
209	.IP \fBlower\fR 8
210	A lower-case letter.
211	.IP \fBdigit\fR 8
212	A decimal digit.
213	.IP \fBxdigit\fR 8
214	A hexadecimal digit.
215	.IP \fBalnum\fR 8
216	An alphanumeric (letter or digit).
217	.IP \fBprint\fR 8
218	A "printable" (same as graph, except also including space).
219	.IP \fBblank\fR 8
220	A space or tab character.
221	.IP \fBspace\fR 8
222	A character producing white space in displayed text.
223	.IP \fBpunct\fR 8
224	A punctuation character.
225	.IP \fBgraph\fR 8
226	A character with a visible representation (includes both alnum and punct).
227	.IP \fBcntrl\fR 8
228	A control character.
229	.PP
230	A locale may provide others. A character class may not be used as an endpoint
231	of a range.
232	.RS
233	.PP
234	(\fINote:\fR the current Tcl implementation has only one locale, the Unicode
235	locale, which supports exactly the above classes.)
236	.RE
237	.SS "BRACKETED CONSTRAINTS"
238	There are two special cases of bracket expressions: the bracket
239	expressions
240	.QW \fB[[:<:]]\fR
241	and
242	.QW \fB[[:>:]]\fR
243	are constraints, matching empty strings at the beginning and end of a word
244	respectively.
245	.\" note, discussion of escapes below references this definition of word
246	A word is defined as a sequence of word characters that is neither preceded
247	nor followed by word characters. A word character is an \fIalnum\fR character
248	or an underscore
249	.PQ \fB_\fR "" .
250	These special bracket expressions are deprecated; users of AREs should use
251	constraint escapes instead (see below).
252	.SS "COLLATING ELEMENTS"
253	Within a bracket expression, a collating element (a character, a
254	multi-character sequence that collates as if it were a single
255	character, or a collating-sequence name for either) enclosed in
256	\fB[.\fR and \fB.]\fR stands for the sequence of characters of that
257	collating element. The sequence is a single element of the bracket
258	expression's list. A bracket expression in a locale that has
259	multi-character collating elements can thus match more than one
260	character. So (insidiously), a bracket expression that starts with
261	\fB^\fR can match multi-character collating elements even if none of
262	them appear in the bracket expression!
263	.RS
264	.PP
265	(\fINote:\fR Tcl has no multi-character collating elements. This information
266	is only for illustration.)
267	.RE
268	.PP
269	For example, assume the collating sequence includes a \fBch\fR multi-character
270	collating element. Then the RE
271	.QW \fB[[.ch.]]*c\fR
272	(zero or more
273	.QW \fBch\fRs
274	followed by
275	.QW \fBc\fR )
276	matches the first five characters of
277	.QW \fBchchcc\fR .
278	Also, the RE
279	.QW \fB[^c]b\fR
280	matches all of
281	.QW \fBchb\fR
282	(because
283	.QW \fB[^c]\fR
284	matches the multi-character
285	.QW \fBch\fR ).
286	.SS "EQUIVALENCE CLASSES"
287	Within a bracket expression, a collating element enclosed in \fB[=\fR
288	and \fB=]\fR is an equivalence class, standing for the sequences of
289	characters of all collating elements equivalent to that one, including
290	itself. (If there are no other equivalent collating elements, the
291	treatment is as if the enclosing delimiters were
292	.QW \fB[.\fR \&
293	and
294	.QW \fB.]\fR .)
295	For example, if \fBo\fR and \fB\N'244'\fR are the members of an
296	equivalence class, then
297	.QW \fB[[=o=]]\fR ,
298	.QW \fB[[=\N'244'=]]\fR ,
299	and
300	.QW \fB[o\N'244']\fR \&
301	are all synonymous. An equivalence class may not be an endpoint of a range.
302	.RS
303	.PP
304	(\fINote:\fR Tcl implements only the Unicode locale. It does not define any
305	equivalence classes. The examples above are just illustrations.)
306	.RE
307	.SH ESCAPES
308	Escapes (AREs only), which begin with a \fB\e\fR followed by an
309	alphanumeric character, come in several varieties: character entry,
310	class shorthands, constraint escapes, and back references. A \fB\e\fR
311	followed by an alphanumeric character but not constituting a valid
312	escape is illegal in AREs. In EREs, there are no escapes: outside a
313	bracket expression, a \fB\e\fR followed by an alphanumeric character
314	merely stands for that character as an ordinary character, and inside
315	a bracket expression, \fB\e\fR is an ordinary character. (The latter
316	is the one actual incompatibility between EREs and AREs.)
317	.SS "CHARACTER-ENTRY ESCAPES"
318	Character-entry escapes (AREs only) exist to make it easier to specify
319	non-printing and otherwise inconvenient characters in REs:
320	.RS 2
321	.TP 5
322	\fB\ea\fR
323	.
324	alert (bell) character, as in C
325	.TP
326	\fB\eb\fR
327	.
328	backspace, as in C
329	.TP
330	\fB\eB\fR
331	.
332	synonym for \fB\e\fR to help reduce backslash doubling in some
333	applications where there are multiple levels of backslash processing
334	.TP
335	\fB\ec\fIX\fR
336	.
337	(where \fIX\fR is any character) the character whose low-order 5 bits
338	are the same as those of \fIX\fR, and whose other bits are all zero
339	.TP
340	\fB\ee\fR
341	.
342	the character whose collating-sequence name is
343	.QW \fBESC\fR ,
344	or failing that, the character with octal value 033
345	.TP
346	\fB\ef\fR
347	.
348	formfeed, as in C
349	.TP
350	\fB\en\fR
351	.
352	newline, as in C
353	.TP
354	\fB\er\fR
355	.
356	carriage return, as in C
357	.TP
358	\fB\et\fR
359	.
360	horizontal tab, as in C
361	.TP
362	\fB\eu\fIwxyz\fR
363	.
364	(where \fIwxyz\fR is exactly four hexadecimal digits) the Unicode
365	character \fBU+\fIwxyz\fR in the local byte ordering
366	.TP
367	\fB\eU\fIstuvwxyz\fR
368	.
369	(where \fIstuvwxyz\fR is exactly eight hexadecimal digits) reserved
370	for a somewhat-hypothetical Unicode extension to 32 bits
371	.TP
372	\fB\ev\fR
373	.
374	vertical tab, as in C are all available.
375	.TP
376	\fB\ex\fIhhh\fR
377	.
378	(where \fIhhh\fR is any sequence of hexadecimal digits) the character
379	whose hexadecimal value is \fB0x\fIhhh\fR (a single character no
380	matter how many hexadecimal digits are used).
381	.TP
382	\fB\e0\fR
383	.
384	the character whose value is \fB0\fR
385	.TP
386	\fB\e\fIxy\fR
387	.
388	(where \fIxy\fR is exactly two octal digits, and is not a \fIback
389	reference\fR (see below)) the character whose octal value is
390	\fB0\fIxy\fR
391	.TP
392	\fB\e\fIxyz\fR
393	.
394	(where \fIxyz\fR is exactly three octal digits, and is not a back
395	reference (see below)) the character whose octal value is
396	\fB0\fIxyz\fR
397	.RE
398	.PP
399	Hexadecimal digits are
400	.QR \fB0\fR \fB9\fR ,
401	.QR \fBa\fR \fBf\fR ,
402	and
403	.QR \fBA\fR \fBF\fR .
404	Octal digits are
405	.QR \fB0\fR \fB7\fR .
406	.PP
407	The character-entry escapes are always taken as ordinary characters.
408	For example, \fB\e135\fR is \fB]\fR in Unicode, but \fB\e135\fR does
409	not terminate a bracket expression. Beware, however, that some
410	applications (e.g., C compilers and the Tcl interpreter if the regular
411	expression is not quoted with braces) interpret such sequences
412	themselves before the regular-expression package gets to see them,
413	which may require doubling (quadrupling, etc.) the
414	.QW \fB\e\fR .
415	.SS "CLASS-SHORTHAND ESCAPES"
416	Class-shorthand escapes (AREs only) provide shorthands for certain
417	commonly-used character classes:
418	.RS 2
419	.TP 10
420	\fB\ed\fR
421	.
422	\fB[[:digit:]]\fR
423	.TP
424	\fB\es\fR
425	.
426	\fB[[:space:]]\fR
427	.TP
428	\fB\ew\fR
429	.
430	\fB[[:alnum:]_]\fR (note underscore)
431	.TP
432	\fB\eD\fR
433	.
434	\fB[^[:digit:]]\fR
435	.TP
436	\fB\eS\fR
437	.
438	\fB[^[:space:]]\fR
439	.TP
440	\fB\eW\fR
441	.
442	\fB[^[:alnum:]_]\fR (note underscore)
443	.RE
444	.PP
445	Within bracket expressions,
446	.QW \fB\ed\fR ,
447	.QW \fB\es\fR ,
448	and
449	.QW \fB\ew\fR \&
450	lose their outer brackets, and
451	.QW \fB\eD\fR ,
452	.QW \fB\eS\fR ,
453	and
454	.QW \fB\eW\fR \&
455	are illegal. (So, for example,
456	.QW \fB[a-c\ed]\fR
457	is equivalent to
458	.QW \fB[a-c[:digit:]]\fR .
459	Also,
460	.QW \fB[a-c\eD]\fR ,
461	which is equivalent to
462	.QW \fB[a-c^[:digit:]]\fR ,
463	is illegal.)
464	.SS "CONSTRAINT ESCAPES"
465	A constraint escape (AREs only) is a constraint, matching the empty
466	string if specific conditions are met, written as an escape:
467	.RS 2
468	.TP 6
469	\fB\eA\fR
470	.
471	matches only at the beginning of the string (see \fBMATCHING\fR,
472	below, for how this differs from
473	.QW \fB^\fR )
474	.TP
475	\fB\em\fR
476	.
477	matches only at the beginning of a word
478	.TP
479	\fB\eM\fR
480	.
481	matches only at the end of a word
482	.TP
483	\fB\ey\fR
484	.
485	matches only at the beginning or end of a word
486	.TP
487	\fB\eY\fR
488	.
489	matches only at a point that is not the beginning or end of a word
490	.TP
491	\fB\eZ\fR
492	.
493	matches only at the end of the string (see \fBMATCHING\fR, below, for
494	how this differs from
495	.QW \fB$\fR )
496	.TP
497	\fB\e\fIm\fR
498	.
499	(where \fIm\fR is a nonzero digit) a \fIback reference\fR, see below
500	.TP
501	\fB\e\fImnn\fR
502	.
503	(where \fIm\fR is a nonzero digit, and \fInn\fR is some more digits,
504	and the decimal value \fImnn\fR is not greater than the number of
505	closing capturing parentheses seen so far) a \fIback reference\fR, see
506	below
507	.RE
508	.PP
509	A word is defined as in the specification of
510	.QW \fB[[:<:]]\fR
511	and
512	.QW \fB[[:>:]]\fR
513	above. Constraint escapes are illegal within bracket expressions.
514	.SS "BACK REFERENCES"
515	A back reference (AREs only) matches the same string matched by the
516	parenthesized subexpression specified by the number, so that (e.g.)
517	.QW \fB([bc])\e1\fR
518	matches
519	.QW \fBbb\fR
520	or
521	.QW \fBcc\fR
522	but not
523	.QW \fBbc\fR .
524	The subexpression must entirely precede the back reference in the RE.
525	Subexpressions are numbered in the order of their leading parentheses.
526	Non-capturing parentheses do not define subexpressions.
527	.PP
528	There is an inherent historical ambiguity between octal
529	character-entry escapes and back references, which is resolved by
530	heuristics, as hinted at above. A leading zero always indicates an
531	octal escape. A single non-zero digit, not followed by another digit,
532	is always taken as a back reference. A multi-digit sequence not
533	starting with a zero is taken as a back reference if it comes after a
534	suitable subexpression (i.e. the number is in the legal range for a
535	back reference), and otherwise is taken as octal.
536	.SH "METASYNTAX"
537	In addition to the main syntax described above, there are some special
538	forms and miscellaneous syntactic facilities available.
539	.PP
540	Normally the flavor of RE being used is specified by
541	application-dependent means. However, this can be overridden by a
542	\fIdirector\fR. If an RE of any flavor begins with
543	.QW \fB***:\fR ,
544	the rest of the RE is an ARE. If an RE of any flavor begins with
545	.QW \fB***=\fR ,
546	the rest of the RE is taken to be a literal string, with
547	all characters considered ordinary characters.
548	.PP
549	An ARE may begin with \fIembedded options\fR: a sequence
550	\fB(?\fIxyz\fB)\fR (where \fIxyz\fR is one or more alphabetic
551	characters) specifies options affecting the rest of the RE. These
552	supplement, and can override, any options specified by the
553	application. The available option letters are:
554	.RS 2
555	.TP 3
556	\fBb\fR
557	.
558	rest of RE is a BRE
559	.TP 3
560	\fBc\fR
561	.
562	case-sensitive matching (usual default)
563	.TP 3
564	\fBe\fR
565	.
566	rest of RE is an ERE
567	.TP 3
568	\fBi\fR
569	.
570	case-insensitive matching (see \fBMATCHING\fR, below)
571	.TP 3
572	\fBm\fR
573	.
574	historical synonym for \fBn\fR
575	.TP 3
576	\fBn\fR
577	.
578	newline-sensitive matching (see \fBMATCHING\fR, below)
579	.TP 3
580	\fBp\fR
581	.
582	partial newline-sensitive matching (see \fBMATCHING\fR, below)
583	.TP 3
584	\fBq\fR
585	.
586	rest of RE is a literal
587	.PQ quoted
588	string, all ordinary characters
589	.TP 3
590	\fBs\fR
591	.
592	non-newline-sensitive matching (usual default)
593	.TP 3
594	\fBt\fR
595	.
596	tight syntax (usual default; see below)
597	.TP 3
598	\fBw\fR
599	.
600	inverse partial newline-sensitive
601	.PQ weird
602	matching (see \fBMATCHING\fR, below)
603	.TP 3
604	\fBx\fR
605	.
606	expanded syntax (see below)
607	.RE
608	.PP
609	Embedded options take effect at the \fB)\fR terminating the sequence.
610	They are available only at the start of an ARE, and may not be used
611	later within it.
612	.PP
613	In addition to the usual (\fItight\fR) RE syntax, in which all
614	characters are significant, there is an \fIexpanded\fR syntax,
615	available in all flavors of RE with the \fB\-expanded\fR switch, or in
616	AREs with the embedded x option. In the expanded syntax, white-space
617	characters are ignored and all characters between a \fB#\fR and the
618	following newline (or the end of the RE) are ignored, permitting
619	paragraphing and commenting a complex RE. There are three exceptions
620	to that basic rule:
621	.IP \(bu 3
622	a white-space character or
623	.QW \fB#\fR
624	preceded by
625	.QW \fB\e\fR
626	is retained
627	.IP \(bu 3
628	white space or
629	.QW \fB#\fR
630	within a bracket expression is retained
631	.IP \(bu 3
632	white space and comments are illegal within multi-character symbols
633	like the ARE
634	.QW \fB(?:\fR
635	or the BRE
636	.QW \fB\e(\fR
637	.PP
638	Expanded-syntax white-space characters are blank, tab, newline, and
639	any character that belongs to the \fIspace\fR character class.
640	.PP
641	Finally, in an ARE, outside bracket expressions, the sequence
642	.QW \fB(?#\fIttt\fB)\fR
643	(where \fIttt\fR is any text not containing a
644	.QW \fB)\fR )
645	is a comment, completely ignored. Again, this is not
646	allowed between the characters of multi-character symbols like
647	.QW \fB(?:\fR .
648	Such comments are more a historical artifact than a useful facility,
649	and their use is deprecated; use the expanded syntax instead.
650	.PP
651	\fINone\fR of these metasyntax extensions is available if the
652	application (or an initial
653	.QW \fB***=\fR
654	director) has specified that the
655	user's input be treated as a literal string rather than as an RE.
656	.SH MATCHING
657	In the event that an RE could match more than one substring of a given
658	string, the RE matches the one starting earliest in the string. If
659	the RE could match more than one substring starting at that point, its
660	choice is determined by its \fIpreference\fR: either the longest
661	substring, or the shortest.
662	.PP
663	Most atoms, and all constraints, have no preference. A parenthesized
664	RE has the same preference (possibly none) as the RE. A quantified
665	atom with quantifier \fB{\fIm\fB}\fR or \fB{\fIm\fB}?\fR has the same
666	preference (possibly none) as the atom itself. A quantified atom with
667	other normal quantifiers (including \fB{\fIm\fB,\fIn\fB}\fR with
668	\fIm\fR equal to \fIn\fR) prefers longest match. A quantified atom
669	with other non-greedy quantifiers (including \fB{\fIm\fB,\fIn\fB}?\fR
670	with \fIm\fR equal to \fIn\fR) prefers shortest match. A branch has
671	the same preference as the first quantified atom in it which has a
672	preference. An RE consisting of two or more branches connected by the
673	\fB\|\fR operator prefers longest match.
674	.PP
675	Subject to the constraints imposed by the rules for matching the whole
676	RE, subexpressions also match the longest or shortest possible
677	substrings, based on their preferences, with subexpressions starting
678	earlier in the RE taking priority over ones starting later. Note that
679	outer subexpressions thus take priority over their component
680	subexpressions.
681	.PP
682	Note that the quantifiers \fB{1,1}\fR and \fB{1,1}?\fR can be used to
683	force longest and shortest preference, respectively, on a
684	subexpression or a whole RE.
685	.PP
686	Match lengths are measured in characters, not collating elements. An
687	empty string is considered longer than no match at all. For example,
688	.QW \fBbb*\fR
689	matches the three middle characters of
690	.QW \fBabbbc\fR ,
691	.QW \fB(week\|wee)(night\|knights)\fR
692	matches all ten characters of
693	.QW \fBweeknights\fR ,
694	when
695	.QW \fB(.).\fR
696	is matched against
697	.QW \fBabc\fR
698	the parenthesized subexpression matches all three characters, and when
699	.QW \fB(a)\fR
700	is matched against
701	.QW \fBbc\fR
702	both the whole RE and the parenthesized subexpression match an empty string.
703	.PP
704	If case-independent matching is specified, the effect is much as if
705	all case distinctions had vanished from the alphabet. When an
706	alphabetic that exists in multiple cases appears as an ordinary
707	character outside a bracket expression, it is effectively transformed
708	into a bracket expression containing both cases, so that \fBx\fR
709	becomes
710	.QW \fB[xX]\fR .
711	When it appears inside a bracket expression,
712	all case counterparts of it are added to the bracket expression, so
713	that
714	.QW \fB[x]\fR
715	becomes
716	.QW \fB[xX]\fR
717	and
718	.QW \fB[^x]\fR
719	becomes
720	.QW \fB[^xX]\fR .
721	.PP
722	If newline-sensitive matching is specified, \fB.\fR and bracket
723	expressions using \fB^\fR will never match the newline character (so
724	that matches will never cross newlines unless the RE explicitly
725	arranges it) and \fB^\fR and \fB$\fR will match the empty string after
726	and before a newline respectively, in addition to matching at
727	beginning and end of string respectively. ARE \fB\eA\fR and \fB\eZ\fR
728	continue to match beginning or end of string \fIonly\fR.
729	.PP
730	If partial newline-sensitive matching is specified, this affects
731	\fB.\fR and bracket expressions as with newline-sensitive matching,
732	but not \fB^\fR and \fB$\fR.
733	.PP
734	If inverse partial newline-sensitive matching is specified, this
735	affects \fB^\fR and \fB$\fR as with newline-sensitive matching, but
736	not \fB.\fR and bracket expressions. This is not very useful but is
737	provided for symmetry.
738	.SH "LIMITS AND COMPATIBILITY"
739	No particular limit is imposed on the length of REs. Programs
740	intended to be highly portable should not employ REs longer than 256
741	bytes, as a POSIX-compliant implementation can refuse to accept such
742	REs.
743	.PP
744	The only feature of AREs that is actually incompatible with POSIX EREs
745	is that \fB\e\fR does not lose its special significance inside bracket
746	expressions. All other ARE features use syntax which is illegal or
747	has undefined or unspecified effects in POSIX EREs; the \fB***\fR
748	syntax of directors likewise is outside the POSIX syntax for both BREs
749	and EREs.
750	.PP
751	Many of the ARE extensions are borrowed from Perl, but some have been
752	changed to clean them up, and a few Perl extensions are not present.
753	Incompatibilities of note include
754	.QW \fB\eb\fR ,
755	.QW \fB\eB\fR ,
756	the lack of special treatment for a trailing newline, the addition of
757	complemented bracket expressions to the things affected by
758	newline-sensitive matching, the restrictions on parentheses and back
759	references in lookahead constraints, and the longest/shortest-match
760	(rather than first-match) matching semantics.
761	.PP
762	The matching rules for REs containing both normal and non-greedy
763	quantifiers have changed since early beta-test versions of this
764	package. (The new rules are much simpler and cleaner, but do not work
765	as hard at guessing the user's real intentions.)
766	.PP
767	Henry Spencer's original 1986 \fIregexp\fR package, still in
768	widespread use (e.g., in pre-8.1 releases of Tcl), implemented an
769	early version of today's EREs. There are four incompatibilities
770	between \fIregexp\fR's near-EREs
771	.PQ RREs " for short"
772	and AREs. In roughly increasing order of significance:
773	.IP \(bu 3
774	In AREs, \fB\e\fR followed by an alphanumeric character is either an
775	escape or an error, while in RREs, it was just another way of writing
776	the alphanumeric. This should not be a problem because there was no
777	reason to write such a sequence in RREs.
778	.IP \(bu 3
779	\fB{\fR followed by a digit in an ARE is the beginning of a bound,
780	while in RREs, \fB{\fR was always an ordinary character. Such
781	sequences should be rare, and will often result in an error because
782	following characters will not look like a valid bound.
783	.IP \(bu 3
784	In AREs, \fB\e\fR remains a special character within
785	.QW \fB[\\|]\fR ,
786	so a literal \fB\e\fR within \fB[\\|]\fR must be written
787	.QW \fB\e\e\fR .
788	\fB\e\e\fR also gives a literal \fB\e\fR within \fB[\\|]\fR in RREs,
789	but only truly paranoid programmers routinely doubled the backslash.
790	.IP \(bu 3
791	AREs report the longest/shortest match for the RE, rather than the
792	first found in a specified search order. This may affect some RREs
793	which were written in the expectation that the first match would be
794	reported. (The careful crafting of RREs to optimize the search order
795	for fast matching is obsolete (AREs examine all possible matches in
796	parallel, and their performance is largely insensitive to their
797	complexity) but cases where the search order was exploited to
798	deliberately find a match which was \fInot\fR the longest/shortest
799	will need rewriting.)
800	.SH "BASIC REGULAR EXPRESSIONS"
801	BREs differ from EREs in several respects.
802	.QW \fB\|\fR ,
803	.QW \fB+\fR ,
804	and \fB?\fR are ordinary characters and there is no equivalent for their
805	functionality. The delimiters for bounds are \fB\e{\fR and
806	.QW \fB\e}\fR ,
807	with \fB{\fR and \fB}\fR by themselves ordinary characters. The
808	parentheses for nested subexpressions are \fB\e(\fR and
809	.QW \fB\e)\fR ,
810	with \fB(\fR and \fB)\fR by themselves ordinary
811	characters. \fB^\fR is an ordinary character except at the beginning
812	of the RE or the beginning of a parenthesized subexpression, \fB$\fR
813	is an ordinary character except at the end of the RE or the end of a
814	parenthesized subexpression, and \fB*\fR is an ordinary character if
815	it appears at the beginning of the RE or the beginning of a
816	parenthesized subexpression (after a possible leading
817	.QW \fB^\fR ).
818	Finally, single-digit back references are available, and \fB\e<\fR and
819	\fB\e>\fR are synonyms for
820	.QW \fB[[:<:]]\fR
821	and
822	.QW \fB[[:>:]]\fR
823	respectively; no other escapes are available.
824	.SH "SEE ALSO"
825	RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
826	.SH KEYWORDS
827	match, regular expression, string
828	.\" Local Variables:
829	.\" mode: nroff
830	.\" End:

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format