Planet

navi

home

PPS

about

screenshots

download

development

forum

Context Navigation

source: downloads/boost_1_33_1/libs/regex/doc/introduction.html @ 20

Last change on this file since 20 was 12, checked in by landauf, 18 years ago
added boost
File size: 11.2 KB

Rev	Line
[12]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
	2	<html>
	3	<head>
	4	<title>Boost.Regex: Introduction</title>
	5	<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
	6	<link rel="stylesheet" type="text/css" href="../../../boost.css">
	7	</head>
	8	<body>
	9	<P>
	10	<TABLE id="Table1" cellSpacing="1" cellPadding="1" width="100%" border="0">
	11	<TR>
	12	<td valign="top" width="300">
	13	<h3><a href="../../../index.htm"><img height="86" width="277" alt="C++ Boost" src="../../../boost.png" border="0"></a></h3>
	14	</td>
	15	<TD width="353">
	16	<H1 align="center">Boost.Regex</H1>
	17	<H2 align="center">Introduction</H2>
	18	</TD>
	19	<td width="50">
	20	<h3><a href="index.html"><img height="45" width="43" alt="Boost.Regex Index" src="uarrow.gif" border="0"></a></h3>
	21	</td>
	22	</TR>
	23	</TABLE>
	24	</P>
	25	<HR>
	26	<p></p>
	27	<P>Regular expressions are a form of pattern-matching that are often used in text
	28	processing; many users will be familiar with the Unix utilities <I>grep</I>, <I>sed</I>
	29	and <I>awk</I>, and the programming language <I>Perl</I>, each of which make
	30	extensive use of regular expressions. Traditionally C++ users have been limited
	31	to the POSIX C API's for manipulating regular expressions, and while regex++
	32	does provide these API's, they do not represent the best way to use the
	33	library. For example regex++ can cope with wide character strings, or search
	34	and replace operations (in a manner analogous to either sed or Perl), something
	35	that traditional C libraries can not do.</P>
	36	<P>The class <A href="basic_regex.html">boost::basic_regex</A> is the key class in
	37	this library; it represents a "machine readable" regular expression, and is
	38	very closely modeled on std::basic_string, think of it as a string plus the
	39	actual state-machine required by the regular expression algorithms. Like
	40	std::basic_string there are two typedefs that are almost always the means by
	41	which this class is referenced:</P>
	42	<pre><B>namespace </B>boost{
	43
	44	<B>template</B> <<B>class</B> charT,
	45	<B> class</B> traits = regex_traits<charT> >
	46	<B>class</B> basic_regex;
	47
	48	<B>typedef</B> basic_regex<<B>char</B>> regex;
	49	<B>typedef</B> basic_regex<<B>wchar_t></B> wregex;
	50
	51	}</pre>
	52	<P>To see how this library can be used, imagine that we are writing a credit card
	53	processing application. Credit card numbers generally come as a string of
	54	16-digits, separated into groups of 4-digits, and separated by either a space
	55	or a hyphen. Before storing a credit card number in a database (not necessarily
	56	something your customers will appreciate!), we may want to verify that the
	57	number is in the correct format. To match any digit we could use the regular
	58	expression [0-9], however ranges of characters like this are actually locale
	59	dependent. Instead we should use the POSIX standard form [[:digit:]], or the
	60	regex++ and Perl shorthand for this \d (note that many older libraries tended
	61	to be hard-coded to the C-locale, consequently this was not an issue for them).
	62	That leaves us with the following regular expression to validate credit card
	63	number formats:</P>
	64	<PRE>(\d{4}[- ]){3}\d{4}</PRE>
	65	<P>Here the parenthesis act to group (and mark for future reference)
	66	sub-expressions, and the {4} means "repeat exactly 4 times". This is an example
	67	of the extended regular expression syntax used by Perl, awk and egrep. Regex++
	68	also supports the older "basic" syntax used by sed and grep, but this is
	69	generally less useful, unless you already have some basic regular expressions
	70	that you need to reuse.</P>
	71	<P>Now let's take that expression and place it in some C++ code to validate the
	72	format of a credit card number:</P>
	73	<PRE><B>bool</B> validate_card_format(<B>const</B> std::string s)
	74	{
	75	<B>static</B> <B>const</B> <A href="basic_regex.html">boost::regex</A> e("(\\d{4}[- ]){3}\\d{4}");
	76	<B>return</B> <A href="regex_match.html">regex_match</A>(s, e);
	77	}</PRE>
	78	<P>Note how we had to add some extra escapes to the expression: remember that the
	79	escape is seen once by the C++ compiler, before it gets to be seen by the
	80	regular expression engine, consequently escapes in regular expressions have to
	81	be doubled up when embedding them in C/C++ code. Also note that all the
	82	examples assume that your compiler supports Koenig lookup, if yours doesn't
	83	(for example VC6), then you will have to add some boost:: prefixes to some of
	84	the function calls in the examples.</P>
	85	<P>Those of you who are familiar with credit card processing, will have realized
	86	that while the format used above is suitable for human readable card numbers,
	87	it does not represent the format required by online credit card systems; these
	88	require the number as a string of 16 (or possibly 15) digits, without any
	89	intervening spaces. What we need is a means to convert easily between the two
	90	formats, and this is where search and replace comes in. Those who are familiar
	91	with the utilities <I>sed</I> and <I>Perl</I> will already be ahead here; we
	92	need two strings - one a regular expression - the other a "<A href="format_syntax.html">format
	93	string</A>" that provides a description of the text to replace the match
	94	with. In regex++ this search and replace operation is performed with the
	95	algorithm<A href="regex_replace.html"> regex_replace</A>, for our credit card
	96	example we can write two algorithms like this to provide the format
	97	conversions:</P>
	98	<PRE><I>// match any format with the regular expression:
	99	</I><B>const</B> boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
	100	<B>const</B> std::string machine_format("\\1\\2\\3\\4");
	101	<B>const</B> std::string human_format("\\1-\\2-\\3-\\4");
	102
	103	std::string machine_readable_card_number(<B>const</B> std::string s)
	104	{
	105	<B>return</B> <A href="regex_replace.html">regex_replace</A>(s, e, machine_format, boost::match_default \| boost::format_sed);
	106	}
	107
	108	std::string human_readable_card_number(<B>const</B> std::string s)
	109	{
	110	<B>return</B> <A href="regex_replace.html">regex_replace</A>(s, e, human_format, boost::match_default \| boost::format_sed);
	111	}</PRE>
	112	<P>Here we've used marked sub-expressions in the regular expression to split out
	113	the four parts of the card number as separate fields, the format string then
	114	uses the sed-like syntax to replace the matched text with the reformatted
	115	version.</P>
	116	<P>In the examples above, we haven't directly manipulated the results of a regular
	117	expression match, however in general the result of a match contains a number of
	118	sub-expression matches in addition to the overall match. When the library needs
	119	to report a regular expression match it does so using an instance of the class <A href="match_results.html">
	120	match_results</A>, as before there are typedefs of this class for the most
	121	common cases:
	122	</P>
	123	<PRE><B>namespace </B>boost{
	124	<B>typedef</B> match_results<<B>const</B> <B>char</B>*> cmatch;
	125	<B>typedef</B> match_results<<B>const</B> <B>wchar_t</B>*> wcmatch;
	126	<STRONG>typedef</STRONG> match_results<std::string::const_iterator> smatch;
	127	<STRONG>typedef</STRONG> match_results<std::wstring::const_iterator> wsmatch;
	128	}</PRE>
	129	<P>The algorithms <A href="regex_search.html">regex_search</A> and <A href="regex_match.html">regex_match</A>
	130	make use of match_results to report what matched; the difference between these
	131	algorithms is that <A href="regex_match.html">regex_match</A> will only find
	132	matches that consume <EM>all</EM> of the input text, where as <A href="regex_search.html">
	133	regex_search</A> will <EM>search</EM> for a match anywhere within the text
	134	being matched.</P>
	135	<P>Note that these algorithms are not restricted to searching regular C-strings,
	136	any bidirectional iterator type can be searched, allowing for the possibility
	137	of seamlessly searching almost any kind of data.
	138	</P>
	139	<P>For search and replace operations, in addition to the algorithm <A href="regex_replace.html">
	140	regex_replace</A> that we have already seen, the <A href="match_results.html">match_results</A>
	141	class has a format member that takes the result of a match and a format string,
	142	and produces a new string by merging the two.</P>
	143	<P>For iterating through all occurences of an expression within a text, there are
	144	two iterator types: <A href="regex_iterator.html">regex_iterator</A> will
	145	enumerate over the <A href="match_results.html">match_results</A> objects
	146	found, while <A href="regex_token_iterator.html">regex_token_iterator</A> will
	147	enumerate a series of strings (similar to perl style split operations).</P>
	148	<P>For those that dislike templates, there is a high level wrapper class RegEx
	149	that is an encapsulation of the lower level template code - it provides a
	150	simplified interface for those that don't need the full power of the library,
	151	and supports only narrow characters, and the "extended" regular expression
	152	syntax. This class is now deprecated as it does not form part of the regular
	153	expressions C++ standard library proposal.
	154	</P>
	155	<P>The <A href="posix_api.html">POSIX API</A> functions: regcomp, regexec, regfree
	156	and regerror, are available in both narrow character and Unicode versions, and
	157	are provided for those who need compatibility with these API's.
	158	</P>
	159	<P>Finally, note that the library now has run-time <A href="localisation.html">localization</A>
	160	support, and recognizes the full POSIX regular expression syntax - including
	161	advanced features like multi-character collating elements and equivalence
	162	classes - as well as providing compatibility with other regular expression
	163	libraries including GNU and BSD4 regex packages, and to a more limited extent
	164	Perl 5.
	165	</P>
	166	<P>
	167	<HR>
	168	<P></P>
	169	<p>Revised
	170	<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%d %B, %Y" startspan -->
	171	24 Oct 2003
	172	<!--webbot bot="Timestamp" endspan i-checksum="39359" --></p>
	173	<p><i>© Copyright John Maddock 1998-
	174	<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%Y" startspan -->
	175	2003<!--webbot bot="Timestamp" endspan i-checksum="39359" --></i></p>
	176	<P><I>Use, modification and distribution are subject to the Boost Software License,
	177	Version 1.0. (See accompanying file <A href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</A>
	178	or copy at <A href="http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</A>)</I></P>
	179	</body>
	180	</html>

Note: See TracBrowser for help on using the repository browser.

Download in other formats: