| [12] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> | 
|---|
 | 2 | <html> | 
|---|
 | 3 |    <head> | 
|---|
 | 4 |       <title>Boost.Regex: Introduction</title> | 
|---|
 | 5 |       <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> | 
|---|
 | 6 |       <link rel="stylesheet" type="text/css" href="../../../boost.css"> | 
|---|
 | 7 |    </head> | 
|---|
 | 8 |    <body> | 
|---|
 | 9 |       <P> | 
|---|
 | 10 |          <TABLE id="Table1" cellSpacing="1" cellPadding="1" width="100%" border="0"> | 
|---|
 | 11 |             <TR> | 
|---|
 | 12 |                <td valign="top" width="300"> | 
|---|
 | 13 |                   <h3><a href="../../../index.htm"><img height="86" width="277" alt="C++ Boost" src="../../../boost.png" border="0"></a></h3> | 
|---|
 | 14 |                </td> | 
|---|
 | 15 |                <TD width="353"> | 
|---|
 | 16 |                   <H1 align="center">Boost.Regex</H1> | 
|---|
 | 17 |                   <H2 align="center">Introduction</H2> | 
|---|
 | 18 |                </TD> | 
|---|
 | 19 |                <td width="50"> | 
|---|
 | 20 |                   <h3><a href="index.html"><img height="45" width="43" alt="Boost.Regex Index" src="uarrow.gif" border="0"></a></h3> | 
|---|
 | 21 |                </td> | 
|---|
 | 22 |             </TR> | 
|---|
 | 23 |          </TABLE> | 
|---|
 | 24 |       </P> | 
|---|
 | 25 |       <HR> | 
|---|
 | 26 |       <p></p> | 
|---|
 | 27 |       <P>Regular expressions are a form of pattern-matching that are often used in text  | 
|---|
 | 28 |          processing; many users will be familiar with the Unix utilities <I>grep</I>, <I>sed</I> | 
|---|
 | 29 |          and <I>awk</I>, and the programming language <I>Perl</I>, each of which make  | 
|---|
 | 30 |          extensive use of regular expressions. Traditionally C++ users have been limited  | 
|---|
 | 31 |          to the POSIX C API's for manipulating regular expressions, and while regex++  | 
|---|
 | 32 |          does provide these API's, they do not represent the best way to use the  | 
|---|
 | 33 |          library. For example regex++ can cope with wide character strings, or search  | 
|---|
 | 34 |          and replace operations (in a manner analogous to either sed or Perl), something  | 
|---|
 | 35 |          that traditional C libraries can not do.</P> | 
|---|
 | 36 |       <P>The class <A href="basic_regex.html">boost::basic_regex</A> is the key class in  | 
|---|
 | 37 |          this library; it represents a "machine readable" regular expression, and is  | 
|---|
 | 38 |          very closely modeled on std::basic_string, think of it as a string plus the  | 
|---|
 | 39 |          actual state-machine required by the regular expression algorithms. Like  | 
|---|
 | 40 |          std::basic_string there are two typedefs that are almost always the means by  | 
|---|
 | 41 |          which this class is referenced:</P> | 
|---|
 | 42 |       <pre><B>namespace </B>boost{ | 
|---|
 | 43 |  | 
|---|
 | 44 | <B>template</B> <<B>class</B> charT,  | 
|---|
 | 45 | <B>          class</B> traits = regex_traits<charT> > | 
|---|
 | 46 | <B>class</B> basic_regex; | 
|---|
 | 47 |  | 
|---|
 | 48 | <B>typedef</B> basic_regex<<B>char</B>> regex; | 
|---|
 | 49 | <B>typedef</B> basic_regex<<B>wchar_t></B> wregex; | 
|---|
 | 50 |  | 
|---|
 | 51 | }</pre> | 
|---|
 | 52 |       <P>To see how this library can be used, imagine that we are writing a credit card  | 
|---|
 | 53 |          processing application. Credit card numbers generally come as a string of  | 
|---|
 | 54 |          16-digits, separated into groups of 4-digits, and separated by either a space  | 
|---|
 | 55 |          or a hyphen. Before storing a credit card number in a database (not necessarily  | 
|---|
 | 56 |          something your customers will appreciate!), we may want to verify that the  | 
|---|
 | 57 |          number is in the correct format. To match any digit we could use the regular  | 
|---|
 | 58 |          expression [0-9], however ranges of characters like this are actually locale  | 
|---|
 | 59 |          dependent. Instead we should use the POSIX standard form [[:digit:]], or the  | 
|---|
 | 60 |          regex++ and Perl shorthand for this \d (note that many older libraries tended  | 
|---|
 | 61 |          to be hard-coded to the C-locale, consequently this was not an issue for them).  | 
|---|
 | 62 |          That leaves us with the following regular expression to validate credit card  | 
|---|
 | 63 |          number formats:</P> | 
|---|
 | 64 |       <PRE>(\d{4}[- ]){3}\d{4}</PRE> | 
|---|
 | 65 |       <P>Here the parenthesis act to group (and mark for future reference)  | 
|---|
 | 66 |          sub-expressions, and the {4} means "repeat exactly 4 times". This is an example  | 
|---|
 | 67 |          of the extended regular expression syntax used by Perl, awk and egrep. Regex++  | 
|---|
 | 68 |          also supports the older "basic" syntax used by sed and grep, but this is  | 
|---|
 | 69 |          generally less useful, unless you already have some basic regular expressions  | 
|---|
 | 70 |          that you need to reuse.</P> | 
|---|
 | 71 |       <P>Now let's take that expression and place it in some C++ code to validate the  | 
|---|
 | 72 |          format of a credit card number:</P> | 
|---|
 | 73 |       <PRE><B>bool</B> validate_card_format(<B>const</B> std::string s) | 
|---|
 | 74 | { | 
|---|
 | 75 |    <B>static</B> <B>const</B> <A href="basic_regex.html">boost::regex</A> e("(\\d{4}[- ]){3}\\d{4}"); | 
|---|
 | 76 |    <B>return</B> <A href="regex_match.html">regex_match</A>(s, e); | 
|---|
 | 77 | }</PRE> | 
|---|
 | 78 |       <P>Note how we had to add some extra escapes to the expression: remember that the  | 
|---|
 | 79 |          escape is seen once by the C++ compiler, before it gets to be seen by the  | 
|---|
 | 80 |          regular expression engine, consequently escapes in regular expressions have to  | 
|---|
 | 81 |          be doubled up when embedding them in C/C++ code. Also note that all the  | 
|---|
 | 82 |          examples assume that your compiler supports Koenig lookup, if yours doesn't  | 
|---|
 | 83 |          (for example VC6), then you will have to add some boost:: prefixes to some of  | 
|---|
 | 84 |          the function calls in the examples.</P> | 
|---|
 | 85 |       <P>Those of you who are familiar with credit card processing, will have realized  | 
|---|
 | 86 |          that while the format used above is suitable for human readable card numbers,  | 
|---|
 | 87 |          it does not represent the format required by online credit card systems; these  | 
|---|
 | 88 |          require the number as a string of 16 (or possibly 15) digits, without any  | 
|---|
 | 89 |          intervening spaces. What we need is a means to convert easily between the two  | 
|---|
 | 90 |          formats, and this is where search and replace comes in. Those who are familiar  | 
|---|
 | 91 |          with the utilities <I>sed</I> and <I>Perl</I> will already be ahead here; we  | 
|---|
 | 92 |          need two strings - one a regular expression - the other a "<A href="format_syntax.html">format  | 
|---|
 | 93 |             string</A>" that provides a description of the text to replace the match  | 
|---|
 | 94 |          with. In regex++ this search and replace operation is performed with the  | 
|---|
 | 95 |          algorithm<A href="regex_replace.html"> regex_replace</A>, for our credit card  | 
|---|
 | 96 |          example we can write two algorithms like this to provide the format  | 
|---|
 | 97 |          conversions:</P> | 
|---|
 | 98 |       <PRE><I>// match any format with the regular expression: | 
|---|
 | 99 | </I><B>const</B> boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"); | 
|---|
 | 100 | <B>const</B> std::string machine_format("\\1\\2\\3\\4"); | 
|---|
 | 101 | <B>const</B> std::string human_format("\\1-\\2-\\3-\\4"); | 
|---|
 | 102 |  | 
|---|
 | 103 | std::string machine_readable_card_number(<B>const</B> std::string s) | 
|---|
 | 104 | { | 
|---|
 | 105 |    <B>return</B> <A href="regex_replace.html">regex_replace</A>(s, e, machine_format, boost::match_default | boost::format_sed); | 
|---|
 | 106 | } | 
|---|
 | 107 |  | 
|---|
 | 108 | std::string human_readable_card_number(<B>const</B> std::string s) | 
|---|
 | 109 | { | 
|---|
 | 110 |    <B>return</B> <A href="regex_replace.html">regex_replace</A>(s, e, human_format, boost::match_default | boost::format_sed); | 
|---|
 | 111 | }</PRE> | 
|---|
 | 112 |       <P>Here we've used marked sub-expressions in the regular expression to split out  | 
|---|
 | 113 |          the four parts of the card number as separate fields, the format string then  | 
|---|
 | 114 |          uses the sed-like syntax to replace the matched text with the reformatted  | 
|---|
 | 115 |          version.</P> | 
|---|
 | 116 |       <P>In the examples above, we haven't directly manipulated the results of a regular  | 
|---|
 | 117 |          expression match, however in general the result of a match contains a number of  | 
|---|
 | 118 |          sub-expression matches in addition to the overall match. When the library needs  | 
|---|
 | 119 |          to report a regular expression match it does so using an instance of the class <A href="match_results.html"> | 
|---|
 | 120 |             match_results</A>, as before there are typedefs of this class for the most  | 
|---|
 | 121 |          common cases: | 
|---|
 | 122 |       </P> | 
|---|
 | 123 |       <PRE><B>namespace </B>boost{ | 
|---|
 | 124 | <B>typedef</B> match_results<<B>const</B> <B>char</B>*> cmatch; | 
|---|
 | 125 | <B>typedef</B> match_results<<B>const</B> <B>wchar_t</B>*> wcmatch; | 
|---|
 | 126 | <STRONG>typedef</STRONG> match_results<std::string::const_iterator> smatch; | 
|---|
 | 127 | <STRONG>typedef</STRONG> match_results<std::wstring::const_iterator> wsmatch;  | 
|---|
 | 128 | }</PRE> | 
|---|
 | 129 |       <P>The algorithms <A href="regex_search.html">regex_search</A> and <A href="regex_match.html">regex_match</A> | 
|---|
 | 130 |          make use of match_results to report what matched; the difference between these  | 
|---|
 | 131 |          algorithms is that <A href="regex_match.html">regex_match</A> will only find  | 
|---|
 | 132 |          matches that consume <EM>all</EM> of the input text, where as <A href="regex_search.html"> | 
|---|
 | 133 |             regex_search</A> will <EM>search</EM> for a match anywhere within the text  | 
|---|
 | 134 |          being matched.</P> | 
|---|
 | 135 |       <P>Note that these algorithms are not restricted to searching regular C-strings,  | 
|---|
 | 136 |          any bidirectional iterator type can be searched, allowing for the possibility  | 
|---|
 | 137 |          of seamlessly searching almost any kind of data. | 
|---|
 | 138 |       </P> | 
|---|
 | 139 |       <P>For search and replace operations, in addition to the algorithm <A href="regex_replace.html"> | 
|---|
 | 140 |             regex_replace</A> that we have already seen, the <A href="match_results.html">match_results</A> | 
|---|
 | 141 |          class has a format member that takes the result of a match and a format string,  | 
|---|
 | 142 |          and produces a new string by merging the two.</P> | 
|---|
 | 143 |       <P>For iterating through all occurences of an expression within a text, there are  | 
|---|
 | 144 |          two iterator types: <A href="regex_iterator.html">regex_iterator</A> will  | 
|---|
 | 145 |          enumerate over the <A href="match_results.html">match_results</A> objects  | 
|---|
 | 146 |          found, while <A href="regex_token_iterator.html">regex_token_iterator</A> will  | 
|---|
 | 147 |          enumerate a series of strings (similar to perl style split operations).</P> | 
|---|
 | 148 |       <P>For those that dislike templates, there is a high level wrapper class RegEx  | 
|---|
 | 149 |          that is an encapsulation of the lower level template code - it provides a  | 
|---|
 | 150 |          simplified interface for those that don't need the full power of the library,  | 
|---|
 | 151 |          and supports only narrow characters, and the "extended" regular expression  | 
|---|
 | 152 |          syntax. This class is now deprecated as it does not form part of the regular  | 
|---|
 | 153 |          expressions C++ standard library proposal. | 
|---|
 | 154 |       </P> | 
|---|
 | 155 |       <P>The <A href="posix_api.html">POSIX API</A> functions: regcomp, regexec, regfree  | 
|---|
 | 156 |          and regerror, are available in both narrow character and Unicode versions, and  | 
|---|
 | 157 |          are provided for those who need compatibility with these API's. | 
|---|
 | 158 |       </P> | 
|---|
 | 159 |       <P>Finally, note that the library now has run-time <A href="localisation.html">localization</A> | 
|---|
 | 160 |          support, and recognizes the full POSIX regular expression syntax - including  | 
|---|
 | 161 |          advanced features like multi-character collating elements and equivalence  | 
|---|
 | 162 |          classes - as well as providing compatibility with other regular expression  | 
|---|
 | 163 |          libraries including GNU and BSD4 regex packages, and to a more limited extent  | 
|---|
 | 164 |          Perl 5. | 
|---|
 | 165 |       </P> | 
|---|
 | 166 |       <P> | 
|---|
 | 167 |          <HR> | 
|---|
 | 168 |       <P></P> | 
|---|
 | 169 |       <p>Revised  | 
|---|
 | 170 |          <!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%d %B, %Y" startspan -->  | 
|---|
 | 171 |          24 Oct 2003  | 
|---|
 | 172 |          <!--webbot bot="Timestamp" endspan i-checksum="39359" --></p> | 
|---|
 | 173 |       <p><i>© Copyright John Maddock 1998-  | 
|---|
 | 174 |             <!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%Y" startspan -->  | 
|---|
 | 175 |             2003<!--webbot bot="Timestamp" endspan i-checksum="39359" --></i></p> | 
|---|
 | 176 |       <P><I>Use, modification and distribution are subject to the Boost Software License,  | 
|---|
 | 177 |             Version 1.0. (See accompanying file <A href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</A> | 
|---|
 | 178 |             or copy at <A href="http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</A>)</I></P> | 
|---|
 | 179 |    </body> | 
|---|
 | 180 | </html> | 
|---|