Context Navigation

tokenization.qbk @ 29

Last change on this file since 29 was 29, checked in by landauf, 17 years ago
updated boost from 1_33_1 to 1_34_1
File size: 5.0 KB

Rev	Line
[29]	1	[section String Splitting and Tokenization]
	2
	3	_regex_token_iterator_ is the Ginsu knife of the text manipulation world. It slices! It dices! This section describes
	4	how to use the highly-configurable _regex_token_iterator_ to chop up input sequences.
	5
	6	[h2 Overview]
	7
	8	You initialize a _regex_token_iterator_ with an input sequence, a regex, and some optional configuration parameters.
	9	The _regex_token_iterator_ will use _regex_search_ to find the first place in the sequence that the regex matches. When
	10	dereferenced, the _regex_token_iterator_ returns a ['token] in the form of a `std::basic_string<>`. Which string it returns
	11	depends on the configuration parameters. By default it returns a string corresponding to the full match, but it could also
	12	return a string corresponding to a particular marked sub-expression, or even the part of the sequence that ['didn't] match.
	13	When you increment the _regex_token_iterator_, it will move to the next token. Which token is next depends on the configuration
	14	parameters. It could simply be a different marked sub-expression in the current match, or it could be part or all of the
	15	next match. Or it could be the part that ['didn't] match.
	16
	17	As you can see, _regex_token_iterator_ can do a lot. That makes it hard to describe, but some examples should make it clear.
	18
	19	[h2 Example 1: Simple Tokenization]
	20
	21	This example uses _regex_token_iterator_ to chop a sequence into a series of tokens consisting of words.
	22
	23	std::string input("This is his face");
	24	sregex re = +_w; // find a word
	25
	26	// iterate over all the words in the input
	27	sregex_token_iterator begin( input.begin(), input.end(), re ), end;
	28
	29	// write all the words to std::cout
	30	std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
	31	std::copy( begin, end, out_iter );
	32
	33	This program displays the following:
	34
	35	[pre
	36	This
	37	is
	38	his
	39	face
	40	]
	41
	42	[h2 Example 2: Simple Tokenization, Reloaded]
	43
	44	This example also uses _regex_token_iterator_ to chop a sequence into a series of tokens consisting of words,
	45	but it uses the regex as a delimiter. When we pass a `-1` as the last parameter to the _regex_token_iterator_
	46	constructor, it instructs the token iterator to consider as tokens those parts of the input that ['didn't]
	47	match the regex.
	48
	49	std::string input("This is his face");
	50	sregex re = +_s; // find white space
	51
	52	// iterate over all non-white space in the input. Note the -1 below:
	53	sregex_token_iterator begin( input.begin(), input.end(), re, -1 ), end;
	54
	55	// write all the words to std::cout
	56	std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
	57	std::copy( begin, end, out_iter );
	58
	59	This program displays the following:
	60
	61	[pre
	62	This
	63	is
	64	his
	65	face
	66	]
	67
	68	[h2 Example 3: Simple Tokenization, Revolutions]
	69
	70	This example also uses _regex_token_iterator_ to chop a sequence containing a bunch of dates into a series of
	71	tokens consisting of just the years. When we pass a positive integer [^['N]] as the last parameter to the
	72	_regex_token_iterator_ constructor, it instructs the token iterator to consider as tokens only the [^['N]]-th
	73	marked sub-expression of each match.
	74
	75	std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981");
	76	sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date
	77
	78	// iterate over all the years in the input. Note the 3 below, corresponding to the 3rd sub-expression:
	79	sregex_token_iterator begin( input.begin(), input.end(), re, 3 ), end;
	80
	81	// write all the words to std::cout
	82	std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
	83	std::copy( begin, end, out_iter );
	84
	85	This program displays the following:
	86
	87	[pre
	88	2003
	89	1999
	90	1981
	91	]
	92
	93	[h2 Example 4: Not-So-Simple Tokenization]
	94
	95	This example is like the previous one, except that instead of tokenizing just the years, this program
	96	turns the days, months and years into tokens. When we pass an array of integers [^['{I,J,...}]] as the last
	97	parameter to the _regex_token_iterator_ constructor, it instructs the token iterator to consider as tokens the
	98	[^['I]]-th, [^['J]]-th, etc. marked sub-expression of each match.
	99
	100	std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981");
	101	sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date
	102
	103	// iterate over the days, months and years in the input
	104	int const sub_matches[] = { 2, 1, 3 }; // day, month, year
	105	sregex_token_iterator begin( input.begin(), input.end(), re, sub_matches ), end;
	106
	107	// write all the words to std::cout
	108	std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
	109	std::copy( begin, end, out_iter );
	110
	111	This program displays the following:
	112
	113	[pre
	114	02
	115	01
	116	2003
	117	23
	118	04
	119	1999
	120	13
	121	11
	122	1981
	123	]
	124
	125	The `sub_matches` array instructs the _regex_token_iterator_ to first take the value of the 2nd sub-match, then
	126	the 1st sub-match, and finally the 3rd. Incrementing the iterator again instructs it to use _regex_search_ again
	127	to find the next match. At that point, the process repeats -- the token iterator takes the value of the 2nd
	128	sub-match, then the 1st, et cetera.
	129
	130	[endsect]

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format