Planet

navi

home

PPS

about

screenshots

download

development

forum

Context Navigation

source: downloads/boost_1_33_1/libs/program_options/doc/design.xml @ 20

Last change on this file since 20 was 12, checked in by landauf, 18 years ago
added boost
File size: 9.0 KB

Line
1	<?xml version="1.0" standalone="yes"?>
2	<!DOCTYPE library PUBLIC "-//Boost//DTD BoostBook XML V1.0//EN"
3	"http://www.boost.org/tools/boostbook/dtd/boostbook.dtd"
4	[
5	<!ENTITY % entities SYSTEM "program_options.ent" >
6	%entities;
7	]>
8	<section id="program_options.design">
9	<title>Design Discussion</title>
10
11	<para>This section focuses on some of the design questions.
12	</para>
13
14	<section id="program_options.design.unicode">
15
16	<title>Unicode Support</title>
17
18	<para>Unicode support was one of the features specifically requested
19	during the formal review. Throughout this document "Unicode support" is
20	a synonym for "wchar_t" support, assuming that "wchar_t" always uses
21	Unicode encoding. Also, when talking about "ascii" (in lowercase) we'll
22	not mean strict 7-bit ASCII encoding, but rather "char" strings in local
23	8-bit encoding.
24	</para>
25
26	<para>
27	Generally, "Unicode support" can mean
28	many things, but for the program_options library it means that:
29
30	<itemizedlist>
31	<listitem>
32	<para>Each parser should accept either <code>char*</code>
33	or <code>wchar_t*</code>, correctly split the input into option
34	names and option values and return the data.
35	</para>
36	</listitem>
37	<listitem>
38	<para>For each option, it should be possible to specify whether the conversion
39	from string to value uses ascii or Unicode.
40	</para>
41	</listitem>
42	<listitem>
43	<para>The library guarantees that:
44	<itemizedlist>
45	<listitem>
46	<para>ascii input is passed to an ascii value without change
47	</para>
48	</listitem>
49	<listitem>
50	<para>Unicode input is passed to a Unicode value without change</para>
51	</listitem>
52	<listitem>
53	<para>ascii input passed to a Unicode value, and Unicode input
54	passed to an ascii value will be converted using a codecvt
55	facet (which may be specified by the user(which can be
56	specified by the user)
57	</para>
58	</listitem>
59	</itemizedlist>
60	</para>
61	</listitem>
62	</itemizedlist>
63	</para>
64
65	<para>The important point is that it's possible to have some "ascii
66	options" together with "Unicode options". There are two reasons for
67	this. First, for a given type you might not have the code to extract the
68	value from Unicode string and it's not good to require that such code be written.
69	Second, imagine a reusable library which has some options and exposes
70	options description in its interface. If <emphasis>all</emphasis>
71	options are either ascii or Unicode, and the library does not use any
72	Unicode strings, then the author will likely to use ascii options, which
73	would make the library unusable inside Unicode
74	applications. Essentially, it would be necessary to provide two versions
75	of the library -- ascii and Unicode.
76	</para>
77
78	<para>Another important point is that ascii strings are passed though
79	without modification. In other words, it's not possible to just convert
80	ascii to Unicode and process the Unicode further. The problem is that the
81	default conversion mechanism -- the <code>codecvt</code> facet -- might
82	not work with 8-bit input without additional setup.
83	</para>
84
85	<para>The Unicode support outlined above is not complete. For example, we
86	don't plan allow Unicode in option names. Unicode support is hard and
87	requires a Boost-wide solution. Even comparing two arbitrary Unicode
88	strings is non-trivial. Finally, using Unicode in option names is
89	related to internationalization, which has it's own
90	complexities. E.g. if option names depend on current locale, then all
91	program parts and other parts which use the name must be
92	internationalized too.
93	</para>
94
95	<para>The primary question in implementing the Unicode support is whether
96	to use templates and <code>std::basic_string</code> or to use some
97	internal encoding and convert between internal and external encodings on
98	the interface boundaries.
99	</para>
100
101	<para>The choice, mostly, is between code size and execution
102	speed. A templated solution would either link library code into every
103	application that uses the library (thereby making shared library
104	impossible), or provide explicit instantiations in the shared library
105	(increasing its size). The solution based on internal encoding would
106	necessarily make conversions in a number of places and will be somewhat slower.
107	Since speed is generally not an issue for this library, the second
108	solution looks more attractive, but we'll take a closer look at
109	individual components.
110	</para>
111
112	<para>For the parsers component, we have three choices:
113	<itemizedlist>
114	<listitem>
115	<para>Use a fully templated implementation: given a string of a
116	certain type, a parser will return a &parsed_options; instance
117	with strings of the same type (i.e. the &parsed_options; class
118	will be templated).</para>
119	</listitem>
120	<listitem>
121	<para>Use internal encoding: same as above, but strings will be converted to and
122	from the internal encoding.</para>
123	</listitem>
124	<listitem>
125	<para>Use and partly expose the internal encoding: same as above,
126	but the strings in the &parsed_options; instance will be in the
127	internal encoding. This might avoid a conversion if
128	&parsed_options; instance is passed directly to other components,
129	but can be also dangerous or confusing for a user.
130	</para>
131	</listitem>
132	</itemizedlist>
133	</para>
134
135	<para>The second solution appears to be the best -- it does not increase
136	the code size much and is cleaner than the third. To avoid extra
137	conversions, the Unicode version of &parsed_options; can also store
138	strings in internal encoding.
139	</para>
140
141	<para>For the options descriptions component, we don't have much
142	choice. Since it's not desirable to have either all options use ascii or all
143	of them use Unicode, but rather have some ascii and some Unicode options, the
144	interface of the &value_semantic; must work with both. The only way is
145	to pass an additional flag telling if strings use ascii or internal encoding.
146	The instance of &value_semantic; can then convert into some
147	other encoding if needed.
148	</para>
149
150	<para>For the storage component, the only affected function is &store;.
151	For Unicode input, the &store; function should convert the value to the
152	internal encoding. It should also inform the &value_semantic; class
153	about the used encoding.
154	</para>
155
156	<para>Finally, what internal encoding should we use? The
157	alternatives are:
158	<code>std::wstring</code> (using UCS-4 encoding) and
159	<code>std::string</code> (using UTF-8 encoding). The difference between
160	alternatives is:
161	<itemizedlist>
162	<listitem>
163	<para>Speed: UTF-8 is a bit slower</para>
164	</listitem>
165	<listitem>
166	<para>Space: UTF-8 takes less space when input is ascii</para>
167	</listitem>
168	<listitem>
169	<para>Code size: UTF-8 requires additional conversion code. However,
170	it allows one to use existing parsers without converting them to
171	<code>std::wstring</code> and such conversion is likely to create a
172	number of new instantiations.
173	</para>
174	</listitem>
175
176	</itemizedlist>
177	There's no clear leader, but the last point seems important, so UTF-8
178	will be used.
179	</para>
180
181	<para>Choosing the UTF-8 encoding allows the use of existing parsers,
182	because 7-bit ascii characters retain their values in UTF-8,
183	so searching for 7-bit strings is simple. However, there are
184	two subtle issues:
185	<itemizedlist>
186	<listitem>
187	<para>We need to assume the character literals use ascii encoding
188	and that inputs use Unicode encoding.</para>
189	</listitem>
190	<listitem>
191	<para>A Unicode character (say '=') can be followed by 'composing
192	character' and the combination is not the same as just '=', so a
193	simple search for '=' might find the wrong character.
194	</para>
195	</listitem>
196	</itemizedlist>
197	Neither of these issues appear to be critical in practice, since ascii is
198	almost universal encoding and since composing characters following '=' (and
199	other characters with special meaning to the library) are not likely to appear.
200	</para>
201
202	</section>
203
204
205	</section>
206
207	<!--
208	Local Variables:
209	mode: xml
210	sgml-indent-data: t
211	sgml-parent-document: ("program_options.xml" "section")
212	sgml-set-face: t
213	End:
214	-->

Note: See TracBrowser for help on using the repository browser.

Download in other formats: