| 1 | <?xml version="1.0" standalone="yes"?> |
|---|
| 2 | <!DOCTYPE library PUBLIC "-//Boost//DTD BoostBook XML V1.0//EN" |
|---|
| 3 | "http://www.boost.org/tools/boostbook/dtd/boostbook.dtd" |
|---|
| 4 | [ |
|---|
| 5 | <!ENTITY % entities SYSTEM "program_options.ent" > |
|---|
| 6 | %entities; |
|---|
| 7 | ]> |
|---|
| 8 | <section id="program_options.design"> |
|---|
| 9 | <title>Design Discussion</title> |
|---|
| 10 | |
|---|
| 11 | <para>This section focuses on some of the design questions. |
|---|
| 12 | </para> |
|---|
| 13 | |
|---|
| 14 | <section id="program_options.design.unicode"> |
|---|
| 15 | |
|---|
| 16 | <title>Unicode Support</title> |
|---|
| 17 | |
|---|
| 18 | <para>Unicode support was one of the features specifically requested |
|---|
| 19 | during the formal review. Throughout this document "Unicode support" is |
|---|
| 20 | a synonym for "wchar_t" support, assuming that "wchar_t" always uses |
|---|
| 21 | Unicode encoding. Also, when talking about "ascii" (in lowercase) we'll |
|---|
| 22 | not mean strict 7-bit ASCII encoding, but rather "char" strings in local |
|---|
| 23 | 8-bit encoding. |
|---|
| 24 | </para> |
|---|
| 25 | |
|---|
| 26 | <para> |
|---|
| 27 | Generally, "Unicode support" can mean |
|---|
| 28 | many things, but for the program_options library it means that: |
|---|
| 29 | |
|---|
| 30 | <itemizedlist> |
|---|
| 31 | <listitem> |
|---|
| 32 | <para>Each parser should accept either <code>char*</code> |
|---|
| 33 | or <code>wchar_t*</code>, correctly split the input into option |
|---|
| 34 | names and option values and return the data. |
|---|
| 35 | </para> |
|---|
| 36 | </listitem> |
|---|
| 37 | <listitem> |
|---|
| 38 | <para>For each option, it should be possible to specify whether the conversion |
|---|
| 39 | from string to value uses ascii or Unicode. |
|---|
| 40 | </para> |
|---|
| 41 | </listitem> |
|---|
| 42 | <listitem> |
|---|
| 43 | <para>The library guarantees that: |
|---|
| 44 | <itemizedlist> |
|---|
| 45 | <listitem> |
|---|
| 46 | <para>ascii input is passed to an ascii value without change |
|---|
| 47 | </para> |
|---|
| 48 | </listitem> |
|---|
| 49 | <listitem> |
|---|
| 50 | <para>Unicode input is passed to a Unicode value without change</para> |
|---|
| 51 | </listitem> |
|---|
| 52 | <listitem> |
|---|
| 53 | <para>ascii input passed to a Unicode value, and Unicode input |
|---|
| 54 | passed to an ascii value will be converted using a codecvt |
|---|
| 55 | facet (which may be specified by the user(which can be |
|---|
| 56 | specified by the user) |
|---|
| 57 | </para> |
|---|
| 58 | </listitem> |
|---|
| 59 | </itemizedlist> |
|---|
| 60 | </para> |
|---|
| 61 | </listitem> |
|---|
| 62 | </itemizedlist> |
|---|
| 63 | </para> |
|---|
| 64 | |
|---|
| 65 | <para>The important point is that it's possible to have some "ascii |
|---|
| 66 | options" together with "Unicode options". There are two reasons for |
|---|
| 67 | this. First, for a given type you might not have the code to extract the |
|---|
| 68 | value from Unicode string and it's not good to require that such code be written. |
|---|
| 69 | Second, imagine a reusable library which has some options and exposes |
|---|
| 70 | options description in its interface. If <emphasis>all</emphasis> |
|---|
| 71 | options are either ascii or Unicode, and the library does not use any |
|---|
| 72 | Unicode strings, then the author will likely to use ascii options, which |
|---|
| 73 | would make the library unusable inside Unicode |
|---|
| 74 | applications. Essentially, it would be necessary to provide two versions |
|---|
| 75 | of the library -- ascii and Unicode. |
|---|
| 76 | </para> |
|---|
| 77 | |
|---|
| 78 | <para>Another important point is that ascii strings are passed though |
|---|
| 79 | without modification. In other words, it's not possible to just convert |
|---|
| 80 | ascii to Unicode and process the Unicode further. The problem is that the |
|---|
| 81 | default conversion mechanism -- the <code>codecvt</code> facet -- might |
|---|
| 82 | not work with 8-bit input without additional setup. |
|---|
| 83 | </para> |
|---|
| 84 | |
|---|
| 85 | <para>The Unicode support outlined above is not complete. For example, we |
|---|
| 86 | don't plan allow Unicode in option names. Unicode support is hard and |
|---|
| 87 | requires a Boost-wide solution. Even comparing two arbitrary Unicode |
|---|
| 88 | strings is non-trivial. Finally, using Unicode in option names is |
|---|
| 89 | related to internationalization, which has it's own |
|---|
| 90 | complexities. E.g. if option names depend on current locale, then all |
|---|
| 91 | program parts and other parts which use the name must be |
|---|
| 92 | internationalized too. |
|---|
| 93 | </para> |
|---|
| 94 | |
|---|
| 95 | <para>The primary question in implementing the Unicode support is whether |
|---|
| 96 | to use templates and <code>std::basic_string</code> or to use some |
|---|
| 97 | internal encoding and convert between internal and external encodings on |
|---|
| 98 | the interface boundaries. |
|---|
| 99 | </para> |
|---|
| 100 | |
|---|
| 101 | <para>The choice, mostly, is between code size and execution |
|---|
| 102 | speed. A templated solution would either link library code into every |
|---|
| 103 | application that uses the library (thereby making shared library |
|---|
| 104 | impossible), or provide explicit instantiations in the shared library |
|---|
| 105 | (increasing its size). The solution based on internal encoding would |
|---|
| 106 | necessarily make conversions in a number of places and will be somewhat slower. |
|---|
| 107 | Since speed is generally not an issue for this library, the second |
|---|
| 108 | solution looks more attractive, but we'll take a closer look at |
|---|
| 109 | individual components. |
|---|
| 110 | </para> |
|---|
| 111 | |
|---|
| 112 | <para>For the parsers component, we have three choices: |
|---|
| 113 | <itemizedlist> |
|---|
| 114 | <listitem> |
|---|
| 115 | <para>Use a fully templated implementation: given a string of a |
|---|
| 116 | certain type, a parser will return a &parsed_options; instance |
|---|
| 117 | with strings of the same type (i.e. the &parsed_options; class |
|---|
| 118 | will be templated).</para> |
|---|
| 119 | </listitem> |
|---|
| 120 | <listitem> |
|---|
| 121 | <para>Use internal encoding: same as above, but strings will be converted to and |
|---|
| 122 | from the internal encoding.</para> |
|---|
| 123 | </listitem> |
|---|
| 124 | <listitem> |
|---|
| 125 | <para>Use and partly expose the internal encoding: same as above, |
|---|
| 126 | but the strings in the &parsed_options; instance will be in the |
|---|
| 127 | internal encoding. This might avoid a conversion if |
|---|
| 128 | &parsed_options; instance is passed directly to other components, |
|---|
| 129 | but can be also dangerous or confusing for a user. |
|---|
| 130 | </para> |
|---|
| 131 | </listitem> |
|---|
| 132 | </itemizedlist> |
|---|
| 133 | </para> |
|---|
| 134 | |
|---|
| 135 | <para>The second solution appears to be the best -- it does not increase |
|---|
| 136 | the code size much and is cleaner than the third. To avoid extra |
|---|
| 137 | conversions, the Unicode version of &parsed_options; can also store |
|---|
| 138 | strings in internal encoding. |
|---|
| 139 | </para> |
|---|
| 140 | |
|---|
| 141 | <para>For the options descriptions component, we don't have much |
|---|
| 142 | choice. Since it's not desirable to have either all options use ascii or all |
|---|
| 143 | of them use Unicode, but rather have some ascii and some Unicode options, the |
|---|
| 144 | interface of the &value_semantic; must work with both. The only way is |
|---|
| 145 | to pass an additional flag telling if strings use ascii or internal encoding. |
|---|
| 146 | The instance of &value_semantic; can then convert into some |
|---|
| 147 | other encoding if needed. |
|---|
| 148 | </para> |
|---|
| 149 | |
|---|
| 150 | <para>For the storage component, the only affected function is &store;. |
|---|
| 151 | For Unicode input, the &store; function should convert the value to the |
|---|
| 152 | internal encoding. It should also inform the &value_semantic; class |
|---|
| 153 | about the used encoding. |
|---|
| 154 | </para> |
|---|
| 155 | |
|---|
| 156 | <para>Finally, what internal encoding should we use? The |
|---|
| 157 | alternatives are: |
|---|
| 158 | <code>std::wstring</code> (using UCS-4 encoding) and |
|---|
| 159 | <code>std::string</code> (using UTF-8 encoding). The difference between |
|---|
| 160 | alternatives is: |
|---|
| 161 | <itemizedlist> |
|---|
| 162 | <listitem> |
|---|
| 163 | <para>Speed: UTF-8 is a bit slower</para> |
|---|
| 164 | </listitem> |
|---|
| 165 | <listitem> |
|---|
| 166 | <para>Space: UTF-8 takes less space when input is ascii</para> |
|---|
| 167 | </listitem> |
|---|
| 168 | <listitem> |
|---|
| 169 | <para>Code size: UTF-8 requires additional conversion code. However, |
|---|
| 170 | it allows one to use existing parsers without converting them to |
|---|
| 171 | <code>std::wstring</code> and such conversion is likely to create a |
|---|
| 172 | number of new instantiations. |
|---|
| 173 | </para> |
|---|
| 174 | </listitem> |
|---|
| 175 | |
|---|
| 176 | </itemizedlist> |
|---|
| 177 | There's no clear leader, but the last point seems important, so UTF-8 |
|---|
| 178 | will be used. |
|---|
| 179 | </para> |
|---|
| 180 | |
|---|
| 181 | <para>Choosing the UTF-8 encoding allows the use of existing parsers, |
|---|
| 182 | because 7-bit ascii characters retain their values in UTF-8, |
|---|
| 183 | so searching for 7-bit strings is simple. However, there are |
|---|
| 184 | two subtle issues: |
|---|
| 185 | <itemizedlist> |
|---|
| 186 | <listitem> |
|---|
| 187 | <para>We need to assume the character literals use ascii encoding |
|---|
| 188 | and that inputs use Unicode encoding.</para> |
|---|
| 189 | </listitem> |
|---|
| 190 | <listitem> |
|---|
| 191 | <para>A Unicode character (say '=') can be followed by 'composing |
|---|
| 192 | character' and the combination is not the same as just '=', so a |
|---|
| 193 | simple search for '=' might find the wrong character. |
|---|
| 194 | </para> |
|---|
| 195 | </listitem> |
|---|
| 196 | </itemizedlist> |
|---|
| 197 | Neither of these issues appear to be critical in practice, since ascii is |
|---|
| 198 | almost universal encoding and since composing characters following '=' (and |
|---|
| 199 | other characters with special meaning to the library) are not likely to appear. |
|---|
| 200 | </para> |
|---|
| 201 | |
|---|
| 202 | </section> |
|---|
| 203 | |
|---|
| 204 | |
|---|
| 205 | </section> |
|---|
| 206 | |
|---|
| 207 | <!-- |
|---|
| 208 | Local Variables: |
|---|
| 209 | mode: xml |
|---|
| 210 | sgml-indent-data: t |
|---|
| 211 | sgml-parent-document: ("program_options.xml" "section") |
|---|
| 212 | sgml-set-face: t |
|---|
| 213 | End: |
|---|
| 214 | --> |
|---|