| 1 | <html> |
|---|
| 2 | <head> |
|---|
| 3 | <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> |
|---|
| 4 | <title>Design Discussion</title> |
|---|
| 5 | <link rel="stylesheet" href="../boostbook.css" type="text/css"> |
|---|
| 6 | <meta name="generator" content="DocBook XSL Stylesheets V1.68.1"> |
|---|
| 7 | <link rel="start" href="../index.html" title="The Boost C++ Libraries BoostBook Documentation Subset"> |
|---|
| 8 | <link rel="up" href="../program_options.html" title="Chapter 10. Boost.Program_options"> |
|---|
| 9 | <link rel="prev" href="howto.html" title="How To"> |
|---|
| 10 | <link rel="next" href="s06.html" title="Acknowledgements"> |
|---|
| 11 | </head> |
|---|
| 12 | <body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"> |
|---|
| 13 | <table cellpadding="2" width="100%"> |
|---|
| 14 | <td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../boost.png"></td> |
|---|
| 15 | <td align="center"><a href="../../../index.htm">Home</a></td> |
|---|
| 16 | <td align="center"><a href="../../../libs/libraries.htm">Libraries</a></td> |
|---|
| 17 | <td align="center"><a href="../../../people/people.htm">People</a></td> |
|---|
| 18 | <td align="center"><a href="../../../more/faq.htm">FAQ</a></td> |
|---|
| 19 | <td align="center"><a href="../../../more/index.htm">More</a></td> |
|---|
| 20 | </table> |
|---|
| 21 | <hr> |
|---|
| 22 | <div class="spirit-nav"> |
|---|
| 23 | <a accesskey="p" href="howto.html"><img src="../images/prev.png" alt="Prev"></a><a accesskey="u" href="../program_options.html"><img src="../images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../images/home.png" alt="Home"></a><a accesskey="n" href="s06.html"><img src="../images/next.png" alt="Next"></a> |
|---|
| 24 | </div> |
|---|
| 25 | <div class="section" lang="en"> |
|---|
| 26 | <div class="titlepage"><div><div><h2 class="title" style="clear: both"> |
|---|
| 27 | <a name="program_options.design"></a>Design Discussion</h2></div></div></div> |
|---|
| 28 | <div class="toc"><dl><dt><span class="section"><a href="design.html#program_options.design.unicode">Unicode Support</a></span></dt></dl></div> |
|---|
| 29 | <p>This section focuses on some of the design questions. |
|---|
| 30 | </p> |
|---|
| 31 | <div class="section" lang="en"> |
|---|
| 32 | <div class="titlepage"><div><div><h3 class="title"> |
|---|
| 33 | <a name="program_options.design.unicode"></a>Unicode Support</h3></div></div></div> |
|---|
| 34 | <p>Unicode support was one of the features specifically requested |
|---|
| 35 | during the formal review. Throughout this document "Unicode support" is |
|---|
| 36 | a synonym for "wchar_t" support, assuming that "wchar_t" always uses |
|---|
| 37 | Unicode encoding. Also, when talking about "ascii" (in lowercase) we'll |
|---|
| 38 | not mean strict 7-bit ASCII encoding, but rather "char" strings in local |
|---|
| 39 | 8-bit encoding. |
|---|
| 40 | </p> |
|---|
| 41 | <p> |
|---|
| 42 | Generally, "Unicode support" can mean |
|---|
| 43 | many things, but for the program_options library it means that: |
|---|
| 44 | |
|---|
| 45 | </p> |
|---|
| 46 | <div class="itemizedlist"><ul type="disc"> |
|---|
| 47 | <li><p>Each parser should accept either <code class="computeroutput">char*</code> |
|---|
| 48 | or <code class="computeroutput">wchar_t*</code>, correctly split the input into option |
|---|
| 49 | names and option values and return the data. |
|---|
| 50 | </p></li> |
|---|
| 51 | <li><p>For each option, it should be possible to specify whether the conversion |
|---|
| 52 | from string to value uses ascii or Unicode. |
|---|
| 53 | </p></li> |
|---|
| 54 | <li> |
|---|
| 55 | <p>The library guarantees that: |
|---|
| 56 | </p> |
|---|
| 57 | <div class="itemizedlist"><ul type="circle"> |
|---|
| 58 | <li><p>ascii input is passed to an ascii value without change |
|---|
| 59 | </p></li> |
|---|
| 60 | <li><p>Unicode input is passed to a Unicode value without change</p></li> |
|---|
| 61 | <li><p>ascii input passed to a Unicode value, and Unicode input |
|---|
| 62 | passed to an ascii value will be converted using a codecvt |
|---|
| 63 | facet (which may be specified by the user). |
|---|
| 64 | </p></li> |
|---|
| 65 | </ul></div> |
|---|
| 66 | <p> |
|---|
| 67 | </p> |
|---|
| 68 | </li> |
|---|
| 69 | </ul></div> |
|---|
| 70 | <p> |
|---|
| 71 | </p> |
|---|
| 72 | <p>The important point is that it's possible to have some "ascii |
|---|
| 73 | options" together with "Unicode options". There are two reasons for |
|---|
| 74 | this. First, for a given type you might not have the code to extract the |
|---|
| 75 | value from Unicode string and it's not good to require that such code be written. |
|---|
| 76 | Second, imagine a reusable library which has some options and exposes |
|---|
| 77 | options description in its interface. If <span class="emphasis"><em>all</em></span> |
|---|
| 78 | options are either ascii or Unicode, and the library does not use any |
|---|
| 79 | Unicode strings, then the author will likely to use ascii options, which |
|---|
| 80 | would make the library unusable inside Unicode |
|---|
| 81 | applications. Essentially, it would be necessary to provide two versions |
|---|
| 82 | of the library -- ascii and Unicode. |
|---|
| 83 | </p> |
|---|
| 84 | <p>Another important point is that ascii strings are passed though |
|---|
| 85 | without modification. In other words, it's not possible to just convert |
|---|
| 86 | ascii to Unicode and process the Unicode further. The problem is that the |
|---|
| 87 | default conversion mechanism -- the <code class="computeroutput">codecvt</code> facet -- might |
|---|
| 88 | not work with 8-bit input without additional setup. |
|---|
| 89 | </p> |
|---|
| 90 | <p>The Unicode support outlined above is not complete. For example, we |
|---|
| 91 | don't support Unicode option names. Unicode support is hard and |
|---|
| 92 | requires a Boost-wide solution. Even comparing two arbitrary Unicode |
|---|
| 93 | strings is non-trivial. Finally, using Unicode in option names is |
|---|
| 94 | related to internationalization, which has it's own |
|---|
| 95 | complexities. E.g. if option names depend on current locale, then all |
|---|
| 96 | program parts and other parts which use the name must be |
|---|
| 97 | internationalized too. |
|---|
| 98 | </p> |
|---|
| 99 | <p>The primary question in implementing the Unicode support is whether |
|---|
| 100 | to use templates and <code class="computeroutput">std::basic_string</code> or to use some |
|---|
| 101 | internal encoding and convert between internal and external encodings on |
|---|
| 102 | the interface boundaries. |
|---|
| 103 | </p> |
|---|
| 104 | <p>The choice, mostly, is between code size and execution |
|---|
| 105 | speed. A templated solution would either link library code into every |
|---|
| 106 | application that uses the library (thereby making shared library |
|---|
| 107 | impossible), or provide explicit instantiations in the shared library |
|---|
| 108 | (increasing its size). The solution based on internal encoding would |
|---|
| 109 | necessarily make conversions in a number of places and will be somewhat slower. |
|---|
| 110 | Since speed is generally not an issue for this library, the second |
|---|
| 111 | solution looks more attractive, but we'll take a closer look at |
|---|
| 112 | individual components. |
|---|
| 113 | </p> |
|---|
| 114 | <p>For the parsers component, we have three choices: |
|---|
| 115 | </p> |
|---|
| 116 | <div class="itemizedlist"><ul type="disc"> |
|---|
| 117 | <li><p>Use a fully templated implementation: given a string of a |
|---|
| 118 | certain type, a parser will return a <code class="computeroutput">parsed_options</code> instance |
|---|
| 119 | with strings of the same type (i.e. the <code class="computeroutput">parsed_options</code> class |
|---|
| 120 | will be templated).</p></li> |
|---|
| 121 | <li><p>Use internal encoding: same as above, but strings will be converted to and |
|---|
| 122 | from the internal encoding.</p></li> |
|---|
| 123 | <li><p>Use and partly expose the internal encoding: same as above, |
|---|
| 124 | but the strings in the <code class="computeroutput">parsed_options</code> instance will be in the |
|---|
| 125 | internal encoding. This might avoid a conversion if |
|---|
| 126 | <code class="computeroutput">parsed_options</code> instance is passed directly to other components, |
|---|
| 127 | but can be also dangerous or confusing for a user. |
|---|
| 128 | </p></li> |
|---|
| 129 | </ul></div> |
|---|
| 130 | <p> |
|---|
| 131 | </p> |
|---|
| 132 | <p>The second solution appears to be the best -- it does not increase |
|---|
| 133 | the code size much and is cleaner than the third. To avoid extra |
|---|
| 134 | conversions, the Unicode version of <code class="computeroutput">parsed_options</code> can also store |
|---|
| 135 | strings in internal encoding. |
|---|
| 136 | </p> |
|---|
| 137 | <p>For the options descriptions component, we don't have much |
|---|
| 138 | choice. Since it's not desirable to have either all options use ascii or all |
|---|
| 139 | of them use Unicode, but rather have some ascii and some Unicode options, the |
|---|
| 140 | interface of the <code class="computeroutput"><a href="../boost/program_options/value_semantic.html" title="Class value_semantic">value_semantic</a></code> must work with both. The only way is |
|---|
| 141 | to pass an additional flag telling if strings use ascii or internal encoding. |
|---|
| 142 | The instance of <code class="computeroutput"><a href="../boost/program_options/value_semantic.html" title="Class value_semantic">value_semantic</a></code> can then convert into some |
|---|
| 143 | other encoding if needed. |
|---|
| 144 | </p> |
|---|
| 145 | <p>For the storage component, the only affected function is <code class="computeroutput"><a href="../id1009524-bb.html" title="Function store">store</a></code>. |
|---|
| 146 | For Unicode input, the <code class="computeroutput"><a href="../id1009524-bb.html" title="Function store">store</a></code> function should convert the value to the |
|---|
| 147 | internal encoding. It should also inform the <code class="computeroutput"><a href="../boost/program_options/value_semantic.html" title="Class value_semantic">value_semantic</a></code> class |
|---|
| 148 | about the used encoding. |
|---|
| 149 | </p> |
|---|
| 150 | <p>Finally, what internal encoding should we use? The |
|---|
| 151 | alternatives are: |
|---|
| 152 | <code class="computeroutput">std::wstring</code> (using UCS-4 encoding) and |
|---|
| 153 | <code class="computeroutput">std::string</code> (using UTF-8 encoding). The difference between |
|---|
| 154 | alternatives is: |
|---|
| 155 | </p> |
|---|
| 156 | <div class="itemizedlist"><ul type="disc"> |
|---|
| 157 | <li><p>Speed: UTF-8 is a bit slower</p></li> |
|---|
| 158 | <li><p>Space: UTF-8 takes less space when input is ascii</p></li> |
|---|
| 159 | <li><p>Code size: UTF-8 requires additional conversion code. However, |
|---|
| 160 | it allows one to use existing parsers without converting them to |
|---|
| 161 | <code class="computeroutput">std::wstring</code> and such conversion is likely to create a |
|---|
| 162 | number of new instantiations. |
|---|
| 163 | </p></li> |
|---|
| 164 | </ul></div> |
|---|
| 165 | <p> |
|---|
| 166 | There's no clear leader, but the last point seems important, so UTF-8 |
|---|
| 167 | will be used. |
|---|
| 168 | </p> |
|---|
| 169 | <p>Choosing the UTF-8 encoding allows the use of existing parsers, |
|---|
| 170 | because 7-bit ascii characters retain their values in UTF-8, |
|---|
| 171 | so searching for 7-bit strings is simple. However, there are |
|---|
| 172 | two subtle issues: |
|---|
| 173 | </p> |
|---|
| 174 | <div class="itemizedlist"><ul type="disc"> |
|---|
| 175 | <li><p>We need to assume the character literals use ascii encoding |
|---|
| 176 | and that inputs use Unicode encoding.</p></li> |
|---|
| 177 | <li><p>A Unicode character (say '=') can be followed by 'composing |
|---|
| 178 | character' and the combination is not the same as just '=', so a |
|---|
| 179 | simple search for '=' might find the wrong character. |
|---|
| 180 | </p></li> |
|---|
| 181 | </ul></div> |
|---|
| 182 | <p> |
|---|
| 183 | Neither of these issues appear to be critical in practice, since ascii is |
|---|
| 184 | almost universal encoding and since composing characters following '=' (and |
|---|
| 185 | other characters with special meaning to the library) are not likely to appear. |
|---|
| 186 | </p> |
|---|
| 187 | </div> |
|---|
| 188 | </div> |
|---|
| 189 | <table width="100%"><tr> |
|---|
| 190 | <td align="left"></td> |
|---|
| 191 | <td align="right"><small>Copyright © 2002-2004 Vladimir Prus</small></td> |
|---|
| 192 | </tr></table> |
|---|
| 193 | <hr> |
|---|
| 194 | <div class="spirit-nav"> |
|---|
| 195 | <a accesskey="p" href="howto.html"><img src="../images/prev.png" alt="Prev"></a><a accesskey="u" href="../program_options.html"><img src="../images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../images/home.png" alt="Home"></a><a accesskey="n" href="s06.html"><img src="../images/next.png" alt="Next"></a> |
|---|
| 196 | </div> |
|---|
| 197 | </body> |
|---|
| 198 | </html> |
|---|