[16] | 1 | <?xml version="1.0" standalone="no"?> |
---|
| 2 | <!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" |
---|
| 3 | "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd" [ |
---|
| 4 | |
---|
| 5 | ]> |
---|
| 6 | |
---|
| 7 | <section id="vorbis-spec-intro"> |
---|
| 8 | <sectioninfo> |
---|
| 9 | <releaseinfo> |
---|
| 10 | $Id: 01-introduction.xml 7186 2004-07-20 07:19:25Z xiphmont $ |
---|
| 11 | </releaseinfo> |
---|
| 12 | </sectioninfo> |
---|
| 13 | <title>Introduction and Description</title> |
---|
| 14 | |
---|
| 15 | <section> |
---|
| 16 | <title>Overview</title> |
---|
| 17 | |
---|
| 18 | <para> |
---|
| 19 | This document provides a high level description of the Vorbis codec's |
---|
| 20 | construction. A bit-by-bit specification appears beginning in |
---|
| 21 | <xref linkend="vorbis-spec-codec"/>. |
---|
| 22 | The later sections assume a high-level |
---|
| 23 | understanding of the Vorbis decode process, which is |
---|
| 24 | provided here.</para> |
---|
| 25 | |
---|
| 26 | <section> |
---|
| 27 | <title>Application</title> |
---|
| 28 | <para> |
---|
| 29 | Vorbis is a general purpose perceptual audio CODEC intended to allow |
---|
| 30 | maximum encoder flexibility, thus allowing it to scale competitively |
---|
| 31 | over an exceptionally wide range of bitrates. At the high |
---|
| 32 | quality/bitrate end of the scale (CD or DAT rate stereo, 16/24 bits) |
---|
| 33 | it is in the same league as MPEG-2 and MPC. Similarly, the 1.0 |
---|
| 34 | encoder can encode high-quality CD and DAT rate stereo at below 48kbps |
---|
| 35 | without resampling to a lower rate. Vorbis is also intended for |
---|
| 36 | lower and higher sample rates (from 8kHz telephony to 192kHz digital |
---|
| 37 | masters) and a range of channel representations (monaural, |
---|
| 38 | polyphonic, stereo, quadraphonic, 5.1, ambisonic, or up to 255 |
---|
| 39 | discrete channels). |
---|
| 40 | </para> |
---|
| 41 | </section> |
---|
| 42 | |
---|
| 43 | <section> |
---|
| 44 | <title>Classification</title> |
---|
| 45 | <para> |
---|
| 46 | Vorbis I is a forward-adaptive monolithic transform CODEC based on the |
---|
| 47 | Modified Discrete Cosine Transform. The codec is structured to allow |
---|
| 48 | addition of a hybrid wavelet filterbank in Vorbis II to offer better |
---|
| 49 | transient response and reproduction using a transform better suited to |
---|
| 50 | localized time events. |
---|
| 51 | </para> |
---|
| 52 | </section> |
---|
| 53 | |
---|
| 54 | <section> |
---|
| 55 | <title>Assumptions</title> |
---|
| 56 | |
---|
| 57 | <para> |
---|
| 58 | The Vorbis CODEC design assumes a complex, psychoacoustically-aware |
---|
| 59 | encoder and simple, low-complexity decoder. Vorbis decode is |
---|
| 60 | computationally simpler than mp3, although it does require more |
---|
| 61 | working memory as Vorbis has no static probability model; the vector |
---|
| 62 | codebooks used in the first stage of decoding from the bitstream are |
---|
| 63 | packed in their entirety into the Vorbis bitstream headers. In |
---|
| 64 | packed form, these codebooks occupy only a few kilobytes; the extent |
---|
| 65 | to which they are pre-decoded into a cache is the dominant factor in |
---|
| 66 | decoder memory usage. |
---|
| 67 | </para> |
---|
| 68 | |
---|
| 69 | <para> |
---|
| 70 | Vorbis provides none of its own framing, synchronization or protection |
---|
| 71 | against errors; it is solely a method of accepting input audio, |
---|
| 72 | dividing it into individual frames and compressing these frames into |
---|
| 73 | raw, unformatted 'packets'. The decoder then accepts these raw |
---|
| 74 | packets in sequence, decodes them, synthesizes audio frames from |
---|
| 75 | them, and reassembles the frames into a facsimile of the original |
---|
| 76 | audio stream. Vorbis is a free-form variable bit rate (VBR) codec and packets have no |
---|
| 77 | minimum size, maximum size, or fixed/expected size. Packets |
---|
| 78 | are designed that they may be truncated (or padded) and remain |
---|
| 79 | decodable; this is not to be considered an error condition and is used |
---|
| 80 | extensively in bitrate management in peeling. Both the transport |
---|
| 81 | mechanism and decoder must allow that a packet may be any size, or |
---|
| 82 | end before or after packet decode expects.</para> |
---|
| 83 | |
---|
| 84 | <para> |
---|
| 85 | Vorbis packets are thus intended to be used with a transport mechanism |
---|
| 86 | that provides free-form framing, sync, positioning and error correction |
---|
| 87 | in accordance with these design assumptions, such as Ogg (for file |
---|
| 88 | transport) or RTP (for network multicast). For purposes of a few |
---|
| 89 | examples in this document, we will assume that Vorbis is to be |
---|
| 90 | embedded in an Ogg stream specifically, although this is by no means a |
---|
| 91 | requirement or fundamental assumption in the Vorbis design.</para> |
---|
| 92 | |
---|
| 93 | <para> |
---|
| 94 | The specification for embedding Vorbis into |
---|
| 95 | an Ogg transport stream is in <xref linkend="vorbis-over-ogg"/>. |
---|
| 96 | </para> |
---|
| 97 | |
---|
| 98 | </section> |
---|
| 99 | |
---|
| 100 | <section> |
---|
| 101 | <title>Codec Setup and Probability Model</title> |
---|
| 102 | |
---|
| 103 | <para> |
---|
| 104 | Vorbis' heritage is as a research CODEC and its current design |
---|
| 105 | reflects a desire to allow multiple decades of continuous encoder |
---|
| 106 | improvement before running out of room within the codec specification. |
---|
| 107 | For these reasons, configurable aspects of codec setup intentionally |
---|
| 108 | lean toward the extreme of forward adaptive.</para> |
---|
| 109 | |
---|
| 110 | <para> |
---|
| 111 | The single most controversial design decision in Vorbis (and the most |
---|
| 112 | unusual for a Vorbis developer to keep in mind) is that the entire |
---|
| 113 | probability model of the codec, the Huffman and VQ codebooks, is |
---|
| 114 | packed into the bitstream header along with extensive CODEC setup |
---|
| 115 | parameters (often several hundred fields). This makes it impossible, |
---|
| 116 | as it would be with MPEG audio layers, to embed a simple frame type |
---|
| 117 | flag in each audio packet, or begin decode at any frame in the stream |
---|
| 118 | without having previously fetched the codec setup header. |
---|
| 119 | </para> |
---|
| 120 | |
---|
| 121 | <note><para> |
---|
| 122 | Vorbis <emphasis>can</emphasis> initiate decode at any arbitrary packet within a |
---|
| 123 | bitstream so long as the codec has been initialized/setup with the |
---|
| 124 | setup headers.</para></note> |
---|
| 125 | |
---|
| 126 | <para> |
---|
| 127 | Thus, Vorbis headers are both required for decode to begin and |
---|
| 128 | relatively large as bitstream headers go. The header size is |
---|
| 129 | unbounded, although for streaming a rule-of-thumb of 4kB or less is |
---|
| 130 | recommended (and Xiph.Org's Vorbis encoder follows this suggestion).</para> |
---|
| 131 | |
---|
| 132 | <para> |
---|
| 133 | Our own design work indicates the primary liability of the |
---|
| 134 | required header is in mindshare; it is an unusual design and thus |
---|
| 135 | causes some amount of complaint among engineers as this runs against |
---|
| 136 | current design trends (and also points out limitations in some |
---|
| 137 | existing software/interface designs, such as Windows' ACM codec |
---|
| 138 | framework). However, we find that it does not fundamentally limit |
---|
| 139 | Vorbis' suitable application space.</para> |
---|
| 140 | |
---|
| 141 | </section> |
---|
| 142 | |
---|
| 143 | <section><title>Format Specification</title> |
---|
| 144 | <para> |
---|
| 145 | The Vorbis format is well-defined by its decode specification; any |
---|
| 146 | encoder that produces packets that are correctly decoded by the |
---|
| 147 | reference Vorbis decoder described below may be considered a proper |
---|
| 148 | Vorbis encoder. A decoder must faithfully and completely implement |
---|
| 149 | the specification defined below (except where noted) to be considered |
---|
| 150 | a proper Vorbis decoder.</para> |
---|
| 151 | </section> |
---|
| 152 | |
---|
| 153 | <section><title>Hardware Profile</title> |
---|
| 154 | <para> |
---|
| 155 | Although Vorbis decode is computationally simple, it may still run |
---|
| 156 | into specific limitations of an embedded design. For this reason, |
---|
| 157 | embedded designs are allowed to deviate in limited ways from the |
---|
| 158 | 'full' decode specification yet still be certified compliant. These |
---|
| 159 | optional omissions are labelled in the spec where relevant.</para> |
---|
| 160 | </section> |
---|
| 161 | |
---|
| 162 | </section> |
---|
| 163 | |
---|
| 164 | <section> |
---|
| 165 | <title>Decoder Configuration</title> |
---|
| 166 | |
---|
| 167 | <para> |
---|
| 168 | Decoder setup consists of configuration of multiple, self-contained |
---|
| 169 | component abstractions that perform specific functions in the decode |
---|
| 170 | pipeline. Each different component instance of a specific type is |
---|
| 171 | semantically interchangeable; decoder configuration consists both of |
---|
| 172 | internal component configuration, as well as arrangement of specific |
---|
| 173 | instances into a decode pipeline. Componentry arrangement is roughly |
---|
| 174 | as follows:</para> |
---|
| 175 | |
---|
| 176 | <mediaobject> |
---|
| 177 | <imageobject> |
---|
| 178 | <imagedata fileref="components.png" format="PNG"/> |
---|
| 179 | </imageobject> |
---|
| 180 | <textobject> |
---|
| 181 | <phrase>decoder pipeline configuration</phrase> |
---|
| 182 | </textobject> |
---|
| 183 | </mediaobject> |
---|
| 184 | |
---|
| 185 | <section><title>Global Config</title> |
---|
| 186 | <para> |
---|
| 187 | Global codec configuration consists of a few audio related fields |
---|
| 188 | (sample rate, channels), Vorbis version (always '0' in Vorbis I), |
---|
| 189 | bitrate hints, and the lists of component instances. All other |
---|
| 190 | configuration is in the context of specific components.</para> |
---|
| 191 | </section> |
---|
| 192 | |
---|
| 193 | <section><title>Mode</title> |
---|
| 194 | |
---|
| 195 | <para> |
---|
| 196 | Each Vorbis frame is coded according to a master 'mode'. A bitstream |
---|
| 197 | may use one or many modes.</para> |
---|
| 198 | |
---|
| 199 | <para> |
---|
| 200 | The mode mechanism is used to encode a frame according to one of |
---|
| 201 | multiple possible methods with the intention of choosing a method best |
---|
| 202 | suited to that frame. Different modes are, e.g. how frame size |
---|
| 203 | is changed from frame to frame. The mode number of a frame serves as a |
---|
| 204 | top level configuration switch for all other specific aspects of frame |
---|
| 205 | decode.</para> |
---|
| 206 | |
---|
| 207 | <para> |
---|
| 208 | A 'mode' configuration consists of a frame size setting, window type |
---|
| 209 | (always 0, the Vorbis window, in Vorbis I), transform type (always |
---|
| 210 | type 0, the MDCT, in Vorbis I) and a mapping number. The mapping |
---|
| 211 | number specifies which mapping configuration instance to use for |
---|
| 212 | low-level packet decode and synthesis.</para> |
---|
| 213 | |
---|
| 214 | </section> |
---|
| 215 | |
---|
| 216 | <section><title>Mapping</title> |
---|
| 217 | |
---|
| 218 | <para> |
---|
| 219 | A mapping contains a channel coupling description and a list of |
---|
| 220 | 'submaps' that bundle sets of channel vectors together for grouped |
---|
| 221 | encoding and decoding. These submaps are not references to external |
---|
| 222 | components; the submap list is internal and specific to a mapping.</para> |
---|
| 223 | |
---|
| 224 | <para> |
---|
| 225 | A 'submap' is a configuration/grouping that applies to a subset of |
---|
| 226 | floor and residue vectors within a mapping. The submap functions as a |
---|
| 227 | last layer of indirection such that specific special floor or residue |
---|
| 228 | settings can be applied not only to all the vectors in a given mode, |
---|
| 229 | but also specific vectors in a specific mode. Each submap specifies |
---|
| 230 | the proper floor and residue instance number to use for decoding that |
---|
| 231 | submap's spectral floor and spectral residue vectors.</para> |
---|
| 232 | |
---|
| 233 | <para> |
---|
| 234 | As an example:</para> |
---|
| 235 | |
---|
| 236 | <para> |
---|
| 237 | Assume a Vorbis stream that contains six channels in the standard 5.1 |
---|
| 238 | format. The sixth channel, as is normal in 5.1, is bass only. |
---|
| 239 | Therefore it would be wasteful to encode a full-spectrum version of it |
---|
| 240 | as with the other channels. The submapping mechanism can be used to |
---|
| 241 | apply a full range floor and residue encoding to channels 0 through 4, |
---|
| 242 | and a bass-only representation to the bass channel, thus saving space. |
---|
| 243 | In this example, channels 0-4 belong to submap 0 (which indicates use |
---|
| 244 | of a full-range floor) and channel 5 belongs to submap 1, which uses a |
---|
| 245 | bass-only representation.</para> |
---|
| 246 | |
---|
| 247 | </section> |
---|
| 248 | |
---|
| 249 | <section><title>Floor</title> |
---|
| 250 | |
---|
| 251 | <para> |
---|
| 252 | Vorbis encodes a spectral 'floor' vector for each PCM channel. This |
---|
| 253 | vector is a low-resolution representation of the audio spectrum for |
---|
| 254 | the given channel in the current frame, generally used akin to a |
---|
| 255 | whitening filter. It is named a 'floor' because the Xiph.Org |
---|
| 256 | reference encoder has historically used it as a unit-baseline for |
---|
| 257 | spectral resolution.</para> |
---|
| 258 | |
---|
| 259 | <para> |
---|
| 260 | A floor encoding may be of two types. Floor 0 uses a packed LSP |
---|
| 261 | representation on a dB amplitude scale and Bark frequency scale. |
---|
| 262 | Floor 1 represents the curve as a piecewise linear interpolated |
---|
| 263 | representation on a dB amplitude scale and linear frequency scale. |
---|
| 264 | The two floors are semantically interchangeable in |
---|
| 265 | encoding/decoding. However, floor type 1 provides more stable |
---|
| 266 | inter-frame behavior, and so is the preferred choice in all |
---|
| 267 | coupled-stereo and high bitrate modes. Floor 1 is also considerably |
---|
| 268 | less expensive to decode than floor 0.</para> |
---|
| 269 | |
---|
| 270 | <para> |
---|
| 271 | Floor 0 is not to be considered deprecated, but it is of limited |
---|
| 272 | modern use. No known Vorbis encoder past Xiph.org's own beta 4 makes |
---|
| 273 | use of floor 0.</para> |
---|
| 274 | |
---|
| 275 | <para> |
---|
| 276 | The values coded/decoded by a floor are both compactly formatted and |
---|
| 277 | make use of entropy coding to save space. For this reason, a floor |
---|
| 278 | configuration generally refers to multiple codebooks in the codebook |
---|
| 279 | component list. Entropy coding is thus provided as an abstraction, |
---|
| 280 | and each floor instance may choose from any and all available |
---|
| 281 | codebooks when coding/decoding.</para> |
---|
| 282 | |
---|
| 283 | </section> |
---|
| 284 | |
---|
| 285 | <section><title>Residue</title> |
---|
| 286 | <para> |
---|
| 287 | The spectral residue is the fine structure of the audio spectrum |
---|
| 288 | once the floor curve has been subtracted out. In simplest terms, it |
---|
| 289 | is coded in the bitstream using cascaded (multi-pass) vector |
---|
| 290 | quantization according to one of three specific packing/coding |
---|
| 291 | algorithms numbered 0 through 2. The packing algorithm details are |
---|
| 292 | configured by residue instance. As with the floor components, the |
---|
| 293 | final VQ/entropy encoding is provided by external codebook instances |
---|
| 294 | and each residue instance may choose from any and all available |
---|
| 295 | codebooks.</para> |
---|
| 296 | </section> |
---|
| 297 | |
---|
| 298 | <section><title>Codebooks</title> |
---|
| 299 | |
---|
| 300 | <para> |
---|
| 301 | Codebooks are a self-contained abstraction that perform entropy |
---|
| 302 | decoding and, optionally, use the entropy-decoded integer value as an |
---|
| 303 | offset into an index of output value vectors, returning the indicated |
---|
| 304 | vector of values.</para> |
---|
| 305 | |
---|
| 306 | <para> |
---|
| 307 | The entropy coding in a Vorbis I codebook is provided by a standard |
---|
| 308 | Huffman binary tree representation. This tree is tightly packed using |
---|
| 309 | one of several methods, depending on whether codeword lengths are |
---|
| 310 | ordered or unordered, or the tree is sparse.</para> |
---|
| 311 | |
---|
| 312 | <para> |
---|
| 313 | The codebook vector index is similarly packed according to index |
---|
| 314 | characteristic. Most commonly, the vector index is encoded as a |
---|
| 315 | single list of values of possible values that are then permuted into |
---|
| 316 | a list of n-dimensional rows (lattice VQ).</para> |
---|
| 317 | |
---|
| 318 | </section> |
---|
| 319 | |
---|
| 320 | </section> |
---|
| 321 | |
---|
| 322 | |
---|
| 323 | <section> |
---|
| 324 | <title>High-level Decode Process</title> |
---|
| 325 | |
---|
| 326 | <section> |
---|
| 327 | <title>Decode Setup</title> |
---|
| 328 | |
---|
| 329 | <para> |
---|
| 330 | Before decoding can begin, a decoder must initialize using the |
---|
| 331 | bitstream headers matching the stream to be decoded. Vorbis uses |
---|
| 332 | three header packets; all are required, in-order, by this |
---|
| 333 | specification. Once set up, decode may begin at any audio packet |
---|
| 334 | belonging to the Vorbis stream. In Vorbis I, all packets after the |
---|
| 335 | three initial headers are audio packets. </para> |
---|
| 336 | |
---|
| 337 | <para> |
---|
| 338 | The header packets are, in order, the identification |
---|
| 339 | header, the comments header, and the setup header.</para> |
---|
| 340 | |
---|
| 341 | <section><title>Identification Header</title> |
---|
| 342 | <para> |
---|
| 343 | The identification header identifies the bitstream as Vorbis, Vorbis |
---|
| 344 | version, and the simple audio characteristics of the stream such as |
---|
| 345 | sample rate and number of channels.</para> |
---|
| 346 | </section> |
---|
| 347 | |
---|
| 348 | <section><title>Comment Header</title> |
---|
| 349 | <para> |
---|
| 350 | The comment header includes user text comments ("tags") and a vendor |
---|
| 351 | string for the application/library that produced the bitstream. The |
---|
| 352 | encoding and proper use of the comment header is described in |
---|
| 353 | <xref linkend="vorbis-spec-comment"/>.</para> |
---|
| 354 | </section> |
---|
| 355 | |
---|
| 356 | <section><title>Setup Header</title> |
---|
| 357 | <para> |
---|
| 358 | The setup header includes extensive CODEC setup information as well as |
---|
| 359 | the complete VQ and Huffman codebooks needed for decode.</para> |
---|
| 360 | </section> |
---|
| 361 | |
---|
| 362 | </section> |
---|
| 363 | |
---|
| 364 | <section><title>Decode Procedure</title> |
---|
| 365 | |
---|
| 366 | <highlights> |
---|
| 367 | <para> |
---|
| 368 | The decoding and synthesis procedure for all audio packets is |
---|
| 369 | fundamentally the same. |
---|
| 370 | <orderedlist> |
---|
| 371 | <listitem><simpara>decode packet type flag</simpara></listitem> |
---|
| 372 | <listitem><simpara>decode mode number</simpara></listitem> |
---|
| 373 | <listitem><simpara>decode window shape (long windows only)</simpara></listitem> |
---|
| 374 | <listitem><simpara>decode floor</simpara></listitem> |
---|
| 375 | <listitem><simpara>decode residue into residue vectors</simpara></listitem> |
---|
| 376 | <listitem><simpara>inverse channel coupling of residue vectors</simpara></listitem> |
---|
| 377 | <listitem><simpara>generate floor curve from decoded floor data</simpara></listitem> |
---|
| 378 | <listitem><simpara>compute dot product of floor and residue, producing audio spectrum vector</simpara></listitem> |
---|
| 379 | <listitem><simpara>inverse monolithic transform of audio spectrum vector, always an MDCT in Vorbis I</simpara></listitem> |
---|
| 380 | <listitem><simpara>overlap/add left-hand output of transform with right-hand output of previous frame</simpara></listitem> |
---|
| 381 | <listitem><simpara>store right hand-data from transform of current frame for future lapping</simpara></listitem> |
---|
| 382 | <listitem><simpara>if not first frame, return results of overlap/add as audio result of current frame</simpara></listitem> |
---|
| 383 | </orderedlist> |
---|
| 384 | </para> |
---|
| 385 | </highlights> |
---|
| 386 | |
---|
| 387 | <para> |
---|
| 388 | Note that clever rearrangement of the synthesis arithmetic is |
---|
| 389 | possible; as an example, one can take advantage of symmetries in the |
---|
| 390 | MDCT to store the right-hand transform data of a partial MDCT for a |
---|
| 391 | 50% inter-frame buffer space savings, and then complete the transform |
---|
| 392 | later before overlap/add with the next frame. This optimization |
---|
| 393 | produces entirely equivalent output and is naturally perfectly legal. |
---|
| 394 | The decoder must be <emphasis>entirely mathematically equivalent</emphasis> to the |
---|
| 395 | specification, it need not be a literal semantic implementation.</para> |
---|
| 396 | |
---|
| 397 | <section><title>Packet type decode</title> |
---|
| 398 | |
---|
| 399 | <para> |
---|
| 400 | Vorbis I uses four packet types. The first three packet types mark each |
---|
| 401 | of the three Vorbis headers described above. The fourth packet type |
---|
| 402 | marks an audio packet. All other packet types are reserved; packets |
---|
| 403 | marked with a reserved type should be ignored.</para> |
---|
| 404 | |
---|
| 405 | <para> |
---|
| 406 | Following the three header packets, all packets in a Vorbis I stream |
---|
| 407 | are audio. The first step of audio packet decode is to read and |
---|
| 408 | verify the packet type; <emphasis>a non-audio packet when audio is expected |
---|
| 409 | indicates stream corruption or a non-compliant stream. The decoder |
---|
| 410 | must ignore the packet and not attempt decoding it to |
---|
| 411 | audio</emphasis>.</para> |
---|
| 412 | |
---|
| 413 | </section> |
---|
| 414 | |
---|
| 415 | |
---|
| 416 | <section><title>Mode decode</title> |
---|
| 417 | <para> |
---|
| 418 | Vorbis allows an encoder to set up multiple, numbered packet 'modes', |
---|
| 419 | as described earlier, all of which may be used in a given Vorbis |
---|
| 420 | stream. The mode is encoded as an integer used as a direct offset into |
---|
| 421 | the mode instance index. </para> |
---|
| 422 | </section> |
---|
| 423 | |
---|
| 424 | <section id="vorbis-spec-window"> |
---|
| 425 | <title>Window shape decode (long windows only)</title> |
---|
| 426 | |
---|
| 427 | <para> |
---|
| 428 | Vorbis frames may be one of two PCM sample sizes specified during |
---|
| 429 | codec setup. In Vorbis I, legal frame sizes are powers of two from 64 |
---|
| 430 | to 8192 samples. Aside from coupling, Vorbis handles channels as |
---|
| 431 | independent vectors and these frame sizes are in samples per channel.</para> |
---|
| 432 | |
---|
| 433 | <para> |
---|
| 434 | Vorbis uses an overlapping transform, namely the MDCT, to blend one |
---|
| 435 | frame into the next, avoiding most inter-frame block boundary |
---|
| 436 | artifacts. The MDCT output of one frame is windowed according to MDCT |
---|
| 437 | requirements, overlapped 50% with the output of the previous frame and |
---|
| 438 | added. The window shape assures seamless reconstruction. </para> |
---|
| 439 | |
---|
| 440 | <para> |
---|
| 441 | This is easy to visualize in the case of equal sized-windows:</para> |
---|
| 442 | |
---|
| 443 | <mediaobject> |
---|
| 444 | <imageobject> |
---|
| 445 | <imagedata fileref="window1.png" format="PNG"/> |
---|
| 446 | </imageobject> |
---|
| 447 | <textobject> |
---|
| 448 | <phrase>overlap of two equal-sized windows</phrase> |
---|
| 449 | </textobject> |
---|
| 450 | </mediaobject> |
---|
| 451 | |
---|
| 452 | <para> |
---|
| 453 | And slightly more complex in the case of overlapping unequal sized |
---|
| 454 | windows:</para> |
---|
| 455 | |
---|
| 456 | <mediaobject> |
---|
| 457 | <imageobject> |
---|
| 458 | <imagedata fileref="window2.png" format="PNG"/> |
---|
| 459 | </imageobject> |
---|
| 460 | <textobject> |
---|
| 461 | <phrase>overlap of a long and a short window</phrase> |
---|
| 462 | </textobject> |
---|
| 463 | </mediaobject> |
---|
| 464 | |
---|
| 465 | <para> |
---|
| 466 | In the unequal-sized window case, the window shape of the long window |
---|
| 467 | must be modified for seamless lapping as above. It is possible to |
---|
| 468 | correctly infer window shape to be applied to the current window from |
---|
| 469 | knowing the sizes of the current, previous and next window. It is |
---|
| 470 | legal for a decoder to use this method. However, in the case of a long |
---|
| 471 | window (short windows require no modification), Vorbis also codes two |
---|
| 472 | flag bits to specify pre- and post- window shape. Although not |
---|
| 473 | strictly necessary for function, this minor redundancy allows a packet |
---|
| 474 | to be fully decoded to the point of lapping entirely independently of |
---|
| 475 | any other packet, allowing easier abstraction of decode layers as well |
---|
| 476 | as allowing a greater level of easy parallelism in encode and |
---|
| 477 | decode.</para> |
---|
| 478 | |
---|
| 479 | <para> |
---|
| 480 | A description of valid window functions for use with an inverse MDCT |
---|
| 481 | can be found in the paper |
---|
| 482 | <citetitle pubwork="article"> |
---|
| 483 | <ulink url="http://www.iocon.com/resource/docs/ps/eusipco_corrected.ps"> |
---|
| 484 | The use of multirate filter banks for coding of high quality digital |
---|
| 485 | audio</ulink></citetitle>, by T. Sporer, K. Brandenburg and B. Edler. Vorbis windows |
---|
| 486 | all use the slope function |
---|
| 487 | <inlineequation> |
---|
| 488 | |
---|
| 489 | <alt>y=sin(.5*PI*sin^2((x+.5)/n*pi))</alt> |
---|
| 490 | <inlinemediaobject> |
---|
| 491 | <textobject> |
---|
| 492 | <phrase>$y = \sin(.5*\pi \, \sin^2((x+.5)/n*\pi))$</phrase> |
---|
| 493 | </textobject> |
---|
| 494 | </inlinemediaobject> |
---|
| 495 | </inlineequation>. |
---|
| 496 | </para> |
---|
| 497 | |
---|
| 498 | </section> |
---|
| 499 | |
---|
| 500 | <section><title>floor decode</title> |
---|
| 501 | <para> |
---|
| 502 | Each floor is encoded/decoded in channel order, however each floor |
---|
| 503 | belongs to a 'submap' that specifies which floor configuration to |
---|
| 504 | use. All floors are decoded before residue decode begins.</para> |
---|
| 505 | </section> |
---|
| 506 | |
---|
| 507 | <section><title>residue decode</title> |
---|
| 508 | |
---|
| 509 | <para> |
---|
| 510 | Although the number of residue vectors equals the number of channels, |
---|
| 511 | channel coupling may mean that the raw residue vectors extracted |
---|
| 512 | during decode do not map directly to specific channels. When channel |
---|
| 513 | coupling is in use, some vectors will correspond to coupled magnitude |
---|
| 514 | or angle. The coupling relationships are described in the codec setup |
---|
| 515 | and may differ from frame to frame, due to different mode numbers.</para> |
---|
| 516 | |
---|
| 517 | <para> |
---|
| 518 | Vorbis codes residue vectors in groups by submap; the coding is done |
---|
| 519 | in submap order from submap 0 through n-1. This differs from floors |
---|
| 520 | which are coded using a configuration provided by submap number, but |
---|
| 521 | are coded individually in channel order.</para> |
---|
| 522 | |
---|
| 523 | </section> |
---|
| 524 | |
---|
| 525 | <section><title>inverse channel coupling</title> |
---|
| 526 | |
---|
| 527 | <para> |
---|
| 528 | A detailed discussion of stereo in the Vorbis codec can be found in |
---|
| 529 | the document <ulink url="stereo.html"><citetitle>Stereo Channel Coupling in the |
---|
| 530 | Vorbis CODEC</citetitle></ulink>. Vorbis is not limited to only stereo coupling, but |
---|
| 531 | the stereo document also gives a good overview of the generic coupling |
---|
| 532 | mechanism.</para> |
---|
| 533 | |
---|
| 534 | <para> |
---|
| 535 | Vorbis coupling applies to pairs of residue vectors at a time; |
---|
| 536 | decoupling is done in-place a pair at a time in the order and using |
---|
| 537 | the vectors specified in the current mapping configuration. The |
---|
| 538 | decoupling operation is the same for all pairs, converting square |
---|
| 539 | polar representation (where one vector is magnitude and the second |
---|
| 540 | angle) back to Cartesian representation.</para> |
---|
| 541 | |
---|
| 542 | <para> |
---|
| 543 | After decoupling, in order, each pair of vectors on the coupling list, |
---|
| 544 | the resulting residue vectors represent the fine spectral detail |
---|
| 545 | of each output channel.</para> |
---|
| 546 | |
---|
| 547 | </section> |
---|
| 548 | |
---|
| 549 | <section><title>generate floor curve</title> |
---|
| 550 | |
---|
| 551 | <para> |
---|
| 552 | The decoder may choose to generate the floor curve at any appropriate |
---|
| 553 | time. It is reasonable to generate the output curve when the floor |
---|
| 554 | data is decoded from the raw packet, or it can be generated after |
---|
| 555 | inverse coupling and applied to the spectral residue directly, |
---|
| 556 | combining generation and the dot product into one step and eliminating |
---|
| 557 | some working space.</para> |
---|
| 558 | |
---|
| 559 | <para> |
---|
| 560 | Both floor 0 and floor 1 generate a linear-range, linear-domain output |
---|
| 561 | vector to be multiplied (dot product) by the linear-range, |
---|
| 562 | linear-domain spectral residue.</para> |
---|
| 563 | |
---|
| 564 | </section> |
---|
| 565 | |
---|
| 566 | <section><title>compute floor/residue dot product</title> |
---|
| 567 | |
---|
| 568 | <para> |
---|
| 569 | This step is straightforward; for each output channel, the decoder |
---|
| 570 | multiplies the floor curve and residue vectors element by element, |
---|
| 571 | producing the finished audio spectrum of each channel.</para> |
---|
| 572 | |
---|
| 573 | <para> |
---|
| 574 | One point is worth mentioning about this dot product; a common mistake |
---|
| 575 | in a fixed point implementation might be to assume that a 32 bit |
---|
| 576 | fixed-point representation for floor and residue and direct |
---|
| 577 | multiplication of the vectors is sufficient for acceptable spectral |
---|
| 578 | depth in all cases because it happens to mostly work with the current |
---|
| 579 | Xiph.Org reference encoder.</para> |
---|
| 580 | |
---|
| 581 | <para> |
---|
| 582 | However, floor vector values can span ~140dB (~24 bits unsigned), and |
---|
| 583 | the audio spectrum vector should represent a minimum of 120dB (~21 |
---|
| 584 | bits with sign), even when output is to a 16 bit PCM device. For the |
---|
| 585 | residue vector to represent full scale if the floor is nailed to |
---|
| 586 | -140dB, it must be able to span 0 to +140dB. For the residue vector |
---|
| 587 | to reach full scale if the floor is nailed at 0dB, it must be able to |
---|
| 588 | represent -140dB to +0dB. Thus, in order to handle full range |
---|
| 589 | dynamics, a residue vector may span -140dB to +140dB entirely within |
---|
| 590 | spec. A 280dB range is approximately 48 bits with sign; thus the |
---|
| 591 | residue vector must be able to represent a 48 bit range and the dot |
---|
| 592 | product must be able to handle an effective 48 bit times 24 bit |
---|
| 593 | multiplication. This range may be achieved using large (64 bit or |
---|
| 594 | larger) integers, or implementing a movable binary point |
---|
| 595 | representation.</para> |
---|
| 596 | |
---|
| 597 | </section> |
---|
| 598 | |
---|
| 599 | <section><title>inverse monolithic transform (MDCT)</title> |
---|
| 600 | |
---|
| 601 | <para> |
---|
| 602 | The audio spectrum is converted back into time domain PCM audio via an |
---|
| 603 | inverse Modified Discrete Cosine Transform (MDCT). A detailed |
---|
| 604 | description of the MDCT is available in the paper <ulink |
---|
| 605 | url="http://www.iocon.com/resource/docs/ps/eusipco_corrected.ps"><citetitle pubwork="article">The use of multirate filter banks for coding of high quality digital |
---|
| 606 | audio</citetitle></ulink>, by T. Sporer, K. Brandenburg and B. Edler.</para> |
---|
| 607 | |
---|
| 608 | <para> |
---|
| 609 | Note that the PCM produced directly from the MDCT is not yet finished |
---|
| 610 | audio; it must be lapped with surrounding frames using an appropriate |
---|
| 611 | window (such as the Vorbis window) before the MDCT can be considered |
---|
| 612 | orthogonal.</para> |
---|
| 613 | |
---|
| 614 | </section> |
---|
| 615 | |
---|
| 616 | <section><title>overlap/add data</title> |
---|
| 617 | <para> |
---|
| 618 | Windowed MDCT output is overlapped and added with the right hand data |
---|
| 619 | of the previous window such that the 3/4 point of the previous window |
---|
| 620 | is aligned with the 1/4 point of the current window (as illustrated in |
---|
| 621 | the window overlap diagram). At this point, the audio data between the |
---|
| 622 | center of the previous frame and the center of the current frame is |
---|
| 623 | now finished and ready to be returned. </para> |
---|
| 624 | </section> |
---|
| 625 | |
---|
| 626 | <section><title>cache right hand data</title> |
---|
| 627 | <para> |
---|
| 628 | The decoder must cache the right hand portion of the current frame to |
---|
| 629 | be lapped with the left hand portion of the next frame. |
---|
| 630 | </para> |
---|
| 631 | </section> |
---|
| 632 | |
---|
| 633 | <section><title>return finished audio data</title> |
---|
| 634 | |
---|
| 635 | <para> |
---|
| 636 | The overlapped portion produced from overlapping the previous and |
---|
| 637 | current frame data is finished data to be returned by the decoder. |
---|
| 638 | This data spans from the center of the previous window to the center |
---|
| 639 | of the current window. In the case of same-sized windows, the amount |
---|
| 640 | of data to return is one-half block consisting of and only of the |
---|
| 641 | overlapped portions. When overlapping a short and long window, much of |
---|
| 642 | the returned range is not actually overlap. This does not damage |
---|
| 643 | transform orthogonality. Pay attention however to returning the |
---|
| 644 | correct data range; the amount of data to be returned is: |
---|
| 645 | |
---|
| 646 | <programlisting> |
---|
| 647 | window_blocksize(previous_window)/4+window_blocksize(current_window)/4 |
---|
| 648 | </programlisting> |
---|
| 649 | |
---|
| 650 | from the center of the previous window to the center of the current |
---|
| 651 | window.</para> |
---|
| 652 | |
---|
| 653 | <para> |
---|
| 654 | Data is not returned from the first frame; it must be used to 'prime' |
---|
| 655 | the decode engine. The encoder accounts for this priming when |
---|
| 656 | calculating PCM offsets; after the first frame, the proper PCM output |
---|
| 657 | offset is '0' (as no data has been returned yet).</para> |
---|
| 658 | </section> |
---|
| 659 | </section> |
---|
| 660 | |
---|
| 661 | </section> |
---|
| 662 | |
---|
| 663 | </section> |
---|
| 664 | <!-- end Vorbis I specification introduction and description --> |
---|
| 665 | |
---|