1 | <?xml version="1.0" standalone="no"?> |
---|
2 | <!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" |
---|
3 | "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd" [ |
---|
4 | |
---|
5 | ]> |
---|
6 | |
---|
7 | <section id="vorbis-spec-intro"> |
---|
8 | <sectioninfo> |
---|
9 | <releaseinfo> |
---|
10 | $Id: 01-introduction.xml 7186 2004-07-20 07:19:25Z xiphmont $ |
---|
11 | </releaseinfo> |
---|
12 | </sectioninfo> |
---|
13 | <title>Introduction and Description</title> |
---|
14 | |
---|
15 | <section> |
---|
16 | <title>Overview</title> |
---|
17 | |
---|
18 | <para> |
---|
19 | This document provides a high level description of the Vorbis codec's |
---|
20 | construction. A bit-by-bit specification appears beginning in |
---|
21 | <xref linkend="vorbis-spec-codec"/>. |
---|
22 | The later sections assume a high-level |
---|
23 | understanding of the Vorbis decode process, which is |
---|
24 | provided here.</para> |
---|
25 | |
---|
26 | <section> |
---|
27 | <title>Application</title> |
---|
28 | <para> |
---|
29 | Vorbis is a general purpose perceptual audio CODEC intended to allow |
---|
30 | maximum encoder flexibility, thus allowing it to scale competitively |
---|
31 | over an exceptionally wide range of bitrates. At the high |
---|
32 | quality/bitrate end of the scale (CD or DAT rate stereo, 16/24 bits) |
---|
33 | it is in the same league as MPEG-2 and MPC. Similarly, the 1.0 |
---|
34 | encoder can encode high-quality CD and DAT rate stereo at below 48kbps |
---|
35 | without resampling to a lower rate. Vorbis is also intended for |
---|
36 | lower and higher sample rates (from 8kHz telephony to 192kHz digital |
---|
37 | masters) and a range of channel representations (monaural, |
---|
38 | polyphonic, stereo, quadraphonic, 5.1, ambisonic, or up to 255 |
---|
39 | discrete channels). |
---|
40 | </para> |
---|
41 | </section> |
---|
42 | |
---|
43 | <section> |
---|
44 | <title>Classification</title> |
---|
45 | <para> |
---|
46 | Vorbis I is a forward-adaptive monolithic transform CODEC based on the |
---|
47 | Modified Discrete Cosine Transform. The codec is structured to allow |
---|
48 | addition of a hybrid wavelet filterbank in Vorbis II to offer better |
---|
49 | transient response and reproduction using a transform better suited to |
---|
50 | localized time events. |
---|
51 | </para> |
---|
52 | </section> |
---|
53 | |
---|
54 | <section> |
---|
55 | <title>Assumptions</title> |
---|
56 | |
---|
57 | <para> |
---|
58 | The Vorbis CODEC design assumes a complex, psychoacoustically-aware |
---|
59 | encoder and simple, low-complexity decoder. Vorbis decode is |
---|
60 | computationally simpler than mp3, although it does require more |
---|
61 | working memory as Vorbis has no static probability model; the vector |
---|
62 | codebooks used in the first stage of decoding from the bitstream are |
---|
63 | packed in their entirety into the Vorbis bitstream headers. In |
---|
64 | packed form, these codebooks occupy only a few kilobytes; the extent |
---|
65 | to which they are pre-decoded into a cache is the dominant factor in |
---|
66 | decoder memory usage. |
---|
67 | </para> |
---|
68 | |
---|
69 | <para> |
---|
70 | Vorbis provides none of its own framing, synchronization or protection |
---|
71 | against errors; it is solely a method of accepting input audio, |
---|
72 | dividing it into individual frames and compressing these frames into |
---|
73 | raw, unformatted 'packets'. The decoder then accepts these raw |
---|
74 | packets in sequence, decodes them, synthesizes audio frames from |
---|
75 | them, and reassembles the frames into a facsimile of the original |
---|
76 | audio stream. Vorbis is a free-form variable bit rate (VBR) codec and packets have no |
---|
77 | minimum size, maximum size, or fixed/expected size. Packets |
---|
78 | are designed that they may be truncated (or padded) and remain |
---|
79 | decodable; this is not to be considered an error condition and is used |
---|
80 | extensively in bitrate management in peeling. Both the transport |
---|
81 | mechanism and decoder must allow that a packet may be any size, or |
---|
82 | end before or after packet decode expects.</para> |
---|
83 | |
---|
84 | <para> |
---|
85 | Vorbis packets are thus intended to be used with a transport mechanism |
---|
86 | that provides free-form framing, sync, positioning and error correction |
---|
87 | in accordance with these design assumptions, such as Ogg (for file |
---|
88 | transport) or RTP (for network multicast). For purposes of a few |
---|
89 | examples in this document, we will assume that Vorbis is to be |
---|
90 | embedded in an Ogg stream specifically, although this is by no means a |
---|
91 | requirement or fundamental assumption in the Vorbis design.</para> |
---|
92 | |
---|
93 | <para> |
---|
94 | The specification for embedding Vorbis into |
---|
95 | an Ogg transport stream is in <xref linkend="vorbis-over-ogg"/>. |
---|
96 | </para> |
---|
97 | |
---|
98 | </section> |
---|
99 | |
---|
100 | <section> |
---|
101 | <title>Codec Setup and Probability Model</title> |
---|
102 | |
---|
103 | <para> |
---|
104 | Vorbis' heritage is as a research CODEC and its current design |
---|
105 | reflects a desire to allow multiple decades of continuous encoder |
---|
106 | improvement before running out of room within the codec specification. |
---|
107 | For these reasons, configurable aspects of codec setup intentionally |
---|
108 | lean toward the extreme of forward adaptive.</para> |
---|
109 | |
---|
110 | <para> |
---|
111 | The single most controversial design decision in Vorbis (and the most |
---|
112 | unusual for a Vorbis developer to keep in mind) is that the entire |
---|
113 | probability model of the codec, the Huffman and VQ codebooks, is |
---|
114 | packed into the bitstream header along with extensive CODEC setup |
---|
115 | parameters (often several hundred fields). This makes it impossible, |
---|
116 | as it would be with MPEG audio layers, to embed a simple frame type |
---|
117 | flag in each audio packet, or begin decode at any frame in the stream |
---|
118 | without having previously fetched the codec setup header. |
---|
119 | </para> |
---|
120 | |
---|
121 | <note><para> |
---|
122 | Vorbis <emphasis>can</emphasis> initiate decode at any arbitrary packet within a |
---|
123 | bitstream so long as the codec has been initialized/setup with the |
---|
124 | setup headers.</para></note> |
---|
125 | |
---|
126 | <para> |
---|
127 | Thus, Vorbis headers are both required for decode to begin and |
---|
128 | relatively large as bitstream headers go. The header size is |
---|
129 | unbounded, although for streaming a rule-of-thumb of 4kB or less is |
---|
130 | recommended (and Xiph.Org's Vorbis encoder follows this suggestion).</para> |
---|
131 | |
---|
132 | <para> |
---|
133 | Our own design work indicates the primary liability of the |
---|
134 | required header is in mindshare; it is an unusual design and thus |
---|
135 | causes some amount of complaint among engineers as this runs against |
---|
136 | current design trends (and also points out limitations in some |
---|
137 | existing software/interface designs, such as Windows' ACM codec |
---|
138 | framework). However, we find that it does not fundamentally limit |
---|
139 | Vorbis' suitable application space.</para> |
---|
140 | |
---|
141 | </section> |
---|
142 | |
---|
143 | <section><title>Format Specification</title> |
---|
144 | <para> |
---|
145 | The Vorbis format is well-defined by its decode specification; any |
---|
146 | encoder that produces packets that are correctly decoded by the |
---|
147 | reference Vorbis decoder described below may be considered a proper |
---|
148 | Vorbis encoder. A decoder must faithfully and completely implement |
---|
149 | the specification defined below (except where noted) to be considered |
---|
150 | a proper Vorbis decoder.</para> |
---|
151 | </section> |
---|
152 | |
---|
153 | <section><title>Hardware Profile</title> |
---|
154 | <para> |
---|
155 | Although Vorbis decode is computationally simple, it may still run |
---|
156 | into specific limitations of an embedded design. For this reason, |
---|
157 | embedded designs are allowed to deviate in limited ways from the |
---|
158 | 'full' decode specification yet still be certified compliant. These |
---|
159 | optional omissions are labelled in the spec where relevant.</para> |
---|
160 | </section> |
---|
161 | |
---|
162 | </section> |
---|
163 | |
---|
164 | <section> |
---|
165 | <title>Decoder Configuration</title> |
---|
166 | |
---|
167 | <para> |
---|
168 | Decoder setup consists of configuration of multiple, self-contained |
---|
169 | component abstractions that perform specific functions in the decode |
---|
170 | pipeline. Each different component instance of a specific type is |
---|
171 | semantically interchangeable; decoder configuration consists both of |
---|
172 | internal component configuration, as well as arrangement of specific |
---|
173 | instances into a decode pipeline. Componentry arrangement is roughly |
---|
174 | as follows:</para> |
---|
175 | |
---|
176 | <mediaobject> |
---|
177 | <imageobject> |
---|
178 | <imagedata fileref="components.png" format="PNG"/> |
---|
179 | </imageobject> |
---|
180 | <textobject> |
---|
181 | <phrase>decoder pipeline configuration</phrase> |
---|
182 | </textobject> |
---|
183 | </mediaobject> |
---|
184 | |
---|
185 | <section><title>Global Config</title> |
---|
186 | <para> |
---|
187 | Global codec configuration consists of a few audio related fields |
---|
188 | (sample rate, channels), Vorbis version (always '0' in Vorbis I), |
---|
189 | bitrate hints, and the lists of component instances. All other |
---|
190 | configuration is in the context of specific components.</para> |
---|
191 | </section> |
---|
192 | |
---|
193 | <section><title>Mode</title> |
---|
194 | |
---|
195 | <para> |
---|
196 | Each Vorbis frame is coded according to a master 'mode'. A bitstream |
---|
197 | may use one or many modes.</para> |
---|
198 | |
---|
199 | <para> |
---|
200 | The mode mechanism is used to encode a frame according to one of |
---|
201 | multiple possible methods with the intention of choosing a method best |
---|
202 | suited to that frame. Different modes are, e.g. how frame size |
---|
203 | is changed from frame to frame. The mode number of a frame serves as a |
---|
204 | top level configuration switch for all other specific aspects of frame |
---|
205 | decode.</para> |
---|
206 | |
---|
207 | <para> |
---|
208 | A 'mode' configuration consists of a frame size setting, window type |
---|
209 | (always 0, the Vorbis window, in Vorbis I), transform type (always |
---|
210 | type 0, the MDCT, in Vorbis I) and a mapping number. The mapping |
---|
211 | number specifies which mapping configuration instance to use for |
---|
212 | low-level packet decode and synthesis.</para> |
---|
213 | |
---|
214 | </section> |
---|
215 | |
---|
216 | <section><title>Mapping</title> |
---|
217 | |
---|
218 | <para> |
---|
219 | A mapping contains a channel coupling description and a list of |
---|
220 | 'submaps' that bundle sets of channel vectors together for grouped |
---|
221 | encoding and decoding. These submaps are not references to external |
---|
222 | components; the submap list is internal and specific to a mapping.</para> |
---|
223 | |
---|
224 | <para> |
---|
225 | A 'submap' is a configuration/grouping that applies to a subset of |
---|
226 | floor and residue vectors within a mapping. The submap functions as a |
---|
227 | last layer of indirection such that specific special floor or residue |
---|
228 | settings can be applied not only to all the vectors in a given mode, |
---|
229 | but also specific vectors in a specific mode. Each submap specifies |
---|
230 | the proper floor and residue instance number to use for decoding that |
---|
231 | submap's spectral floor and spectral residue vectors.</para> |
---|
232 | |
---|
233 | <para> |
---|
234 | As an example:</para> |
---|
235 | |
---|
236 | <para> |
---|
237 | Assume a Vorbis stream that contains six channels in the standard 5.1 |
---|
238 | format. The sixth channel, as is normal in 5.1, is bass only. |
---|
239 | Therefore it would be wasteful to encode a full-spectrum version of it |
---|
240 | as with the other channels. The submapping mechanism can be used to |
---|
241 | apply a full range floor and residue encoding to channels 0 through 4, |
---|
242 | and a bass-only representation to the bass channel, thus saving space. |
---|
243 | In this example, channels 0-4 belong to submap 0 (which indicates use |
---|
244 | of a full-range floor) and channel 5 belongs to submap 1, which uses a |
---|
245 | bass-only representation.</para> |
---|
246 | |
---|
247 | </section> |
---|
248 | |
---|
249 | <section><title>Floor</title> |
---|
250 | |
---|
251 | <para> |
---|
252 | Vorbis encodes a spectral 'floor' vector for each PCM channel. This |
---|
253 | vector is a low-resolution representation of the audio spectrum for |
---|
254 | the given channel in the current frame, generally used akin to a |
---|
255 | whitening filter. It is named a 'floor' because the Xiph.Org |
---|
256 | reference encoder has historically used it as a unit-baseline for |
---|
257 | spectral resolution.</para> |
---|
258 | |
---|
259 | <para> |
---|
260 | A floor encoding may be of two types. Floor 0 uses a packed LSP |
---|
261 | representation on a dB amplitude scale and Bark frequency scale. |
---|
262 | Floor 1 represents the curve as a piecewise linear interpolated |
---|
263 | representation on a dB amplitude scale and linear frequency scale. |
---|
264 | The two floors are semantically interchangeable in |
---|
265 | encoding/decoding. However, floor type 1 provides more stable |
---|
266 | inter-frame behavior, and so is the preferred choice in all |
---|
267 | coupled-stereo and high bitrate modes. Floor 1 is also considerably |
---|
268 | less expensive to decode than floor 0.</para> |
---|
269 | |
---|
270 | <para> |
---|
271 | Floor 0 is not to be considered deprecated, but it is of limited |
---|
272 | modern use. No known Vorbis encoder past Xiph.org's own beta 4 makes |
---|
273 | use of floor 0.</para> |
---|
274 | |
---|
275 | <para> |
---|
276 | The values coded/decoded by a floor are both compactly formatted and |
---|
277 | make use of entropy coding to save space. For this reason, a floor |
---|
278 | configuration generally refers to multiple codebooks in the codebook |
---|
279 | component list. Entropy coding is thus provided as an abstraction, |
---|
280 | and each floor instance may choose from any and all available |
---|
281 | codebooks when coding/decoding.</para> |
---|
282 | |
---|
283 | </section> |
---|
284 | |
---|
285 | <section><title>Residue</title> |
---|
286 | <para> |
---|
287 | The spectral residue is the fine structure of the audio spectrum |
---|
288 | once the floor curve has been subtracted out. In simplest terms, it |
---|
289 | is coded in the bitstream using cascaded (multi-pass) vector |
---|
290 | quantization according to one of three specific packing/coding |
---|
291 | algorithms numbered 0 through 2. The packing algorithm details are |
---|
292 | configured by residue instance. As with the floor components, the |
---|
293 | final VQ/entropy encoding is provided by external codebook instances |
---|
294 | and each residue instance may choose from any and all available |
---|
295 | codebooks.</para> |
---|
296 | </section> |
---|
297 | |
---|
298 | <section><title>Codebooks</title> |
---|
299 | |
---|
300 | <para> |
---|
301 | Codebooks are a self-contained abstraction that perform entropy |
---|
302 | decoding and, optionally, use the entropy-decoded integer value as an |
---|
303 | offset into an index of output value vectors, returning the indicated |
---|
304 | vector of values.</para> |
---|
305 | |
---|
306 | <para> |
---|
307 | The entropy coding in a Vorbis I codebook is provided by a standard |
---|
308 | Huffman binary tree representation. This tree is tightly packed using |
---|
309 | one of several methods, depending on whether codeword lengths are |
---|
310 | ordered or unordered, or the tree is sparse.</para> |
---|
311 | |
---|
312 | <para> |
---|
313 | The codebook vector index is similarly packed according to index |
---|
314 | characteristic. Most commonly, the vector index is encoded as a |
---|
315 | single list of values of possible values that are then permuted into |
---|
316 | a list of n-dimensional rows (lattice VQ).</para> |
---|
317 | |
---|
318 | </section> |
---|
319 | |
---|
320 | </section> |
---|
321 | |
---|
322 | |
---|
323 | <section> |
---|
324 | <title>High-level Decode Process</title> |
---|
325 | |
---|
326 | <section> |
---|
327 | <title>Decode Setup</title> |
---|
328 | |
---|
329 | <para> |
---|
330 | Before decoding can begin, a decoder must initialize using the |
---|
331 | bitstream headers matching the stream to be decoded. Vorbis uses |
---|
332 | three header packets; all are required, in-order, by this |
---|
333 | specification. Once set up, decode may begin at any audio packet |
---|
334 | belonging to the Vorbis stream. In Vorbis I, all packets after the |
---|
335 | three initial headers are audio packets. </para> |
---|
336 | |
---|
337 | <para> |
---|
338 | The header packets are, in order, the identification |
---|
339 | header, the comments header, and the setup header.</para> |
---|
340 | |
---|
341 | <section><title>Identification Header</title> |
---|
342 | <para> |
---|
343 | The identification header identifies the bitstream as Vorbis, Vorbis |
---|
344 | version, and the simple audio characteristics of the stream such as |
---|
345 | sample rate and number of channels.</para> |
---|
346 | </section> |
---|
347 | |
---|
348 | <section><title>Comment Header</title> |
---|
349 | <para> |
---|
350 | The comment header includes user text comments ("tags") and a vendor |
---|
351 | string for the application/library that produced the bitstream. The |
---|
352 | encoding and proper use of the comment header is described in |
---|
353 | <xref linkend="vorbis-spec-comment"/>.</para> |
---|
354 | </section> |
---|
355 | |
---|
356 | <section><title>Setup Header</title> |
---|
357 | <para> |
---|
358 | The setup header includes extensive CODEC setup information as well as |
---|
359 | the complete VQ and Huffman codebooks needed for decode.</para> |
---|
360 | </section> |
---|
361 | |
---|
362 | </section> |
---|
363 | |
---|
364 | <section><title>Decode Procedure</title> |
---|
365 | |
---|
366 | <highlights> |
---|
367 | <para> |
---|
368 | The decoding and synthesis procedure for all audio packets is |
---|
369 | fundamentally the same. |
---|
370 | <orderedlist> |
---|
371 | <listitem><simpara>decode packet type flag</simpara></listitem> |
---|
372 | <listitem><simpara>decode mode number</simpara></listitem> |
---|
373 | <listitem><simpara>decode window shape (long windows only)</simpara></listitem> |
---|
374 | <listitem><simpara>decode floor</simpara></listitem> |
---|
375 | <listitem><simpara>decode residue into residue vectors</simpara></listitem> |
---|
376 | <listitem><simpara>inverse channel coupling of residue vectors</simpara></listitem> |
---|
377 | <listitem><simpara>generate floor curve from decoded floor data</simpara></listitem> |
---|
378 | <listitem><simpara>compute dot product of floor and residue, producing audio spectrum vector</simpara></listitem> |
---|
379 | <listitem><simpara>inverse monolithic transform of audio spectrum vector, always an MDCT in Vorbis I</simpara></listitem> |
---|
380 | <listitem><simpara>overlap/add left-hand output of transform with right-hand output of previous frame</simpara></listitem> |
---|
381 | <listitem><simpara>store right hand-data from transform of current frame for future lapping</simpara></listitem> |
---|
382 | <listitem><simpara>if not first frame, return results of overlap/add as audio result of current frame</simpara></listitem> |
---|
383 | </orderedlist> |
---|
384 | </para> |
---|
385 | </highlights> |
---|
386 | |
---|
387 | <para> |
---|
388 | Note that clever rearrangement of the synthesis arithmetic is |
---|
389 | possible; as an example, one can take advantage of symmetries in the |
---|
390 | MDCT to store the right-hand transform data of a partial MDCT for a |
---|
391 | 50% inter-frame buffer space savings, and then complete the transform |
---|
392 | later before overlap/add with the next frame. This optimization |
---|
393 | produces entirely equivalent output and is naturally perfectly legal. |
---|
394 | The decoder must be <emphasis>entirely mathematically equivalent</emphasis> to the |
---|
395 | specification, it need not be a literal semantic implementation.</para> |
---|
396 | |
---|
397 | <section><title>Packet type decode</title> |
---|
398 | |
---|
399 | <para> |
---|
400 | Vorbis I uses four packet types. The first three packet types mark each |
---|
401 | of the three Vorbis headers described above. The fourth packet type |
---|
402 | marks an audio packet. All other packet types are reserved; packets |
---|
403 | marked with a reserved type should be ignored.</para> |
---|
404 | |
---|
405 | <para> |
---|
406 | Following the three header packets, all packets in a Vorbis I stream |
---|
407 | are audio. The first step of audio packet decode is to read and |
---|
408 | verify the packet type; <emphasis>a non-audio packet when audio is expected |
---|
409 | indicates stream corruption or a non-compliant stream. The decoder |
---|
410 | must ignore the packet and not attempt decoding it to |
---|
411 | audio</emphasis>.</para> |
---|
412 | |
---|
413 | </section> |
---|
414 | |
---|
415 | |
---|
416 | <section><title>Mode decode</title> |
---|
417 | <para> |
---|
418 | Vorbis allows an encoder to set up multiple, numbered packet 'modes', |
---|
419 | as described earlier, all of which may be used in a given Vorbis |
---|
420 | stream. The mode is encoded as an integer used as a direct offset into |
---|
421 | the mode instance index. </para> |
---|
422 | </section> |
---|
423 | |
---|
424 | <section id="vorbis-spec-window"> |
---|
425 | <title>Window shape decode (long windows only)</title> |
---|
426 | |
---|
427 | <para> |
---|
428 | Vorbis frames may be one of two PCM sample sizes specified during |
---|
429 | codec setup. In Vorbis I, legal frame sizes are powers of two from 64 |
---|
430 | to 8192 samples. Aside from coupling, Vorbis handles channels as |
---|
431 | independent vectors and these frame sizes are in samples per channel.</para> |
---|
432 | |
---|
433 | <para> |
---|
434 | Vorbis uses an overlapping transform, namely the MDCT, to blend one |
---|
435 | frame into the next, avoiding most inter-frame block boundary |
---|
436 | artifacts. The MDCT output of one frame is windowed according to MDCT |
---|
437 | requirements, overlapped 50% with the output of the previous frame and |
---|
438 | added. The window shape assures seamless reconstruction. </para> |
---|
439 | |
---|
440 | <para> |
---|
441 | This is easy to visualize in the case of equal sized-windows:</para> |
---|
442 | |
---|
443 | <mediaobject> |
---|
444 | <imageobject> |
---|
445 | <imagedata fileref="window1.png" format="PNG"/> |
---|
446 | </imageobject> |
---|
447 | <textobject> |
---|
448 | <phrase>overlap of two equal-sized windows</phrase> |
---|
449 | </textobject> |
---|
450 | </mediaobject> |
---|
451 | |
---|
452 | <para> |
---|
453 | And slightly more complex in the case of overlapping unequal sized |
---|
454 | windows:</para> |
---|
455 | |
---|
456 | <mediaobject> |
---|
457 | <imageobject> |
---|
458 | <imagedata fileref="window2.png" format="PNG"/> |
---|
459 | </imageobject> |
---|
460 | <textobject> |
---|
461 | <phrase>overlap of a long and a short window</phrase> |
---|
462 | </textobject> |
---|
463 | </mediaobject> |
---|
464 | |
---|
465 | <para> |
---|
466 | In the unequal-sized window case, the window shape of the long window |
---|
467 | must be modified for seamless lapping as above. It is possible to |
---|
468 | correctly infer window shape to be applied to the current window from |
---|
469 | knowing the sizes of the current, previous and next window. It is |
---|
470 | legal for a decoder to use this method. However, in the case of a long |
---|
471 | window (short windows require no modification), Vorbis also codes two |
---|
472 | flag bits to specify pre- and post- window shape. Although not |
---|
473 | strictly necessary for function, this minor redundancy allows a packet |
---|
474 | to be fully decoded to the point of lapping entirely independently of |
---|
475 | any other packet, allowing easier abstraction of decode layers as well |
---|
476 | as allowing a greater level of easy parallelism in encode and |
---|
477 | decode.</para> |
---|
478 | |
---|
479 | <para> |
---|
480 | A description of valid window functions for use with an inverse MDCT |
---|
481 | can be found in the paper |
---|
482 | <citetitle pubwork="article"> |
---|
483 | <ulink url="http://www.iocon.com/resource/docs/ps/eusipco_corrected.ps"> |
---|
484 | The use of multirate filter banks for coding of high quality digital |
---|
485 | audio</ulink></citetitle>, by T. Sporer, K. Brandenburg and B. Edler. Vorbis windows |
---|
486 | all use the slope function |
---|
487 | <inlineequation> |
---|
488 | |
---|
489 | <alt>y=sin(.5*PI*sin^2((x+.5)/n*pi))</alt> |
---|
490 | <inlinemediaobject> |
---|
491 | <textobject> |
---|
492 | <phrase>$y = \sin(.5*\pi \, \sin^2((x+.5)/n*\pi))$</phrase> |
---|
493 | </textobject> |
---|
494 | </inlinemediaobject> |
---|
495 | </inlineequation>. |
---|
496 | </para> |
---|
497 | |
---|
498 | </section> |
---|
499 | |
---|
500 | <section><title>floor decode</title> |
---|
501 | <para> |
---|
502 | Each floor is encoded/decoded in channel order, however each floor |
---|
503 | belongs to a 'submap' that specifies which floor configuration to |
---|
504 | use. All floors are decoded before residue decode begins.</para> |
---|
505 | </section> |
---|
506 | |
---|
507 | <section><title>residue decode</title> |
---|
508 | |
---|
509 | <para> |
---|
510 | Although the number of residue vectors equals the number of channels, |
---|
511 | channel coupling may mean that the raw residue vectors extracted |
---|
512 | during decode do not map directly to specific channels. When channel |
---|
513 | coupling is in use, some vectors will correspond to coupled magnitude |
---|
514 | or angle. The coupling relationships are described in the codec setup |
---|
515 | and may differ from frame to frame, due to different mode numbers.</para> |
---|
516 | |
---|
517 | <para> |
---|
518 | Vorbis codes residue vectors in groups by submap; the coding is done |
---|
519 | in submap order from submap 0 through n-1. This differs from floors |
---|
520 | which are coded using a configuration provided by submap number, but |
---|
521 | are coded individually in channel order.</para> |
---|
522 | |
---|
523 | </section> |
---|
524 | |
---|
525 | <section><title>inverse channel coupling</title> |
---|
526 | |
---|
527 | <para> |
---|
528 | A detailed discussion of stereo in the Vorbis codec can be found in |
---|
529 | the document <ulink url="stereo.html"><citetitle>Stereo Channel Coupling in the |
---|
530 | Vorbis CODEC</citetitle></ulink>. Vorbis is not limited to only stereo coupling, but |
---|
531 | the stereo document also gives a good overview of the generic coupling |
---|
532 | mechanism.</para> |
---|
533 | |
---|
534 | <para> |
---|
535 | Vorbis coupling applies to pairs of residue vectors at a time; |
---|
536 | decoupling is done in-place a pair at a time in the order and using |
---|
537 | the vectors specified in the current mapping configuration. The |
---|
538 | decoupling operation is the same for all pairs, converting square |
---|
539 | polar representation (where one vector is magnitude and the second |
---|
540 | angle) back to Cartesian representation.</para> |
---|
541 | |
---|
542 | <para> |
---|
543 | After decoupling, in order, each pair of vectors on the coupling list, |
---|
544 | the resulting residue vectors represent the fine spectral detail |
---|
545 | of each output channel.</para> |
---|
546 | |
---|
547 | </section> |
---|
548 | |
---|
549 | <section><title>generate floor curve</title> |
---|
550 | |
---|
551 | <para> |
---|
552 | The decoder may choose to generate the floor curve at any appropriate |
---|
553 | time. It is reasonable to generate the output curve when the floor |
---|
554 | data is decoded from the raw packet, or it can be generated after |
---|
555 | inverse coupling and applied to the spectral residue directly, |
---|
556 | combining generation and the dot product into one step and eliminating |
---|
557 | some working space.</para> |
---|
558 | |
---|
559 | <para> |
---|
560 | Both floor 0 and floor 1 generate a linear-range, linear-domain output |
---|
561 | vector to be multiplied (dot product) by the linear-range, |
---|
562 | linear-domain spectral residue.</para> |
---|
563 | |
---|
564 | </section> |
---|
565 | |
---|
566 | <section><title>compute floor/residue dot product</title> |
---|
567 | |
---|
568 | <para> |
---|
569 | This step is straightforward; for each output channel, the decoder |
---|
570 | multiplies the floor curve and residue vectors element by element, |
---|
571 | producing the finished audio spectrum of each channel.</para> |
---|
572 | |
---|
573 | <para> |
---|
574 | One point is worth mentioning about this dot product; a common mistake |
---|
575 | in a fixed point implementation might be to assume that a 32 bit |
---|
576 | fixed-point representation for floor and residue and direct |
---|
577 | multiplication of the vectors is sufficient for acceptable spectral |
---|
578 | depth in all cases because it happens to mostly work with the current |
---|
579 | Xiph.Org reference encoder.</para> |
---|
580 | |
---|
581 | <para> |
---|
582 | However, floor vector values can span ~140dB (~24 bits unsigned), and |
---|
583 | the audio spectrum vector should represent a minimum of 120dB (~21 |
---|
584 | bits with sign), even when output is to a 16 bit PCM device. For the |
---|
585 | residue vector to represent full scale if the floor is nailed to |
---|
586 | -140dB, it must be able to span 0 to +140dB. For the residue vector |
---|
587 | to reach full scale if the floor is nailed at 0dB, it must be able to |
---|
588 | represent -140dB to +0dB. Thus, in order to handle full range |
---|
589 | dynamics, a residue vector may span -140dB to +140dB entirely within |
---|
590 | spec. A 280dB range is approximately 48 bits with sign; thus the |
---|
591 | residue vector must be able to represent a 48 bit range and the dot |
---|
592 | product must be able to handle an effective 48 bit times 24 bit |
---|
593 | multiplication. This range may be achieved using large (64 bit or |
---|
594 | larger) integers, or implementing a movable binary point |
---|
595 | representation.</para> |
---|
596 | |
---|
597 | </section> |
---|
598 | |
---|
599 | <section><title>inverse monolithic transform (MDCT)</title> |
---|
600 | |
---|
601 | <para> |
---|
602 | The audio spectrum is converted back into time domain PCM audio via an |
---|
603 | inverse Modified Discrete Cosine Transform (MDCT). A detailed |
---|
604 | description of the MDCT is available in the paper <ulink |
---|
605 | url="http://www.iocon.com/resource/docs/ps/eusipco_corrected.ps"><citetitle pubwork="article">The use of multirate filter banks for coding of high quality digital |
---|
606 | audio</citetitle></ulink>, by T. Sporer, K. Brandenburg and B. Edler.</para> |
---|
607 | |
---|
608 | <para> |
---|
609 | Note that the PCM produced directly from the MDCT is not yet finished |
---|
610 | audio; it must be lapped with surrounding frames using an appropriate |
---|
611 | window (such as the Vorbis window) before the MDCT can be considered |
---|
612 | orthogonal.</para> |
---|
613 | |
---|
614 | </section> |
---|
615 | |
---|
616 | <section><title>overlap/add data</title> |
---|
617 | <para> |
---|
618 | Windowed MDCT output is overlapped and added with the right hand data |
---|
619 | of the previous window such that the 3/4 point of the previous window |
---|
620 | is aligned with the 1/4 point of the current window (as illustrated in |
---|
621 | the window overlap diagram). At this point, the audio data between the |
---|
622 | center of the previous frame and the center of the current frame is |
---|
623 | now finished and ready to be returned. </para> |
---|
624 | </section> |
---|
625 | |
---|
626 | <section><title>cache right hand data</title> |
---|
627 | <para> |
---|
628 | The decoder must cache the right hand portion of the current frame to |
---|
629 | be lapped with the left hand portion of the next frame. |
---|
630 | </para> |
---|
631 | </section> |
---|
632 | |
---|
633 | <section><title>return finished audio data</title> |
---|
634 | |
---|
635 | <para> |
---|
636 | The overlapped portion produced from overlapping the previous and |
---|
637 | current frame data is finished data to be returned by the decoder. |
---|
638 | This data spans from the center of the previous window to the center |
---|
639 | of the current window. In the case of same-sized windows, the amount |
---|
640 | of data to return is one-half block consisting of and only of the |
---|
641 | overlapped portions. When overlapping a short and long window, much of |
---|
642 | the returned range is not actually overlap. This does not damage |
---|
643 | transform orthogonality. Pay attention however to returning the |
---|
644 | correct data range; the amount of data to be returned is: |
---|
645 | |
---|
646 | <programlisting> |
---|
647 | window_blocksize(previous_window)/4+window_blocksize(current_window)/4 |
---|
648 | </programlisting> |
---|
649 | |
---|
650 | from the center of the previous window to the center of the current |
---|
651 | window.</para> |
---|
652 | |
---|
653 | <para> |
---|
654 | Data is not returned from the first frame; it must be used to 'prime' |
---|
655 | the decode engine. The encoder accounts for this priming when |
---|
656 | calculating PCM offsets; after the first frame, the proper PCM output |
---|
657 | offset is '0' (as no data has been returned yet).</para> |
---|
658 | </section> |
---|
659 | </section> |
---|
660 | |
---|
661 | </section> |
---|
662 | |
---|
663 | </section> |
---|
664 | <!-- end Vorbis I specification introduction and description --> |
---|
665 | |
---|