Symphonia Play Case Study

May 16, 2024

Abstract

Symphonia Play, part of the Rust Symphonia project, is a simple audio player used to test audio sample processing. It can be utilized to better comprehend the patterns and data structures that might aid in interfacing with audio samples not directly supported by Rodio. This will help us understand how to design a library that can interact with Rodio to output audio.

Requirements

Currently, we aim to process audio samples using a variety of codec(s) defined through an m3u8 playlist file. I have chosen the Rodio rust crate to support cross-platform audio output.

We will likely need to use Symphonia to process some of the audio samples provided by the HLS server. This is necessary because Rodio only supports MP3 files and a few other formats. This support is insufficient when used with an m3u8 playlist, as a wider range of codec(s) need to be supported. An audio processing library like Symphonia is already configured to handle this processing; we don’t need to reinvent that wheel.

Symphonia

Demuxing vs Decoding

The process of reading a media file and gradually extracting the tracks from the audio sample is known as ”demultiplexing” or ”demuxing”. The opposite process, which involves converting the track ”packet” back into the codec sample data, is referred to as ”decoding”.

Symphonia distinguishes between these two processes. Upon analysis, you'll see that they operate in conjunction (refer to the decode loop). One process extracts metadata, such as the codec type or artist information. The other process decodes the packet information back into a byte sequence, which output devices can interpret.

MediaSource

Symphonia uses the concept of a MediaSource which is a seek-able, readable stream that can be used to create a MediaSourceStream instance. From this MediaSourceStream we can use the default probe provided by Symphonia to detect the underline format of the MediaSource providing a reader or more specifically a trait FormatReader over this source.

FormatReader

The FormatReader trait in Symphonia is a key component in handling and decoding media formats. It is responsible for managing multiple tracks that may make up a MediaSource. Each track has a codec associated with it, which is a means of encoding and decoding a digital data stream or signal.

Each track in the FormatReader is also rich in metadata, a set of data that gives information about other data. The metadata fields can include a wide array of information about the media, such as its duration, bitrate, sample rate, and more, depending on the specifics of the codec and format in use.

The FormatReader trait in Symphonia provides a way to access and process the individual tracks of a media file, each with its own codec and rich metadata, and derive decoders for handling the digital data streams provided by the media file. This recursive logic is termed the ”decode loop”.

Decode Loop

The decode loop is where all the magic happens and where we are able to process the information described by the FormatReader.

The FormatReader uses the codec associated with each track to derive a decoder, a component that converts the encoded data back into a format that can be understood and processed by an underlying output device. This process involves parsing the encoded data, often involving complex computational algorithms to accurately reproduce the original data from the encoded form.

Pseudo-code for the audio processing decode loop may look something like this

// Pseudocode for audio processing decode loop
while there are packets in the media format:
    Acquire a packet from the media format
    Consume any new metadata
    Filter the packet
    if packet is not for the selected track:
        Continue to next packet
    Decode the packet into audio samples using its associated decoder
    Write the decoded samples to the audio output device
end while

Processing AudioBuffer

In Symphonia, an AudioBuffer is a data structure that contains the underlying byte sequence and a SignalSpec for an audio device to make sense of the underlying byte sequence.

Getting to the point of processing the buffer in the decode loop you have two options. You can either manually process samples within the AudioBufferRef provided by the AudioBuffer struct like so:

use symphonia_core::audio::{AudioBufferRef, Signal};

let decoded = decoder.decode(&packet).unwrap();

match decoded {
    AudioBufferRef::F32(buf) => {
        for &sample in buf.chan(0) {
            // Do something with `sample`.
        }
    }
    _ => {
        // Repeat for the different sample formats.
        unimplemented!()
    }
}

Otherwise, there is a structure called a SampleBuffer whose responsibility it is to allocate a block of memory in which AudioBuffer(s) can be placed within. Methods provided in the structures' interface allow you to manipulate the allocated memory in a structured manner.

In the example below, the copy_interleaved_ref() method is used to write the newly decoded information (the AudioBuffer) into the SampleBuffer.

use symphonia_core::audio::SampleBuffer;

// Create a sample buffer that matches the parameters of the decoded audio buffer.
let mut sample_buf = SampleBuffer::<f32>::new(decoded.capacity() as u64, *decoded.spec());

// Copy the contents of the decoded audio buffer into the sample buffer whilst performing
// any required conversions.
sample_buf.copy_interleaved_ref(decoded);

// The interleaved f32 samples can be accessed as follows.
let samples = sample_buf.samples();

Symphonia Play

Now, let's consider the symphonia play binary. There are many trivial aspects, such as defining the command-line arguments, that we aren't particularly concerned about. Key elements that we are concerned about include the definition of an AudioOutput trait, which defines a write() method and a flush() method. The write method is responsible for taking an AudioBuffer and writing it to the output device.

Looking at the cpal module specifically we noticed a trait is defined for AudioOutputSample that is implemented for the f32, i16 and u16 data types to extend the cross-platform support cpal already provides using these core data types.

Inspecting the CpalAudioOutputImpl struct which is a concrete structure that implements the AudioOutput trait.

struct CpalAudioOutputImpl<T: AudioOutputSample> {
    ring_buf_producer: rb::Producer<T>,
    sample_buf: SampleBuffer<T>,
    stream: cpal::Stream,
    resampler: Option<Resampler<T>>,
}

The two elements within the structure we will focus on are the sample buffer and the ring buffer. The cpal::Stream is used to represent the audio output to the device which is out of scope in this case study.

The Ring Buffer

The ring buffer's usage is noteworthy as it offers a FIFO interface to maintain the order of audio samples while consuming them efficiently from both ends of the stream. Within the CpalAudioOutputImpl constructor, a closure is defined. In this closure, the rb::Consumer end of the ring buffer performs a blocking read for audio sample data and writes it to the device output, cpal::Stream.

Let's turn our attention to the write() method. This is where we modify the other end of the ring buffer. Initially, we modify the data containing the samples using the provided method, copy_interleaved_ref(), as illustrated in this example:

// overwrite the sample buffer with the decoded AudioBufferRef
self.sample_buf.copy_interleaved_ref(decoded);
let samples = self.sample_buf.samples();

Once the underlying data within the buffer is written, we write to the ring buffer in a blocking fashion.

// Write all samples to the ring buffer.
while let Some(written) = self.ring_buf_producer.write_blocking(samples) {
    samples = &samples[written..];
}

It seems like the allocated size of the ring buffer is the main mechanism that defines the audio/video buffering size. Back pressure will be placed on the blocking writing calls based on the speed at which the data from the ring buffer can be consumed.

Runtime

In the main.rs Rust file, two functions are defined: play() and play_track(). While these two functions have similar signatures, their purposes differ slightly.

The play_track() function reads from the FormatReader, selects a track with a suitable codec, and decodes the track's packets within a decode loop.

// Abbreviated decode loop
// Decode all packets, ignoring all decode errors.
let result = loop {
    let packet = match reader.next_packet() {
        Ok(packet) => packet,
        Err(err) => break Err(err),
    };

    // If the packet does not belong to the selected track, skip over it.
    if packet.track_id() != track_id {
        continue;
    }

    // Decode the packet into audio samples.
    match decoder.decode(&packet) {
        Ok(_decoded) => continue,
        Err(Error::DecodeError(err)) => warn!("decode error: {}", err),
        Err(err) => break Err(err),
    }
};

The decode loop is expected to terminate with an error. The "happy path" expects an io::Error of a std::io::ErrorKind::UnexpectedEof kind, which signals the end of file (EOF) in the MediaSource. This is a special error case that is handled by disregarding it.

Conclusion

Examining the Symphonia Play binary has given us a deeper understanding of how to build an audio player for our specific needs. Key insights include the functionality of the Symphonia FormatReader, responsible for demuxing and decoding audio media formats, as well as investigating the structure and functions of the decode loop in processing information provided by the FormatReader.