Professional audio over digital networks

Technological issues

End-to-end delay

With analogue audio, the only delays experienced are the time taken for the electrical signal to travel along a cable and for acoustic signals to travel through the air. Where a delay was required in pre-digital times, untidy arrangements involving magnetic tape trailing across the floor had to be used.

Binary digits, however, are easy to store; most PCs have enough internal memory to store several hours' worth of AES3 format audio. Often, particularly in data communications networks, a "store and forward" architecture is easier to arrange than one in which the digits are passed on as soon as they arrive.

In many networks, audio samples have to be collected together into "packets" rather than being sent individually; this introduces an additional delay as described below.

Delays through a network

The delay between an audio sample being received at the transmitting network interface and that same audio sample being output by the receiving network interface has three components: packetisation delay, transit delay, and buffering delay.

Packetisation delay is the time between the first sample being received and the packet being ready for transmission; in the cartoon below, it's the time between the first person getting on the bus and the bus being ready to leave.

If samples arrive at regular intervals, each sample arriving instantaneously, and n samples are packed in a packet, then the packetisation delay is n-1 sample times; in practice a sample usually takes 1 sample time to arrive so the packetisation delay is n sample times.

Transit delay is the time from the packet being ready for transmission until it is ready to be unpacked at the receiving end (or from the bus being ready to leave until the bus arrives at its destination). This is composed of the time from starting to arrive at each switching or routing element in the network until it can start to be transmitted, time waiting to be transmitted on each link, and the propagation delay along the link (time queuing at road junctions etc, waiting for traffic lights, and actually moving).

The propagation delay is governed by the length of the link and the speed of the signal along the link, which is about 1 metre per 5 nanoseconds or 200 km per millisecond; note that the length of the cable may be considerably more than the straight-line distance between its ends.

Buffering delay is the time between the packet arriving at the destination and the first sample being output.

unpacking samples from cell

The receiving device needs to output samples at regular intervals, and needs to be sure that each sample will have arrived by the time it is needed. It therefore needs to keep some data in hand in case a packet arrives late (there needs to be either a queue of buses waiting to unload, or a queue of people who have got off the buses waiting to leave the bus station); the Packet Delay Variation is a measure of how late a packet is liable to arrive, and defines the minimum buffering delay (i.e. the minimum FIFO size) that will ensure reliable transfer of the audio samples.

Network features that minimise delays

To keep packetisation delay small, the packet size needs to be small, particularly if only one or two audio channels are being carried. In a typical case when conveying an AES3 stream, a 48-byte packet would carry six audio samples per channel, resulting in a packetisation delay of six sample times. Ideally, each packet should carry one sample per channel.

The time spent waiting to be output on each link can be kept small by reserving periodic "slots" on the output, and by giving the audio priority over other traffic. In a conventional packet network that allows large data packets to be conveyed in one piece, an audio packet that arrives at an output just after a large data packet has begun to be output will have to wait until the end of that packet.

Packet routing within the switch should be a simple, fast, process that can begin as soon as the first few bytes of the packet have been received.

Buffering delay is minimised by a service which delivers packets at regular intervals, rather than allowing them to become bunched together (as happens with buses when the roads are congested).

Synchronisation

Synchronisation is a problem that simply does not arise with analogue audio, but requires careful engineering in digital systems.

At the point where a source is digitised, a clock signal is required; every time the clock ticks, the source is sampled and a binary representation of its instantaneous value (relative to a reference voltage) is output. The accuracy of the resulting stream of numbers depends on the quality of the analogue-to-digital converter circuit and the stability of the clock signal and the reference voltage. If a rising signal is sampled too early, for example, or the reference voltage is too high, the number will be smaller than the correct value.

A clock signal is also required where the analogue signal is reconstructed from the stream of values at its destination; every time the clock ticks, another value is consumed. Again, the quality of the output depends on the stability of the clock signal and the reference voltage.

The effect of the absolute value of the clock frequency on the audio signal is negligible; the error is unlikely to be more than 100ppm, which will shift the audio frequencies by less than 1/500 of a semitone. However, jitter in the clock frequency (whereby the time from one clock tick to the next is not always the same) can introduce noticeable distortion.

Similarly, if the reference voltage is stable but slightly too high or too low the output signal will not be at quite the right level but will not be distorted, whereas AC components in the reference can find their way into the output signal.

In between these two points, all the system has to do is to convey the numbers correctly.

The supply voltage to the digital circuitry is much less critical, and its clock does not need to be stable: sometimes "spread spectrum" clocks are used, into which jitter is deliberately introduced to improve EMC performance. On most networks the samples will be batched together into groups for transmission, and the rate of transmission will not be related to any of the clocks that drive the digital circuitry. For instance, if the transmission medium is 100Mb/s Ethernet there is no relationship between the audio clock and the 125MHz clock that drives the network interface.

End-to-end synchronisation

Although the exact frequency of the reference clock at the receiving end is not critical to the quality of the reconstructed audio signal, it is important for another reason.

At the destination, samples will arrive at a rate which in the long term matches the rate at which they are produced by the source, though in most networks there will be short term fluctuations because the samples are grouped into packets (so arrive in bunches) and some packets take longer than others to traverse the network (so the bunches arrive at irregular intervals). The destination interface thus needs a FIFO buffer in which the incoming samples are stored, and from which one sample is taken at each tick of the destination sample clock. At the start of transmission, the first sample should be taken when the FIFO is half full.

If the destination clock has exactly the same frequency as the source clock (for example, if they are both locked to GPS), then the FIFO will continue indefinitely to be approximately half full. However, if it has a higher frequency than the source clock then samples will be taken from the FIFO more quickly than they arrive and eventually the FIFO will be empty when a sample is required. Similarly, if it has a lower frequency the FIFO will gradually fill and eventually some samples will be lost.

For example, if the destination frequency is 10ppm less than the source frequency, the amount of audio data in the FIFO increases by 3 msec every 5 minutes; if 10ppm more, it decreases at the same rate.

Synchronisation strategies

One solution to this problem is to control the output sample frequency such that the amount of data in the FIFO remains within defined limits; the size of the FIFO and the limit values are chosen according to the characteristics of the network. The control algorithm needs to be able to lock fairly quickly to the incoming data rate when a new connection is set up, while also minimising both high- and low-frequency jitter in the output sample clock.

A software algorithm that can adjust the output bit clock in multiples of approximately 1ppm, with at least two seconds between adjustments, has been found to be highly successful in the case of AES3 digital outputs, keeping the jitter well within the AES3 specification and allowing a good sample clock to be recovered from the AES3 signal for D-to-A conversion. The rate at which the adjustments take place limits the low-frequency jitter to less than 0.3 Hz.

There are, however, many circumstances in which the output must be synchronised to local equipment such as digital mixing desks. Also, the two channels of an AES3 output may contain data from different sources, which must therefore be synchronised to each other.

The best strategy in this case is to frequency-lock all sample clocks to a global reference such as the Global Positioning System.

Another possibility is to continuously transmit an audio signal from one studio to all the others, and frequency-lock their "house sync" signals to the incoming packet stream; this still requires sample rate conversion when receiving from a "foreign" source.

It must be noted that none of these strategies will cope with "vari-speed" operation, in which the sample clock is deliberately changed by up to 12.5%, for instance to bring a backing track into tune with a singer. It would be very difficult to make the output clock follow the input clock closely enough without introducing unacceptable amounts of jitter, so any vari-speed signals must be sample-rate-converted to a stable clock before transmission.

Error detection and correction

One of the major advantages of digital compared with analogue is that the processing, storage, and transmission stages do not contribute unwanted content (such as tape hiss) to the signal which cannot then be separated from it.

When signals are transmitted over cables, errors in the binary values will only occur as a result of faulty equipment, and when transmitting over public networks the error rate depends on how diligent the telecommunications provider is in taking faulty links out of service and repairing them. In practice, errors are extremely rare in most systems, with no errors at all occurring over periods of several months.

When errors do occur, they tend to occur in bursts, so any system for correcting them must cope with bursts of errors as well as single-bit errors.

There are two common methods of error correction:

forward error correction, in which redundant information, from which the true values can be reconstructed, is included in the data; and
error detection, with faulty packets being discarded and lost packets (including those that were discarded) being retransmitted.

Forward error correction introduces encoding delays; the redundant information must be calculated over a block of data significantly longer than the longest error burst, and the recipient may need to receive the whole block before it is able to output the first sample (and must therefore have an extra buffering delay at least as long as the time it takes a whole block to arrive).

Error detection and retransmission increases the buffering delay even more, because a receiver must allow enough time to request and receive a retransmission before it needs to output the audio data.

PCM audio has a number of features that differentiate it from most applications in which error detection and correction are used.

On a call carrying, say, 6 samples per packet of 48kHz audio, one packet is sent every 125 µsec. On a 155 Mbit/s SDH link, an error burst that affected more than one packet would need to be nearly 20,000 bits long, so it is reasonable to concentrate on error bursts that affect only one packet.

The audio data itself includes a fair amount of redundancy, so if a few samples are lost it is possible to construct plausible values for them from the adjacent samples.

Bit errors in audio data are only really important if they affect the upper bits of the sample; if a sample is discarded and recreated because of an error in one of the lower bits, the replacement sample is likely to be a worse approximation to the true value than the corrupted sample.

It follows that the best strategy is probably

to include redundant information (such as a CRC) that will protect the top bits of each sample,
to label the packets so that lost packets can be detected, and
to use interpolation to recreate lost samples.

This is the approach taken in the AES47 standard, and in Flexilink.

The receiving equipment needs to log the occurrence of errors, so that remedial measures can be taken before the effects become audible.

--oOo--