Latency
Latency, or delay, causes two problems—echo and talker overlap. Echo is caused by the signal reflections of the speaker's voice. Echo becomes a significant problem when delay is greater than 50 milliseconds. Since echo is a significant quality problem, equipment providers must implement echo cancellation. Talker overlap becomes significant if one-way delay is greater than 250 milliseconds, so every effort must be made to minimize delay. The sources of delay in a VoP implementation include the following:
Accumulation Delay
This delay is caused by the need to collect a frame of voice samples to be processed by the voice coder. It varies from a single sample time (.125 microseconds) to many milliseconds. Standard voice coders (and their frame times) include the following:
- G.728 low-delay (LD)–code excited linear prediction (CELP)—2.5 milliseconds
- G.729 a, b, e conjugate structure (CS)–ACELP—10 milliseconds
Algorithmic Delay (Sometimes Called Look-Ahead Delay)
This delay is caused by the characteristics of the specific voice encoding algorithm. An example of algorithmic delay is the following:
- G.729 a, b, e CS–ACELP—5 milliseconds
Processing Delay
This delay is caused by the actual process of encoding and collecting samples into a packet for transmission. The encoding delay is a function of both the processor execution time and the type of algorithm used. Often, multiple voice-coder frames will be collected in a single packet to reduce overhead. For example, three frames of G.729 code words, equaling 30 milliseconds of speech, may be collected and packed into a single packet. This process of encapsulating several small packets into a single larger frame is called concatenation.
Network Delay
Network delay is a function of the processing that occurs as packets are sent across a network. This delay is caused by the physical medium and the protocols used to transmit the voice data and by the buffers used to remove packet jitter on the receive side. The jitter buffers add additional delay that is used to smooth the jitter created by the varying times at which each packet arrives. This delay can be a significant part of the overall delay since it can be as high as 70–100 milliseconds.
Polling Delay
Cable-based IP telephony creates an additional latency that other packet networks do not because of the way head-end systems collect packets from callers. The head end polls the NI at each customer location. Because the head end doesn't maintain a continuous connection with each NI, there is a transmission delay while voice packets wait for the next poll. Therefore, it is important that cable-based IP telephony equipment minimize this delay by anticipating when the next poll will arrive—a process called grant synchronization—so that the packets are queued up and ready to go.
Echo
Echo is present even in a conventional POTS network. However, it is acceptable because delay is less than 50 milliseconds and the echo is masked by the normal side tone that every telephone generates. Echo becomes a problem in VoP networks because the delay is almost always greater than 50 milliseconds. Thus, echo-cancellation techniques must be used. The International Telecommunication Union (ITU) standards G.165 and G.168 define performance requirements for echo cancellers.
Echo is generated toward the packet network from the telephone network. The echo canceller compares the voice data received from the packet network with voice data being transmitted to the packet network. The echo from the telephone network is removed by a digital filter on the transmit path into the packet network.
Jitter
The delay problem is compounded by the need to remove jitter—a variable interpacket timing caused by the fact that packets do not all cross the network at the same speed. Removing jitter requires collecting packets and holding them long enough to allow the slowest packets to arrive and be played in the correct sequence. This causes significant delay. The conflicting goals of minimizing delay and removing jitter have led to various schemes aimed at optimizing the size of the jitter buffer to minimize its impact on latency.
A common approach in cable-based IP telephony is to count the number of packets that arrive late and create a ratio of these packets to the number of packets that are successfully processed. This ratio is then used to adjust the jitter buffer to target a specific late-packet ratio.
Lost Packets
In today's IP networks, voice frames are treated exactly like data. Under peak loads and congestion, voice frames will be dropped at the same rate as data frames. The data frames, however, are not time sensitive, and dropped packets can be corrected through retransmission. Lost voice packets cannot be handled in the same manner. Some techniques used by VoP software to address this problem include the following:
- InterpolationInterpolate for lost speech packets by replaying the last packet received during the interval when the lost packet was supposed to be played out. This scheme is a simple method that fills the time between noncontiguous speech frames. It works well when the incidence of lost frames is infrequent. It does not work very well if there are a number of consecutive lost packets or a burst of lost packets.
- RedundancySend redundant information at the expense of bandwidth utilization. The basic approach replicates and sends the nth packet along with the (n+1)th packet. This method has the advantage of being able to correct for the lost packet exactly. However, this approach uses more bandwidth and also creates greater delay.
- Voice CoderA hybrid approach uses a lower-bandwidth voice coder to provide redundant information carried along in the (n+1)th packet. This reduces the problem of bandwidth consumption but does not solve the problem of delay.


