Title of Invention

A METHOD AND A DEVICE FOR ENCODING SPEECH SIGNALS

Abstract The present invention relates to a method for encoding a speech signal, the method-comprising: - dividing a speech signal into a sequence of frames; - deriving a signal from the speech signal, such that pitch pulses can be identified from the derived signal; - identifying a position of a last pitch pulse of a current frame and a position of a last pitch pulse of a previous frame with reference to the derived signal; and - determining an optimal delay parameter (620) value such that a pitch delay contour representative of pitch delay variation within the current frame characterised by said optimal delay parameter (620) value provides a minimised prediction error when the pitch delay contour is used to predict the position of the last pitch pulse in the preceding frame. The present invention also relates to a device for encoding a speech signal (203).
Full Text FIELD OF THE INVENTION
The present invention relates generally to the encoding and decoding
of sound signals in communication systems. More specifically, the present
invention is concerned with a signal modification technique applicable to, in
particular but not exclusively, code-excited linear prediction (CELP) coding.
BACKGROUND OF THE INVENTION
Demand for efficient digital narrow- and wideband speech coding
techniques with a good trade-off between the subjective quality and bit rate is
increasing in various application areas such as teleconferencing, multimedia,
and wireless communications. Until recently, the telephone bandwidth
constrained into a range of 200-3400 Hz has mainly been used in speech
coding applications. However, wideband speech applications provide
increased intelligibility and naturalness in communication compared to the
conventional telephone bandwidth. A bandwidth in the range 50-7000 Hz has
been found sufficient for delivering a good quality giving an impression of
face-to-face communication. For general audio signals, this bandwidth gives
an acceptable subjective quality, but is still lower than the quality of FM radio
or CD that operate in ranges of 20-16000 Hz and 20-20000 Hz, respectively.

A speech encoder converts a speech signal into a digital bit stream
which is transmitted over a communication channel or stored in a storage
medium. The speech signal is digitized, that is sampled and quantized with
usually 16-bits per sample. The speech encoder has the role of representing
these digital samples with a smaller number of bits while maintaining a good
subjective speech quality. The speech decoder or synthesizer operates on the
transmitted or stored bit stream and converts it back to a sound signal.
Code-Excited Linear Prediction (CELP) coding is one of the best
techniques for achieving a good compromise between the subjective quality
and bit rate. This coding technique is a basis of several speech coding
standards both in wireless and wire line applications. In CELP coding, the
sampled speech signal is processed in successive blocks of N samples
usually called frames, where N is a predetermined number corresponding
typically to 10-30 ms. A linear prediction (LP) filter is computed and
transmitted every frame. The computation of the LP filter typically needs a
look ahead, i.e. a 5-10 ms speech segment from the subsequent frame. The
N-sample frame is divided into smaller blocks called subframes. Usually the
number of subframes is three or four resulting in 4-10 ms subframes. In each
subframe, an excitation signal is usually obtained from two components: a
past excitation and an innovative, fixed-codebook excitation. The component
formed from the past excitation is often referred to as the adaptive codebook
or pitch excitation. The parameters characterizing the excitation signal are
coded and transmitted to the decoder, where the reconstructed excitation
signal is used as the input of the LP filter.
In conventional CELP coding, long term prediction for mapping the
past excitation to the present is usually performed on a subframe basis. Long
term prediction is characterized by a delay parameter and a pitch gain that are

usually computed, coded and transmitted to the decoder for every subframe.
At low bit rates, these parameters consume a substantial proportion of the
available bit budget. Signal modification techniques [1-7]
[1] W.B. Kleljn, P. Kroon, and D. Nahurni, "The RCELP
speech-coding algorithm," European Transactions on
Telecommunications, Vol. 4, No. 5, pp. 573-582,1994.
[2] W.B. Kleijn, R.P. Ramachandran, and P. Kroon,
"Interpolation of the pitch-predictor parameters in analysis-by-
synthesis speech coders," IEEE Transactions on Speech and Audio
Processing, Vol. 2, No. 1, pp. 42-54,1994,
[3] Y. Gao, A. Benyassine, J. Thyssen, H. Su, and E. Shlomot,
"EX-CELP: A speech coding paradigm," IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), Salt Lake City, Utah, U.S.A., pp. 689-692, 7-11 May
2001.
[4] US Patent 5,704,003, "RCELP coder," Lucent Technologies
Inc., (W.B. Kleijn and D. Nahurni), Filing Date: 19 September 1995.
[5] European Patent Application 0 602 826 A2, "Time shifting
for analysis-by-synthesis coding," AT&T Corp., (B. Kleijn), Filing
Date: 1 December 1993.
[6] Patent Application WO 00/11653, "Speech encoder with
continuous warping combined with long term prediction," Conexant
Systems Inc., (Y. Gao), Filing Date: 24 August 1999.

[7] Patent Application WO 00/11654, "Speech encoder
adaptiveiy applying pitch preprocessing with continuous warping,"
Conexant Systems Inc., (H. Su and Y. Gao), Filing Date: 24 Aug.
1999.
improve the performance of long term prediction at low bit rates by adjusting
the signal to be coded. This is done by adapting the evolution of the pitch
cycles in the speech signal to fit the long term prediction delay, enabling to
transmit only one delay parameter per frame. Signal modification is based on
the premise that it is possible to render the difference between the modified
speech signal and the original speech signal inaudible. The CELP coders
utilizing signal modification are often referred to as generalized analysis-by-
synthesis or relaxed CELP (RCELP) coders.
Signal modification techniques adjust the pitch of the signal to a
predetermined delay contour. Long term prediction then maps the past
excitation signal to the present subframe using this delay contour and scaling
by a gain parameter. The delay contour is obtained straightforwardly by
interpolating between two open-loop pitch estimates, the first obtained in the
previous frame and the second in the current frame. Interpolation gives a
delay value for every time instant of the frame. After the delay contour is
available, the pitch in the subframe to be coded currently is adjusted to follow
this artificial contour by warping, i.e. changing the time scale of the signal.
In discontinuous warping [1,4 and 5]

[1] W.B. Kleijn, P. Kroon, and D. Nahumi, The RCELP speech-
coding algorithm," European Transactions on Telecommunications,
Vol. 4, No. 5,' pp. 573-582, 1994.
[4] US Patent 5,704,003, "RCELP coder," Lucent Technologies Inc.,
(W.B. Kleijn and D. Nahumi), Filing Date: 19 September 1995.
[5] European Patent Application 0 602 826 A2, "Time shifting for
analysis-by-synthesis coding," AT&T Corp., (B. Kleijn), Filing Date:
1 December 1993.
a signal segment is shifted in time without altering the segment length.
Discontinuous warping requires a procedure for handling the resulting
overlapping or missing signal portions. Continuous warping [2, 3, 6,7]
[2] W.B. Kleijn, R.P. Ramachandran, and P. Kroon, "Interpolation of
the pitch-predictor parameters in analysis-by-synthesis speech
coders," IEEE Transactions on Speech and Audio Processing, Vol.
2, No. 1, pp. 42-54,1994.

[3] Y. Gao, A. Benyassine, J. Thyssen, H. Su, and E. Shlomot, "EX-
CELP: A speech coding paradigm," IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), Salt Lake
City, Utah, U.S.A., PP. 689-692,7-11 May 2001.
[6] Patent Application WO 00/11653, "Speech encoder with
continuous warping combined with long term prediction," Conexant
Systems Inc., (Y. Gao), Filing Date: 24 Aug. 1999.

[7] Patent Application WO 00/11654, "Speech encoder adaptively
applying pitch preprocessing with continuous warping," Conexant
Systems Inc., (H. Su and Y. Gao), Filing Date 24 Aug. 1999.
either contracts or expands a signal segment. This is done using a time
continuous approximation for the signal segment and re-sampling it to a
desired length with unequal sampling intervals determined based on the delay
contour. For reducing artifacts in these operations, the tolerated change in the
time scale is kept small. Moreover, warping is typically done using the LP
residual signal or the weighted speech signal to reduce the resulting
distortions. The use of these signals instead of the speech signal also
facilitates detection of pitch pulses and low-power regions in between them,
and thus the determination of the signal segments for warping. The actual
modified speech signal is generated by inverse filtering.
After the signal modification is done for the current subframe, the
coding can proceed in any conventional manner except the adaptive
codebook excitation is generated using the predetermined delay contour.
Essentially the same signal modification techniques can be used both in
narrow- and wideband CELP coding.
Signal modification techniques can also be applied in other types of
speech coding methods such as waveform interpolation coding and sinusoidal
coding for instance in accordance with [8].
[8] US Patent 6,223,151, "Method and apparatus for pre-processing
speech signals prior to coding by transform-based speech coders,"
Telefon Aktie Bolaget LM Ericsson, (W.B. Kleijn and T. Eriksson),
Filing Date 10 Feb. 1999.
As an example, Navarro (US 5,974,377) may be seen to include examples of
prior art such as conventional methods for CELP coding or long term prediction as
have been described above.

SUMMARY OF THE INVENTION
The present invention relates to a method for determining a long-term-
prediction delay parameter characterizing a long term prediction in a
technique using signal modification for digitally encoding a sound signal,
comprising dividing the sound signal into a series of successive frames,
locating a feature of the sound signal in a previous frame, locating a
corresponding feature of the sound signal in a current frame, and determining
the long-term-prediction delay parameter for the current frame such that the
long term prediction maps the signal feature of the previous frame to the
corresponding signal feature of the current frame.
The subject invention is concerned with a device for determining a
long-term-prediction delay parameter characterizing a long term prediction in
a technique using signal modification for digitally encoding a sound singal,
comprising a divider of the sound signal into a series of successive frames, a
delector of a feature of the sound signal in a previous frame, a detector of a
corresponding feature of the sound signal in a current frame, and a calculator
of the long-term-prediction delay parameter for the current frame, the
calculation of the long-term-prediction delay parameter being made such that
the long term prediction maps the signal feature of the previous frame to the
corresponding signal feature of the current frame.
According to the invention, there is provided a signal modification
method for implementation into a technique for digitally encoding a sound
signal, comprising dividing the sound signal into a series of successive
frames, partitioning each frame of the sound signal into a plurality of signal

segments, and warping at least a part of the signal segments of the frame,
this warping comprising constraining the warped signal segments inside the
frame.
In accordance with the present invention, there is provided a signal
modification device for implementation into a technique for digitally encoding a
sound signal, comprising a first divider of the sound signal into a series of
successive frames, a second divider of each frame of the sound signal into a
plurality of signal segments, and a signal segment warping member supplied
with at least a part of the signal segments of the frame, this warping member
comprising a constrainer of the warped signal segments inside the frame.
The present invention also relates to a method for searching pitch
pulses in a sound signal, comprising dividing the sound signal into a series of
successive frames, dividing each frame into a number of sub-frames,
producing a residual signal by filtering the sound signal through a linear
prediction analysis filter, locating a last pitch pulse of the sound signal of the
previous frame from the residual signal, extracting a pitch pulse prototype of
given length around the position of the last pitch pulse of the previous frame
using the residual signal, and locating pitch pulses in a current frame using
the pitch pulse prototype.
The present invention is also concerned with a device for searching
pitch pulses in a sound signal, comprising a divider of the sound signal into a
series of successive frames, a divider of each frame into a number of
subframes, a linear prediction analysis filter for filtering the sound signal and
thereby producing a residual signal, a detector of a last pitch pulse of the
sound signal of the previous frame in response to the residual signal, an
extractor of a pitch pulse prototype of given length around the position of the

last pitch pulse of the previous frame in response to the residual signal, and a
detector of pitch pulses in a current frame using the pitch pulse prototype.
According to the invention, there is also provided a method for
searching pitch pulses in a sound signal, comprising dividing the sound signal
into a series of successive frames, dividing each frame into a number of
sub-frames, producing a weighted sound signal by processing the sound
signal through a weighting filter wherein the weighted sound signal is
indicative of signal periodicity, locating a last pitch pulse of the sound signal of
the previous frame from the weighted sound signal, extracting a pitch pulse
prototype of given length around the position of the last pitch pulse of the
previous frame using the weighted sound signal, and locating pitch pulses in a
current frame using the pitch pulse prototype.
Also in accordance with the present invention, there is provided a
device for searching pitch pulses in a sound signal, comprising a divider of the
sound signal into a series of successive frames, a divider of each frame into a
number of sub-frames, a weighting filter for processing the sound signal to
produce a weighted sound signal wherein the weighted sound signal is
indicative of signal periodicity, a detector of a last pitch pulse of the sound
signal of the previous frame in response to the weighted sound signal, an
extractor of a pitch pulse prototype of given length around the position of the
last pitch pulse of the previous frame in response to the weighted sound
signal, and a detector of pitch pulses in a current frame using the pitch pulse
prototype.
The present invention further relates to a method for searching pitch
pulses in a sound signal, comprising dividing the sound signal into a series of
successive frames, dividing each frame into a number of subfrarnes,

producing a synthesized weighted sound signal by filtering a synthesized
speech signal produced during a last subframe of a previous frame of the
sound signal through a weighting filter, locating a last pitch pulse of the sound
signal of the previous frame from the synthesized weighted sound signal,
extracting a pitch pulse prototype of given length around the position of the
last pitch pulse of the previous frame using the synthesized weighted sound
signal, and locating pitch pulses in a current frame using the pitch pulse
prototype.
The present invention is further concerned with a device for searching
pitch pulses in a sound signal, comprising a divider of the sound signal into a
series of successive frames, a divider of each frame into a number of
subframes, a weighting filter for filtering a synthesized speech signal
produced during a last subframe of a previous frame of the sound signal and
thereby producing a synthesized weighted sound signal, a detector of a last
pitch pulse of the sound signal of the previous frame in response to the
synthesized weighted sound signal, an extractor of a pitch pulse prototype of
given length around the position of the last pitch pulse of the previous frame in
response to the synthesized weighted sound signal, and a detector of pitch
pulses in a current frame using the pitch pulse prototype.
According to the invention, there is further provided a method for
forming an adaptive codebook excitation during decoding of a sound signal
divided into successive frames and previously encoded by means of a
technique using signal modification for digitally encoding the sound signal,
comprising:
receiving, for each frame, a long-term-prediction delay parameter
characterizing a long term prediction in the digital sound signal encoding
technique;

recovering a delay contour using the long-term-prediction delay
parameter received during a current frame and the long-term-prediction delay
parameter received during a previous frame, wherein the delay contour, with
long term prediction, maps a signal feature of the previous frame to a
corresponding signal feature of the current frame;
forming the adaptive codebook excitation in an adaptive codebook in
response to the delay contour.
Further in accordance with the present invention, there is provided a
device for forming an adaptive codebook excitation during decoding of a
sound signal divided into successive frames and previously encoded by
means of a technique using signal modification for digitally encoding the
sound signal, comprising:
a receiver of a long-term-prediction delay parameter of each frame,
wherein the long-term-prediction delay parameter characterizes a long term
prediction in the digital sound signal encoding technique;
a calculator of a delay contour in response to the long-term-prediction
delay parameter received during a current-frame and the long-term-prediction
delay parameter received during a previous frame, wherein the delay contour,
with long term prediction, maps a signal feature of the previous frame to a
corresponding signal feature of the current frame; and
an adaptive codebook for forming the adaptive codebook excitation in'
response to the delay contour.
The foregoing and other objects, advantages and features of the
present invention will become more apparent upon reading of the following
non restrictive description of illustrative embodiments thereof, given by way of
example only with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is an illustrative example of original and modified residual
signals for one frame;
Figure 2 is a functional block diagram of an illustrative embodiment of
a signal modification method according to the invention;
Figure 3 is a schematic block diagram of an illustrative example of
speech communication system showing the use of speech encoder and
decoder;
Figure 4 is a schematic block diagram of an illustrative embodiment of
speech encoder that utilizes a signal modification method;
Figure 5 is a functional block diagram of an illustrative embodiment of
pitch pulse search;
Figure 6 is an illustrative example of located pitch pulse positions and
a corresponding pitch cycle segmentation for one frame;
Figure 7 is an illustrative example on determining a delay parameter
when the number of pitch pulses is three (c = 3);
Figure 8 is an illustrative example of delay interpolation (thick line)
over a speech frame compared to linear interpolation (thin line);

Figure 9 is an illustrative example of a delay contour over ten frames
selected in accordance with the delay interpolation (thick line) of Figure 8 and
linear interpolation (thin line) when the correct pitch value is 52 samples;
Figure 10 is a functional block diagram of the signal modification
method that adjusts the speech frame to the selected delay contour in
accordance with an illustrative embodiment of the present invention;
Figure 11 is an illustrative example on updating the target signal w(t)
using a determined optimal shift 5, and on replacing the signal segment ws(k)
with interpolated values shown as gray dots;
Figure 12 is a functional block diagram of a rate determination logic in
accordance with an illustrative embodiment of the present invention; and
Figure 13 is a schematic block diagram of an illustrative embodiment
of speech decoder that utilizes the delay contour formed in accordance with
an illustrative embodiment of the present invention.
DETAILED DESCRIPTION OF THE ILLUSTRATIVE EMBODIMENTS
Although the illustrative embodiments of the present invention will be
described in relation to speech signals and the 3GPP AMR Wideband Speech
Codec AMR-WB Standard (ITU-T G.722.2), it should be kept in mind that the
concepts of the present invention may be applied to other types of sound
signals as well as other speech and audio coders.

Figure 1 illustrates an example of modified residual signal 12 within
one frame. As shown in Figure 1, the time shift in the modified residual signal
12 is constrained such that this modified residual signal is time synchronous
with the original, unmodified residual signal 11 at frame boundaries occurring
at time instants tn-t and tn Here n refers to the index of the present frame.
More specifically, the time shift is controlled implicitly with a delay
contour employed for interpolating the delay parameter over the current
frame. The delay parameter and contour are determined considering the time
alignment constrains at the above-mentioned frame boundaries. When linear
interpolation is used to force the time alignment, the resulting delay
parameters tend to oscillate over several frames. This often causes annoying
artifacts to the modified signal whose pitch follows the artificial oscillating
delay contour. Use of a properly chosen nonlinear interpolation technique for
the delay parameter will substantially reduce these oscillations.
A functional block diagram of the illustrative embodiment of the signal
modification method according to the invention is presented in Figure 2.
The method starts, in "pitch cycle search" block 101, by locating
individual pitch pulses and pitch cycles. The search of block 101 utilizes an
open-loop pitch estimate interpolated over the frame. Based on the located
pitch pulses, the frame is divided into pitch cycle segments, each containing
one pitch pulse and restricted inside the frame boundaries tn-t and tn.
The function of the "delay curve selection" block 103 is to determine a
delay parameter for the long term predictor and form a delay contour for
interpolating this delay parameter over the frame. The delay parameter and
contour are determined considering the time synchrony constrains at frame

boundaries tn-1 and tn. The delay parameter determined in block 103 is coded
and transmitted to the decoder when signal modification is enabled for the
current frame.
The actual signal modification procedure is conducted in the "pitch
synchronous signal modification" block 105. Block 105 first forms a target
signal based on the delay contour determined in block 103 for subsequently
matching the individual pitch cycle segments into this target signal. The pitch
cycle segments are then shifted one by one to maximize their correlation with
this target signal. To keep the complexity at a low level, no continuous time
warping is applied while searching the optimal shift and shifting the segments.
The illustrative embodiment of signal modification method as
disclosed in the present specification is typically enabled only on purely
voiced speech frames. For instance, transition frames such as voiced onsets
are not modified because of a high risk of causing artifacts. In purely voiced
frames, pitch cycles usually change relatively slowly and therefore small shifts
suffice to adapt the signal to the long term prediction model. Because only
small, cautious signal adjustments are made, the probability of causing
artifacts is minimized.
The signal modification method constitutes an efficient classifier for
purely voiced segments, and hence a rate determination mechanism to be
used in a source-controlled coding of speech signals. Every block 101, 103
and 105 of Figure 2 provide several indicators on signal periodicity and the
suitability of signal modification in the current frame. These indicators are
analyzed in logic blocks 102, 104 and 106 in order to determine a proper
coding mode and bit rate for the current frame. More specifically, these logic

blocks 102, 104 and 106 monitor the success of the operations conducted in
blocks 101,103, and 105.
If block 102 detects that the operation performed in block 101 is
successful, the signal modification method is continued in block 103. When
this block 102 detects a failure in the operation performed in block 101, the
signal modification procedure is terminated and the original speech frame is
preserved intact for coding (see block 108 corresponding to normal mode (no
signal modification)).
If block 104 detects that the operation performed in block 103 is
successful, the signal modification method is continued in block 105. When,
on the contrary, this block 104 detects a failure in the operation performed in
block 103, the signal modification procedure is terminated and the original
speech frame is preserved intact for coding (see block 108 corresponding to
normal mode (no signal modification)).
If block 106 detects that the operation performed in block 105 is
successful, a low bit rate mode with signal modification is used (see block
107). On the contrary, when this block 106 detects a failure in the operation
performed in block 105 the signal modification procedure Is terminated, and
the original speech frame is preserved intact for coding (see block 108
corresponding to normal mode (no signal modification)). The operation of the
blocks 101-108 will be described in detail later in the present specification.
Figure 3 is a schematic block diagram of an illustrative example of
speech communication system depicting the use of speech encoder and
decoder. The speech communication system of Figure 3 supports
transmission and reproduction of a speech signal across a communication

channel 205. Although it may comprise for example a wire, an optical link or a
fiber link, the communication channel 205 typically comprises at least in part a
radio frequency link. The radio frequency link often supports multiple,
simultaneous speech communications requiring shared bandwidth resources
such as may be found with cellular telephony. Although not shown, the
communication channel 205 may be replaced by a storage device that
records and stores the encoded speech signal for later playback.
On the transmitter side, a microphone 201 produces an analog
speech signal 210 that is supplied to an analog-to-digital (A/D) converter 202.
The function of the A/D converter 202 is to convert the analog speech signal
210 into a digital speech signal 211. A speech encoder 203 encodes the
digital speech signal 211 to produce a set of coding parameters 212 that are
coded into binary form and delivered to a channel encoder 204. The channel
encoder 204 adds redundancy to the binary representation of the coding
parameters before transmitting them into a bitstream 213 over the
communication channel 205.
On the receiver side, a channel decoder 206 is supplied with the
above mentioned redundant binary representation of the coding parameters
from the received bitstream 214 to detect and correct channel errors that
occurred in the transmission. A speech decoder 207 converts the channel-
error-corrected bitstream 215 from the channel decoder 206 back to a set of
coding parameters for creating a synthesized digital speech signal 216. The
synthesized speech signal 216 reconstructed by the speech decoder 207 is
converted to an analog speech signal 217 through a digital-to-analog (D/A)
converter 208 and played back through a loudspeaker unit 209.

Figure 4 is a schematic block diagram showing the operations
performed by the illustrative embodiment of speech encoder 203 (Figure 3)
incorporating the signal modification functionality. The present specification
presents a novel implementation of this signal modification functionality of
block 603 in Figure 4. The other operations performed by the speech encoder
203 are well known to those of ordinary skill in the art and have been
described, for example, in the publication [10]
[10] 3GPP TS 26.190, "AMR Wideband Speech Codec:
Transcoding Functions," 3GPP Technical Specification.
which is incorporated herein by reference. When not stated otherwise, the
implementation of the speech encoding and decoding operations in the
illustrative embodiments and examples of the present invention, will comply
with the AMR Wideband Speech Codec (AMR-WB) Standard.
The speech encoder 203 as shown in Figure 4 encodes the digitized
speech signal using one or a plurality of coding modes. When a plurality of
coding modes are used and the signal modification functionality is disabled in
one of these modes, this particular mode will operate in accordance with well
established standards known to those of ordinary skill in the art.
Although not shown in Figure 4, the speech signal is sampled at a
rate of 16 kHz and each speech signal sample is digitized. The digital speech
signal is then divided into successive frames of given length, and each of
these frames is divided into a given number of successive subframes. The
digital speech signal is further subjected to preprocessing as taught by the
AMR-WB standard. This preprocessing includes high-pass filtering, pre-
emphasis filtering using a filter P(z) ≈ 1 - 0.68z-1 and down-sampling from the

sampling rate of 16 kHz to 12.8 kHz. The subsequent operations of Figure 4
assume, that the input speech signal s{t) has been preprocessed and down-
sampled to the sampling rate of 12.8 kHz.
The speech encoder 203 comprises an LP (Linear Prediction)
analysis and quantization module 601 responsive to the input, preprocessed
digital speech signal s(t) 617 to compute and quantize the parameters a0, a1,
a2, ... , anA of the LP filter 1/A(z), wherein nA is the order of the filter and A(z) -
a0 + a1z-1 + a2z-2 + ... + anAZ-nA . The binary representation 616 of these
quantized LP filter parameters is supplied to the multiplexer 614 and
subsequently multiplexed into the bitstream 615. The non-quantized and
quantized LP filter parameters can be interpolated for obtaining the
corresponding LP filter parameters for every subframe.
The speech encoder 203 further comprises a pitch estimator 602 to
compute open-loop pitch estimates 619 for the current frame in response to
the LP filter parameters 618 from the LP analysis and quantization module
601. These open-loop pitch estimates 619 are interpolated over the frame to
be used in a signal modification module 603.
The operations performed in the LP analysis and quantization module
601 and the pitch estimator 602 can be implemented in compliance with the
above-mentioned AMR-WB Standard.
The signal modification module 603 of Figure 4 performs a signal
modification operation prior to the closed-loop pitch search of the adaptive
codebook excitation signal for adjusting the speech signal to the determined
delay contour d(t). in the illustrative embodiment, the delay contour d(t)
defines a long term prediction delay for every sample of the frame. By

construction the delay contour is fully characterized over the frame t e (tn-1,
tn] by a delay parameter 620 dn = d(tn) and its previous value dn-1 = d(tn-1)
that are equal to the value of the delay contour at frame boundaries. The
delay parameter 620 is determined as a part of the signal modification
operation, and coded and then supplied to the multiplexer 614 where it is
multiplexed into the bitstream 615.
The delay contour d(t) defining a long term prediction delay parameter
for every sample of the frame is supplied to an adaptive codebook 607. The
adaptive codebook 607 is responsive to the delay contour d(f) to form the
adaptive codebook excitation Ub(t) of the current subframe from the excitation
u(t) using the delay contour d(f) as ub(t) = u(t - d(t)). Thus the the delay
contour maps the past sample of the exitation signal u(t - d(t)) to the present
sample in the adaptive codebook excitation ub(t).
The signal modification procedure produces also a modified residual
signal r(t) to be used for composing a modified target signal 621 for the
closed-loop search of the fixed-codebook excitation uc(t). The modified
residual signal r(t) is obtained in the signal modification module 603 by
warping the pitch cycle segments of the LP residual signal, and is supplied to
the computation of the modified target signal in module 604. The LP synthesis
filtering of the modified residual signal with the filter 1/A(z) yields then in
module 604 the modified speech signal. The modified target signal 621 of the
fixed-codebook excitation search is formed in module 604 in accordance with
the operation of the AMR-WB Standard, but with the original speech signal
replaced by its modified version.

After the adaptive codebook excitation ub(t) and the modified target
signal 621 have been obtained for the current subframe, the encoding can
further proceed using conventional means.
The function of the closed-loop fixed-codebook excitation search is to
determine the fixed-codebook excitation signal uc(t) for the current subframe.
To schematically illustrate the operation of the closed-loop fixed-codebook
search, the fixed-codebook excitation uc(t) is gain scaled through an amplifier
610. In the same manner, the adaptive-codebook excitation ub(t) is gain
scaled through an amplifier 609. The gain scaled adaptive and fixed-
codebook excitations ub(t) and uc(t) are summed together through an adder
611 to form a total excitation signal u(t). This total excitation signal u(t) is
processed through an LP synthesis filter 1/A(z) 612 to produce a synthesis
speech signal 625 which is subtracted from the modified target signal 621
through an adder 605 to produce an error signal 626. An error weighting and
minimization module 606 is responsive to the error signal 626 to calculate,
according to conventional methods, the gain parameters for the amplifiers 609
and 610 every subframe. The error weighting and minimization module 606
further calculates, in accordance with conventional methods and in response
to the error signal 626, the input 627 to the fixed codebook 608. The
quantized gain parameters 622 and 623 and the parameters 624
characterizing the fixed-codebook excitation signal uc(t) are supplied to the
multiplexer 614 and multiplexed into the bitstream 615. The above procedure
is done in the same manner both when signal modification is enabled or
disabled.
It should be noted that, when the signal modification functionality is
disabled, the adaptive excitation codebook 607 operates according to
conventional methods. In this case, a separate delay parameter is searched

for every subframe in the adaptive codebook 607 to refine the open-loop pitch
estimates 619. These delay parameters are coded, supplied to the multiplexer
614 and multiplexed into the bitstream 615. Furthermore, the target signal 621
for the fixed-codebook search is formed in accordance with conventional
methods.
The speech decoder as shown in Figure 13 operates according to
conventional methods except when signal modification is enabled. Signal
modification disabled and enabled operation differs essentially only in the way
the adaptive codebook excitation signal ub(t) is formed. In both operational
modes, the decoder decodes the received parameters from their binary
representation. Typically the received parameters include excitation, gain,
delay and LP parameters. The decoded excitation parameters are used in
module 701 to form the fixed-codebook excitation signal uc(t) for every
subframe. This signal is supplied through an amplifier 702 to an adder 703.
Similarly, the adaptive codebook excitation signal ub(t) of the current subframe
is supplied to the adder 703 through an amplifier 704. In the adder 703, the
gain-scaled adaptive and fixed-codebook excitation signals ub(t) and uc(t) are
summed together to form a total excitation signal u(t) for the current subframe.
This excitation signal u[f) is processed through the LP synthesis filter 1/A(z)
708, that uses LP parameters interpolated In module 707 for the current
subframe, to produce the synthesized speech signal ŝ(t).
When signal modification is enabled, the speech decoder recovers the
delay contour d(t) in module 705 using the received delay parameter dn and
its previous received value dn-1 as in the encoder. This delay contour d(t)
defines a long term prediction delay parameter for every time instant of the
current frame. The adaptive codebook excitation ub(t) = u(t-d(t)) is formed

from the past excitation for the current subframe as in the encoder using the
delay contour d(t).
The remaining description discloses the detailed operation of the
signal modification procedure 603 as well as its use as a part- of the mode
determination mechanism.
Search of Pitch Pulses and Pitch Cycle Segments
The signal modification method operates pitch and frame
synchronously, shifting each detected pitch cycle segment individually but
constraining the shift at frame boundaries. This requires means for locating
pitch pulses and corresponding pitch cycle segments for the current frame. In
the illustrative embodiment of the signal modification method, pitch cycle
segments are determined based on detected pitch pulses that are searched
according to Figure 5.
Pitch pulse search can operate on the residual signal r(t), the
weighted speech signal w(t) and/or the weighted synthesized speech signal
w(t). The residual signal r(t) is obtained by filtering the speech signal s(f) with
the LP filter A(z), which has been interpolated for the subframes. In the
illustrative embodiment, the order of the LP filter A(z) is 16. The weighted
speech signal w(t) is obtained by processing the speech signal s(t) through
the weighting filter


where the coefficients γ1 = 0.92 and γ2 - 0.68. The weighted speech signal
w(t) is often utilized in open-loop pitch estimation (module 602) since the
weighting filter defined by Equation (1) attenuates the formant structure in the
speech signal s(t), and preserves the periodicity also on sinusoidal signal
segments. That facilitates pitch pulse search because possible signal
periodicity becomes clearly apparent in weighted signals. It should be noted
that the weighted speech signal w(t) is needed also for the look ahead in
order to search the last pitch pulse in the current frame. This can be done by
using the weighting filter of Equation (1) formed in the last subframe of the
current frame over the look ahead portion.
The pitch pulse search procedure of Figure 5 starts in block 301 by
locating the last pitch pulse of the previous frame from the residual signal r(t).
A pitch pulse typically stands out clearly as the maximum absolute value of
the low-pass filtered residual signal in a pitch cycle having a length of
approximately p(tn-1). A normalized Hamming window Hs(z) = (0.08z-2 +
0.54 z-1 + 1 + 0.54 z + 0.08z2)/2.24 having a length of five (5) samples is used
for the low-pass filtering in order to facilitate the locating of the last pitch pulse
of the previous frame. This pitch pulse position is denoted by To. The
illustrative embodiment of the signal modification method according to the
invention does not require an accurate position for this pitch pulse, but rather
-a rough location estimate of the high-energy segment in the pitch cycle.
After locating the last pitch pulse at T0 in the previous frame, a pitch
pulse prototype of length 2/ + 1 samples is extracted in block 302 of Figure 5
around this rough position estimate as, for example:


This pitch pulse prototype is subsequently used in locating pitch pulses in the
current frame.
The synthesized weighted speech signal w(t) (or the weighted speech
signal w(t)) can be used for the pulse prototype instead of the residual signal
r(t). This facilitates pitch pulse search, because the periodic structure of the
signal is better preserved in the weighted speech signal. The synthesized
weighted speech signal w(t) is obtained by filtering the synthesized speech
signal ŝ(t) of the last subframe of the previous frame by the weighting filter
W(z) of Equation (1). If the pitch pulse prototype extends over the end of the
previously synthesized frame, the weighted speech signal w{t) of the current
frame is used for this exceeding portion. The pitch pulse prototype has a high
correlation with the pitch pulses of the weighted speech signal w(t) if the
previous synthesized speech frame contains already a well-developed pitch
cycle. Thus the use of the synthesized speech in extracting the prototype
provides additional information for monitoring the performance of coding and
selecting an appropriate coding mode in the current frame as will be explained
in more detail in the following description.
Selecting / = 10 samples provides a good compromise between the
complexity and performance in the pitch pulse search. The value of / can also
be determined proportionally to the open-loop pitch estimate.
Given the position To of the last pulse in the previous frame, the first
pitch pulse of the current frame can be predicted to occur approximately at
instant To + p(T0). Here p(t) denotes the interpolated open-loop pitch estimate
at instant (position) t. This prediction is performed in block 303.
In block 305, the predicted pitch pulse position T0 + p(T0) is refined as


where the weighted speech signal w(t) in the neighborhood of the predicted
position is correlated with the pulse prototype:

Thus the refinement is the argument j, limited into [-jmax, jmax], that maximizes
the weighted correlation C(j) between the pulse prototype and one of the
above mentioned residual signal, weighted speech signal or weighted
synthesized speech signal. According to an illustrative example, the limit jmax
is proportional to the open-loop pitch estimate as min{20, (p(0)/4)}, where the
operator denotes rounding to the nearest integer. The weighting function

in Equation (4) favors the pulse position predicted using the open-loop pitch
estimate, since γ(j) attains its maximum value 1 at j = 0. The denominator
p(To + p(To)) in Equation (5) is the open-loop pitch estimate for the predicted
pitch pulse position.
After the first pitch puise position T1 has been found using Equation
(3), the next pitch pulse can be predicted to be at instant T2 = T1 + p(T1) and
refined as described above. This pitch pulse search comprising the prediction
303 and refinement 305 is repeated until either the prediction or refinement
procedure yields a pitch pulse position outside the current frame. These
conditions are checked in logic block 304 for the prediction of the position of

the next pitch pulse (block 303) and in logic block 306 for the refinement of
this position of the pitch pulse (block 305). It should be noted that the logic
block 304 terminates the search only if a predicted pulse position is so far in
the subsequent frame that the refinement step cannot bring it back to the
current frame. This procedure yields c pitch pulse positions inside the current
frame, denoted by T1, T2,..., Tc.
According to an illustrative example, pitch pulses are located in the
integer resolution except the last pitch pulse of the frame denoted by Tc. Since
the exact distance between the last pulses of two successive frames is
needed to determine the delay parameter to be transmitted, the last pulse is
located using a fractional resolution of 1/4 sample in Equation (4) for j. The
fractional resolution is obtained by upsampling w(t) in the neighborhood of the
last predicted pitch pulse before evaluating the correlation of Equation (4).
According to an illustrative example, Hamming-windowed sine interpolation of
length 33 is used for upsampling. The fractional resolution of the last pitch
pulse position helps to maintain the good performance of long term prediction
despite the time synchrony constrain set to the frame end. This is obtained
with a cost of the additional bit rate needed for transmitting the delay
parameter in a higher accuracy.
After completing pitch cycle segmentation in the current frame, an
optimal shift for each segment is determined. This operation is done using the
weighted speech signal w(t) as will be explained in the following description.
For reducing the distortion caused by warping, the shifts of individual pitch
cycle segments are implemented using the LP residual signal r(t). Since
shifting distorts the signal particularly around segment boundaries, it is
essential to place the boundaries in low power sections of the residual signal
r(t). In an illustrative example, the segment boundaries are placed

approximately in the middle of two consecutive pitch pulses, but constrained
inside the current frame. Segment boundaries are always selected inside the
current frame such that each segment contains exactly one pitch pulse.
Segments with more than one pitch pulse or "empty" segments without any
pitch pulses hamper subsequent correlation-based matching with the target
signal and should be prevented in pitch cycle segmentation. The sth extracted
segment of ls samples is denoted as ws(k) for k ≈ 0, 1, ..., ls - 1. The starting
instant of this segment is ts, selected such that ws(0) = w(ts). The number of
segments in the present frame is denoted by c.
While selecting the segment boundary between two successive pitch
pulses Ts and Ts+1 inside the current frame, the following procedure is used.
First the central instant between two pulses is computed as A = ((Ts +
Ts +1)l2). The candidate positions for the segment boundary are located in the
region [A-εmax, A+εmar], where εmax corresponds to five samples. The energy
of each candidate boundary position is computed as

The position giving the smallest energy is selected because this, choice
typically results in the smallest distortion in the modified speech signal. The
instant that minimizes Equation (6) is denoted as ε. The starting instant of the
new segment is selected as ts ≈ A + ε. This defines also the length of the
previous segment, since the previous segment ends at instant A + ε- 1.
Figure 6 shows an illustrative example of pitch cycle segmentation.
Note particularly the first and the last segment w1(k) and w4(k), respectively,
extracted such that no empty segments result and the frame boundaries are
not exceeded.

Determination of the Delay Parameter
Generally the main advantage of signal modification is that only one
delay parameter per frame has to be coded and transmitted to the decoder
(not shown). However, special attention has to be paid to the determination of
this single parameter, The delay parameter not only defines together with its
previous value the evolution of the pitch cycle length over the frame, but also
affects time asynchrony in the resulting modified signal.
In the methods described in [1,4-7]
[1] W.B. Kleijn, P. Kroon, and D. Nahumi, "The RCELP speech-
coding algorithm," European Transactions on
Telecommunications, Vol. 4, No. 5, pp. 573-582, 1994.
[4] US Patent 5,7.04,003, "RCELP. coder," Lucent Technologies
Inc., (W.B. Kleijn and D. Nahumi), Filing Date 19 Sep. 1995.
[5] European Patent Application 0 602 826 A2, 'Time shifting for
analysis-by-synthesis coding," AT&T Corp., (B. Kleijn), Filing
Date 1 Dec. 1993..
[6] Patent Application WO 00/11653, "Speech encoder with
continuous warping combined with long term prediction,"
Conexant Systems Inc., (Y. Gao), Filing Date 24 Aug. 1999.
[7] Patent Application WO 00/11654, "Speech encoder
adaptively applying pitch preprocessing with continuous

warping," Conexant Systems Inc., (H. Su and Y. Gao), Filing
Date 24 Aug. 1999.
no time synchrony is required at frame boundaries, and thus the delay
parameter to be transmitted can be determined straightforwardly using an
open-loop pitch estimate. This selection usually results in a time asynchrony
at the frame boundary, and translates to an accumulating time shift In the
subsequent frame because the signal continuity has to be preserved.
Although human hearing is insensitive to changes in the time scale of the
synthesized speech signal, increasing time asynchrony complicates the
encoder implementation. Indeed, long signal buffers are required to
accommodate the signals whose time scale may have been expanded, and a
control logic has to be implemented for limiting the accumulated shift during
encoding. Also, time asynchrony of several samples typical in RCELP coding
may cause mismatch between the LP parameters and the modified residual
signal. This mismatch may result in perceptual artifacts to the modified
speech signal that is synthesized by LP filtering the modified residual signal.
On the contrary, the illustrative embodiment of the signal modification
method according to the present invention preserves the time synchrony at
frame boundaries. Thus, a strictly constrained shift occurs at the frame ends
and every new frame starts in perfect time match with the original speech
frame.
To ensure time synchrony at the frame end, the delay contour d(t)
maps, with the long term prediction, the last pitch pulse at the end of the
previous synthesized speech frame to the pitch pulses of the current frame.
The delay contour defines an interpolated long-term prediction delay
parameter over the current nth frame for every sample from instant tn-1 + 1

through tn. Only the delay parameter dn = d(tn) at the frame end is transmitted
to the decoder implying that d(t) must have a form fully specified by the
transmitted values. The long-term prediction delay parameter has to be
selected such that the resulting delay contour fulfils the pulse mapping, in a
mathematical form this mapping can be presented as follows: Let KC be a
temporary time variable and T0 and Tc the last pitch pulse positions in the
previous and current frames, respectively. Now, the delay parameter dn has to
be selected such that, after executing the pseudo-code presented in Table 1,
the variable KC has a value very close to To minimizing the error KC- T0. The
pseudo-code starts from the vaiue k0 = Tc and iterates backwards c times by
updating ki :- KI-1 - d(kj-1). If KC then equals to T0, long term prediction can be
utilized with maximum efficiency without time asynchrony at the frame end.

An example of the operation of the delay selection loop in the case c
= 3 is illustrated in Figure 7. The loop starts from the value ko = Tc and takes
the first iteration backwards as k1 = k0 - d(ko). Iterations are continued twice

more resulting in K2 - K1 - d(k1) and K3 = k2 - d(k2). The final value K3 is then
compared against T0 in terms of the error en = k3 - T0. The resulting error is
a function of the delay contour that is adjusted in the delay selection algorithm
as will be taught later in this specification.
Signal modification methods [1, 4, 6, 7] such as described in the
following documents:
[1] W.B. Kleijn, P. Kroon, and D. Nahumi, 'The RCELP speech-
coding algorithm," European Transactions on
Telecommunications, Vol. 4, No. 5, pp. 573-582,1994.
[4] US Patent 5,704,003, "RCELP coder," Lucent Technologies
Inc., (W.B. Kleijn and D. Nahumi), Filing Date 19 Sep. 1995.
[6] Patent Application WO 00/11653, "Speech encoder with
continuous warping combined with long term prediction,"
Conexant Systems Inc., (Y. Gao), Filing Date 24 Aug. 1999.
[7] Patent Application WO 00/11654, "Speech encoder
adaptively applying pitch preprocessing with continuous
warping," Conexant Systems Inc., (H. Su and Y. Gao), Filing
Date 24 Aug. 1999.
interpolate the delay parameters linearly over the frame between dn-1 and dn
However, when time synchrony is required at the frame end, linear
interpolation tends to result in an oscillating delay contour. Thus pitch cycles
in the modified speech signal contract and expand periodically causing easily
annoying artifacts. The evolution and amplitude of the oscillations are related

to the last pitch position. The further the last pitch pulse is from the frame end
in relation to the pitch period, the more likely the oscillations are amplified.
Since the time synchrony at the frame end is an essential requirement of the
illustrative embodiment of the signal modification method according to the
present invention, linear interpolation familiar from the prior methods cannot
be used without degrading the speech quality. Instead, the illustrative
embodiment of the signal modification method according to the present
invention discloses a piecewise linear delay contour

Oscillations are significantly reduced by using this delay contour. Here tn and
tn-1 are the end instants of the current and previous frames, respectively, and
dn and dn-1 are the corresponding delay parameter values. Note that tn-1 + Σn is
the instant after which the delay contour remains constant.
In an illustrative example, the parameter σn varies as a function of dn-1
as

and the frame length N is 256 samples. To avoid oscillations, it is beneficial to
decrease the value of σn as the length of the pitch cycle increases. On the
other hand, to ayoid rapid changes in the delay contour d(t) in the beginning
of the frame as tn-1
half of the frame length. Rapid changes in d(t) degrade easily the quality of
the modified speech signal.
Note that depending on the coding mode of the previous frame, dn-1
can be either the delay value at the frame end (signal modification enabled) or
the delay value of the last subframe (signal modification disabled). Since the
past value dn-1 of the delay parameter is known at the decoder, the delay
contour is unambiguously defined by dn, and the decoder is able to form the
delay contour using Equation (7).
The only parameter which can be varied while searching the optimal
delay contour is dn, the delay parameter value at the end of the frame
constrained into [34, 231]. There is no simple explicit method for solving the
optimal dn in a general case. Instead, several values have to be tested to find
the best solution. However, the search is straightforward. The value of dn can
be first predicted as

In the illustrative embodiment embodiment, the search is done in three
phases by increasing the resolution and focusing the search range to be
examined inside [34, 231] in every phase. The delay parameters giving the
smallest error en = kC-T0\ in the procedure of Table 1 in these three phases
are denoted by dn(1), dn(2), and dn = dn(3), respectively. In the first phase, the
search is done around the value dn(0) predicted using Equation (10) with a
resolution of four samples in the range [dn(0) -11, dn(0) +12] when dn(0) and in the range [dn(0)-15, dn(0) +16] otherwise. The second phase

constrains the range into [dn(1) -3, dn(1) + 3] and uses the integer resolution.
The last, third phase examines the range [dn(2) -3/4, dn(2) +3/4] with a
resolution of 1/4 sample for dn(2) dn(2) + 1/2] and a resolution of 1/2 sample is used. This third phase yields the
optimal delay parameter dn to be transmitted to the decoder. This procedure is
a compromise between the search accuracy and complexity. Of course, those
of ordinary skill in the art can readily implement the search of the delay
parameter under the time synchrony constrains using alternative means
without departing from the nature and spirit of the present invention.
The delay parameter dn [34, 231] can be coded using nine bits per
frame using a resolution of 1/4 sample for dn
921/2.
Figure 8 illustrates delay interpolation when dn-1 = 50, dn= 53, σn = 172,
and the frame length N = 256. The interpolation method used in the illustrative
embodiment of the signal modification method is shown in thick line whereas
the linear interpolation corresponding to prior methods is shown in thin line.
Both interpolated contours perform approximately in a similar manner in the
delay selection loop of Table 1, but the disclosed piecewise linear
interpolation results in a smaller absolute change |dn-1 - dn|. This feature
reduces potential oscillations in the delay contour d(t) and annoying artifacts
in the modified speech signal whose pitch will follow this delay contour.
To further clarify the performance of the piecewise linear interpolation
method, Figure 9 shows an example on the resulting delay contour d(t) over
ten frames with thick line. The corresponding delay contour d(f) obtained with
conventional linear interpolation is indicated with thin line. The example has

constrains the range into [dn(1) -3, dn(1) + 3] and uses the integer resolution.
The last, third phase examines the range [dn(2) -3/4, dn(2) +3/4] with a
resolution of 1/4 sample for dn(2) dn(2) + 1/2] and a resolution of 1/2 sample is used. This third phase yields the
optimal delay parameter dn to be transmitted to the decoder. This procedure is
a compromise between the search accuracy and complexity. Of course, those
of ordinary skill in the art can readily implement the search of the delay
parameter under the time synchrony constrains using alternative means
without departing from the nature and spirit of the present invention..
The delay parameter dn [34, 231] can be coded using nine bits per
frame using a resolution of 1/4 sample for dn
921/2.
Figure 8 illustrates delay interpolation when dn-1 = 50, dn = 53, σn = 172,
and the frame length N = 256. The interpolation method used in the illustrative
embodiment of the signal modification method is shown in thick line whereas
the linear interpolation corresponding to prior methods is shown in thin line.
Both interpolated contours perform approximately in a similar manner in the
delay selection loop of Table 1, but the disclosed piecewise linear
interpolation results in a smaller absolute change |dn-1 - dn|.. This feature
reduces potential oscillations in the delay contour d{t) and annoying artifacts
in the modified speech signal whose pitch will follow this delay contour.
To further clarify the performance of the piecewise linear interpolation
method, Figure 9 shows an example on the resulting delay contour d(t) over
ten frames with thick line. The corresponding delay contour d(t) obtained with
conventional linear interpolation is indicated with thin line. The example has

been composed using an artificial speech signal having a constant delay
parameter of 52 samples as an input of the speech modification procedure. A
delay parameter do = 54 samples was intentionally used as an initial value for
the first frame to Illustrate the effect of pitch estimation errors typical in speech
coding. Then, the delay parameters dn both for the linear interpolation and the
herein disclosed piecewise linear interpolation method were searched using
the procedure of Table 1. All the parameters needed were selected in
accordance with the illustrative embodiment of the signal modification method
according to the present invention. The resulting delay contours d(t) show that
piecewise linear interpolation yields a rapidly converging delay contour d(t)
whereas the conventional linear interpolation cannot reach the correct value
within the ten frame period. These prolonged oscillations in the delay contour
d(t) often cause annoying artifacts to the modified speech signal degrading
the overall perceptual quality.
Modification of the Signal
After the delay parameter dn and the pitch cycle segmentation have
been determined, the signal modification procedure itself can be initiated. In
the illustrative embodiment of the signal modification method, the speech
signal is modified by shifting individual pilch cycle segments one by one
adjusting them to the delay contour d(t). A segment shift is determined by
correlating the segment in the weighted speech domain with the target signal.
The target signal is composed using the synthesized weighted speech signal
w(t) of the previous frame and the preceding, already shifted segments in the
current frame. The actual shift is done on the residual signal r(t).
Signal modification has to be done carefully to both maximize the
performance of long term prediction and simultaneously to preserve the

perceptual quality of the modified speech signal. The required time synchrony
at frame boundaries has to be taken into account also during modification.
A block diagram of the illustrative embodiment of the signal
modification method is shown in Figure 10. Modification starts by extracting a
new segment ws(k) of /s samples from the weighted speech signal w(t) in
block 401. This segment is defined by the segment length A, and starting
instant ts giving ws(k) = w(ts + k) for k = 0, 1, ..., /s - 1. The segmentation
procedure is carried out in accordance with the teachings of the foregoing
description.
If no more segments can be selected or extracted (block 402), the
signal modification operation is completed (block 403). Otherwise, the signal
modification operation continues with block 404.
For finding the optimal shift of the current segment ws(k), a target
signal w(t) is created in block 405. For the first segment w1(k) in the current
frame, this target signal is obtained by the recursion

Here w(t)is the weighted synthesized speech signal available in the previous
frame for t ≤ tn-1. The parameter δ1 is the maximum shift allowed for the first
segment of length l1 Equation (11) can be interpreted as simulation of long
term prediction using the delay contour over the signal portion in which the
current shifted segment may potentially be situated. The computation of the
target signal for the subsequent segments follows the same principle and will
be presented later in this section.

The search procedure for finding the optimal shift of the current
segment can be initiated after forming the target signal. This procedure is
based on the correlation cs(δ') computed in block 404 between the segment
ws(k) that starts at instant ts and the target signal w(t) as

where δs determines the maximum shift allowed for the current segment ws(k)
and [.] denotes rounding towards plus infinity. Normalized correlation can be
well used instead of Equation (12), although with increased complexity. In the
illustrative embodiment, the following values are used for δs:

As will be described later in this section, the value of δs is more limited for the
first and the last segment in the frame.
Correlation (12) is evaluated with an integer resolution, but higher
accuracy improves the performance of long term prediction. For keeping the
complexity low It is not reasonable to upsample directly the signal ws(k) or
w(t) in Equation (12). Instead, a fractional resolution is obtained in a
computationally efficient manner by determining the optimal shift using the
upsampled correlation cs (δ).

The shift δ maximizing the correlation cs(δ') is searched first in the
integer resolution in block 404. Now, in a fractional resolution the maximum
value must be located in the open interval (δ- 1, δ+ 1), and bounded into [-δs,
δs]. In block 406, the correlation cs(δ') is upsampled in this interval to a
resolution of 1/8 sample using Hamming-windowed sine interpolation of a
length equal to 65 samples. The shift δ corresponding to the maximum value
of the upsampled correlation is then the optima! shift in a fractional resolution.
After finding this optimal shift, the weighted speech segment ws(k) is
recalculated In the solved fractional resolution in block 407. That is, the
precise new starting instant of the segment is updated as ts := ts - δ + δi,
where δ1 = [δ]. Further, the residual segment rs(k) corresponding to the
weighted speech segment ws(k) in fractional resolution is computed from the
residual signal r(t) at this point using again the sinc interpolation as described
before (block 407). Since the fractional part of the optimal shift is incorporated
into the residual and weighted speech segments, all subsequent
computations can be implemented with the upward-rounded shift δ1 ≈ [δ]
Figure 11 illustrates recalculation of the segment ws(k) in accordance
with block 407 of Figure 10. In this illustrative example, the optimal shift is
searched with a resolution of 1/8 sample by maximizing the correlation giving
the value δ = -13/8. Thus the integer part δ1 becomes [—13/8] = -1 and the
fractional part 3/8. Consequently, the starting instant of the segment is updated
as ts = ts + 3/8. In Figure 11, the new samples of ws(k) are indicated with gray
dots.
If the logic block 106, which will be disclosed later, permits to continue
signal modification, thefinal task is to update the modified residual signal r(t)
by copying the current residual signal segment rs(k) into it (block 411):


Since shifts in successive segments are independent from each others, the
segments positioned to r(t) either overlap or have a gap in between them.
Straightforward weighted averaging can be used for overlapping segments.
Gaps are filled by copying neighboring samples from the adjacent segments.
Since the number of overlapping or missing samples is usually small and the
segment boundaries occur at low-energy regions of the residual signal,
usually no perceptual artifacts are caused. It should be noted that no
continuous signal warping as described in [2], [6], [7],
[2] W.B. Kleijn, R.P. Ramachandran, and P. Kroon,
"Interpolation of the pitch-predictor parameters in analysis-by-
synthesis speech coders," IEEE Transactions on Speech and
Audio Processing, Vol. 2, No. 1, pp. 42-54,1994.
[6] Patent Application WO 00/11653, "Speech encoder with
continuous warping combined with long term prediction,"
Conexant Systems Inc., (Y. Gao), Filing Date 24 Aug. 1999.
[7] Patent Application WO 00/11654, "Speech encoder
adaptively applying pitch preprocessing with continuous
warping," Conexant Systems Inc., (H. Su and Y. Gao), Filing
Date 24 Aug. 1999.
is employed, but modification is done discontinuously by shifting pitch cycle
segments in order to reduce the complexity..

Processing of the subsequent pitch cycle segments follows the above-
disclosed procedure, except the target signal w(t) in block 405 is formed
differently than for the first segment. The samples of w(t) are first replaced
with the modified weighted speech samples as

This procedure is illustrated in Figure 11. Then the samples following the
updated segment are also updated,


The update of target signal w(t) ensures higher correlation between
successive pitch cycle segments in the modified speech signal considering
the delay contour d(t) and thus more accurate long term prediction. While
processing the last segment of the frame, the target signal w(t) does not
need to be updated.
The shifts of the first and the last segments in the frame are special
cases which have to be performed particularly carefully. Before shifting the
first segment, it should be ensured that no high power regions exist in the
residual signal r(t) close to the frame boundary tn -1. because shifting such a
segment may cause artifacts. The high power region is searched by squaring
the residual signal r(t) as


where g0 = (p(tn-1)/2). If the maximum of E0(k) is detected close to the frame
boundary in the range [tn-1 - 2, tn - 1 + 2], the allowed shift is limited to 1/4
samples. If the proposed shift δ for the first segment is smaller that this limit,
the signal modification procedure is enabled in the current frame, but the first
segment is kept intact.
The last segment in the frame is processed in a similar manner. As
was described in the foregoing description, the delay contour d(t) is selected
such that in principle no shifts are required for the last segment. However,
because the target signal is repeatedly updated during signal modification
considering correlations between successive segments in Equations (16) and
(17), it is possible the last segment has to be shifted slightly. In the illustrative
embodiment, this shift is always constrained to be smaller than 3/2 samples. If
there is a high power region at the frame end, no shift is allowed. This
condition is verified by using the squared residual signal

where g1 = p(tn). If the maximum of E1(k) is attained for k larger than or equal
to tn - 4, no shift is allowed for the last segment. Similarly as for the first
segment, when the proposed shift \δ\ for modification, but the last segment is kept intact.
It should be noted that, contrary to the known signal modification
methods, the shift does not translate to the next frame, and every new frame
starts perfectly synchronized with the original input signal. As another
fundamental difference particularly to RCELP coding, the illustrative
embodiment of signal modification method processes a complete speech
frame before the subframes are coded. Adnmittedly, subframe-wise

modification enables to compose the target signal for every subframe using
the previously coded subframe potentially improving the performance. This
approach cannot be used in the context of the illustrative embodiment of the
signal modification method since the allowed time asynchrony at the frame
end is strictly constrained. Nevertheless, the update of the target signal with
Equations (15) and (16) gives practically speaking equal performance with the
subframe-wise processing, because modification is enabled only on smoothly
evolving voiced frames.
Mode Determination Logic Incorporated into the Signal
Modification Procedure
The illustrative embodiment of signal modification method according
to the present invention incorporates an efficient classification and mode
determination mechanism as depicted in Figure 2. Every operation performed
in blocks 101, 103 and .105 yields several indicators quantifying the attainable
performance of long term prediction in the current frame. If any of these
indicators is outside its allowed limits, the signal modification procedure is
terminated by one of the logic blocks 102, 104, or 106. In this case, the
original signal is preserved intact.
The pitch pulse search procedure 101 produces several indicators on
the periodicity of the present frame. Hence the logic block 102 analyzing
these indicators is the most important component of the classification logic.
The logic block 102 compares the difference between the detected pitch pulse
positions and the interpolated open-loop pitch estimate using the condition


and terminates the signal modification procedure if this condition is not met.
The selection of the delay contour d(t) in block 103 gives also
additional information on the evolution of the pitch cycles and the periodicity of
the current speech frame. This information is examined in the logic block 104.
The signal modification procedure is continued from this block 104 only if the
condition |dn - dn-1| delay change is tolerated for classifying the current frame as purely voiced
frame. The logic block 104 also evaluates the success of the delay selection
loop of Table 1 by examining the difference |KC - T0| for the selected delay
parameter value dn. If this difference is greater than one, sample, the signal
modification procedure is terminated.
For guaranteeing a good quality for the modified speech signal, it is
advantageous to constrain shifts done for successive pitch cycle segments in
block 105. This is achieved in the logic block 106 by imposing the criteria

to all segments of the frame. Here δ(s) and δ(s-1) are the shifts done for the sth
and (5 -1)th pitch cycle segments, respectively. If the thresholds are
exceeded, the signal modification procedure is interrupted and the original
signal is maintained.
When the frames subjected to signal modification are coded at a low
bit rate, it is essential that the shape of pitch cycle segments remains similar
over the frame. This allows faithful signal modeling by long term prediction
and thus coding at a low bit rate without degrading the subjective quality. The

similarity of successive segments can be quantified simply by the normalized
correlation

between the current segment and the target signal at the optimal shift after
the update of ws(k) in block 407 of Figure 10. The normalized correlation gs is
also referred to as pitch gain.
Shifting of the pitch cycle segments in block 105 maximizing their
correlation with the target signal enhances the periodicity and yields a high
pitch prediction gain if the signal modification is useful in the current frame.
The success of the procedure is examined in the logic block 106 using the
criteria

If this condition is not fulfilled for all segments, the signal modification
procedure is terminated (block 409) and the original signal is kept intact.
When this condition is met (block 106), the signal modification continues in
block 411. The pitch gain gs is computed in block 408 between the
recalculated segment ws(k) from block 407 and the target signal w(t) from
block 405. In general, a slightly lower gain threshold can be allowed on male
voices with equal coding performance. The gain thresholds can be changed in
different operation modes of the encoder for adjusting the usage percentage
of the signal modification mode and thus the resulting average bit rate.

Mode Determination Logic for a Source-controlled Variable Bit
Rate Speech Codec
This section discloses the use of the signal modification procedure as
a part of the general rate determination mechanism in a source-controlied
variable bit rate speech codec. This functionality is immersed into the
illustrative embodiment of the signal modification method, since it provides
several indicators on signal periodicity and the expected coding performance
of long term prediction in the present frame. These indicators include the
evolution of pitch period, the fitness of the selected delay contour for
describing this evolution, and the pitch prediction gain attainable with signal
modification. If the logic blocks 102, 104 and 106 shown in Figure 2 enable
signal modification, long term prediction is able to model the modified speech
frame efficiently facilitating its coding at a low bit rate without degrading
subjective quality. In this case, the adaptive codebook excitation has a
dominant contribution in describing the excitation signal, and thus the bit rate
allocated for the fixed-codebook excitation can be reduced. When a logic
block 102, 104 or 106 disables signal modification, the frame is likely to
contain an non-stationary speech segment such as a voiced onset or rapidly
evolving voiced speech signal. These frames typically require a high bit rate
for sustaining good subjective quality.
Figure 12 depicts the signal modification procedure 603 as a part of
the rate determination logic that controls four coding modes. In this illustrative
embodiment, the mode set comprises a dedicated mode for non-active
speech frames (block 508), unvoiced speech frames (block 507), stable
voiced frames (block 506), and other types of frames (block 505). It should be
noted that all these modes except the mode for stable voiced frames 506 are

implemented in accordance with techniques well known to those of ordinary
skill in the art.
The rate determination logic is based on signal classification done in
three steps in logic blocks 501, 502, and 504, from which the operation of
blocks 501 and 502 is well known to those or ordinary skill in the art.
First, a voice activity detector (VAD) 501 discriminates between active
and inactive speech frames. If an inactive speech frame is detected, the
speech signal is processed according to mode 508.
If an active speech frame is detected in block 501, the frame is
subjected to a second classifier 502 dedicated to making a voicing decision. If
the classifier 502 rates the current frame as unvoiced speech signal, the
classification chain ends and the speech signal is processed in accordance .
with mode 507. Otherwise, the speech frame is passed through to the signal
modification module 603.
The signal modification module then provides itself a decision on
enabling or disabling the signal modification of the current frame in a logic
block 504. This decision is in practice made as an integral part of the signal
modification procedure in the logic blocks 102, 104 and 106 as explained
earlier with reference to Figure 2. When signal modification is enabled, the
frame is deemed as a stable voiced, or purely voiced speech segment.
When the rate determination mechanism selects mode 506, the signal
modification mode is enabled and the speech frame is encoded in accordance
with the teachings of the previous sections. Table 2 discloses the bit allocation
used in the illustrative embodiment for the mode 506. Since the frames to be

coded in this mode are characteristically very periodic, a substantially lower
bit rate suffices for sustaining good subjective quality compared for instance
to transition frames. Signal modification allows also efficient coding of the
delay information using only nine bits per 20-ms frame saving a considerable
proportion of the bit budget for other parameters. Good performance of long
term prediction allows to use only 13 bits per 5-ms subframe for the fixed-
. codebook excitation without sacrificing the subjective speech quality. The
fixed-codebook comprises one track with two pulses, both having 64 possible
positions.


The other coding modes 505, 507 and 508 are implemented following
known techniques. Signal modification to disabled in all these modes. Table 3
shows the bit allocation of the mode 505 adopted from the AMR-WB standard.
The technical specifications [11] and [12] related to the AMR-WB
standard are enclosed here as references on the comfort noise and VAD
functionalities in 501 and 508, respectively.

[11] 3GPP TS 26.192, "AMR Wideband Speech Codec:
Comfort Noise Aspects," 3GPP Technical Specification.
[12] 3GPP TS 26,193, "AMR Wideband Speech Codec: Voice
Activity Detector (VAD)," 3GPP Technical Specification.
In summary, the present specification has described a frame
synchronous signal modification method for purely voiced speech frames, a
classification mechanism for detecting frames to be modified, and to use
these methods in a source-controlled CELP speech codec in order to enable
high-quality coding at a low bit rate.
The signal modification method incorporates a classification
mechanism for determining the frames to be modified. This differs from prior
signal modification and preprocessing means in operation and in the
properties of the modified signal. The classification functionality embedded
into the signal modification procedure is used as a part of the rate
determination mechanism in a source-controlled CELP speech codec.
Signal modification is done pitch and frame synchronously, that is,
adapting one pitch cycle segment at a time in the current frame such that a
subsequent speech frame starts in perfect time alignment with the original
signal. The pitch cycle segments are limited by frame boundaries. This feature
prevents time shift translation over frame boundaries simplifying encoder
implementation and reducing a risk of artifacts in the modified speech signal.
Since time shift does not accumulate over successive frames, the signal
modification method disclosed does not need long buffers for accommodating
expanded signals nor a complicated logic for controlling the accumulated time

shift. In source-controlled speech coding, it simplifies multi-mode operation
between signal modification enabled and disabled modes, since every new
frame starts in time alignment with the original signal.
Of course, many other modifications and variations are possible. In
view of the above detailed illustrative description of the present invention and
associated drawings, such other modifications and variations will now become
apparent to those of ordinary skill in the art. It should also be apparent that
such other variations may be effected without departing from the spirit and
scope of the present invention.

We claim:
1 A method of forming a delay contour characterising a long term prediction in a technique using signal modification
for digitally encoding a speech signal, the method comprising:
dividing the speech signal into a series of successive frames;
locating a pitch pulse of the speech signal in a previous frame; and
locating a corresponding pitch pulse of the speech signal in a current frame;
characterised by forming a delay contour by selecting a long term prediction delay parameter for the current frame
by iterating backwards a function of a temporary time variable, from the location of the pitch pulse of the speech
signal in the current frame towards the location of the corresponding pitch pulse of the speech signal in the previous
frame.
2. A method as claimed in claim 1, comprising;
forming the delay contour as a function of distances of successive pitch pulses between a last pitch pulse of
the previous frame and a last pitch pulse of the current frame
3. A method as claimed in claim 1 or claim 2, further comprising:
fully characterising the delay contour with a long-term-prediction delay parameter of the previous frame and
the long-term-prediction delay parameter of the current frame.
4. A method as claimed in claim 3, wherein forming the delay contour comprises:
nonlinearly interpolating the delay contour between the long-term-prediction delay parameter of the previous
frame and the long-term-prediction delay parameter of the current frame.
5. A method as claimed In claim 3, wherein forming the delay contour comprises:
determining a piecewise linear delay contour between the long-term-prediction delay parameter of the previous
frame and the long-term-prediction delay parameter of the current frame.
6. A method as claimed in any preceding claim, wherein locating a pitch pulse comprises deriving a linear prediction
residual signal from the speech signal.
7. A method as claimed in any of claims 1 to 5, wherein locating a pitch pulse comprises deriving a weighted speech
signal from the speech signal.
8. A method as claimed in any of claims 1 to 5, wherein locating a pitch pulse comprises deriving a synthesised
weighted speech signal from the speech signal.
9. A method as claimed in any preceding claim, wherein the backwards iteration comprises searching for a long term
prediction delay parameter value in plural phases and beginning with a long term prediction delay parameter value
predicted for the end of the current frame, each successive phase having increased resolution and a more focused
search range.
10. A method as claimed in claim 9, comprising predicting the long term prediction delay parameter value as being
equal to the difference between the long term prediction delay parameter value at the end of the previous frame
and twice the difference between the locations of the pitch pulses of the speech signal in the previous and current
frames divided by the number of iterations of the function.

11. A method as claimed in any preceding claim, comprising modifying the speech signal by shifting pitch cycle segments
one by one to adjust them to the delay contour.
12. A method as claimed in claim 11, comprising determining a segment shift by correlating a segment in the weighted
speech domain with a target signal.
13. A method as claimed In claim 12, comprising composing the target signal using the synthesised weighted speech
signal of the previous frame and any preceding, shifted segments in the current frame.

14. A device (603) for forming a delay contour characterising a long term prediction in a technique using signal modl-
fication for digitally encoding a speech signal, the device comprising:
a divider of the speech signal Into a series of successive frames;
a detector of a location of a pitch pulse of the speech signal in a previous frame; and
a detector of a location of a corresponding pitch pulse of the speech signal in a current frame,
characterised by a former of a delay contour for selecting a long term prediction delay parameter for the current
frame by backwards iteration of a function of a temporary time variable, from the location of the pitch pulse of the
speech signal in the current frame towards the location of the corresponding pitch pulse of the speech signal in the
previous frame.
15. A device as claimed in claim 14, wherein the former is;
a calculator of the long-term-prediction delay parameter as a function of distances of successive pitch pulses
between the last pitch pulse of the previous frame and the last pitch pulse of the current frame.
16. A device as claimed in claim 14 or claim 15, further incorporating:
a function fully characterising the delay contour with a long-term- prediction delay parameter of the previous
frame and the long-term-prediction delay parameter of the current frame.
17. A device as claimed in claim 16, wherein the former is:
a selector of a nonllnearly Interpolated delay contour between the long-term-prediction delay parameter of the
previous frame and the long-term-prediction delay parameter of the current frame.
18. A device as claimed in claim 16, wherein the former is:
a selector of a piecewise linear delay contour determined from the long-term-prediction delay parameter of the
previous frame and the long-term-prediction delay parameter of the current frame.
19. A device as claimed in any of claims 14 to 18, wherein the former is a searcher of a long term prediction delay
parameter value by backwards iteration in plural phases and beginning with a long term prediction delay parameter
value predicted for the end of the current frame, each successive phase having Increased resolution and a more
focused search range.
20. A device as claimed in claim 19, comprising a predictor of the long term prediction delay parameter value as being
equal to the difference between the long term prediction delay parameter value at the end of the previous frame
and twice the difference between the locations of the pitch pulses of the speech signal in the previous and current
frames divided by the number of iterations of the function.
21. A device as claimed In any of claims 14 to 20, comprising a modifier of the speech signal by shifting pitch cycle
segments one by one to adjust them to the delay contour.
22. A device as claimed in claim 21, comprising a determiner of a segment shift by correlating a segment in the weighted
speech domain with a target signal.
23. A device as claimed in claim 22, comprising a composer of the target signal using a synthesised weighted speech
signal of the previous frame and any preceding, shifted segments in the current frame.

Documents:

830-kolnp-2004-abstract.pdf

830-kolnp-2004-assignment-1.1.pdf

830-kolnp-2004-assignment.pdf

830-KOLNP-2004-CLAIMS-1.1.pdf

830-kolnp-2004-claims.pdf

830-KOLNP-2004-CORRESPONDENCE 1.2.pdf

830-KOLNP-2004-CORRESPONDENCE-1.1.pdf

830-kolnp-2004-correspondence.pdf

830-KOLNP-2004-DESCRIPTION (COMPLETE)-1.1.pdf

830-kolnp-2004-description (complete).pdf

830-kolnp-2004-drawings.pdf

830-kolnp-2004-examination report-1.1.pdf

830-kolnp-2004-examination report.pdf

830-KOLNP-2004-FORM 1-1.1.pdf

830-kolnp-2004-form 1.pdf

830-KOLNP-2004-FORM 13-1.1.pdf

830-kolnp-2004-form 13.pdf

830-kolnp-2004-form 18-1.1.pdf

830-kolnp-2004-form 18.pdf

830-kolnp-2004-form 2-1.1.pdf

830-kolnp-2004-form 2.pdf

830-kolnp-2004-form 3-1.1.pdf

830-kolnp-2004-form 3.pdf

830-kolnp-2004-form 5-1.1.pdf

830-kolnp-2004-form 5.pdf

830-kolnp-2004-form 6-1.1.pdf

830-kolnp-2004-form 6.pdf

830-KOLNP-2004-FORM-27.pdf

830-kolnp-2004-gpa.pdf

830-kolnp-2004-granted-abstract.pdf

830-kolnp-2004-granted-claims.pdf

830-kolnp-2004-granted-description (complete).pdf

830-kolnp-2004-granted-drawings.pdf

830-kolnp-2004-granted-form 1.pdf

830-kolnp-2004-granted-form 2.pdf

830-kolnp-2004-granted-specification.pdf

830-kolnp-2004-others-1.1.pdf

830-kolnp-2004-others.pdf

830-KOLNP-2004-PA 1.2.pdf

830-KOLNP-2004-PA-1.1.pdf

830-kolnp-2004-pa.pdf

830-kolnp-2004-reply to examination report-1.1.pdf

830-kolnp-2004-reply to examination report.pdf

830-kolnp-2004-specification.pdf


Patent Number 246541
Indian Patent Application Number 830/KOLNP/2004
PG Journal Number 09/2011
Publication Date 04-Mar-2011
Grant Date 03-Mar-2011
Date of Filing 16-Jun-2004
Name of Patentee NOKIA CORPORATION
Applicant Address KEILALAHDENTIE 4, FIN-02150 ESPOO
Inventors:
# Inventor's Name Inventor's Address
1 TAMMI MIKKO KEMIANKATU 9 E 51 33720 TAMPERE
2 LAFLAMME CLAUDE 294 CHEMIN DEPOT, ORFORD, QUEBEC,J1X, 6W1
3 RUOPPILA VESA 3913 MENTANA STREET, MONTREAL, QUEBEC, H2L 3R7
4 JELINEK MILAN 245 MERRILL PARK, NORTH HATLEY, QUEBEC JOB 2C0
PCT International Classification Number G01L 19/08
PCT International Application Number PCT/CA2002/001948
PCT International Filing date 2002-12-13
PCT Conventions:
# PCT Application Number Date of Convention Priority Country
1 2,365,203 2001-12-14 Canada