Title of Invention

A METHOD AND SYSTEM OF SPEAKER VERIFICATION OR IDENTIFICATION

Abstract A method of speaker verification or identification that uses a cohort including an enrolment model for each of a predetermined number of speakers, each enrolment model representing at least one speech sample for each speaker, the method comprising: capturing a test speech sample from a speaker claiming to be one of the enrolled speakers; modelling the test sample to provide a test model, and classifying the test model by always matching it with one of the enrolment models of the cohort associated with the speaker so that a false acceptance rate for the test sample is determined by the cohort size, wherein the steps of modelling and/or classifying are such that a false rejection rate for the test sample is substantially zero.
Full Text

1
2
3 The present invention relates to systems, methods
4 and apparatus for performing speaker recognition. 5

6 Speaker recognition encompasses the related fields
7 of speaker verification and speaker identification.
8 The main objective is to confirm the claimed
9 identity of a speaker from his/her utterances, known

10 as verification, or to recognise the speaker from
11 his/her utterances, known as identification. Both
12 use a person's voice as a bioraetric measure and
13 assume a unique relationship between the utterance
14 and the person producing the utterance. This unique
15 relationship makes both verification and
16 identification possible. Speaker recognition
17 technology analyses a test utterance and compares it
18 to a known template or model for the person being
19 recognised or verified. The effectiveness of the
20 system is dependent on the quality of the algorithms
21 used in the process. 22

22 speaker recognition systems have many possible
23 applications. In accordance with a further aspect
24 of the present invention, speaker recognition
25 technology may be used to permanently mark an
26 electronic document with a biometric print for every
27 person who views or edits the content. This produces
28 an audit trail identifying all of the users and the
29 times of access and modification. As the user mark
9 is biometric it is very difficult for the user to
10 dispute the authenticity of the mark.
11
12 Other biometric measures may provide the basis for
13 possible recognition systems, such as iris scanning,
14 finger printing and facial features. These measures
15 all require additional hardware for recording
16 whereas speaker recognition can be used with any
17 voice input such as over a telephone line or using a
18 standard multi-media personal computer with no
19 modification. The techniques can be used in
20 conjunction with other security measures and other
21 biometrics for increased security. From the point of
22 view of a user the operation of the system is very
23 simple. 24

25 For example, when an on-line document is requested
26 the person requiring access will be asked to give a
27 sample of their speech. This will be a simple prompt
28 from the client software 'please say this phrase..."
29 or something similar. The phrase uttered will then
30 be sent to a database server or to a speech
31 recognition server, via any data network such as an
32 intranet, to be associated with the document and

33 stored as the key used to activate the document at
34 that particular time. A permanent record for a
35 document can therefore be produced, over time,
36 providing an audit trail for the document. The
37 speaker authentication server may maintain a set of
38 templates (models) for all currently enrolled
39 persons and a historical record of previously
40 enrolled persons. 9

10 Speaker recognition systems rely on extracting some
11 unique features from a person's speech. This in
12 turn depends on the manner in which human speech is
13 produced using the vocal tract and the nasal tract.
14 For practical purposes, the vocal tract and nasal
15 tract can be regarded as two connected pipes, which
16 can resonate in a manner similar to a musical
17 instrument. The resonances produced depend on the
18 diameter and length of the pipes. In the human
19 speech production mechanism, these diameters and to
20 some extent the length of the pipe sections can be
21 modified by the articulators, typically the
22 positions of the tongue, the jaw, the lips and the
23 soft palate (velum). These resonances in the
24 spectrum are called the formant frequencies. There
25 are normally around four formant frequencies in a
26 typical voice spectrum. 27

28 As with musical instruments, sound will only be
29 produced when a constriction of the airflow occurs
30 causing either vibration or turbulence. In human
31 speech, the major vibrations occur when the
32 constriction occurs at the glottis (vocal cords).

33 When this happens, voiced speech is produced,
34 typically vowel-like sounds. When the constriction
35 is in the mouth, caused by the tongue or teeth, a
36 turbulence is produced, (a hissing type of sound)
37 and the speech produced is called a fricative,
38 typified by "s", "sh", "th" etc. From an
39 engineering point of view, this is similar to a
40 source signal (the result of the constriction) being
41 applied to a filter which has the general

10 characteristics (i.e. the same resonances) of the
11 vocal tract and the resulting output signal is the
12 speech sound. True speech is produced by
13 dynamically varying the positions of the
14 articulators. 15

16 All existing speaker recognition systems perform
17 similar computational steps. They operate by
18 creating a template or model for an enrolled
19 speaker. The model is created by two main steps
20 applied to a speech sample, namely spectral analysis
21 and statistical analysis. Subsequent recognition of
22 an input speech sample is performed by modelling the
23 input sample (test utterance) in the same way as
24 during speaker enrolment, and pattern/classification
25 matching of the input model against a database of
26 enrolled speakers. Existing systems vary in the
27 approach taken when performing some or all of these
28 steps. In conventional (industry standard) systems,
29 the spectral analysis is either Linear Predictive
30 Coding (LPC)/Cepstral analysis ("LPCC") or FFT/sub-
31 banding. This is followed by a statistical analysis
32 technique, usually a technique called Hidden Markov

33 Modelling (HMM), and the classification step is a
34 combination of a match against the claimed speaker
35 model and against an "impostor cohort" or "world
36 model" (i.e. a set of other speaker models). 5

6 To allow efficient processing of speech samples, all
7 speaker recognition systems use time slices called
8 frames, where the utterance is split into frames and
9 each frame is processed in turn. Frames may or may

10 not be of equal size and may or may not overlap. An
11 example of a typical time signal representation of a
12 speech utterance divided into frames is illustrated
13 in Fig. 1 of the accompanying drawings. A generic
14 speaker recognition system is shown in block diagram
15 form in Fig. 2, illustrating a test utterance being
16 processed, through an input filter 10, a spectral
17 analysis (LPCC) stage 12 and a statistical analysis

18 (HMM) stage 14, followed by score normalisation and
19 speaker classification 16, by thresholding,
20 employing a database 18 of speaker models (enrolled
21 speaker data-set), before generating a decision as
22 to the identity of the speaker (identification) or
23 the veracity of the speaker's claimed identity
24 (verification). 25

26 Such systems have a number of disadvantages or
27 limitations. Firstly, conventional spectral
28 analysis techniques produce a limited and incomplete
29 feature set and therefore poor modelling. Secondly,
30 HMM techniques are "black-box" methods, which
31 combine good performance with relative ease of use,
32 but at the expense of transparency. The relative

33 importance of features extracted by the technique
34 are not visible to the designer. Thirdly, the
35 nature of the HMM models do not allow model-against-
36 model comparisons to be made effectively.
37 Accordingly, important structural detail contained
38 within the enrolled speaker data-set cannot be
39 analysed and used effectively to improve system
40 performance. Fourthly, HMM technology uses temporal
41 information to construct the model and is therefore

10 vulnerable to mimics, who impersonate others' voices
11 by temporal variations in pitch etc. Fifthly, the
12 world model/impostor cohort employed by the system
13 cannot easily be optimised for the purpose of
14 testing an utterance by a claimed speaker. 15

16 The performance of a speaker recognition system
17 relies on the fact that when a true speaker
18 utterance is tested against a model for that speaker
19 it will produce a score, which is lower than a score
20 that is produced when an impostor utterance is
21 tested against the same model. This allows an
22 accept/reject threshold to be set. Consecutive
23 tests by the true speaker will not produce identical
24 scores. Rather, the scores will form a statistical
25 distribution. However, the mean of the true speaker
26 distribution will be considerably lower than the
27 means of impostor distributions tested against the
28 same model. This is illustrated in Fig. 3, where 25
29 scores are plotted for each of eight speakers,
30 speaker 1 being the true speaker. It can be seen
31 from Fig. 3 that the scores of some speakers are

32 closer to the true speaker than others and can be
33 problematic. 3

4 The present invention relates to improved speaker
5 recognition methods and systems which provide
6 improved performance in comparison with conventional
7 systems. In various aspects, the invention provides
8 improvements including but not limited to: improved
9 spectral analysis, transparency in its statistical

10 analysis, improved modelling, models that can be
11 compared allowing the data-set structure to be
12 analysed and used to improve system performance,
13 improved classification methods and the use of
14 statistically independent/partially independent
15 parallel processes to improve system performance. 16

17 The invention further embraces computer programs for
18 implementing the methods and systems of the
19 invention, data carriers and storage media encoded
20 with such programs, data processing devices and
21 systems adapted to implement the methods and
22 systems, and data processing systems and devices
23 incorporating the methods and systems. 24

25 The various aspects and preferred features of the
26 invention are defined in the Claims appended hereto. 27

28 Embodiments of the invention will now be described,
29 by way of example only, with reference to the
30 accompanying drawings, in which: 31

31 Fig. 1 is a time signal representation of an example
32 of a speech utterance divided into frames;
33 Fig. 2 is a block diagram of a generic, prior art
34 speaker recognition system;
35 Fig. 3 is a plot of speaker recognition score
36 distributions for a number of speakers tested
37 against one of the speakers, obtained using a
38 conventional speaker recognition system;
39 Fig. 4 is a block diagram illustrating a first

10 embodiment of the present invention;
11 Fig. 5 is a block diagram illustrating a second
12 embodiment of the present invention;
13 Fig. 6 is a block diagram illustrating a third
14 embodiment of the present invention;
15 Fig. 7 is a block diagram illustrating a further
16 embodiment of a speaker recognition system in
17 accordance with the present invention;
18 Fig. 8(a) is a time signal representation of an
19 example of a speech utterance divided into frames
20 and Fig. 8(b) shows the corresponding frequency
21 spectrum and smoothed frequency spectrum of one
22 frame thereof;
23 Fig. 9 illustrates the differences between the
24 frequency spectra of two mis-aligned frames;
25 Fig. 10 shows the distribution of accumulated frame
26 scores plotted against their frequency of
27 occurrence;
28 Fig. 11(a) shows the same accumulated score
29 distributions as Fig. 3 for comparison with Fig.
30 11(b), which shows corresponding accumulated score
31 distributions obtained using a speaker recognition
32 system in accordance with the present invention;

33 Fig. 12 illustrates the results of model against
34 model comparisons as compared with actual test
35 scores, obtained using a system in accordance with
36 the present invention;
37 Fig. 13 illustrates the distribution of speaker
38 models used by a system in accordance with the
39 present invention in a two-dimensional
40 representation of a multi-dimensional dataspace;
41 Fig. 14 illustrates the use of an impostor cohort as

10 used in a system in accordance with the present
11 invention;
12 Fig, 15 is a block diagram illustrating a
13 normalisation process in accordance with one aspect
14 of the present invention;
15 Fig. 16 is a block diagram illustrating an example
16 of wide area user authentication system in
17 accordance with the present invention;
18 Fig. 17 is a block diagram illustrating the
19 corruption of a speech signal by various noise
20 sources and channel characteristics in the input
21 channel of a speaker recognition system;
22 Figs. 18 and 19 illustrate the effects of noise and
23 channel characteristics on test utterances and
24 enrolment models in a speaker recognition system;
25 and
26 Fig. 20 illustrates a channel normalisation method
27 in accordance with one aspect of the present
28 invention. 29

30 The present invention includes a number of aspects
31 and features which may be combined in a variety of
32 ways in order to provide improved speaker
10

1 recognition (verification and/or identification)
2 systems. Certain aspects of the invention are
3 concerned with the manner in which speech samples
4 are modelled during speaker enrolment and during
5 subsequent recognition of input speech samples.
6 Other aspects are concerned with the manner in which
7 input speech models are classified in order to reach
8 a decision regarding the identity of the speaker. A
9 further aspect is concerned with normalising speech

10 signals input to speaker recognition systems
11 (channel normalisation). Still further aspects
12 concern applications of speaker recognition systems. 13

14 Referring now to the drawings, Figs. 4 to 6
15 illustrate the basic architectures used in systems
16 embodying various aspects of the invention. It will
17 be understood that the inputs to all of the
18 embodiments of the invention described herein are
19 digital signals comprising speech samples which have
20 previously been digitised by any suitable means (not
21 shown), and all of the filters and other modules
22 referred to are digital. 23

24 In Pig. 4, a speech sample is input to the system
25 via a channel normalisation module 200 and a filter
26 24. Instead of or in addition to this "front-end"
27 normalisation, channel normalisation may be
28 performed at a later stage of processing the speech
29 sample, as shall be discussed further below. The
30 sample would be divided into a series of frames
31 prior to being input to the filter 24 or at some
32 other point prior to feature extraction. In some
11

1 embodiments, as discussed further below, a noise
2 signal 206 may be added to the filtered signal (or
3 could be added prior to the filter 24). The sample
4 data are input to a modelling (feature extraction)
5 module 202, which includes a spectral analysis
6 module 26 and (at least in the case of speech sample
7 data being processed for enrolment purposes) a
8 statistical analysis module 28. The model (feature
9 set) output from the modelling module 202 comprises

10 a set of coefficients representing the smoothed
11 frequency spectrum of the input speech sample.
12 During enrolment of a speaker, the model is added to
13 a database of enrolled speakers (not shown). During
14 recognition of an input speech sample, the model
15 (feature set) is input to a classification module
16 110, which compares the model (feature set) with
17 models selected from the database of enrolled
18 speakers. On the basis of this comparison, a
19 decision is reached at 204 so as to identify the
20 speaker or to verify the claimed identity of the
21 speaker. The channel normalisation of the input
22 sample and the addition of the noise signal 206
23 comprise aspects of the invention, as shall be
24 described in more detail below, and are preferred
25 features of all implementations of the invention.
26 In some embodiments, channel normalisation may be
27 applied following spectral analysis 26 or during the
28 classification process, rather than being applied to
29 the input speech sample prior to processing as shown
30 in Pigs. 4 to 6. Novel aspects of the modelling and
31 classification processes in accordance with other
12

1 aspects of the invention will also be described in
2 more detail below. 3

4 Other aspects of the invention involve various types
5 of parallelism in the processing of speech samples
6 for enrolment and/or recognition. 7

8 In Fig. 5, the basic operation of the system is the
9 same as in Fig. 4, except that the output from the

10 modelling module 202 is input to multiple, parallel
11 classification processes llOa, 110b ... llOn, and
12 the outputs from the multiple classification
13 processes are combined in order to reach a final
14 decision, as shall be described in more detail
15 below. In Fig. 6, the basic operation of the system
16 is also the same as in Fig. 4, except that the input
17 sample is processed by multiple, parallel modelling
18 processes 202a, 202b ... 202n (typically providing
19 slightly different feature extraction/modelling as
20 described further below), possibly via multiple
21 filters 24a, 24b ... 24n (in this case the noise
22 signal 206 is shown being added to the input signal
23 upstream of the filters 24a, 24b ... 24n), and the
24 outputs from the multiple modelling processes are
25 input to the classification module 110, as shall
26 also be described in more detail below. These types
27 of multiple parallel modelling processes are
28 preferably applied to both enrolment sample data and
29 test sample data. 30

31 Multiple parallel modelling processes may also be
32 combined with multiple parallel classification
13

1 processes; e.g. the input to each of the parallel
2 classification processes llOa-n in Fig, 5 could be
3 the output from multiple parallel modelling
4 processes as shown in Fig. 6. 5

6 Various aspects of the invention will now be
7 described in more detail by reference to the
8 modelling, classification and normalisation
9 processes indicated in Figs. 4 to 6. 10
11 MODELLING
12
13 The spectral analysis modules 26, 26a-n may apply
14 similar spectral analysis methods to those used in
15 conventional speaker recognition systems.
16 Preferably, the spectral analysis applied by the
17 modules 26a-n is of a type that, for each frame of
18 the sample data, extracts a set of feature vectors

19 (coefficients) representing the smoothed frequency
20 spectrum of the frame. This preferably comprises
21 LPC/Cepstral (LPCC) modelling, producing an
22 increased feature set which models the finer detail
23 of the spectra, but may include variants such as
24 delta cepstral or emphasis/de-emphasis of selected
25 coefficients based on a weighting scheme. Similar
26 coefficients may alternatively be obtained by other
27 means such as or Fast Fourier Transform [PFT] or by
28 use of a filter bank, 29
3 0 The complete sample is represented by a matrix
31 consisting of one row of coefficients for each frame
32 of the sample. For the purposes of the preferred
14

1 embodiments of the present invention, these matrices
2 will each have a size of the order of 1000 (frames)
3 X 24 (coefficients). In conventional systems, a
4 single first matrix of this type, representing the
5 complete original signal, would be subject to
6 statistical analysis such as HMM. 7

8 As will be understood by those skilled in the art,
9 the LP transform effectively produces a set of

10 filter coefficients representing the smoothed
11 frequency spectrum for each frame of the test
12 utterance. The LP filter coefficients are related
13 to Z-plane poles. The Cepstral transform has the
14 effect of compressing the dynamic range of the
15 smoothed spectrum, de-emphasising the LP poles by
16 moving them closer to the Z-plane origin (away from
17 the real frequency axis at 2=6^") . The Cepstral
18 transform uses a log function for this purpose. It
19 will be understood that other similar or equivalent
20 techniques could be used in the spectral analysis of
21 the speech sample in order to obtain a smoothed
22 frequency spectrum and to de-emphasise the poles
23 thereof. This de-emphasis produces a set of
24 coefficients which when transformed back into the
25 time domain are less dynamic and more well balanced
26 (the cepstral coefficients are akin to a time signal
27 or impulse response of the LP filter with de-
28 emphasised poles). The log function also transforms
29 multiplicative processes into additive processes. 30

31 The model derived from the speech sample may be
32 regarded as a set of feature vectors based on the
15

1 frequency content of the sample signal. When a
2 feature vector based on frequency content is
3 extracted from a signal, the order of the vector is.
4 important. If the order is too low then some
5 important information may not be modelled. To avoid
6 this, the order of the feature extractor (e.g. the
7 number of poles of an LP filter) may be selected to
8 be greater than the expected order. However, this
9 in itself causes problems. Poles which match

10 resonances in the signal give good results, whilst
11 the other resulting coefficients of the feature
12 vector will model spurious aspects of the signal.
13 Accordingly, when this vector is compared with
14 another model or reference, the distance measure
15 computed may be unduly influenced by the values of
16 those coefficients which are modelling spurious
17 aspects of the signal. The distance measure (score)
18 which is returned will thus be inaccurate, possibly
19 giving a poor score for a frame which in reality is
20 a good match. 21

22 In accordance with one aspect of the invention, this
23 problem can be obviated or mitigated by adding a
24 noise signal n(t) (206 in Figs. 4-6) having known
25 characteristics to the speech signal s(t) before the
26 signal is input to the modelling process (i.e. the
27 input signal = s{t)+n(t)). The same noise signal
28 would be used during enrolment of speakers and in
29 subsequent use of the system. The addition of the
3 0 known noise signal has the effect of forcing the
31 "extra" coefficients (above the number actually
32 required) to model a known function and hence to
16

1 give consistent results which are less problematic
2 during model/test vector comparison. This is
3 particularly relevant for suppressing the effect of
4 noise (channel noise and other noise) during
5 "silences" in the speech sample data. This problem
6 may also be addressed as a consequence of the use of
7 massively overlapping sample frames discussed below. 8
9 As previously mentioned, in order to allow efficient
10 processing of speech samples all speaker recognition
11 systems use time slices called frames, so that the
12 utterance is split into a sequence of frames and
13 each frame is processed in turn. The frames may or
14 may not be of equal size and they may overlap.
15 Models generated by speaker recognition systems thus
16 comprise a plurality of feature sets {vectors
17 corresponding to seta of coefficients) representing
18 a plurality of frames. When models are compared in
19 conventional speaker recognition systems it is
20 necessary to align corresponding frames of the
21 respective models. Different utterances of a given
22 phrase will never be exactly the same length, even
23 when spoken by the same person. Accordingly, a
24 difficulty exists in correctly aligning frames for
25 comparison. 26

27 Conventional systems convert the frames into a
28 spectral or smoothed spectral equivalent as shown in
29 Figs. 8(a) (showing a time signal divided into
3 0 frames) and 8(b) (showing the corresponding
31 frequency spectrum and smoothed frequency spectrum
32 of one of the frames of Pig. 8(a)), The systems
17

1 then perform further transformations and analysis
2 (such as Cepstral transformation. Vector
3 Quantisation, Hidden Markov Modelling (HMM) and
4 Dynamic Time Warping (DTW)) to obtain the desired
5 result. Frame boundaries can be allocated in many
6 ways, but are usually measured from an arbitrary
7 starting point estimated to be the starting point of
8 the useful speech signal. To compensate for this
9 arbitrary starting point, and also to compensate for

10 the natural variation in the length of similar
11 sounds, techniques such as HMM and DTW are used when
12 comparing two or more utterances such as when
13 building models or when comparing models with test
14 utterances. The HMM/DTW compensation is generally
15 done at a point in the system following spectral
16 analysis, using whatever coefficient set is used to
17 represent the content of a frame, and does not refer
18 to the original time signal. The alignment
19 precision is thus limited to the size of a frame.
20 In addition, these techniques assume that the
21 alignment of a particular frame will be within a
22 fixed region of an utterance which is within a few
23 frames of where it is expected to lie. This
24 introduces a temporal element to the system as the
25 estimated alignment of the current frame depends on
26 the alignment of previous frames, and the alignment
27 of subsequent frames depends on the alignment of the
28 present frame. In practice, this means that a
29 particular frame, such as a frame which exists 200
30 ms into an utterance, will in general only be
31 compared with other frames in the 200 ms region of
32 the model or of other utterances being used to
18

1 construct a model. This approach derives from
2 speech recognition methods (e.g. speech-to-text
3 conversion), where it is used to estimate a phonetic
4 sequence from a series of frames. The present
5 applicants believe that this approach is
6 inappropriate for speaker recognition, for the
7 following reasons. 8
9 A. Most seriously, the conventional approach
10 provides only crude alignment of frames. The
11 arbitrary allocation of starting points means that
12 it will generally not be possible to obtain accurate
13 alignment of the starting points of two respective
14 frames, so that even two frames giving a "best
15 match" may have significantly different spectral
16 characteristics, as illustrated in Fig. 9. 17

18 B, Secondly, the conventional approach relies on
19 the temporal sequence of the frames and bases
20 speaker verification on spectral characteristics
21 derived from temporally adjacent frames. 22

23 In accordance with a further aspect of the
24 invention, the present enrolment modelling process
25 involves the use of very large frame overlaps, akin
26 to convolution, to avoid problems arising from frame
27 alignment between models (discussed at A. above) and
28 to improve the quality of the model obtained. This
29 technique is applied during speaker enrolment in
30 order to obtain a model, preferably based on
31 repeated utterances of the enrolment phrase. By
32 massively overlapping the frames, the resulting
19

1 model effectively approaches a model of all possible
2 alignments, with relatively small differences
3 between adjacent frames, thereby providing good
4 modelling of patterns. Preferably, the frame overlap
5 is selected to be at least 80%, more preferably it
6 is in the range 80% to 90%, and may be as high as
7 95%. 8
9 The frames are transformed into representative
10 coefficients using the LPCC transformation as
11 described above, so that each utterance employed in
12 the reference model generated by the enrolment
13 process is represented by a matrix (typically having
14 a size of the order of 1000 frames by 24
15 coefficients as previously described). There might
16 typically be ten such matrices representing ten
17 utterances. A clustering or averaging technique
18 such as Vector Quantisation (described further
19 below) is then used to reduce the data to produce
20 the reference model for the speaker. This model
21 does not depend on the temporal order of the frames,
22 addressing the problems described at B. above. 23

24 Preferred embodiments of the present invention
25 combine the massive overlapping of frames described
26 above with Vector Quantisation or the like as
27 described below. This provides a mode of operation
28 which is quite different from conventional HMM/DTW
2 9 systems. In such conventional systems, all frames
30 are considered equally valid and are used to derive
31 a final "score" for thresholding into a yes/no
32 decision, generally by accumulating scores derived
20

1 by comparing and matching individual frames. The
2 validity of the scores obtained is limited by the
3 accuracy of the frame alignments. 4

5 In accordance with this aspect of the present
6 invention, the reference (enrolment) models
7 represent a large number of possible frame
8 alignments. Rather than matching individual frames
9 of a test utterance with individual frames of the

10 reference models and deriving scores for each
11 matched pair of frames, this allows all frames of
12 the test utterance to be compared and scored against
13 every frame of the reference model, giving a
14 statistical distribution of the frequency of
15 occurrence of frame score values. "Good" frame
16 matches will yield low scores and "poor" frame
17 matches will yield high scores (or the converse,
18 depending on the scoring scheme). A test utterance
19 frame tested against a large number of reference
20 models will result in a normal distribution as
21 illustrated in Fig. 10. Most frame scores will lie
22 close to the mean and within a few standard
23 deviations therefrom. Because of the massive
24 overlapping of frames in the reference models, the
25 score distributions will include "best matches"
26 between accurately aligned corresponding frames of
27 the test utterance and reference models. When a
28 test utterance from a particular speaker is tested
29 against the reference model for that speaker, the
3 0 distribution will thus include a higher incidence of
31 very low scores. This ultimately results in "true
32 speaker" scores being consistently low due to some
21

1 parts of the utterance being easily identified as
2 originating from the true speaker while other parts
3 less obviously from the true speaker are classified
4 by being from the general population. Impostor
5 frame scores will not produce low scores and will be
6 classified as being from the general population. 7

8 That is, in accordance with this aspect of the
9 invention, the reference models comprise sets of

10 coefficients derived for a plurality of massively
11 overlapping frames, and a test utterance is tested
12 by comparing all of the frames of the test utterance
13 with all of the frames of the relevant reference
14 models and analysing the distribution of frame
15 scores obtained therefrom. 16

17 The massive overlapping of frames applied to speech
18 samples for enrolment purposes may also be applied
19 to input utterances during subsequent speaker
20 recognition, but this is not necessary. 21

22 The use of massive overlaps in the enrolment sample
23 data is also beneficial in dealing with problems
24 arising from noise present in periods of silence in
25 the sample data. Such problems are particularly
26 significant for text-independent speaker recognition
27 systems. The existence of silences may or may not
28 cause problems for an individual model or verification
29 attempt, but they will cause deterioration in the
30 overall system performance. The question is therefore
31 how do we remove this completely or minimise the
32 adverse effect. The use of massive frame overlaps in
22

1 the present invention contains an inherent solution.
2 Consider the equations, which describe averaging the
3 frame spectra (discussed in more detail below),
4 s((o) = —2;s„(to) = — 5;(ss„(a))xsd„((o))
5 6
7 — (ss,((o)xsdi(ci})) + (ss2((o)xsd2(a))) + (ssN((io)xsdN(co))
8 9
10 ss(oi))x — (sd]( N
11 It can be seen that the static parts (sa) average to
12 ss (to) and that individual frames have the spectra
13 sSn(co)xsd„(to) . Consider however the spectra of two
14 added frames,
15 (SS| (co) X sd, (ci))) + (sSj (co) X sdj (m)) = ss(co) x (sd, (co) + sdj (o))
16 we have the steady part multiplied by a new spectra
17 sd,((B) + sd2(a)) . But since it is to be reduced by
18 averaging, and it is also dynamic or variable in
19 nature, the new spectra should behave in exactly the
20 same way as a randomly extracted frame. The
21 implication of this is that frames could be randomly
22 added together with minimal effect on performance.
23 This observation is not entirely true since we can
24 have the case of valid speech frames added to
25 silence frames in which the net result is a valid
26 speech frame. This in fact results in an improvement
27 in performance, as we are no longer including
28 unwanted silences in the modelling.
23

1 If a typical signal with some minor silence problems
2 has time frames randomly added, the silences would
3 be eliminated but the signal would appear to have
4 undergone major corruption. However the present
5 invention using massively overlapped frames still
6 functions. Interestingly, the implication of this is
7 that channel echoes have no effect and can be
8 ignored. It also underlines the fact that the
9 preferred operating modes of the present invention

10 extract the static parts of the spectra to a larger
11 extent than conventional verifiers (as discussed
12 further below). The addition of frames in this way
13 has substantially the same effect as adding coloured
14 noise to prevent unwanted modelling as discussed
15 above. 16

17 In accordance with another aspect, the invention
18 uses clustering or averaging techniques such as
19 Vector Quantisation applied by the modules 28, 28a-n
20 in a manner that differs from statistical analysis
21 techniques used in conventional speaker recognition
22 systems. 23

24 Preferably, the system of the present invention uses
25 a Vector Quantisation (VQ) technique in processing
26 the enrolment sample data output from the spectral
27 analysis modules 26, 26a-n. This is a simplified
28 technique, compared with statistical analysis
29 techniques such as HMM employed in many prior art
30 systems, resulting in transparent modelling
31 providing models in a form which allow model-
32 against-model comparisons in the subsequent
24

1 classification stage. Also, VQ as deployed in the
2 present invention does not use temporal information,
3 making the system resistant to impostors. 4

5 The VQ process effectively compresses the LPCC
6 output data by identifying clusters of data points,
7 determining average values for each cluster, and
8 discarding data which do not clearly belong to any
9 cluster. This results in a set of second matrices

10 of second coefficients, representing the LPCC data
11 of the set of first matrices, but of reduced size
12 (typically, for example, 64 x 24 as compared with
13 1000 X 24). 14

15 The effects of the use of LPCC spectral analysis and
16 clustering/averaging in the present invention will
17 now be discussed. 18

19 The basic model assumes that spectral magnitude is
20 useful and that the phase may be disregarded. This
21 is known to apply to human hearing and if it was not
22 applied to a verifier the system would exhibit
23 undesirable phase related problems, such as
24 sensitivity to the distance of the microphone from
25 the speaker. Further assume that the spectral
26 information of a speech sample can be regarded as
27 consisting of two parts a static part ss(fi)) and a
28 dynamic part sd(ci)) and that the processes are
29 multiplicative. It is also assumed that the dynamic
30 part is significantly larger than the static part.
31 s(o)) = ss(ffl) X sd(aj)
25

1 As, by definition, the static part is fixed it is
2 the more useful as a biometric as it will be related
3 to the static characteristics of the vocal tract.
4 This will relate the measure to some fixed physical
5 characteristic as opposed to sd(a)) which is related
6 to the dynamics of the speech. 7

8 The complete extraction of ss(a)) would give a
9 biometric which exhibits the properties of a

10 physical biometric, i.e. cannot be changed at will
11 and does not deteriorate over time. Alternatively
12 the exclusive use of sd(co) will give a biometric
13 which exhibits the properties of a behavioral
14 biometric, i.e. can be changed at will and will
15 deteriorate over time. A mixture of the two should
16 exhibit intermediate properties but as ed{(o) is much
17 larger than ss(co) it is more likely that a
18 combination will exhibit the properties of sd(co) ;
19 i.e. behavioral. 20

21 As with all frequency representations of a signal
22 the assumption is that the time signal exists from
23 -00 to +00 which clearly is not physically possible.
24 In practice all spectral estimates of a signal will
25 be made using a window, which exists for a finite
26 period of time. The window can either be rectangular
27 or shaped by a function (such as a Hamming window). 28

29 The use of a rectangular window amounts to simply
30 taking a section of a signal in the area of interest
31 and assuming that it is zero elsewhere. This
26

1 technique is common in speech processing in which
2 the sections of signal are called frames; Fig. 1
3 shows a time signal with the frames indicated. 4

5 The frames can be shaped using an alternate window.
6 Interestingly, the major effect of windowing is a
7 spreading of the characteristic of a particular
8 frequency to its neighbours, a kind of spectral
9 averaging. This effect is caused by the main lobe;

10 in addition to this the side lobes produce spectral
11 oscillations, which are periodic in the spectrum.
12 The present system later extracts the all-pole
13 Linear Prediction coefficients, which have the
14 intended effect of spectral smoothing and the extra
15 smoothing, caused by the windowing, is not seen as a
16 major issue. However, the periodic side lobe
17 effects might be troublesome if the window size was
18 inadvertently changed. This however can be avoided
19 by good housekeeping. 20

21 Given that we can divide the time signal into frames
22 the spectral characteristics for frames 1 to N can
23 be represented as
24 s,((o) = ss,((o)xsd,((o) ; S2(co)5=ss2(a))xsd2(a)) ; • • • 25* ••••• •

26 s„(©) = ss„(a))xsd„((D) ; • • • •
27 SN(o)) = ss^(a))xsd„(co) 28
29 But by definition
30 ss(co) = ss, (a)) = ss J ((B) = 883(01) • • • =ss^(oi)
31
27

1 On first impressions to extract ss(co) would seem to
2 be possible using an averaging process,
3 s(co) =—^s„({0) = — 2](ss„(co)xsd„(ci)))
" n ^ n
4
5 —(ss, (o)) X sd, (©)) + (ss^ ( N
6
7 ss(a))x — (sd,(fi)) + sdj(o) + sd2(ci)) + sdN(ci))) ~ ss(co) x U(fi>)
N
8 where,
9 U(ci)) =—(sd, (co) 4- sdj (to) + sdj (o) + sd,, (co))
N
10 If the frames had independent spectral
11 characteristics (each resulting from random process)
12 then U(a>) would tend to white noise, i.e. would have
13 a flat spectrum so that s((fl) could be extracted by
14 smoothing the spectrum. This would most likely be
15 the case if N were very large ->«. Given the linear
16 nature of the time domain - frequency domain - time
17 domain transformations a similar analysis could have
18 been described in the time domain. 19

20 For real world conditions it cannot be assumed that
21 N would be large in the sense that the frames have
22 independent spectral characteristics. It is
23 important to remember that this would require N to
24 be large under two conditions:

25 1. During model creation
26 2. During a verification event 27
28

1 Failure to comply during either would potentially
2 cause a system failure (error), however a failure in
3 1 is the more serious as it would remain a potential
4 source of error until updated, whereas a problem in
5 2 is a single instance event. 6

7 If U{co) cannot be guaranteed to converge to white
8 noise, what can be done to cope with the situation?
9 First consider that:

10 1. U((JO) will be a variable cjuantity
11 2. When smoothed across the frequency spectrum it

12 would ideally be flat; i.e. the smoothed
13 version Usm(co) = l
14 3. UCco) is the truncated sum of the speech frames
15 the number of which would ideally tend to
16 infinity. 17
18 Considering the equation
19 s(a)) = ss(co) X—2] sd „ (ffl)
20 The summation part tending to a flat spectrum is not
21 an ideal performance measure, if we return to the
22 frame based equivalent:
23 s((B) =—2^(ss„((D)xsd„(co))
24 If we take the logarithms of the frames:
25 — ^ iogiiss,, (©) X sd,, (G)))
26
27 = - S ['^eCss „ ((a)) + log(sd „ (co))] = log(ss(03)) + — J] log(sd „ (co))
29

1
2 = lss((fl) + Isd(a))
3 it can be seen that the relationship between the
4 static and dynamic parts is now additive. Because
5 the relationship between the time domain and the
6 frequency domain is linear a transformation from
7 frequency to time gives:
8 Iss(co) + lsd( CS(T) + cd(T) = C(T)
9
10 In signal processing C(T) is known as the Cepstral
11 transformation of s(t) as discussed previously. 12

13 In general cepstral analysis consists of
14 time _ domain -^ frequency _ domain -► log(spectnim) -> time _ domain
15 The Cepstral transformation has been used in speech
16 analysis in many forms. 17

18 As discussed above, in our current usage we create
19 the Cepstral coefficients for the frames and extract
20 the static part,
22
23 Ideally the length of the speech signal would be
24 long enough so that the dynamic part was completely
25 random and the mean would tend to zero. This would
26 leave the static part cs(t) as our biometric
27 measure. However, we have a number of problems to
28 overcome. 29
30 1. How do we handle the imperfect nature of the
31 Bum-to-zero
30

1 2. channel variation
2 3. endpointing
3 4 . additive noise 4

5 Referring to the imperfect nature of the sum-to-
6 zero, the nature of the Cepstral coefficients are
7 such that they decay with increasing time and have
8 the appearance of an impulse response for stable
9 systems. This means that the dynamic range of each

10 coefficient is different and they are in general in
11 descending order. 12

13 It can be shown that the differences between the
14 average coefficients of a test sample and the frame
15 coefficient values for the true speaker model and
16 the frame coefficient values of an impostor model
17 are not large and a simple summation over all of the
18 utterance frames to produce a distance score will be
19 difficult to threshold in the conventional manner. 20

21 If we consider the two difficult problems associated
22 with this methodology together rather than
23 separately the answer to the problem is revealed. To
24 re-emphasise, the two points of difficulty are,
25 1. the utterances will never be long enough for
26 the mean of the dynamic part to converge to
27 zero
28 2. the differences between the true speaker and
29 the impostors will be small and difficult to
30 threshold. 31
32 Consider two speakers with models based upon
31

1
2
3 4
5 6 7 8

SO that the models are ml(T) and m2{T), where,
ml(T) = dC0 = ^Scl„(t) = lX(*^Sl„(T) + cdl„(T)) = CSl(x) + l5;cdI„(T)
= CS1(T) + el(T) ; where el (T) is the error In vector form the models are



csl, +el, cslj +el2

cs2, + e2, cs22 +e22



ml =

and m2 =



cslp +elp

cs2p+e2p



10 11
12 13

A test utterance from speaker 1 expressed in the same form will be

csli +Tel, csl 2 +Tel2

14

Tl

csl p+Tel,

15 16 17

using a simple distance measure true speaker distance is



18

dl=ml-Tl =


csl, + ell
cslj ^elj
• -

cslp+elp_


csl, + Tel,l
csl 2 + Tel2


cslp + TelpJ

= |el-Tel|



19 impostor distance is

32

d2 = lra2-Tl| =


"cs2, +e2,"
cs2j +e22
• -

cs2p+e2p_


csl, + Tel,l
CSlj H-Telj


cslp + TelJ

= |cs2-csl + e2-Tel|

3
4 Assuming that the convergence of the dynamic parts
5 of the models is good (i.e. that the error vectors
6 are small compared to the static vectors) then in
7 general dl 8 models built represent the enrolled speaker (a
9 condition that can easily checked during enrolment

10 using the data available at that time).
11 Interestingly, if el and e2 are small compared to
12 the test signal error Tel the distances become
13 independent of el and e2. The condition under which
14 the test error will be large when compared to the
15 model error is during text-independent test
16 conditions. This shows that if the dynamic
17 components of the enrolment speech samples are
18 minimised in the enrolment models then such models
19 can provide a good basis for text-independent
20 speaker recognition 21

22 The errors el and e2 above are average model
23 construction errors; the actual errors are on a
24 frame by frame basis and will have a distribution
25 about the mean. This distribution could be modelled
26 in a number of ways the simplest being by use of a
27 standard clustering technique such as k-means to
28 model the distribution. The use of k-means
33

1 clustering is also known in other forms as Vector
2 Quantisation (VQ) and is a major part of the Self
3 Organising Map (SOM) also known as the Kohonen
4 Artificial Neural Network. 5

6 The system just described where a test utterance is
7 applied to two models and the closest chosen is a
8 variant of identification. In the above case if
9 either speaker 1 or speaker 2, the enrolled

10 speakers, claim to be themselves and are tested they
11 will always test as true and so the False Rejection
12 Rate FRR =0. If an unknown speaker claims to be
13 either apeakerl or speaker2 he will be classified as
14 one or the other, so there is a 1/2 chance of
15 success and hence a False Acceptance Rate FAR =50%.
16 If an equal number of true speaker tests and random
17 impostor tests were carried out, we can calculate an
18 overall error rate as (FRR+FAR)/2 = (0+0.5)/2=25% 19

20 It is obvious that the number of models (the cohort)
21 against which the test utterance is tested will have
22 an effect on the FAR and it will reduce as the
23 cohort increases. It can be shown that the accuracy
24 of recognition under these conditions is asymptotic
25 to 100% with increasing cohort size, since FRR=0,
26 but as the accuracy is 27
28 accuracy = 100 - (FRR + FAR)—= 100-I FRR + ? |—
2 (^ cohort^sizej 2
29
30 it is in more general terms asymptotic to 100-FRR.
31
34

1 It is worth observing at this point that the FRR and
2 FAR are largely decoupled: the FRR is fixed by the
3 quality of the model produced and the FAR is fixed
4 by the cohort size. It is also worth observing that
5 to halve the error rate we need to double the cohort
6 size e.g. for 99% accuracy the cohort is 50, for
7 99.5% accuracy the cohort is 100, for 99.75%
8 accuracy the cohort is 200. As the cohort increases
9 the computational load increases and in fact doubles

10 for each halving of the error rate. As the cohort
11 increases to very large numbers the decoupling of
12 the FRR and FAR will break down and the FRR will
13 begin to increase. 14

15 Rather than continually increasing the cohort size
16 in an attempt to reduce the FAR to a minimum another
17 approach is needed. The approach, in accordance
18 with one aspect of the invention, is to use parallel
19 processes (also discussed elsewhere in the present
20 description), which exhibit slightly different
21 impostor characteristics and are thus partially
22 statistically independent with respect to the
23 identifier strategy. The idea is to take a core
24 identifier which exhibits the zero or approximately
25 zero FRR and which has a FAR that is set by the
26 cohort size. The front end processing of this core
27 identifier is then modified slightly to reorder the
28 distances of the cohort member models from the true
29 speaker model. This is done while maintaining the
30 FRR~0 and can be achieved hy altering the spectral
31 shaping filters 24a-24n (see Fig. 7), or by altering
35

1 the transformed coefficients, such as by using
2 delta-ceps etc. 3

4 When an enrolled speaker uses the system the test
5 signal is applied to all of the processes in
6 parallel but each process has a FRR~0 and the
7 speaker will pass. When an unknown impostor uses the
8 system he will pass each individual process with a
9 probability of l/cohort_size. However with the

10 parallel processes we have introduced conditional
11 probabilities. That is, if an impostor passes
12 processl what is the likelihood of him passing the
13 modified process2 as well etc. Although the
14 probability of an impostor passing all of the
15 processes is not that of the statistically
16 independent case of
17 statistically _ independent _ result = process _ prob""-"'-'™*™'
18 it does however reduce with the addition of
19 processes. It can be shown that for a given process
20 FAR value, the overall accuracy of the system
21 increases with the number of processes. 22

23 Where multiple parallel processes are used in this
24 way, the scheme for matching a test sample against a
25 claimed identity may require a successful match for
26 each process or may require a predetermined
27 proportion of successful matches. 28

29 The combined use of massive sample frame overlaps
30 and Vector Quantisation (or equivalent) in building
31 enrolment models in accordance with the present
32 invention provides particular advantages. The
36

1 massive overlapping is applied at the time of
2 constructing the models, although it could also be
3 applied at the time of testing an utterance. The
4 technique involves using a massive frame overlap,
5 typically 80-90%, to generate a large possible
6 number of alignments; the frames generated by the
7 alignments are then transformed into representative
8 coefficients using the LPCC transformation to
9 produce a matrix of coefficients representing all of

10 the alignments. This avoids conventional problems of
11 frame alignment. The matrix is typically of the size
12 no_of_frames by LPCC_order, for example 1000x24.
13 This is repeated for all of the utterances used in
14 constructing the model, typically 10, Giving 10
15 matrices of 1000x24. Vector Quantisation is then
16 used to reduce the data to produce a model for the
17 speaker. This has the effect of averaging the frames
18 so as to reduce the significance of the dynamic
19 components of the sampled speech data as discussed
20 above. The resulting model does not take cognisance
21 of the frame position in the test utterance and is
22 hence not temporal in nature. This addresses the
23 problem of temporal dependency. 24

25 The combined use of VQ and massive frame overlapping
26 produces an operation mode which is different from
27 conventional systems based upon HMM/DTW. In HMM/DTW
28 all frames are considered to be equally valid and
29 are used to form a final score for thresholding into
30 a yes/no decision. In the present invention every
31 row (frame) of the test sample data is tested
32 against every row of the enrolment model data for
37

1 the claimed speaker and the associated impostor
2 cohort. For each row of the test sample data, a
3 best match can be found with one row of the
4 enrolment model, yielding a test score for the test
5 sample against each of the relevant enrolment
6 models. The test sample is matched to the enrolment
7 model that gives the best score. If the match is
8 with the claimed identity, the test speaker is
9 accepted. If the match is with an impostor the test
10 speaker is rejected.
11
12 The present system, then, uses LPCC and VQ modelling
13 (or similar/equivalent spectral analysis and
14 clustering techniques), together with massive
15 overlapping of the sample frames, to produce the
16 reference models for each enrolled speaker, which
17 are stored in the database. In use of the system, an
18 input test utterance is subjected to similar
19 spectral analysis to obtain an input test model
20 which can be tested against the enrolled speaker
21 data-set. Advantageously, this approach can be
22 applied so as to obtain a very low False Rejection
23 Rate (FRR), substantially equal to zero. The
24 significance of this is discussed further below. 25
26 Parallel Modelling
27
28 As previously discussed, the performance of speaker
29 recognition systems in accordance with the invention
30 can be improved by using multiple parallel processes
31 to generate the model. 32
38

1 Referring now to Fig. 7 of the drawings, one
2 preferred embodiment of a speaker recognition system
3 employing parallel modelling processes in accordance
4 with one aspect of the invention comprises an input
5 channel 100 for inputting a signal representing a
6 speech sample to the system, a channel normalisation
7 process 200 as described elsewhere, a plurality of
8 parallel signal processing channels 102a, 102b ...
9 102n, a classification module 110 and an output

10 channel 112. The system further includes an
11 enrolled speaker data-set 114; i.e. a database of
12 speech models obtained from speakers enrolled to use
13 the system. The speech sample data is processed in
14 parallel by each of the processing channels 102a-n,
15 the outputs from each of the processing channels is
16 input to the classification module 110, which
17 communicates with the database 114 of enrolled
18 speaker data, and a decision as to the identity of
19 the source of the test utterance is output via the
20 output channel 112. 21

22 Each of the processing channels I02a-n comprises, in
23 series, a spectral shaping filter 24a-n, an
24 (optional) added noise input 206a-n, as described
25 elsewhere, a spectral analysis module 26a-n and a
26 statistical analysis module 28a-n. The outputs from
27 each of the statistical analysis modules 28a-n is
28 input to the classification module 110. 29

30 The spectral shaping filters 24a-n comprise a bank
31 of filters which together divide the utterance
32 signal into a plurality of overlapping frequency
39

1 bands, each of which is then processed in parallel
2 by the subsequent modules 26a-n and 28a-n. The
3 number of processing channels, and hence the number
4 of frequency bands, may vary, with more channels
5 providing more detail in the subsequent analysis of
6 the input data. Preferably, at least two channels
7 are employed, more preferably at least four
8 channels. The filters 24a-n preferably constitute a
9 low-pass or band-pass or high-pass filter bank. The

10 bandwidth of the base filter 24a is selected such
11 that the False Rejection Rate (FRR) resulting from
12 subsequent analysis of the output from the first
13 channel 102a is zero or as close as possible to
14 zero. The subsequent filters 24b-n have
15 incrementally increasing bandwidths that
16 incrementally pass more of the signal from the input
17 channel 100. The FRR for the output from each
18 channel 102a-n is thus maintained close to zero
19 whilst the different channel outputs have slightly
20 different False Acceptance (FA) characteristics.
21 Analysis of the combined outputs from the channels
22 102a-n yields a reduced overall FA rate (a claimed
23 identity is only accepted if the outputs from all of
24 the channels are accepted) with a FRR close to zero.
25 The significance of this is discussed further below. 26

27 The use of multiple frequency bands improves upon
28 conventional single-channel spectral analysis,
29 increasing the size of the feature vectors of
30 interest in the subsequent statistical analysis. 31
40

1 It will be understood that different types of
2 parallel processing may be employed in the modelling
3 process in order to provide multiple feature sets
4 modelling different (related or unrelated) aspects
5 of the input speech sample and/or alternative models
6 of similar aspects. Banks of filters of other types
7 in addition to or instead of low pass filters might
8 be employed. Different types or variants of
9 spectral and/or statistical analysis techniques

10 might be used in parallel processing channels.
11 Parallel statistical analyses may involve applying
12 different weighting values to sets of feature
13 coefficients so as to obtain a set of slightly
14 deviated models. 15

16 It will be understood that the architecture
17 illustrated in Fig. 7 may be used for both obtaining
18 enrolment models for storing in the database 114 and
19 for processing test speech samples for testing
20 against the enrolment models. Each enrolment model
21 may include data-sets for each of a plurality of
22 enrolment utterances. For each enrolment utterance,
23 there will be a matrix of data representing the
24 output of each of the parallel modelling processes.
25 Each of these matrices represents the
26 clustered/averaged spectral feature vectors. Test
27 sample data is subject to the same parallel spectral
28 analysis processes, but without
29 clustering/averaging, so that the test model data
30 comprises a matrix representing the spectral
31 analysis data for each of the parallel modelling
32 processes. When a test model is tested against an
41

1 enrolment model, the test matrix representing a
2 particular modelling process is tested against
3 enrolment matrices generated by the same modelling
4 process. 5
6 CLASSIFICATION
7
8 The nature of the reference models obtained by the
9 modelling techniques described above is such that

10 they lend themselves to direct model against model
11 comparisons. This enables the system to employ an
12 identifier strategy in which each enrolment model is
13 associated with an impostor cohort. That is, for
14 the reference model of each enrolled speaker
15 ("subject") , there is an impostor cohort comprising
16 a predetermined number of reference models of other
17 enrolled speakers, specific to that subject and
18 which has a known and predictable relationship to
19 the subject's reference model. These predictable
20 relationships enable the performance of the system
21 to be improved. Fig. 11(a) shows the results
22 obtained by a conventional speaker recognition
23 system, similar to Fig. 3, comparing scores for an
24 input utterance tested against reference data for
25 eight speakers. Speaker 1 is the true speaker, but
26 the scores for some of the other speakers are
27 sufficiently close to reduce significantly the
28 degree of confidence that the system has identified
29 the correct speaker. Fig. 11(b) shows equivalent
30 results obtained using a system in accordance with
31 the present invention. It can be seen that the
32 results for speaker 1 are much more clearly
42

1 distinguished from the results of all of the other
2 speakers 2 to 8. 3

4 The speaker modelling method employed in the
5 preferred embodiments of the present invention is
6 inherently simpler (and, in strict mathematical
7 terms, cruder) than conventional techniques such as
8 HMM and possible alternatives such as gaussian
9 mixture models. However, the present applicants

10 believe that the conventional use of "tight"
11 statistical methods is inherently flawed and result
12 in poor "real world" performance, and that,
13 surprisingly, the relatively simpler statistical
14 methods of the present invention are much more
15 effective in practice. As previously noted, the
16 temporal nature of HMM makes it susceptible to
17 mimics, a problem which is avoided by the present
18 invention. Further, the models of the present
19 invention are ideally suited to enable analysis of
20 the structure of the enrolled speaker data-set by
21 model against model testing. 22

23 The ability to perform model against model
24 comparisons by using the present speaker models
25 provides two particular advantages. Firstly, this
26 provides the ability to identify the most relevant
27 impostors in the enrolled speaker data-set (i.e.
28 those which are close to and uniformly distributed
29 around a particular model) and to produce an
3 0 effective and predictable speaker normalisation
31 mechanism. VQ modelling involves choosing the size
32 of the model; i.e. choosing the number of
43

1 coefficients ("centres"). Once this has been done,
2 the positions of the centres can be moved around
3 until they give the best fit to all of the enrolment
4 data vectors. This effectively means allocating a
5 centre to a cluster of enrolment vectors, so each
6 centre in the model represents a cluster of
7 information important to the speaker identity. 8
9 The model against model tests make it possible to
10 predict how an enrolled speaker, or claimed
11 identity, will perform against the database both in
12 the broad sense and in an area local (in the system
13 dataspace) to the claimed identity. Fig. 12
14 illustrates the results of testing reference models
15 for speakers 2 to 8 against the reference model for
16 speaker 1. The ellipses show the model against
17 model results whilst the stars show actual scores
18 for speaker utterances tested against model 1. It
19 can be seen that the model against model tests can
20 be used to predict the actual performance of a
21 particular speaker against a particular reference
22 model. The model against model results tend to lie
23 at the bottom of the actual score distributions and
24 therefore indicate how well a particular impostor
25 will perform against model 1. This basic approach
26 of using model against model teats to predict actual
27 performance is known as such. As described further
28 below, this approach may be extended in accordance
29 with one aspect of the present invention to guard
30 particular models against impostors using
31 individually selected, statistically variable
32 groupings.
44

1
2 The second advantage derived from model against
3 model testing is the ability to predict the
4 performance of a test utterance against some or, if
5 need be, all of the enrolled speaker models. This
6 enables a virtually unlimited number of test
7 patterns to be used to confirm an identity, which is
8 not possible with conventional systems. 9

10 In addition, the model against model test results
11 may be used to assemble a specific impostor cohort
12 for use with each reference model. This allows
13 accurate score normalisation and also allows each
14 model to be effectively "guarded" against impostors
15 by using a statistically variable grouping which is
16 selected for each enrolled speaker. This is
17 illustrated by Fig. 13. Each reference model can be
18 regarded as a point in a multi-dimensional
19 dataspace, so that "distances" between models can be
20 calculated. Fig. 13 illustrates this idea in two
21 dimensions for clarity, where each star represents a
22 model and the two-dimensional distance represents
23 the distance between models. 24

25 It can be seen that the distribution of speaker
26 models is not uniform, so that a world-model based
27 normalisation technique will not operate equally
28 well for all speaker models. It can also be seen
29 that some speaker models can be relatively close to
30 one another, which implies that there is potential
31 for impostors to successfully impersonate enrolled
32 speakers. For each speaker model, these issues can
45

1 be resolved by creating a specific cohort of
2 impostors around the subject model. This simplifies
3 normalisation and creates a guard against impostors.
4 This is illustrated in Fig. 14, which shows, in a
5 similar manner to Fig. 13, a subject model
6 represented by a circle, members of an impostor
7 cohort represented by stars, and a score for an
8 impostor claiming to be the subject, represented by
9 an "X". The impostor score is sufficiently close to

10 the subject model to cause recognition problems.
11 However, because the speaker data-set enables
12 prediction of how the true subject speaker will
13 perform against the models of the impostor cohort,
14 this information can be used to distinguish the
15 impostor x from the true subject, by testing the
16 impostor against the models of the cohort members as
17 well as against the true subject model. That is, it
18 can be seen that the impostor utterance x is closer
19 to some of the cohort members than would be expected
20 for the true subject, and further away from others
21 than expected. This would indicate an impostor
22 event and result in the impostor utterance being
23 rejected as a match for the true subject. 24

25 This provides the basis for a two stage recognition
26 process which firstly rejects impostors who are
27 clearly not the claimed speaker followed, where
28 necessary, by a more detailed process applied to
29 utterances which are close enough to possibly be the
30 claimed speaker. 31
46

1 In certain applications of speaker verification
2 systems, it is important to minimise the possibility
3 of "false rejections"; i.e. instances in which the
4 identity claimed by a user is incorrectly rejected
5 as being false. In accordance with one aspect of
6 the invention, an "identifier strategy" is employed
7 which provides very low false rejections, whilst
8 also providing predictable system performance and
9 minimising problems associated with the use of

10 thresholds in accepting or rejecting a claimed
11 identity. 12

13 In accordance with this strategy, the database of
14 enrolled speakers (the "speaker space") is
15 partitioned; e.g. so that each speaker enrolled in
16 the system is assigned to a cohort comprising a
17 fixed number N of enrolled speakers, as described
18 above. The speaker classification module of the
19 system (e.g. the module 110 in the system of Pig. 4)
20 operates such that the input test utterance is
21 compared with all of the members of the cohort
22 associated with the identity claimed by the speaker,
23 and the test utterance is classified as
24 corresponding to that member of the cohort which
25 provides the best match. That is, the test
26 utterance is always matched to one member of the
27 cohort, and will never be deemed not to match any
28 member of the cohort. If the cohort member to which
29 the utterance is matched corresponds to the claimed
30 identity, then the claimed identity is accepted as
31 true. If the utterance is matched to any other
47

1 member of the cohort then the claimed identity is
2 rejected as false. 3

4 The modelling and classification processes can be
5 tuned such that the proportion of false rejections
6 is effectively zero (FR = 0%) (as discussed above);
7 i.e. there is substantially zero probability that a
8 speaker will be wrongly identified as a member of
9 the cohort other than the claimed identity. This is

10 facilitated by the use of model against model
11 comparisons such that a match is not based simply
12 upon the test utterance being matched against the
13 single closest model, but also on the basis of its
14 relationship to other members of the cohort. Where
15 the cohort is of a fixed size N, the maximum
16 possible proportion of false acceptances
17 FA = 100/N % and the total average error rate
18 = (FA + FR)/2 = 50/N %. If the cohort size N is 20,
19 the error rate is thus 2.5 %; i.e. an accuracy of
20 97.5 %. If the cohort size is fixed, the system is
21 scalable to any size of population while maintaining
22 a fixed and predictable error rate. That is, the
23 accuracy of the system is based on the size of the
24 cohort and is independent of the size of the general
25 population, making the system scalable to very large
26 populations. Accuracy can be improved by increasing
27 the cohort size, as long as the false rejection rate
28 does not increase significantly. 29

30 This strategy does not rely on the use of thresholds
31 to determine a result, but thresholds could still be
32 used to reduce false acceptances; i.e. once a test
48

1 utterance has been matched to the claimed identity
2 using the foregoing strategy, thresholds could be
3 applied to determine whether the match is close
4 enough to be finally accepted. 5

6 As indicated above, the selection of an impostor
7 cohort associated with a particular enrolment model
8 may involve the use of algorithms so that the
9 members of the impostor cohort have a particular

10 relationship with the enrolment model in question.
11 In principle, this may provide a degree of
12 optimisation in the classification process.
13 However, it has been found that a randomly selected
14 impostor cohort performs equally well for most
15 practical purposes. The most important point is
16 that the cohort size should be predetermined in
17 order to give predictable performance. The impostor
18 cohort for a particular enrolment model may be
19 selected at the time of enrolment or at the time of
20 testing a test utterance. 21
22 Parallel Classification
23
24 The performance of a speaker recognition system in
25 accordance with the invention may be improved by the
26 use of multiple parallel classification processes.
27 Generally speaking, such processes will be
28 statistically independent or partially independent.
29 This approach will provide multiple classification
30 results which can be combined to derive a final
31 result, as illustrated in Fig. 5. 32
49

1 In one example, using the identifier strategy
2 described above, the same test utterance may be
3 tested against a number of different cohorts, or
4 against different enrolment phrases, or combinations
5 thereof. Where multiple cohorts are employed, each
6 cohort will give a result with a false rejection
7 rate of essentially zero (FR = 0 %) and a false
8 acceptance rate FA = lOO/N % as before. The overall
9 false acceptance rate for n cohorts of equal size

10 will be
11 FA = 100*M/N" % and the average error rate
12 = 50*M/N" %, where M is a coefficient having a value
13 greater than 1 and representing the effect of the
14 processes not being entirely statistically
15 independent. That is, with 2 cohorts and a cohort
16 size of 20, the average error rate will be 0.125*M %
17 as compared with 2.5 % for a single cohort as
18 described above. Thresholds may also be applied to
19 further improve accuracy as previously described. 20

21 Other types of partially statistically independent
22 processes may be employed in the modelling process,
23 the classification process or both as previously
24 discussed. Besides the examples previously given, a
25 single utterance may be divided into parts and
26 processed separately. 27
28 NORMALISATION
29
30 A further problem encountered with conventional
31 speaker recognition systems is that system
32 performance may be affected by differences between
50

1 speech sampling systems used for initial enrolment
2 and subsequent recognition. Such differences arise
3 from different transducers (microphones), soundcards
4 etc. In accordance with a further aspect of the
5 present invention, these difficulties can be
6 obviated or mitigated by normalising speech samples
7 on the basis of a normalisation characteristic which
8 is obtained and stored for each sampling system (or,
9 possibly, each type of sampling system) used to

10 input speech samples to the recognition system.
11 Alternatively (preferably), the normalisation
12 characteristic can be estimated "on the fly" when a
13 speech sample is being input to the system. The
14 normalisation characteristic(s) can then be applied
15 to all input speech samples, so that reference
16 models and test scores are independent of the
17 characteristics of particular sampling systems.
18 Alternatively or additionally, in accordance with a
19 further aspect of the invention a normalisation
20 process can be applied at the time of testing test
21 sample data against enrolment sample data. 22

23 A normalisation characteristic is effectively a
24 transfer function of the sampling system and can be
25 derived, for example, by inputting a known reference
26 signal to the sampling system, and processing the
27 sampled reference signal through the speech
28 recognition system. The resulting output from the
29 recognition system can then be stored and used to
30 normalise speech samples subsequently input through
31 the same sampling system or the same type of
32 sampling system.
51

1
2 Alternatively, as illustrated in Fig. 15, a speech
3 signal S(f) which has been modified by the transfer
4 function C(f) of an input channel 300 can be
5 normalised on the fly by inputting the modified
6 speech signal S(f)*C(f) to an estimating module 302,
7 which estimates the transfer function C(f) of the
8 channel 300, and to a normalisation module 304, and
9 applying the inverse of the estimated transfer

10 function 1/C(f) to the normalisation module, so that
11 the output from the normalisation module closely
12 approximates the input signal S(f). The estimator
13 module 302 creates a digital filter with the
14 spectral characteristics of the channel 3 00 and the
15 inverse of this filter is used to normalise the
16 signal. For example, the inverse filter can be
17 calculated by determining the all-pole filter which
18 represents the spectral quality of a sample frame.
19 The filter coefficients are then smoothed over the
20 frames to remove as much of the signal as possible,
21 leaving the spectrum of the channel (C(f)). The
22 estimate of the channel spectrum is then used to
23 produce the inverse filter l/C(f). This basic
24 approach can be enhanced to smooth the positions of
25 the poles of the filters obtained for the frames,
26 with intelligent cancellation of the poles to remove
27 those which are known not to be concerned with the
28 channel characteristics. 29

30 Depending on the nature of the transfer
31 function/normalisation characteristic, the
32 normalisation process can be applied to the speech
52

1 sample prior to processing by the speaker
2 recognition system or to the spectral data or to the
3 model generated by the system. 4

5 A preferred method of channel normalisation, in
6 accordance with one aspect of the invention, is
7 applied to the test model data and the relevant
8 enrolment models at the time of testing the test
9 sample against the enrolment models. 10

11 The overall effect of the channel characteristics on
12 a speech signal could be described as
13 s((o) = ss((o) X sd((o) X cc(co)
14 where s(co)is the estimate of the speakers
15 characteristics, cc(co) is the channel characteristic
16 or changed channel characteristic as appropriate,
17 and the speech signal is treated as comprising a
18 static part and a dynamic part as before. Ideally
19 the unwanted channel characteristic can be estimated
20 and removed. In practice the removal can be achieved
21 in the time domain, frequency domain or a
22 combination. They both achieve the same effect, that
23 is to estimate cc(co) and remove it using some form of
24 inverse filter or spectral division. If cc(ci})is the
25 estimate of the spectrum of the unwanted channel
26 then we would calculate
27 ^ = ss(co)xsd(a))x^»s«o)
cc(co) cc(co)
28
53

1 If the estimation of the channel characteristic is
CC(tt)) ,
2 good wl and our estimate of the speech is
cc(co)
3 good with the unwanted spectral shaping removed.
4 This would normally be implemented using a algorithm
5 based on the FFT. 6

7 An alternative implementation is to model the
8 channel characteristic as a filter, most likely in
9 the all-pole form, 10
^^ h(z) = -—-^5———
12
13 This is the most basic form of the ARMA and would
14 normally be extracted from the time signal directly,
15 possibly using Linear Prediction. 16

17 A similar normalisation could be carried out on the
18 Cepstral representation. 19

20 In the Cepstral domain the speech signal is
21 represented as
22 C{T) = CS(T) + cd(T)
23 and the speech signal modified by the unwanted
24 channel characteristics is
2 5 C(T) = cs(t) + cd(T) + CC(T)
26 It can be seen that in this case we have an additive
27 process rather than a product. But it should also be
28 remembered that both cs and cc are static and we may
29 need to remove one cc without removing the other. 30
54

1 It is important to consider the context in which we
2 would wish to remove the signal cc and their
3 different conditions (enrolled model, database
4 derived cohort, test speaker etc.). 5

6 Figure 16 illustrates various sources of corruption
7 of a speech sample in a speaker recognition system.
8 The input speech signal s(t) is altered by
9 environmental background noise, b(t), the recording

10 device bandwidth, r(t), electrical noise and channel
11 crosstalk, t(t), and transmission channel bandwidth,
12 c(t), so that the signal input to the recognition
13 system is an altered signal v(t). The system is
14 easier to analyse in the frequency domain and the
15 signal at the verifier is:
16 v(a)) = ((s(cci) + b(a))).r((D) + t((o)).c((o) eql
17
18 At the verifier we can define two conditions, when
19 the person is speaking and when he is not. Resulting
20 in two equations,
21 v(o3) = ((s((o) + b(o))).r(a)) + t(£o)).c(o))
22 and
2 3 v(co) = ((0 + b(G))).r( 24
25 First consider the simplified problem as it applies
26 to the systems in accordance with the present
27 invention; assume that b{t)=t(t)=0
2 8 v((o) = s(o)).r(to).c((o) = s(co).h((o)
29 where h( ) is the combined channel spectral
30 characteristic,
31 h(fo) = r(a)).c( 55

1 v(co) =5 s(a>).h(o)) = ss(Q)).sd(G))Ji(a))
2
3 The cohort models are selected from the database of
4 speakers recorded using the same channel (b) and the
5 true speaker model is recorded using a different
6 channel (a). The test speaker can either be the true
7 speaker or an impostor and will be recorded using a
8 third channel (c). Figure 17 shows this
9 diagrammatically. Fig. 18 shows the same thing

10 expressed in the alternate form using the Cepstral
11 coefficients. It should be remembered that the
12 values of the signal components as represented in
13 Figs 17 and 18 are averages corresponding to the
14 summations of sample frame data. 15
1.6 Consider the claimed identity model, which was built
17 from,
18 V,(T) = CS, (T) + cd,(T) + h,(T) eq2
19 and the cohort models which were built from,
20 v„(t)=cs„(T) + cd„(T) + hb(T) eq3
21
22 The problem for the verifier is that there are two
23 different channels used in the identifier and if we
24 assume the difference between them is
25 hd(T) = h,(T)-h,(T)
26 or h.(t) = hb(T) + hd(T)
27
28 then the claimed identity model referred to the
29 cohort channel (b) will be
30 V, (T) = cs, (T) + cd, (T) + h, (T) = cs, (T) + cd, (T) + hj, (T) + hd(t)
31 and v,(T) = (cs,(T) + hd(T)) + cd,(T) + hb(T)
56

1
2 it can be seen that the mean of the static part of
3 the claimed identity model has been shifted by the
4 difference between the channels and will cause an
5 error if the true speaker is tested using channel-b
6 if the situation is not corrected. Similar problems
7 involving false acceptances using channel-a will
8 also occur. 9

10 One method of addressing this problem is to remove
11 the mean from the claimed identity model, but a
12 simple removal of the mean would at first glance
13 produce,
14 v,(T) = cd,(T)
15 where the static part of the speaker model has also
16 been removed. However, examining equation 1 (the
17 system model including additive noise)
18 V((B) = ((s(co) + b(o))).r(co) + t((B)).c(a))
19 if we consider the case during which the speaker
20 pauses, s(a))=!0
21 then v((o) = (b(co).r(a)) + t(co)).c((o)
22 and vC©) = n(o)).c(co)
23 where n(to) is a noise signal. 24
25 In cepstral form this would be
2 6 V(T) = n(T) + C(T) = sn(T) + dn(T) + C(T)
27 where as before sn is the static part of the noise
28 and dn is the result of the summation of the dynamic
29 part. 30
57

1 The average of a model constructed from this would
2 be
3 sn(T) + C(T)
4 where sn is any steady state noise such as an
5 interference tone and c is the channel. 6

7 Considering again equationl (the claimed identity
8 model build conditions)
9 v,(T) = cs,(T) + cd,(T) + h.(T)
10 this was the noise free case, adding a steady state
11 noise gives,
12 V, (T) = CS,(T) + cd,(t) + h,(T) + sn(T)
13 If we constructed the speaker pause model for this
14 case we would get
15 sn(T) + h,(T)
16 and using this to remove the mean results in
17 V,(T) = CS,(T) + Cd,(T)
18 This gives us a model unbiased by the channel. A
19 similar process could be applied to each model
2 0 whereby it has the channel bias removed by its own
21 silence model. The test speaker could be similarly
22 treated, i.e. its silence model is used to remove
23 the channel effects. 24

25 The removal (reduction) of the channel
26 characteristics using the silence model as described
27 above requires suitable channel noise and perfect
28 detection of the silence parts of the utterance. As
29 these cannot be guaranteed they need to be mitigated
30 (for instance, if the silence includes some speech
31 we will include some of the claimed identity speaker
58

1 static speech and inadvertently remove it).
2 Fortunately they can be dealt with in one simple
3 modification to the process: the cohort models
4 should all be referred to the same silence model. 5

6 That is, if we re-add the silence average of the
7 claimed identity model to all of the models in the
8 cohort (including the claimed identity model). This
9 refers all of the models to the same mean

10 sn(T) + h,(T). This normalisation is also applied to the
11 test model, thereby referring all of the models and
12 the test utterance to the same reference point. In
13 effect we choose a reference channel and noise
14 condition and refer all others to it. 15

16 This is illustrated diagrammatically in Fig. 19,
17 which shows the Cepstral coefficients of the test
18 utterance together with the claimed identity model
19 and the cohort models 1 to m being input to the
20 classifier 110, A "silence model" or "normalisation
21 model" 400 derived from the claimed identity
22 enrolment data is used to normalise each of these
23 before input to the classifier, so that the actual
24 inputs to the classifier are a normalised test
25 utterance, normalised claimed identity model and
26 normalised cohort models. Ideally, the
27 normalisation model 400 is based on data from
28 periods of silence in the claimed identity enrolment
2 9 sample as discussed above, but it could be derived
30 from the complete claimed identity enrolment sample.
31 In practical terms, the normalisation model
32 comprises a single row of Cepstral coefficients,
59

1 each of which is the mean value of one column (or
2 selected members of one column) of Cepstral
3 coefficients from the claimed identity model. These
4 mean values are used to replace the mean values of
5 each of the sets of input data. That is, taking the
6 test utterance as an example, the mean value of each
7 column of the test utterance Cepstral coefficients
8 is subtracted from each individual member of that
9 column and the corresponding mean value from the

10 normalisation model is added to each individual
11 member of the column. A similar operation is
12 applied to the claimed identity model and each of
13 the cohort models. 14

15 It will be understood that the normalisation model
16 could be derived from the claimed identity model or
17 from the test utterance or from any of the cohort
18 models. It is preferable for the model to be
19 derived from either the claimed identity model or
20 the test utterance, and it is most preferable for it
21 to be derived from the claimed identity model. The
22 normalisation model could be derived from the "raw"
23 enrolment sample Cepstral coefficients or from final
24 model after Vector Quantisation. That is, it could
25 be derived at the time of enrolment and stored along
26 with the enrolment model or it could be calculated
27 when needed as part of the verification process.
28 Generally, it is preferred that a normalisation
29 model is calculated for each enrolled speaker at the
30 time of enrolment and stored as part of the enrolled
31 speaker database. 32
60

1 These normalisation techniques can be employed with
2 various types of speaker recognition systems but are
3 advantageously combined with the speaker recognition
4 systems of the present invention. 5

6 Speaker recognition systems in accordance with the
7 invention provide improved real world performance
8 for a number of reasons. Firstly, the modelling
9 techniques employed significantly improve separation

10 between true speakers and impostors. This improved
11 modelling makes the system less sensitive to real
12 world problems such as changes of sound system

13 (voice sampling system) and changes of speaker
14 characteristics (due to, for example, colds etc.),
15 Secondly, the modelling technique is non-temporal in
16 nature so that it is less susceptible to temporal
17 voice changes, thereby providing longer persistence
18 of speaker models. Thirdly, the use of filter pre-
19 processing allows the models to be used for variable
20 bandwidth conditions; e.g. models created using high
21 fidelity sampling systems such as multimedia PCs
22 will work with input received via reduced bandwidth
23 input channels such as telephony systems. 24

25 It will be understood that the preferred methods in
26 accordance with the present invention are inherently
27 suited for use in text-independent speaker
28 recognition systems as well as text-dependent
29 systems. 30
31 SYSTEMS
32
61

1 The invention thus provides the basis for flexible,
2 reliable and simple voice recognition systems
3 operating on a local or wide area basis and
4 employing a variety of communications/input
5 channels. Fig. 16 illustrates one example of a wide
6 area system operating over local networks and via
7 the Internet, to authenticate users of a database
8 system server 400, connected to a local network 402,
9 such as an Ethernet network, and, via a router 4 04,

10 to the Internet 406. A speaker authentication
11 system server 408, implementing a speaker
12 recognition system in accordance with the present
13 invention, is connected to the local network for the
14 purpose of authenticating users of the database 400.
15 Users of the system may obviously be connected
16 directly to the local network 402. More generally,
17 users at sites such as 410 and 412 may access the
18 system via desktop or laptop computers 414, 416
19 equipped with microphones and connected to other
2 0 local networks which are in turn connected to the
21 Internet 406. Other users such as 418, 420, 422 may
22 access the system by dial-up modem connections via
23 the public switched telephone network 424 and
24 Internet Service Providers 426. 25
26 IMPLEMENTATION
27
28 The algorithms employed by speaker recognition
29 systems in accordance with the invention may be
30 implemented as computer programs using any suitable
31 programming language such as C or C++, and
32 executable programs may be in any required form
62

1 including stand alone applications on any
2 hardware/operating system platform, embedded code in
3 DSP chips etc. (hardware/firmware implementations),
4 or be incorporated into operating systems (e.g. as
5 MS Windows DLLs). User interfaces (for purposes of
6 both system enrolment and subsequent system access)
7 may similarly be implemented in a variety of forms,
8 including Web based client server systems and Web
9 browser-based interfaces, in which case speech

10 sampling may be implemented using, for example,
11 ActiveX/Java components or the like. 12

13 Apart from desktop and laptop computers, the system
14 is applicable to other terminal devices including
15 palmtop devices, WAP enabled mobile phones etc. via
16 cabled and/or wireless data/telecommunications
17 networks. 18
19 APPLICATIONS
20
21 Speaker recognition systems having the degree of
22 flexibility and reliability provided by the present
23 invention have numerous applications. One
24 particular example, in accordance with a further
25 aspect of the present invention, is in providing an
26 audit trail of users accessing and/or modifying
27 digital information such as documents or database
28 records. Such transactions can be recorded,
2 9 providing information regarding the date/time and
30 identity of the user, as is well known in the art.
31 However, conventional systems do not normally verify
32 or authenticate the identity of the user.
63

1
2 Speaker recognition, preferably uaing a speaker
3 recognition system in accordance with the present
4 invention, may be used to verify the identity of a
5 user whenever required; e.g. when opening and/or
6 editing and/or saving a digital document, database
7 record or the like. The document or record itself
8 may be marked with data relating to the speaker
9 verification procedure, or such data may be recorded

10 in a separate audit trail, providing a verified
11 record of access to and modification of the
12 protected document, record etc. Unauthorised users
13 identified by the system will be denied access or
14 prevented from performing actions which are
15 monitored by the system. 16

17 Improvements and modifications may be incorporated
18 without departing from the scope of the invention as
19 defined in the appended claims.


WE CLAIM:
1. A method of speaker verification or identification that uses a cohort
including an enrolment model for each of a predetermined number of speakers,
each enrolment model representing at least one speech sample for each speaker,
the method comprising:
capturing a test speech sample from a speaker claiming to be one of the enrolled speakers;
modelling the test sample to provide a test model, and
classifying the test model by always matching it with one of the enrolment models of the cohort associated with the speaker so that a false acceptance rate for the test sample is determined by the cohort size,
wherein the steps of modelling and/or classifying are such that a false rejection rate for the test sample is substantially zero.
2. A method as claimed in claim 1 comprising:
providing for each speaker of the cohort multiple enrolment models determined using multiple parallel modelling processes;
applying the multiple parallel modelling processes to the test sample to provide corresponding multiple test models, and
classifying each test model by matching it with one of the correspondingly modelled enrolment models of the cohort.
,3. A method as claimed in claim 2 wherein the parallel modelling processes comprise applying to the speech samples at least one of different frequency banding; different spectral modelling and different clustering.

Documents:

0089-chenp-2004 claims.pdf

0089-chenp-2004 correspondence po.pdf

0089-chenp-2004 correspondence others.pdf

0089-chenp-2004 description(complete).pdf

0089-chenp-2004 drawings.pdf

0089-chenp-2004 form-1.pdf

0089-chenp-2004 form-13.pdf

0089-chenp-2004 form-18.pdf

0089-chenp-2004 form-3.pdf

0089-chenp-2004 form-5.pdf

0089-chenp-2004 form-6.pdf

0089-chenp-2004 others.pdf

0089-chenp-2004 pct search report.pdf

0089-chenp-2004 pct.pdf

0089-chenp-2004 petition.pdf

89-chenp-2004 abstract.pdf

89-chenp-2004 claims granted.pdf


Patent Number 239443
Indian Patent Application Number 89/CHENP/2004
PG Journal Number 13/2010
Publication Date 26-Mar-2010
Grant Date 19-Mar-2010
Date of Filing 16-Jan-2004
Name of Patentee SPEECH SENTINEL LIMITED
Applicant Address 39 CASTLE STREET, EDINBURGH, EH2 3BH, DX ED40 EDINBURGH
Inventors:
# Inventor's Name Inventor's Address
1 SAPELUK, ANDREW, THOMAS 4 BURN STREET, DUNDEE DD3 0LA
PCT International Classification Number G10L17/00
PCT International Application Number PCT/GB02/02726
PCT International Filing date 2002-06-13
PCT Conventions:
# PCT Application Number Date of Convention Priority Country
1 60/302,501 2001-07-02 U.K.
2 0114866.7 2001-06-19 U.K.