Title of Invention

METHOD AND SYSTEM FOR VOICE MESSAGE EDITING

Abstract This specification relates to a method and apparatus for voice message editing. In particular it relates to a method and apparatus for splicing voicemails sections together. There is described a system, method and computer program product of forming a voice message on an interactive voice response system (IVR) spoken by a user comprising: recording a first voice message wherein the user speaks the first voice message but makes an error in a last portion of the first voice message; recording a second voice message wherein the user speaks the last portion again without the error; determining splice points in the first and second voice messages; and splicing the first and second voice messages at the splice points whereby the spliced first and second voice message is a continuous voice message including the last portion but not including the error.
Full Text

METHOD AND APPARATUS FOR VOICE MESSAGE EDITING
This invention relates to a method and apparatus for voice message editing. In particular it relates to a method and apparatus for splicing voicemails sections together.
.BACKGROUND
An interactive voice response system (IVR) is a computer connected to a telephony network and providing integration between the two. The telephony network can be a plain old telephony system such as a line switched telephony network or a packet switched telephony network like a voice-over-internet-protocol (VoIP) network. An IVR typically runs a telephony application that controls the interaction of a user and the IVR. Such a voice application is a voicemail application controlling the interaction between a user and the computer, on its own or part of another voice interaction.
A voicemail telephony application requires the recording of audio messages directly into an IVR system over the telephone. This method of recording can be cumbersome when working with lengthy passages of speech, especially if a mistake is made whilst reading a passage. Often mispronouncing a word or coughing during a lengthy passage creates undesired audio data in the segment. Such undesired audio data (called artefacts below) result in having to re-record the entire message which costs both time and money.
It would be useful to have a voice message system which did not require a complete re-record of the voice message.
SUMMARY OF INVENTION
According to a first aspect of the present invention there is provided a method of forming a voice message on an interactive voice response system (IVR) spoken by a user comprising: recording a first voice message wherein the user speaks the first voice message but makes an error in a last portion of the first voice message; recording a second voice message wherein the user speaks the last portion again without the errors-determining splice point in the first and second voice messages; and splicing the first and second voice messages at the splice points whereby the spliced first and second voice message is a continuous voice message including the last portion but not including the error.

Such a system splices a second voice message and a first voice message and eliminates the need for re-recording the whole of the first message.
Advantageously determination of the splice points comprises determining regions of common voice data in the first and second messages wherein the splice points are corresponding points within the regions of common voice data.
In the preferred embodiment the determination of the common regions and the splice points comprises the following steps: performing phoneme recognition on the first and second messages to acquire corresponding first and second recognition phonemes; determining regions of common phonemes in the first and second recognition phonemes; and determining the corresponding regions in the first and second voice messages.
In another embodiment the determination of the common region and splice point is performed by comparing the signal energy over time of the first and second voice messages and determining the most similar energy patterns, the splice point is a low point in the signal which is assumed to be a space between two words.
Using phoneme based speech recognition to determine the splice point means that the splice point will lie between two silences in a phrase but not use the full processing resources needed to identify complete words.
Using the phoneme string allows for an excellent comparison of the messages and also allows for an accurate determination of space between two words for the splice point. Using phoneme recognition means that the recognition is unconstrained by a word grammar.
However in another embodiment word based speech recognition could improve the accuracy of the determination of the splice point whilst using more processing power. Furthermore using word based speech recognition allows splice points to be precisely located between words. Whereas phoneme based recognition relies on the silences between the phonemes.
For example, caller presses a key when an error is made whilst recording an audio message. The key press causes the ceasing of the first recording; prompts the user to start reading the message from before the recorded mistake; and starts a second recording. By applying a speech recognition engine to the first and second recorded messages it is

possible to estimate the location of the error in the first recording and splice the two audio streams together to remove the error. This will greatly speed up and increase the ease of use of recording lengthy audio segments directly into an IVR.
In this implementation the IVK application will record the first audio segments and the user will signal an error by pressing, say, the * key to generate a digital tone. On receipt of key digital tone, the IVR will prompt the caller to start speaking from before the error was made. When the supplemental recording is finished, both recordings will be submitted to speech recognition. The returned text from the recognition engine is compared to see where the overlap is, and the timestamps for the matching phrases in each recording will be collected via the speech recognition engine. The two recordings will then be joined together based on these time stamps.
DESCRIPTION OF DRAWINGS
Embodiments of the invention will now be described, by means of example only, with reference to the accompanying drawings in which:
Figure 1 shows an IVR system according to the preferred embodiment; and
Figure 2 shows example speech signals being processed by the preferred embodiment.
DESCRIPTION OF THE EMBODIMENTS
Referring to Figure 1 there is shown an interactive voice response, system (IVR) 10 connected through a telephony switch (PSTN) 12 to a caller 14. Caller 14 is one of many callers that could be connected to the IVR
10. IVR 10 comprises: audio recording application 16 and voice recognition unit 18. Audio recording application 16 is a control program for performing method steps 102, 104, 106, 108, 110, 112 and 114. Voice recognition unit 18 comprises a control program for performing speech recognition steps 111 and 113. In this description, the user is referred to as a caller since the IVR calls are normally considered incoming from the user, however the call may be outgoing with the user being the called party.

Audio recording application 16 is activated when a caller connects to the IVR and makes a request to leave a voice message.
Step 102 of the audio recording application triggers the IVR to play a prompt to the caller to request the recording of the first voice message. In this embodiment the prompt states "Please record audio passage X after the tone, press if you make a mistake and # when you're finished". The caller understands that when a mistake has been made in recording then he should press the key. If no mistake is made then he understands to press the # key at the end of the voice message.
In step 104, the audio recording application records the caller's voice as a first voice message and monitors for a key press. If the # key is pressed then the end of the voice recording has been received without an error and the process moves to step 106. If the * is pressed then an error has been received and the process moves to step 108.
In step 10 6, the audio recording application stores the recording made in step 104 as a first voice message and prepares for another voice message to be recorded by returning control to step 102. At this stage the caller may also quit the audio recording application if satisfied with the recorded audio. Step 106 also takes a recording input from step 114, in this case the recording is a spliced recording made in step 114.
In step 108, the audio recording application directs the IVR to play a prompt to the caller to request the recording of the second voice message. In the preferred embodiment the prompt states "Please begin speaking from the sentence before you made the mistake, press # to terminate". The caller understands that they must start speaking from a point in the sentence before the error was made and then to finish the recording by pressing # at the end. The caller's voice is recorded as the second voice message.
In step 110, both first and second voice messages are submitted to the voice recognition unit 18 and first and second recognition texts are returned. Each recognition text comprises a string of text labels corresponding to the voice message.
In step 111, the voice recognition unit processes the first and second voice messages. In the preferred embodiment, only partial speech recognition is performed and the text labels returned are the phonemes

corresponding to the voice data. In another embodiment human readable text is returned.
In Step 112, the audio recording application locates the matching text labels in the first and second text messages. The positions of the matching labels in the first and second text messages are inserted into a query for the voice recognition unit. Using the position of the matching labels the voice recognition unit locates a first timestamp in the first voice message and a second timestamp in the second voice message. These timestamps represent the splice points in the first and second voice messages.
In step 113, the voice recognition unit receives the timestamp query and returns the first and second timestamps. In another embodiment the timestamps for the text labels may be supplied by the voice recognition unit at the same time as the text labels.
In step 114 the located first and second timestamps are used to splice the first and second voice segments together. The process then returns to step 106 where the spliced voice segment is recorded and the
audio application exits or prepares for the next passage.
The preferred embodiment related to a voice mail IVR but such a solution could be used in other voice applications such as personal voice dictation.
An example of the words, speech signal and phonetic transcription involved in splicing two voice messages is shown in Figure 2. The user speaks the First voice message Speech A "The current service is unavailable, please try later" which is shown at the top of Figure 2. Directly underneath Speech A in Figure 2 is Recording A - an example of the signal amplitude against time corresponding to the words in Speech A. Directly underneath Recording A is Phonetic transcription A representing the phoneme string derived from Recording A by the Voice Recognition Unit. The time scale in milliseconds is shown along the x axis under Phonetic transcription A. For any given text, or phoneme string or signal, the voice recognition unit can return the start timestamp and the end timestamp.
The user speaks the Second" voice message Speech B ""please try again later" shown under the time scale access in Figure 2, Directly underneath Speech B is Recording B - an example of the signal against time

corresponding to the wcrds in Speech 3. Directly underneath Reccrding 3 is Phonetic transcription B representing the phoneme string derived from Reccrding 3 by the Voice R.ecognition Unit. The time scale in milliseconds along the x axis only represents the relative time scale for Phonetic transcription B.
The resultant strings of phonemes are shown below:
Phonetic transcription A "thequrreentservisizunavaylablpleeztriylayter"
Phonetic transcription B "pleeztriyagaynlayter"
The goal of alignment is to find where the second message begins in the first message. The preferred embodiment finds the longest substring between the two - in this case the 'pleeztriy' substring. A user of this embodiment will tend to say a similar or identical phrase as a reference to align against, then say something different (the intended correction). The preferred method is to retain the audio of the first voice message up to the splice point (just before the first occurrence of the phonemes similar to the second voice message) and add all audio from the second voice message.
The preferred embodiment works well with good voice recognition, imperfect recognition accuracy will introduce errors in the form of insertions, repetitions, substitutions and deletions of phonemes. In this case a more complicated matching algorithm can be used which take into account the possibility of recognition errors.
A further example is now described when full speech recognition is used and this example is not illustrated.
A user reads: "... Two rises in three months would have poured fuel on the fire of expectations of further rises to come, potentially ****". In this example **** represents a cough, another noise made in error, or any other error. The IVR records this as' a first voice message.
The user keys: * and continues to read: "expectations of further rises to come, potentially pushing the pound to damaging levels against the dollar...". The IVR records this reading as a second voice message.
The user keys: # to terminate the recording of the second voice message.

The two voice messages are returned as text messages from a speech recognition engine:
First text message: "... Two rises in three months would have poured fuel on the fire of expectations of further rises to come, potentially #####" where the ##### represents the letters returned by the recogniser representing the error.
Second text message: "expectations of further rises to come, potentially pushing the pound to damaging levels against the dollar.,."
The texts of the voice messages are further processed to determine the overlapping parts:
First text message: "... Two rises in three months would have poured fuel on the fire of expectations of further rises to come, potentially
Second text message: "expectations of further rises to come, potentially pushing the pound to damaging levels against the dollar..." where the overlapping parts are underlined.
Timestamps (in seconds) in the voice messages corresponding to the beginning and end of matched phrase in the first text segment and the second text message are acquired.
First voice messages: 05:06:43 - 05:09:90 seconds
Second voice message: 00:02.81 - 00:05:27 seconds
The first and second voice messages are joined based on acquired
timestamps: Final voice message = First voice message (00:00:00 -
05:06:43) + second voice message (00:02:81 - end of audio)
The audio recording application of the preferred embodiment is part of a messaging system on an XVR server and telephony system. However, in an alternative embodiment the audio recording application maybe part of the telephone or client device and interacts to leave a completed message with a messaging system on an IVR server. In this alternative embodiment the client may need to download the audio recording program from a server before execution.
In summary, this specification relates to a method and apparatus for voice message editing. In particular it relates to a method and apparatus

for splicing voicemails sections togerher. There is described a system, method and computer program product of forming a voice message on an interactive voice response system (IVR) spoken by a user comprising: recording a first voice message wherein the user speaks the first voice message but makes an error in a last portion of the first voice message; recording a second voice message wherein the user speaks the last portion again without the error; determining splice points in the first and second voice messages; and splicing the first and second voice messages at the splice points whereby the spliced first and second voice message is a continuous voice message including the last portion but not including the error.





CIAIMS We Claim:
1. A method of forcing a voice message spoken by a user on an
inzeracrive voice response system (IVR; comprising:
recording a first voice miessage wherein the user speaks the first voice message but makes an error in a last portion of the first voice message;
recording a second voice message wherein the user speaks the last
portion again without the errors-determining splice points in the first and second voice messages;
and
splicing the first and second voice messages at the splice points whereby the spliced first and second voice message is a continuous voice message including the last portion but not including the error.
2. A method according to claim 1 wherein determination of the splice points comprises determining regions of common voice data in the first and second messages wherein the splice points are corresponding points within the regions of common voice data.
3. A method according to claim 2 wherein determination of the common regions and the splice points comprises the following steps:
performing speech recognition on the first and second messages to acquire corresponding first and second recognition text;
determining regions of common text in the first and second recognition text; and
determining the corresponding regions in the first and second voice messages.

a phoneme string and nor a language word so to reduce the amount of processing used ro determine the splice point.
5 . A method according to claim 2 wherein determination of the common region and splice point is performed by comparing the signal energy over time of the first and second voice messages and determdning the most similar energy patterns, the splice point is a low point in the signal whihc is assumed to be a space between two words.
6. A system of forming a voice message on an interactive voice response
system (IVR) spoken by a user, the IVR system coupled to a voice recording
unit, wherein the system is capable of performing the method as claimed in
any of the claims 1 to 5.
7. The system of claim 6 wherein the IVR system comprises an audio recording application configured to record voice messages.
8. The system of claim 6, wherein the voice message are input via a calling device.
9. A voice signal from a calling device transmitted over a communication network when received by IVR system is configured to perform the method as claimed in any of the claims 1 to 5.



Documents:

3977-CHENP-2007 AMENDED PAGES OF SPECIFICATION 05-12-2011.pdf

3977-CHENP-2007 AMENDED CLAIMS 05-12-2011.pdf

3977-CHENP-2007 CORRESPONDENCE OTHERS 20-10-2011.pdf

3977-CHENP-2007 EXAMINATION REPORT REPLY RECEIVED 05-12-2011.pdf

3977-CHENP-2007 FORM-1 05-12-2011.pdf

3977-CHENP-2007 FORM-3 05-12-2011.pdf

3977-CHENP-2007 OTHER PATENT DOCUMENT 05-12-2011.pdf

3977-CHENP-2007 POWER OF ATTORNEY 05-12-2011.pdf

3977-CHENP-2007 AMENDED CLAIMS 22-06-2012.pdf

3977-CHENP-2007 CORREPONDENCE OTHERS 22-06-2012.pdf

3977-chenp-2007-abstract.pdf

3977-chenp-2007-claims.pdf

3977-chenp-2007-correspondnece-others.pdf

3977-chenp-2007-description(complete).pdf

3977-chenp-2007-drawings.pdf

3977-chenp-2007-form 1.pdf

3977-chenp-2007-form 18.pdf

3977-chenp-2007-form 26.pdf

3977-chenp-2007-form 3.pdf

3977-chenp-2007-form 5.pdf

3977-chenp-2007-pct.pdf


Patent Number 253438
Indian Patent Application Number 3977/CHENP/2007
PG Journal Number 30/2012
Publication Date 27-Jul-2012
Grant Date 20-Jul-2012
Date of Filing 12-Sep-2007
Name of Patentee INTERNATIONAL BUSINESS MACHINES CORPORATION
Applicant Address ARMONK, NEW YORK 10504
Inventors:
# Inventor's Name Inventor's Address
1 GROEGER, STEVEN 15 CHAFFINCH CLOSE CREEKMOOR CREEKMOOR POOLE HAMPSHIRE BH17 7UR
2 HAYTON, STUART, JOHN 12 KHANDALA GARDENS WATERLOOVILLE HAMPSHIRE PO7 5UA
3 LISTER, JAMIE WALTON COTTAGE OTTERBOURNE ROAD SHAWFORD WINCHESTER, HAMPSHIRE SO21 2DG
4 POULTNEY, TIMOTHY, DAVID 49 AVENUE ROAD SOUTHAMPTON HAMPSHIRE SO14 6TW
PCT International Classification Number G10L 21/00
PCT International Application Number PCT/EP06/50403
PCT International Filing date 2006-01-24
PCT Conventions:
# PCT Application Number Date of Convention Priority Country
1 0503162.0 2005-02-16 U.K.