Title of Invention	A METHOD FOR INFORMATION RETRIEVAL IN A SEARCH SYSTEM
Abstract	A search system for information retrieval includes a data structure in the form of a non-evenly spaced sparse suffix tree for storing suffixes of words and / or symbols, or sequences thereof, in a text T, a metric M including combined edit distance metrics for an approximate degree of matching respectively between words and / or symbols, or between sequences thereof, in the text T and a query Q, the latter distance metric including weighting cost functions, for edit operations which transform a sequence S of the text into a sequence P of the query Q, and search algorithms for determining the degree of matching respectively between words and / or symbols, or between sequences thereof, in respectively the text T and the query Q, such that information R is retrieved with a specified degree of matching with the query Q. Optionally the search system also includes algorithms for determining exact matching such that information R may be retrieved with an exact degree of matching with the query Q.

Title of Invention

A METHOD FOR INFORMATION RETRIEVAL IN A SEARCH SYSTEM

Abstract

A search system for information retrieval includes a data structure in the form of a non-evenly spaced sparse suffix tree for storing suffixes of words and / or symbols, or sequences thereof, in a text T, a metric M including combined edit distance metrics for an approximate degree of matching respectively between words and / or symbols, or between sequences thereof, in the text T and a query Q, the latter distance metric including weighting cost functions, for edit operations which transform a sequence S of the text into a sequence P of the query Q, and search algorithms for determining the degree of matching respectively between words and / or symbols, or between sequences thereof, in respectively the text T and the query Q, such that information R is retrieved with a specified degree of matching with the query Q. Optionally the search system also includes algorithms for determining exact matching such that information R may be retrieved with an exact degree of matching with the query Q.

Full Text	A search system and method for retrieval of data, and the use thereof in a search engine The present invention concerns a search svstem for information retrieval, particularly information stored in form of text, wherein a text 7" comprises words and/or symbols and sequences thereof, wherein the information retrieval takes place with a given or varying degree of matching between a query O, wherein the query 0 comprises words and/or symbols and sequences thereof, arid retrieved information R, comprising words and./or symbols and sequences thereof from the text T, wherein the search system comprises a data structure for storing at least a part of the text T, and a metric M which measures the degree of matching between the query O and retrieved information R, and wherein the search system implements search algorithms for executing a search, particularly a full text search on the basis of keywords kw: a method in a search system for information retrieval, particularly information stored in form of text, wherein a text T comprises words -and symbols and sequences thereof, wherein the information retrieval takes place with a given or varying degree of matching between a query 0, wherein the query O comprises words and/or symbols and sequences thereof, and retrieved information R comprising words and/or symbols and sequences thereof from the text 7", wherein the search system comprises a data structure for storing at least a part of the text T, and a metric M which measures the degree of matching between the query O and retrieved information R, and wherein the search system implements search algorithms for executing a search, particularly a full text search on the basis of keywords kw, wherein the information in the text is divided into words and word sequences, the words being substrings of the entire text separated by-word boundary terms and forming a sequence of symbols, and wherein each word is structured as a sequence of symbols. The invention also concern me use of the search system. A tremendous amouni of information in various fields of human knowledge is collected and stored in computer memory systems. As the computer memory systems increasingly are linked in public available data communication networks, there has been an increasing effort to develop systems and methods for searching and retrieving information for public or personal use. Present search methods for data have, however, limitations that seriously reduce the possibility 01 etnciently retrieving ana using imormation stored in this manner. Information may be stored in the form of different data types, and in the context of information search and retrieval it will be useful to discern between dynamic data and static data. Dynamic data is data that change often and continuously, so that the set of valid data varies all the time, while static data only changes very seldom or never at all. For instance will economic data, such as stock values, or meteorological data be subject to very quick changes and hence dynamic. On the other hand archival storage of books and documents are usually permanent and static data. The concept the volatility of the data relates to how long the information is valid. The volatility of data has some bearing upon how the information should be searched and retrieved. Large volumes of data require some structure in order to facilitate searching, but the time cost of building such structures must not be higher than the time the data is valid. The cost of building a structure is dependent on the data volume and hence the building of data structures for searching the information should take both the data volume and the volatility into consideration. The information collected are stored in databases and these may be structured or unstructured. Moreover, the daiabases may contain several types of documents, including compound documents which contain images, video, sound and formatted or annotated text. Particularly structured databases are usually furnished with indexes in order to facilitate searching and retrieving the data. The growth of the World Wide Web (WWW) offers a steadily growing collection of compound and hyperlinked documents. A great many of these are not collected in structured databases and no indexes facilitating rapid searching are available. However, the need for searching documents in the World Wh"de Web is obvious and as a result a number of so-called search engines has been developed, enabling searching at least parts of the information in the World Wride Web. With a search engine it is commonly understood one or more tools for searching and retrieving information. In addition to the search system proper, a search engine also contains an index, for instance comprising text from a large number of uniform resource locators (URLs ). Examples of such search engines are Alta Vista, HotBot with Inktomy technology, Infoseek, Excite and Yahoo. All these offer facilities for performing search and retrieval of information in the World Wide Web. However, their speed and efficiency do by no means match the huge amount of information available on the World Wide Web and hence the search and retrieval efficiency of these search engines leaves much to be desired. Searching a large coilec:ion of text documents can usually be done with several query types. The most common query type is matching and variants of this. By specifying a keyword or set of keywords that has to be present in the queried information the search system retrieves all documents that fulfils this requirement. The basic search method is based on so-called single keyword matching. The keyword p is searched for and all documents containing this word shall be retrieved. It is also possible to search for a keyword prefix D, and all documents where this prefix is present in any keyword in the documents, wiil be retrieved. Instead of searching with keywords, the search is sometimes based on so-called exact phrase matching, where the search uses several single keywords in particular sequence. As well-known by persons skilled in the art, the exact matching of keyword phrases in many search systems may be done with the use of Boolean operators, for instance based on operators such as AND, OR, and NOT which allow a filtering of the information; e.g. using an AND phrase results in that all documents containing the two keywords linked by the AND operator will be returned. Also a NEAR operator has been used for returning just the documents with the keywords matching and located "near" to each other in the document text. In many structured database the documents contained in the database have been annotated, e.g. provided with fields which denote certain parts or types of information in the document. This allows the search for matches in only parts of the documents and is useful when the type of queried information is known in advance. When searching in text documents the data are structured and most likely present in some natural language, like English, Norwegian etc. When searching for documents with a certain context it is possible to apply proximity metrics for matching keywords or phrases that match the query approximately. Allowing errors in keywords and phrases are common method for proximity, using a thesaurus is another common method. A proximity search requires only that there shall be a partial match between the information retrieved and the query. International published application WO96/00945 titled "Variable length data sequence matching method and apparatus" (Doringer &. al.) which has been assigned to International Business Machines, Corp., discloses the building, maintenance and use of a database with a trie-like structure for storing entries and retrieving at least a partial match, preferably the longest partial match or all partial matches of a search argument (input key) from the entries. In order to further illuminate the general prior art mention can be made of international published patent application VJ092J15954 (Kimball &. a]., assigned to Red Brick System, U.S.A.) and US patent no. 5 627 748 (Baker & al., assigned to Lucent Technologies, Inc., U.S.A.), both disclosing data structures in the form of suffix trees for searching/matching in a square matrix. Neither of these two publications disclose anything beyond a regular suffix tree, except for the use of a linked list during matching and do not teach or suggest approaches to limit the search space when searching for approximate matches. However, such approaches would be most desirable when applying data structures based on suffix trees to searching, particularly for approximate matches in extremely large document collections, such as may be found on the World Wide Web. The main object of the present invention is thus to provide a search system and a method for fast and efficient search and retrieval of information in large volumes of data. Particularly it is an object of the present invention to provide a search system suited for implementing search engines for searching of information systems with distributed large volume data storage, for instance Internet. It is to be understood that the search system according to the invention by no means shall be limited to searching and retrieving information stored in the form of alphanumeric symbols, but equally well may be applied to searching and retrieving information stored in the form of disitalized images and graphic svmbols, as the word text used herein also may interpreted as images when these are represented wholly or partly as sets of symbols. It is also to be understood that the search system according to the invention can be implemented as software written in a suitable high-level language on commercially available computer systems, but it may also be implemented in the form of a dedicated processor device for searching and retrieving information of the aforementioned kind. The above-mentioned objects and advantages are realized according to the invention with a search system which is characterized in that the data structure comprises a tree structure in the form of a non-evenly spaced sparse suffix tree ST( J) for storing suffixes of words and/or symbols, and sequences thereof in the text T, that the metric JW comprises a combination of an edit distance metric for an approximate degree of matching between words and/or symbols in the text T and a query 0 and an edit distance metric for an approximate degree of matching between sequences S of words and/or symbols in the text T and a query sequence P of words and/or symbols in the query O, the latter edit distance metric including weighting cost functions for edit operations which transform a sequence S of words and/or symbols in the text T into the sequence P of words and/or symbols in the query 0, the weighting taking place with a value proportional to a change in the length of sequence upon a transformation or dependent on the size of the words and/or symbols in sequences to be matched, that the implemented search algorithms comprise a first algorithm for determining the degree of matching between words and/or symbols in the suffix tree representation of respectively the text T and a query O, and a second algorithm for determining the degree of matching between sequences of words and/or symbols in the suffix tree representation of respectively the text T and the query O, said first and/or second algorithms searching the data structure with queries Q in the form of either words, symbols, sequences of words or sequences of symbols or combinations thereof, such that information R is retrieved on the basis of query O with a specified degree of matching between the former and the latter, and that the search algorithms optionally also comprise a third algorithm for determining exact matching between words and/or symbols in the suffix tree representation of respectively the text T and the query O and/or a fourth algorithm for determining exact matching between sequences of words and/or symbols in the suffix tree representation of respectively the text T and the query 0, said third and/or fourth algorithms searching the data structure with queries 0 in the form of either words, symbols, sequences of words, or sequences of symbols or combinations thereof, such that information R is retrieved on the basis of the query O with an exact matching between the former and the latter. In an advantageous embodiment of the search system according to the invention the suffix tree ST(7) is a word-spaced sparse suffix tree SSTws(r), comprising only a subset of the suffixes in the text T. Preferably is then the word space sparse suffix tree SSTWS(J) a keyword-spaced sparse suffix tree SSTkws(r). In further advamaseous embodiments of the search svstem according to the invention ihe first algorithm for detecting the degree of keyword matchins in a keyword-spaced sparse suffix tree SST-KWS(7) is implemented as disclosed by dependent claim i, the second algorithm for determining the degree of sequence matching in a keyword-spaced sparse suffix tree SSTkws(J) implemented as disclosed by dependent claim 5, whereby a subroutine of the second algorithm preferably is implemented as disclosed by dependent claim 6, the third algorithm for determining an exact keyword matching in a keyword-spaced sparse suffix tree SSTkws(T) implemented as disclosed by dependent claim 7, and finally the fourth algorithm for determining an exact keyword sequence matching in a keyword-spaced sparse suffix tree SSTfcws implemented as disclosed by dependent claim 8. The above-mentioned objects and advantages are also realized according to the invention with a method which is characterized by generating the data structure as a word-spaced sparse suffix tree SSTws(r) of a text T for representing all the suffixes starting at a word separator symbol in the text T, storing sequence information of the words in the text Tin the word-spaced sparse suffix tree SSTWE(r), generating a combined edit distance metric M comprising an edit distance metric D{s,q) for words s in the text T and a query word a in a query 0 and a word-size dependent edit distance metric DWS(S,P) for sequences S of words in the text T and a sequence P of words q in the query Q, the edit distance metric DWS(S,P) being the minimum sum of costs for edit operations transforming the sequence S into the sequence P, the minimum sum of costs being the minimum sum of cost functions for each edit operation weighted by a parameter proportional to the change in the total length of the sequence S or by the ratio of the current"word length and average word length in the sequences S;P, and determining the degree of matching between words s,q by calculating the edit distance D(s.q") between the words s of the retrieved information R and the word q of a query Q, or in case the words s,q are more than k errors from each other, determining the degree of matching between the word sequences SR, PO of retrieved information R and a query 0 respectively by calculating the edit distance Dw,(^,?o) for all matches. Advantageously the method according to the invention additionally comprises weighting an edit operation which changes a word s into word q with a parameter for the proximity between the characters of the words s;q, thus taking the similarity of the words s;a in regard when determining the cost of the edit operation in question. In an advantageous embodiment of the method according to the invention the number of matches is limited by calculating the edit distance by calculating the edit distance DW!(^L!?O) for restricted number of words in the query word sequence P0. In another advantageous embodiment of the method according to the invention the edit distance D(s,qJ between word s and a word q is defined recursively and the edit distance D(s,q) calculated by means of a dynamic programming procedure, and the edit distance DV,S(S,P) between sequences S and a sequence P is correspondingly recursively defined and the edit distance DWS(S,P) calculated by means of a dynamic programming procedure. According to the invention the above-mentioned objects and advantages are also realized with the use of the search system according to the invention in an approximate search engine. The search system and the method according to the invention shall now be discussed in greater detail in the following with reference to the accompanying drawing figures, of which fig. 1 shows an example of a suffix tree, fig. 2 examples of word-spaced sparse suffix trees as used with the present invention, fig. 3 an example of a so-called PATRICIA trie as known in prior art, fig. 4 a further example of a word-spaced sparse suffix tree as used with the present invention, fig. 5 an example of explicitly stored word sequence information as used with the present invention, fig. 6 a leaf node structure as used with the present invention, and fig. 7 schematically the structure of a search engine with the search system according to the present invention. The search system according to the invention consists essentially of three parts, namely the data structure, the metrics for approximate matching and the search algorithm. When full text retrieval is the target, as essentially will be the case with the search system according to the present invention, then the entire data set which shall be retrievable, will be stored in 2 data structure which supports a high query* performance. The basic concepts underlying the present invention shall first be discussed in some detail. Stored information in the form of text T is divided into words s and word sequences S. Words are substrings of the entire text separated by word boundary terms. The set of word boundary terms is denoted BTword. A common set of word boundary terms could be the set {", ":", "\t", " ", "\0[ ".", "; \ "?"} where \t denotes a tab character, denotes a linefeed character and \0 denotes an end-of-document indicator. In connection with the following description of the present invention it will be useful with some definitions concerning strings and sequences. Definition 1: String A string is a sequence of symbols taken from an alphabet, such as the ASCII characters. Then the length of 2 string is the number of instances of symbols or characters comprising the string, and is denoted \x\. If x has the length m the string may also be written as x}x2...X;...xmi where x, represents the z"th symbol in the string. A substring of x is a string given by a contiguous group of symbols within x. Thus, a substring may be obtained from x by deleting one or more characters from the beginning or the end of the string. Definition 2: Substring, suffix and prefix A substring of x is a string xl = X,X^J...XJ for some 1 Also the notion of a word sequence will be used. Definition 3: Word sequence A word sequence is a sequence of separated, consecutive words. A. word sequence S = si,s2l ...,s„ consists of rc single words (or strings) si,S2, up to s„. Word sequences are delimited by sequence boundary terms. The set sequence boundary terms are denoted STseq. A common set of sequence boundary terms could be the set {"OV}. where \0 indicates an end-of-document marker. The concept approximate word matching can be described as follows. Given a siring s = s:s2-.-sn and a query term q = q,q2...qm. Then the task is to find ail occurrences of a in 5 thai is at most k errors away from the original query term a. A proximity metric determines how to calculate the errors between p and a potential match s;...Sj. A common metric for approximate word matching is the Levenstein distance or edit distance (V.I. Levenstein, "Binary codes capable of correcting deletions, insertions, and reversals", (Russian) Doklady Akademii nauk SSSR, Vol. 163, No. 4, pp. 845-8 (1965); also Cybernetics and Control Theory, Vol. 10, No. 8, pp. 707-10, (1966)). This metric is denned as the minimum number of edit operations needed to transform one string into another. An edit .operation is given by any rewrite rule, for instance: It is also possible to define an approximate matching on the level of words in a word sequence and this can be described as follows. Given a text T consisting of the n words w;lWj...wfl where each of the words is a string of characters. A sequence partem P consists of the m words p>, P2. ■■•• Pm- The sequence pattern P is said to have an approximate occurrence in T if the sequence;?;, p2 pm differs with at most t errors from a sequence w,-, Wj-j, ..., w,- for some ij, such that 1 A text that shall be retrieved in a search system must be indexed in a manner which facilitates searching the data. Consequently the data structure is a kernel data structure of the search system according to the present invention and is based on so-called suffix trees and particularly a sparse suffix tree. These two kinds of structures shall be defined in the following. A suffix tree S(T) is a tree representation of all possible suffixes in the text T. All unary nodes in a suffix tree S(T) are concatenated with its child to create a compact variant. Fig. 1 shows the suffix tree for the text T= "structure". Even more particularly the present invention is based on sparse suffix trees. These were introduced by J. Karkkainen & E. Ukkonen, in "Sparse Suffix Trees", Proceedings of the Second Annual International Computing and Combinatorics Conference (COCOON "96), Springer Verlag, pp.2 19-230, which again was based on ideas published by D.R. Morrison, "PATRICIA -Practical Algorithm To Retrieve Information Coded in Alphanumeric", Journal of the ACM, 15, pp. 514-534 (1968). A sparse suffix tree is defined as follows. Definition 4: Sparse suffix tree A sparse suffix tree SST(T) of the text T is a suffix tree, containing only a subset of the suffixes present in the suffix tree ST(T) of the text. "When using the search system according to the present invention searching for entire words, advantageously a non-evenly spaced sparse suffix tree may be created by storing suffixes starting at word boundaries only. The concept words-spaced sparse suffix tree is defined as follows. Definition 5: Word-spaced sparse suffix tree A word-spaced sparse suffix tree SSTWS(T) of a text T is a sparse suffix tree SST(2~) containing only the suffixes starting-at a word separator character in the text. Fig. 2 shows two examples of word-spaced sparse suffix trees. Parts of the suffixes have been omitted to enhance the readability. The word-spaced sparse suffix tree for T= "to be the best" is the left structure, and T= "to make the only major modification" is the right structure in fig. 2. In :he search system of the present invention the text is naturally divided into words which are stored independently in the word-spaced sparse suffix tree. As the atomic search term for searching is the word itself, advantageously each suffix will be terminated at the end of the word. This reduces the sparse suffix tree to a so-called PATRICIA trie (Morrison, op.cit.). A trie as defined in the literature is a rooted tree with the properties that each node, except the root contains a symbol of the alphabet and zhat no two children of the same node contain the same symbol. It should be noted that the word trie derives from the word "retrieval" and hence indicates that the trie is a tree structure suitable for retrieval of data. A PATRICIA trie is defined as a keyword-spaced sparse suffix tree (KWS tree) where the suffixes stored in the leaf nodes are limited by keyword delimiters. An example of a PATRICIA trie for the set of keywords {"avoid", "abuse", "be", "become",. "breathe", "say"} is shown in fig. 3. The structure used in the search system of the present invention differs from the PATRICIA trie because the search system explicitly stores sequence information of the words. Reducing the suffix length requires that the representation of the leaf node is changed. Pointers to the original text are replaced by the suffix string itself. A suffix length reduction of this kind is shown in fig. 4 for the same two strings as shown in fig. 2. In other words fig. 4 shows word-spaced sparse suffix trees with suffixes cut off at word boundaries. The word-spaced sparse suffix tree for T= "to be the best" is shown at left and the word-spaced sparse suffix tree for T = "to make the only major modification" is shown at right in the figure. A leaf node will contain a list of all positions where the word represented by the leaf node occurs. Instead of using the implicit sequence of information-found in the original text, the present invention explicitly stores sequence information in the word-spaced sparse suffix tree. This is done by using pointers between the leaf nodes that represent consecutive words in the original text. As at least all the occurrences of the word represented by a particular leaf node are available, a pointer must be added to the next consecutive leaf. A leaf node contains only the suffix of the word it represents, so when traversing the sequence pointers in the occurrence list only the suffixes of each of the consecutive words are revealed. This is handled by storing the entire word in the leaf node instead of just the suffix and thus also data structure of the invention differs from the PATRICIA trie in this respect. The data structure for explicitly stored word sequence information with an occurrence list with pointers to the next consecutive word and to its occurrence is shown in fig. 5. The search system according to the present invention uses a PATRICIA trie for organizing the occurrence list (Morrisou, op.cit.). The PATRICIA trie enables the search system to access the list of all consecutive words matching the stringy in a time 0([p2\), where ]p2\ of course is the length of p2. By using a PATRICIA trie to organize the list of occurrences, a completely defined tree structure is obtained for storing words from a text and maintaining the sequence information. A typical leaf node, with both a PATRICIA trie for the organized occurrence list and the extra unsorted list of occurrences, is shown in fig. 6. A.s an example the memory requirement for an occurrence list as used in the search system of the present invention, a database with about 742358 documents has a total of 333 856 744 words and a lexicon of 538 244 distinct words. The total size of the database is 2054.52 MB. The average word length is thus 6.45 bytes. A sparse suffix tree will use 8 bytes for each internal node, using 32 bit pointers. It is assumed that an average of 3 internal nodes is used for each word. The leaf node would then require 6.45 bytes for storing the entire word plus 32 bits for a pointer to an occurrence list. A total of 34.45 bytes/word gives a total size of 18.108 MB. In addition the occurrence list has the size of 4 bytes per entry and 12 bytes if the full version is to be used. Hence the total memory requirement of the occurrence list varies from 1273 MB to 3820 MB. The data structure using a sparse suffix tree will have a size between 60% to 200% of the original text. This is comparable with the requirements of an inverted file, but the sparse suffix tree as used in the search system according to the invention provides much faster searching, enables approximate matching and makes sequence matching easy to perform. In approximate searching, a metric is used to give an error measure of a possible match. The search system according to the present invention employs several metrics, and particularly a unique combination of metrics. These metrics along with the combined metric shall be discussed in the following. .An edit distance metric as defined above allows the operations deletion, insertion and change which intuitively apply to words as well as characters. Common errors in matching phrases are missing, extra or changed words. Hence the edit distance metric as previously defined shall be adapted and extended in order to apply to the approximate word sequence matching problem. Edit operations for sequences are defined below. Definition 6: Edit operations for secuences For transforming one sequence S of words into another sequence P of words, the edit operations allowed on the word in the sequences may be written according to the following rewrite rules: • (c—>-e); deletion of word a from the sequence • (s—c). insertion of word a into the sequence • (a—b), change of word a into word b • {ab—ba), transposition of adjacent words a and b. Instead of characters as atoms, the search system according to the invention applies the edit operations to words which then should be regarded as the operational atoms. By using the edit operations as defined above the edit distance for sequences can now be defined. Definition 7: Edit distance for sequences The edit distance metric for sequences defines the distance D!eq(S,P) between the sequence S = S;,si, ■ ■.,„ and the sequence P = pi.pi, ...,pm as the minimum sum of cost c(x—+y) for the sequence of edit operations transforming the sequence S into the sequence P. The search system according to the present invention enhances the edit distances metric for sequences to weight the cost of the edit operations by the size of the words operated upon. Definition 8: Word size-dependent edit distance for sequences The word size dependent edit distance for sequences is defined as the minimum sum of costs for the editing operations needed to transform one secuencs into the other. The cost functions are dependent on the word size of its operands. In the search system according to the invention a definition of cost functions is given by the equations where I denotes the average length of a word in the two sequences being compared. The cost of each edit operation is weighted by a size proportional to the change in the total length of the sequence or by the ratio of the current word length and the average word length in the sequences considered. Now the distance metric reflects the assumption of some relation between the word length and how important the word is to the semantic context of the word sequence. Furthermore the search system according to the invention employs proximity at the character level when the change edit operation fa—*-b) is used. Replacing a word a by another word 5 should be related to the similarity between these two words. The new cost function for the change edit operation hence is given as: Where D(c,b) is the normalized edit distance measuring function for words, 0 means ful] similarity", 1 means no similarity. The search system according to the invention combines the edit distance metric fo: sequences with the ccst functions as given by formulas (4), (5) and (6}, with an edit distance metric for words as given by formula (1). This means that sequence edit operations are only used when the words being matched are more than k errors away from each other. The algorithms used in the search system according to the invention perform efficient searching of the described structures. Matches are found according to the metrics as given above. Approximate word matching in a word-spaced sparse suffix tree is dons by combining the calculation of the edit distance matrix and a traversal of the suffix tree. An algorithm for this is written in pseudo-code and given in table I. This algorithm is adapted from a trie-matching algorithm as proposed by H. Shang & T.H. Merrertal, "Tries for Approximate String Matching", IEEE Transactions on Knowledge and Data Engineering, Vol.5, No. 4, pp. 540-547 (1996). The expected worst case running time of the algorithm is 0(k\l,\l:) according to Shang &. Merrertal (op.cit.). Approximate word sequence matching requires the calculation of the word sequence edit distance for all possible matches. However, the number of possible matches can be limited by starting the calculation of the edit distance only on the possible words. The cost of deleting a word from the sequences determines the number of possible start words. If the accumulated cost of deleting the i first words in a query sequence PQ rises above a given error threshold, the candidate sequence starting with.the ith word of the query cannot possibly be a match. Therefore for a query sequence PQ of/ words, at most i possible start words will be tried. Since there are no backpointers in the sequence structure of the tree, it will not be ensured that all possible matches are obtained. Adding backpointers would solve this problem. The algorithm for approximate-word sequence matching as used in the search system according to the present invention, is given in pseudo-code in table II below. This algorithm tries to match the first keyword with;?/,£;... sequentially, testing all possible start positions. In the ApproxSequsnceMatch algorithm in table II the ApproxMctchRest filiation is defined by the algorithm in table III below. This function matches the remaining sequence, using an initial error value. The algorithms in tables II and III are written in the same pseudo-code as the algorithm in table I. The FindZzaot function used to find the leaf node matching the first word in the sequence performs a simple traversal of the tree and its running time is 0\p>\ wherep, denotes the first word in a query sequence PQ. Calculating the edit distance can be done in \F\ time using straightforward dynamic programming or in 0(k) time (where k denotes the error threshold) using improved versions of the calculation algorithm, see E. Ukkonen, "Finding Approximate Patterns in Strings", Journal of Algorithms, vol. 6, pp. 332-137 (1985). If £nOCe(pi) denotes the total sum of the number of occurrences of each" word Pi in the word sequence, then the worst case running time is 0{kTnoce{p;)). Finally the implementation of a search engine based on the search system according to the invention shall briefly be discussed. Particularly a search engine based on the search system according to the invention is implemented as an approximate search engine (ASE) and is intended as a search, engine for indexing large document collections and providing algorithms for exact and approximate searching of these document collections. ASE shall provide a data structure for storing large texts or collection of documents. It is to be understood that the data structure may be generated from documents which contain additional information, such as images, video, sound, and the text may be formatted and/or annotated. The data structure is identical to the word-spaced sparse suffix tree as discussed above and it is, of course, to be understood that the words is the keywords of the search system, hence the word-spaced sparse suffix tree may instead be termed a keyword-spaced sparse suffix tree (KWS tree). The ASE shall contain algorithms for indexing documents in the KWS tree. These algorithms, of course, do not form a part of the search system according to the present invention, but they are well-known to persons skilled in the art and described in the literature, see for instance J. Karkkainen & E. Ukkonen"(op.cit.) and D.R. Morrison (op.cit). The search system according to the invention and as used in the ASE employs algorithms both for exact and approximate matching of a pattern in a KWS tree. The algorithms given above in table I and table II are used for approximate word and word sequence matching with the non-uniform edit distance as a metric. Finding an exact match of keyword;? with length m in a KWS tree is known in the an and easily implemented as a simple traversal of the tree structure. An appropriate algorithm for exact keyword matching written in pseudo-code is given in table IV. The search system according to the invention also shall be able to support algorithms for exact keyword sequence matching. Algorithms for exact keyword sequence matching are known in the art and easily implemented as e.g. shown in pseudo-code in table V below. The algorithm given here will find the exact match of the first keyword, if any. Then it will for all occurrences of the first keyword check if the second keyword matches the second keyword of the query. If so, the MatchResi procedure in table V is used to determine if the occurrence of the two first keywords are matching in the entire sequence. For approximate keyword matching in a KWS tree the search system implements the algorithm in table I above. For approximate keyword sequence matching the search-system implements the algorithm in table II above, matching a first keyword sequentially withpt.pj... and testing all possible start positions, applying the ApproxMatchRest function as given in table III to match a sequence starting at a particular position and handle the initial error value. Finally, the ASE shall need a simple front end which gives the user control of indexing and querying the document collection. The front end should also be able to furnish statistics of the document collection and provide both a network interface for remote access, e.g. via WWW, and a local server user interface. The ASE with the search system according to the invention should be general in a manner that allows for the adding new indexing and searching algorithms easily. Also, storing extra information about each document or keyword shall be possible to implement in an easy manner. Particularly the front end should be independent of the data structure and the search algorithms, such that internal changes in these has no effect on the design of the former. The use of the search system according to the invention the ASE should be designed to have as low memory overhead as possible in the data structure. Also, searching should be designed to be as fast as possible. However, there will usually be a trade-off berween these rwo factors. To sum up, an ASE with a search system according to the invention shall comprise four major modules. 1. Document indexing module DIM for indexing documents in the KWS tree structure. This module should also contain all extensions to support several document roes. 2. Data storage module DSM based on a keyword-spaced sparse suffix tree (KWS tree). 3. Search algorithm module SAM for searching the KWS tree, comprising algorithms for exact and/or approximate matching of respectively words and word sequences. 4. User interface front-end module FEM comprising both a local server user interface and a network interface for remote queries. The four modules of the ASE works together to offer a complete search engine functionality. The data flow between the different modules is shown in fig. 7. Indexing a collection of documents in done in the document indexing module DIM comprising indexing algorithms. This module is, of course, not a part of the search system according to the invention, but indexing algorithms that can be used are well-known in the art. The text found in the documents is passed on to the data storage DSM module for storage. The data storage module is, of course, a part of the search system according to the invention and is as stated based on the KWS tree structure. The search algorithm module SAM contains algorithms for searching the data located in the data storage module. This module implements the search system according to the present invention and allows"Tor a search process querying the data structure for tree and node information, while maintaining state variables. The front-end module may for instance be implemented on a work station or a personal computer and the -like, providing the functionality as stated above. As already stated in the introduction, it is to be understood that the search system according to the invention can be implemented as software written in a suitable high-level language on commercially available computer systems, including workstations. It may also as stated be implemented in the form of a dedicated processor device which advantageously may comprise a large number of parallel processors being able to process large word sequences in parallel for approximate matching with a large number of query word sequences. The fixed operational parameters of the processor may then be entered in a low-level code, while keyword sequences input from the KWS tree structure allows for an extremely fast processing of queries on a huge amount of data, a.nd the search system according to present invention shall hence in high degree be suited for performing searches on e.g. the World Wide Web, even in a KWS tree structure large enough to index all documents presently offered on the World Wide Web and moreover capable of handling the expected data volume growth on the World Wide Web in the future. WE CLAIM : 1. A method in a search system for information retrieval, particularly information stored in form of text, wherein a text T comprises words and / or symbols s and sequences S thereof, wherein the information retrieval takes place with a given or varying degree of matching between a query Q, wherein the query Q comprises words and / or symbols q and sequences P thereof, and retrieved information R comprising words and / or symbols and sequences thereof from the text T; wherein the search system comprises a data structure for storing at least a part of the text T; and a metric M which measures the degree of matching between the query Q and retrieved information R, and wherein the search system implements search algorithms for executing a search, particularly a full text search on the basis of keywords kw, wherein the information in the text T is divided into words s and word sequences S, the words being substrings of the entire text separated by word boundary terms and forming a sequence of symbols, and wherein each word is structured as a sequence of symbols, characterized by generating the data structure as a word-spaced sparse suffix tree SSTWS (T) of a text T for representing all the suffixes starting at a word separator symbol in the text T, storing sequence information of the words s in the text T in the word-spaced sparse suffix tree SSTW3 (T), generating a combined edit distance metric M comprising an edit distance metric D(s,q) for words s in the text T and a query word q in a query Q and a word-size dependent edit distance metric DWS(S,P) for sequences S of words s in the text T and a sequence P of words q in the query Q, the edit distance metric DWS(S, P) being the minimum sum of costs for edit operations transforming a sequence S into the sequence P, the minimum sum of costs being the minimum sum of cost functions for each edit operation weighted by a value proportional to the change in the total length of the sequence S or by the ratio of the current word length and average word length in the sequences S,P, and determining the degree of matching between words s,q by calculating the edit distance D(s,q) between the words s of the retrieved information R and the word q of a query Q, or in case the words s,q are more than k errors from each other, determining the degree of matching between the word sequences SR; PQ of retrieved information R and a query Q respectively by calculating the edit distance DWS(SR; PQ) for all matches. The method according to claim 1, comprising the step of additionally weighting an edit operation which changes a word s into word q with a parameter for the proximity between the characters of the words s;q, thus taking the similarity of the words s;q in regard when determining the cost of the edit operation in question. The method according to claim 1, comprising the step of limiting the number of matches by calculating the edit distance DWS(SR,PQ) for restricted number of words in the query word sequence PQ The method according to claim 1, comprising the step of defining the edit distance D(s,q) between words s and a word q recursively and calculating the edit distance D(s,q) by means of a dynamic programming procedure. A method according to claim 1 comprising the step of defining the edit distance Dws(S,p) between sequences S and a sequence P recursively and calculating the edit distance DWS(S,P) by means of a dynamic programming procedure.

Full Text

A search system and method for retrieval of data, and the use thereof in a search engine
The present invention concerns a search svstem for information retrieval, particularly information stored in form of text, wherein a text 7" comprises words and/or symbols and sequences thereof, wherein the information retrieval takes place with a given or varying degree of matching between a query O, wherein the query 0 comprises words and/or symbols and sequences thereof, arid retrieved information R, comprising words and./or symbols and sequences thereof from the text T, wherein the search system comprises a data structure for storing at least a part of the text T, and a metric M which measures the degree of matching between the query O and retrieved information R, and wherein the search system implements search algorithms for executing a search, particularly a full text search on the basis of keywords kw: a method in a search system for information retrieval, particularly information stored in form of text, wherein a text T comprises words -and symbols and sequences thereof, wherein the information retrieval takes place with a given or varying degree of matching between a query 0, wherein the query O comprises words and/or symbols and sequences thereof, and retrieved information R comprising words and/or symbols and sequences thereof from the text 7", wherein the search system comprises a data structure for storing at least a part of the text T, and a metric M which measures the degree of matching between the query O and retrieved information R, and wherein the search system implements search algorithms for executing a search, particularly a full text search on the basis of keywords kw, wherein the information in the text is divided into words and word sequences, the words being substrings of the entire text separated by-word boundary terms and forming a sequence of symbols, and wherein each word is structured as a sequence of symbols.
The invention also concern me use of the search system.
A tremendous amouni of information in various fields of human knowledge is collected and stored in computer memory systems. As the computer memory systems increasingly are linked in public available data communication networks, there has been an increasing effort to develop systems and methods for searching and retrieving information for public or personal use. Present search methods for data have, however, limitations that

seriously reduce the possibility 01 etnciently retrieving ana using imormation stored in this manner.
Information may be stored in the form of different data types, and in the context of information search and retrieval it will be useful to discern between dynamic data and static data. Dynamic data is data that change often and continuously, so that the set of valid data varies all the time, while static data only changes very seldom or never at all. For instance will economic data, such as stock values, or meteorological data be subject to very quick changes and hence dynamic. On the other hand archival storage of books and documents are usually permanent and static data. The concept the volatility of the data relates to how long the information is valid. The volatility of data has some bearing upon how the information should be searched and retrieved. Large volumes of data require some structure in order to facilitate searching, but the time cost of building such structures must not be higher than the time the data is valid. The cost of building a structure is dependent on the data volume and hence the building of data structures for searching the information should take both the data volume and the volatility into consideration. The information collected are stored in databases and these may be structured or unstructured. Moreover, the daiabases may contain several types of documents, including compound documents which contain images, video, sound and formatted or annotated text. Particularly structured databases are usually furnished with indexes in order to facilitate searching and retrieving the data. The growth of the World Wide Web (WWW) offers a steadily growing collection of compound and hyperlinked documents. A great many of these are not collected in structured databases and no indexes facilitating rapid searching are available. However, the need for searching documents in the World Wh"de Web is obvious and as a result a number of so-called search engines has been developed, enabling searching at least parts of the information in the World Wride Web.
With a search engine it is commonly understood one or more tools for searching and retrieving information. In addition to the search system proper, a search engine also contains an index, for instance comprising text from a large number of uniform resource locators (URLs ). Examples of such search engines are Alta Vista, HotBot with Inktomy technology, Infoseek, Excite and Yahoo. All these offer facilities for performing search and retrieval of information in the World Wide Web. However, their speed and efficiency do

by no means match the huge amount of information available on the World Wide Web and hence the search and retrieval efficiency of these search engines leaves much to be desired.
Searching a large coilec:ion of text documents can usually be done with several query types. The most common query type is matching and variants of this. By specifying a keyword or set of keywords that has to be present in the queried information the search system retrieves all documents that fulfils this requirement. The basic search method is based on so-called single keyword matching. The keyword p is searched for and all documents containing this word shall be retrieved. It is also possible to search for a keyword prefix D, and all documents where this prefix is present in any keyword in the documents, wiil be retrieved. Instead of searching with keywords, the search is sometimes based on so-called exact phrase matching, where the search uses several single keywords in particular sequence. As well-known by persons skilled in the art, the exact matching of keyword phrases in many search systems may be done with the use of Boolean operators, for instance based on operators such as AND, OR, and NOT which allow a filtering of the information; e.g. using an AND phrase results in that all documents containing the two keywords linked by the AND operator will be returned. Also a NEAR operator has been used for returning just the documents with the keywords matching and located "near" to each other in the document text. In many structured database the documents contained in the database have been annotated, e.g. provided with fields which denote certain parts or types of information in the document. This allows the search for matches in only parts of the documents and is useful when the type of queried information is known in advance.
When searching in text documents the data are structured and most likely present in some natural language, like English, Norwegian etc. When searching for documents with a certain context it is possible to apply proximity metrics for matching keywords or phrases that match the query approximately. Allowing errors in keywords and phrases are common method for proximity, using a thesaurus is another common method. A proximity search requires only that there shall be a partial match between the information retrieved and the query. International published application WO96/00945 titled "Variable length data sequence matching method and apparatus" (Doringer &. al.) which has been assigned to International

Business Machines, Corp., discloses the building, maintenance and use of a database with a trie-like structure for storing entries and retrieving at least a partial match, preferably the longest partial match or all partial matches of a search argument (input key) from the entries.
In order to further illuminate the general prior art mention can be made of international published patent application VJ092J15954 (Kimball &. a]., assigned to Red Brick System, U.S.A.) and US patent no. 5 627 748 (Baker & al., assigned to Lucent Technologies, Inc., U.S.A.), both disclosing data structures in the form of suffix trees for searching/matching in a square matrix. Neither of these two publications disclose anything beyond a regular suffix tree, except for the use of a linked list during matching and do not teach or suggest approaches to limit the search space when searching for approximate matches. However, such approaches would be most desirable when applying data structures based on suffix trees to searching, particularly for approximate matches in extremely large document collections, such as may be found on the World Wide Web.
The main object of the present invention is thus to provide a search system and a method for fast and efficient search and retrieval of information in large volumes of data. Particularly it is an object of the present invention to provide a search system suited for implementing search engines for searching of information systems with distributed large volume data storage, for instance Internet. It is to be understood that the search system according to the invention by no means shall be limited to searching and retrieving information stored in the form of alphanumeric symbols, but equally well may be applied to searching and retrieving information stored in the form of disitalized images and graphic svmbols, as the word text used herein also may interpreted as images when these are represented wholly or partly as sets of symbols. It is also to be understood that the search system according to the invention can be implemented as software written in a suitable high-level language on commercially available computer systems, but it may also be implemented in the form of a dedicated processor device for searching and retrieving information of the aforementioned kind.
The above-mentioned objects and advantages are realized according to the invention with a search system which is characterized in that the data structure comprises a tree structure in the form of a non-evenly spaced sparse

suffix tree ST( J) for storing suffixes of words and/or symbols, and sequences thereof in the text T, that the metric JW comprises a combination of an edit distance metric for an approximate degree of matching between words and/or symbols in the text T and a query 0 and an edit distance metric for an approximate degree of matching between sequences S of words and/or symbols in the text T and a query sequence P of words and/or symbols in the query O, the latter edit distance metric including weighting cost functions for edit operations which transform a sequence S of words and/or symbols in the text T into the sequence P of words and/or symbols in the query 0, the weighting taking place with a value proportional to a change in the length of sequence upon a transformation or dependent on the size of the words and/or symbols in sequences to be matched, that the implemented search algorithms comprise a first algorithm for determining the degree of matching between words and/or symbols in the suffix tree representation of respectively the text T and a query O, and a second algorithm for determining the degree of matching between sequences of words and/or symbols in the suffix tree representation of respectively the text T and the query O, said first and/or second algorithms searching the data structure with queries Q in the form of either words, symbols, sequences of words or sequences of symbols or combinations thereof, such that information R is retrieved on the basis of query O with a specified degree of matching between the former and the latter, and that the search algorithms optionally also comprise a third algorithm for determining exact matching between words and/or symbols in the suffix tree representation of respectively the text T and the query O and/or a fourth algorithm for determining exact matching between sequences of words and/or symbols in the suffix tree representation of respectively the text T and the query 0, said third and/or fourth algorithms searching the data structure with queries 0 in the form of either words, symbols, sequences of words, or sequences of symbols or combinations thereof, such that information R is retrieved on the basis of the query O with an exact matching between the former and the latter.
In an advantageous embodiment of the search system according to the invention the suffix tree ST(7) is a word-spaced sparse suffix tree SSTws(r), comprising only a subset of the suffixes in the text T.
Preferably is then the word space sparse suffix tree SSTWS(J) a keyword-spaced sparse suffix tree SSTkws(r).

In further advamaseous embodiments of the search svstem according to the invention ihe first algorithm for detecting the degree of keyword matchins in a keyword-spaced sparse suffix tree SST-KWS(7) is implemented as disclosed by dependent claim i, the second algorithm for determining the degree of sequence matching in a keyword-spaced sparse suffix tree SSTkws(J) implemented as disclosed by dependent claim 5, whereby a subroutine of the second algorithm preferably is implemented as disclosed by dependent claim 6, the third algorithm for determining an exact keyword matching in a keyword-spaced sparse suffix tree SSTkws(T) implemented as disclosed by dependent claim 7, and finally the fourth algorithm for determining an exact keyword sequence matching in a keyword-spaced sparse suffix tree SSTfcws implemented as disclosed by dependent claim 8.
The above-mentioned objects and advantages are also realized according to the invention with a method which is characterized by generating the data structure as a word-spaced sparse suffix tree SSTws(r) of a text T for representing all the suffixes starting at a word separator symbol in the text T, storing sequence information of the words in the text Tin the word-spaced sparse suffix tree SSTWE(r), generating a combined edit distance metric M comprising an edit distance metric D{s,q) for words s in the text T and a query word a in a query 0 and a word-size dependent edit distance metric DWS(S,P) for sequences S of words in the text T and a sequence P of words q in the query Q, the edit distance metric DWS(S,P) being the minimum sum of costs for edit operations transforming the sequence S into the sequence P, the minimum sum of costs being the minimum sum of cost functions for each edit operation weighted by a parameter proportional to the change in the total length of the sequence S or by the ratio of the current"word length and average word length in the sequences S;P, and determining the degree of matching between words s,q by calculating the edit distance D(s.q") between the words s of the retrieved information R and the word q of a query Q, or in case the words s,q are more than k errors from each other, determining the degree of matching between the word sequences SR, PO of retrieved information R and a query 0 respectively by calculating the edit distance Dw,(^,?o) for all matches.
Advantageously the method according to the invention additionally comprises weighting an edit operation which changes a word s into word q with a parameter for the proximity between the characters of the words s;q,

thus taking the similarity of the words s;a in regard when determining the cost of the edit operation in question.
In an advantageous embodiment of the method according to the invention the number of matches is limited by calculating the edit distance by calculating the edit distance DW!(^L!?O) for restricted number of words in the query word sequence P0.
In another advantageous embodiment of the method according to the invention the edit distance D(s,qJ between word s and a word q is defined recursively and the edit distance D(s,q) calculated by means of a dynamic programming procedure, and the edit distance DV,S(S,P) between sequences S and a sequence P is correspondingly recursively defined and the edit distance DWS(S,P) calculated by means of a dynamic programming procedure.
According to the invention the above-mentioned objects and advantages are also realized with the use of the search system according to the invention in an approximate search engine.
The search system and the method according to the invention shall now be discussed in greater detail in the following with reference to the accompanying drawing figures, of which
fig. 1 shows an example of a suffix tree,
fig. 2 examples of word-spaced sparse suffix trees as used with the present invention,
fig. 3 an example of a so-called PATRICIA trie as known in prior art,
fig. 4 a further example of a word-spaced sparse suffix tree as used with the present invention,
fig. 5 an example of explicitly stored word sequence information as used with the present invention,
fig. 6 a leaf node structure as used with the present invention, and
fig. 7 schematically the structure of a search engine with the search system according to the present invention.
The search system according to the invention consists essentially of three parts, namely the data structure, the metrics for approximate matching and

the search algorithm. When full text retrieval is the target, as essentially will be the case with the search system according to the present invention, then the entire data set which shall be retrievable, will be stored in 2 data structure which supports a high query* performance.
The basic concepts underlying the present invention shall first be discussed in some detail. Stored information in the form of text T is divided into words s and word sequences S. Words are substrings of the entire text separated by word boundary terms. The set of word boundary terms is denoted BTword. A common set of word boundary terms could be the set {", ":", "\t", "
", "\0[ ".", "; \ "?"} where \t denotes a tab character,
denotes a linefeed character and \0 denotes an end-of-document indicator. In connection with the following description of the present invention it will be useful with some definitions concerning strings and sequences.
Definition 1: String
A string is a sequence of symbols taken from an alphabet, such as the ASCII characters. Then the length of 2 string is the number of instances of symbols or characters comprising the string, and is denoted \x\. If x has the length m the string may also be written as x}x2...X;...xmi where x, represents the z"th symbol in the string.
A substring of x is a string given by a contiguous group of symbols within x. Thus, a substring may be obtained from x by deleting one or more characters from the beginning or the end of the string.
Definition 2: Substring, suffix and prefix
A substring of x is a string xl = X,X^J...XJ for some 1 Also the notion of a word sequence will be used.
Definition 3: Word sequence
A word sequence is a sequence of separated, consecutive words. A. word
sequence S = si,s2l ...,s„ consists of rc single words (or strings) si,S2, up to s„.
Word sequences are delimited by sequence boundary terms. The set sequence boundary terms are denoted STseq. A common set of sequence boundary terms could be the set {"OV}. where \0 indicates an end-of-document marker.

The concept approximate word matching can be described as follows.
Given a siring s = s:s2-.-sn and a query term q = q,q2...qm. Then the task is to find ail occurrences of a in 5 thai is at most k errors away from the original query term a. A proximity metric determines how to calculate the errors between p and a potential match s;...Sj.
A common metric for approximate word matching is the Levenstein distance or edit distance (V.I. Levenstein, "Binary codes capable of correcting deletions, insertions, and reversals", (Russian) Doklady Akademii nauk SSSR, Vol. 163, No. 4, pp. 845-8 (1965); also Cybernetics and Control Theory, Vol. 10, No. 8, pp. 707-10, (1966)). This metric is denned as the minimum number of edit operations needed to transform one string into another. An edit .operation is given by any rewrite rule, for instance:

It is also possible to define an approximate matching on the level of words in a word sequence and this can be described as follows.
Given a text T consisting of the n words w;lWj...wfl where each of the words is a string of characters. A sequence partem P consists of the m words p>, P2. ■■•• Pm- The sequence pattern P is said to have an approximate
occurrence in T if the sequence;?;, p2 pm differs with at most t errors
from a sequence w,-, Wj-j, ..., w,- for some ij, such that 1
A text that shall be retrieved in a search system must be indexed in a manner which facilitates searching the data. Consequently the data structure is a kernel data structure of the search system according to the present invention and is based on so-called suffix trees and particularly a sparse suffix tree. These two kinds of structures shall be defined in the following. A suffix tree S(T) is a tree representation of all possible suffixes in the text T. All unary nodes in a suffix tree S(T) are concatenated with its child to create a compact variant.
Fig. 1 shows the suffix tree for the text T= "structure".
Even more particularly the present invention is based on sparse suffix trees. These were introduced by J. Karkkainen & E. Ukkonen, in "Sparse Suffix Trees", Proceedings of the Second Annual International Computing and Combinatorics Conference (COCOON "96), Springer Verlag, pp.2 19-230, which again was based on ideas published by D.R. Morrison, "PATRICIA -Practical Algorithm To Retrieve Information Coded in Alphanumeric", Journal of the ACM, 15, pp. 514-534 (1968). A sparse suffix tree is defined as follows.
Definition 4: Sparse suffix tree
A sparse suffix tree SST(T) of the text T is a suffix tree, containing only a
subset of the suffixes present in the suffix tree ST(T) of the text.
"When using the search system according to the present invention searching for entire words, advantageously a non-evenly spaced sparse suffix tree may be created by storing suffixes starting at word boundaries only. The concept words-spaced sparse suffix tree is defined as follows.
Definition 5: Word-spaced sparse suffix tree
A word-spaced sparse suffix tree SSTWS(T) of a text T is a sparse suffix tree
SST(2~) containing only the suffixes starting-at a word separator character in
the text.
Fig. 2 shows two examples of word-spaced sparse suffix trees. Parts of the suffixes have been omitted to enhance the readability. The word-spaced sparse suffix tree for T= "to be the best" is the left structure, and T= "to make the only major modification" is the right structure in fig. 2.

In :he search system of the present invention the text is naturally divided into words which are stored independently in the word-spaced sparse suffix tree. As the atomic search term for searching is the word itself, advantageously each suffix will be terminated at the end of the word. This reduces the sparse suffix tree to a so-called PATRICIA trie (Morrison, op.cit.). A trie as defined in the literature is a rooted tree with the properties that each node, except the root contains a symbol of the alphabet and zhat no two children of the same node contain the same symbol. It should be noted that the word trie derives from the word "retrieval" and hence indicates that the trie is a tree structure suitable for retrieval of data. A PATRICIA trie is defined as a keyword-spaced sparse suffix tree (KWS tree) where the suffixes stored in the leaf nodes are limited by keyword delimiters. An example of a PATRICIA trie for the set of keywords {"avoid", "abuse", "be", "become",. "breathe", "say"} is shown in fig. 3. The structure used in the search system of the present invention differs from the PATRICIA trie because the search system explicitly stores sequence information of the words. Reducing the suffix length requires that the representation of the leaf node is changed. Pointers to the original text are replaced by the suffix string itself. A suffix length reduction of this kind is shown in fig. 4 for the same two strings as shown in fig. 2. In other words fig. 4 shows word-spaced sparse suffix trees with suffixes cut off at word boundaries. The word-spaced sparse suffix tree for T= "to be the best" is shown at left and the word-spaced sparse suffix tree for T = "to make the only major modification" is shown at right in the figure. A leaf node will contain a list of all positions where the word represented by the leaf node occurs.
Instead of using the implicit sequence of information-found in the original text, the present invention explicitly stores sequence information in the word-spaced sparse suffix tree. This is done by using pointers between the leaf nodes that represent consecutive words in the original text. As at least all the occurrences of the word represented by a particular leaf node are available, a pointer must be added to the next consecutive leaf.
A leaf node contains only the suffix of the word it represents, so when traversing the sequence pointers in the occurrence list only the suffixes of each of the consecutive words are revealed. This is handled by storing the entire word in the leaf node instead of just the suffix and thus also data structure of the invention differs from the PATRICIA trie in this respect. The

data structure for explicitly stored word sequence information with an occurrence list with pointers to the next consecutive word and to its occurrence is shown in fig. 5.
The search system according to the present invention uses a PATRICIA trie for organizing the occurrence list (Morrisou, op.cit.). The PATRICIA trie enables the search system to access the list of all consecutive words matching the stringy in a time 0([p2\), where ]p2\ of course is the length of p2. By using a PATRICIA trie to organize the list of occurrences, a completely defined tree structure is obtained for storing words from a text and maintaining the sequence information. A typical leaf node, with both a PATRICIA trie for the organized occurrence list and the extra unsorted list of occurrences, is shown in fig. 6. A.s an example the memory requirement for an occurrence list as used in the search system of the present invention, a database with about 742358 documents has a total of 333 856 744 words and a lexicon of 538 244 distinct words. The total size of the database is 2054.52 MB. The average word length is thus 6.45 bytes. A sparse suffix tree will use 8 bytes for each internal node, using 32 bit pointers. It is assumed that an average of 3 internal nodes is used for each word. The leaf node would then require 6.45 bytes for storing the entire word plus 32 bits for a pointer to an occurrence list. A total of 34.45 bytes/word gives a total size of 18.108 MB. In addition the occurrence list has the size of 4 bytes per entry and 12 bytes if the full version is to be used. Hence the total memory requirement of the occurrence list varies from 1273 MB to 3820 MB. The data structure using a sparse suffix tree will have a size between 60% to 200% of the original text. This is comparable with the requirements of an inverted file, but the sparse suffix tree as used in the search system according to the invention provides much faster searching, enables approximate matching and makes sequence matching easy to perform.
In approximate searching, a metric is used to give an error measure of a possible match. The search system according to the present invention employs several metrics, and particularly a unique combination of metrics. These metrics along with the combined metric shall be discussed in the following.
.An edit distance metric as defined above allows the operations deletion, insertion and change which intuitively apply to words as well as characters.

Common errors in matching phrases are missing, extra or changed words. Hence the edit distance metric as previously defined shall be adapted and extended in order to apply to the approximate word sequence matching problem. Edit operations for sequences are defined below.
Definition 6: Edit operations for secuences
For transforming one sequence S of words into another sequence P of words, the edit operations allowed on the word in the sequences may be written according to the following rewrite rules:
• (c—>-e); deletion of word a from the sequence
• (s—*c). insertion of word a into the sequence

• (a—*b), change of word a into word b
• {ab—*ba), transposition of adjacent words a and b.
Instead of characters as atoms, the search system according to the invention applies the edit operations to words which then should be regarded as the operational atoms.

By using the edit operations as defined above the edit distance for sequences can now be defined.
Definition 7: Edit distance for sequences
The edit distance metric for sequences defines the distance D!eq(S,P) between the sequence S = S;,si, ■ ■.,*„ and the sequence P = pi.pi, ...,pm as the minimum sum of cost c(x—+y) for the sequence of edit operations transforming the sequence S into the sequence P.
The search system according to the present invention enhances the edit distances metric for sequences to weight the cost of the edit operations by the

size of the words operated upon.
Definition 8: Word size-dependent edit distance for sequences The word size dependent edit distance for sequences is defined as the minimum sum of costs for the editing operations needed to transform one secuencs into the other. The cost functions are dependent on the word size of its operands.
In the search system according to the invention a definition of cost functions is given by the equations

where I denotes the average length of a word in the two sequences being compared. The cost of each edit operation is weighted by a size proportional to the change in the total length of the sequence or by the ratio of the current word length and the average word length in the sequences considered.
Now the distance metric reflects the assumption of some relation between the word length and how important the word is to the semantic context of the word sequence. Furthermore the search system according to the invention employs proximity at the character level when the change edit operation fa—*-b) is used. Replacing a word a by another word 5 should be related to the similarity between these two words. The new cost function for the change edit operation hence is given as:

Where D(c,b) is the normalized edit distance measuring function for words, 0 means ful] similarity", 1 means no similarity.

The search system according to the invention combines the edit distance metric fo: sequences with the ccst functions as given by formulas (4), (5) and (6}, with an edit distance metric for words as given by formula (1). This means that sequence edit operations are only used when the words being matched are more than k errors away from each other.
The algorithms used in the search system according to the invention perform efficient searching of the described structures. Matches are found according to the metrics as given above.
Approximate word matching in a word-spaced sparse suffix tree is dons by combining the calculation of the edit distance matrix and a traversal of the suffix tree. An algorithm for this is written in pseudo-code and given in table I.
This algorithm is adapted from a trie-matching algorithm as proposed by H. Shang & T.H. Merrertal, "Tries for Approximate String Matching", IEEE Transactions on Knowledge and Data Engineering, Vol.5, No. 4, pp. 540-547 (1996). The expected worst case running time of the algorithm is 0(k\l,\l:) according to Shang &. Merrertal (op.cit.).
Approximate word sequence matching requires the calculation of the word sequence edit distance for all possible matches. However, the number of possible matches can be limited by starting the calculation of the edit distance only on the possible words. The cost of deleting a word from the sequences determines the number of possible start words. If the accumulated cost of deleting the i first words in a query sequence PQ rises above a given error threshold, the candidate sequence starting with.the ith word of the query cannot possibly be a match. Therefore for a query sequence PQ of/ words, at most i possible start words will be tried. Since there are no backpointers in the sequence structure of the tree, it will not be ensured that all possible matches are obtained. Adding backpointers would solve this problem. The algorithm for approximate-word sequence matching as used in the search system according to the present invention, is given in pseudo-code in table II below. This algorithm tries to match the first keyword with;?/,£;... sequentially, testing all possible start positions.

In the ApproxSequsnceMatch algorithm in table II the ApproxMctchRest filiation is defined by the algorithm in table III below. This function matches the remaining sequence, using an initial error value.

The algorithms in tables II and III are written in the same pseudo-code as the algorithm in table I.
The FindZzaot function used to find the leaf node matching the first word in the sequence performs a simple traversal of the tree and its running time is 0\p>\ wherep, denotes the first word in a query sequence PQ. Calculating the edit distance can be done in \F\ time using straightforward dynamic programming or in 0(k) time (where k denotes the error threshold) using improved versions of the calculation algorithm, see E. Ukkonen, "Finding Approximate Patterns in Strings", Journal of Algorithms, vol. 6, pp. 332-137 (1985).
If £nOCe(pi) denotes the total sum of the number of occurrences of each" word Pi in the word sequence, then the worst case running time is 0{kTnoce{p;)).
Finally the implementation of a search engine based on the search system according to the invention shall briefly be discussed. Particularly a search engine based on the search system according to the invention is implemented as an approximate search engine (ASE) and is intended as a search, engine for indexing large document collections and providing algorithms for exact and approximate searching of these document collections. ASE shall provide a data structure for storing large texts or collection of documents. It is to be understood that the data structure may be generated from documents which contain additional information, such as images, video, sound, and the text may be formatted and/or annotated. The data structure is identical to the word-spaced sparse suffix tree as discussed above and it is, of course, to be understood that the words is the keywords of the search system, hence the word-spaced sparse suffix tree may instead be termed a keyword-spaced sparse suffix tree (KWS tree). The ASE shall contain algorithms for indexing documents in the KWS tree. These algorithms, of course, do not form a part of the search system according to the present invention, but they are well-known to persons skilled in the art and described in the literature, see for instance J. Karkkainen & E. Ukkonen"(op.cit.) and D.R. Morrison (op.cit).
The search system according to the invention and as used in the ASE employs algorithms both for exact and approximate matching of a pattern in a KWS tree. The algorithms given above in table I and table II are used for approximate word and word sequence matching with the non-uniform edit

distance as a metric. Finding an exact match of keyword;? with length m in a KWS tree is known in the an and easily implemented as a simple traversal of the tree structure. An appropriate algorithm for exact keyword matching written in pseudo-code is given in table IV. The search system according to the invention also shall be able to support algorithms for exact keyword sequence matching. Algorithms for exact keyword sequence matching are known in the art and easily implemented as e.g. shown in pseudo-code in table V below. The algorithm given here will find the exact match of the first keyword, if any. Then it will for all occurrences of the first keyword check if the second keyword matches the second keyword of the query. If so, the MatchResi procedure in table V is used to determine if the occurrence of the two first keywords are matching in the entire sequence. For approximate keyword matching in a KWS tree the search system implements the algorithm in table I above. For approximate keyword sequence matching the search-system implements the algorithm in table II above, matching a first keyword sequentially withpt.pj... and testing all possible start positions, applying the ApproxMatchRest function as given in table III to match a sequence starting at a particular position and handle the initial error value.
Finally, the ASE shall need a simple front end which gives the user control of indexing and querying the document collection. The front end should also be able to furnish statistics of the document collection and provide both a network interface for remote access, e.g. via WWW, and a local server user interface.
The ASE with the search system according to the invention should be general in a manner that allows for the adding new indexing and searching algorithms easily. Also, storing extra information about each document or keyword shall be possible to implement in an easy manner. Particularly the front end should be independent of the data structure and the search algorithms, such that internal changes in these has no effect on the design of the former.
The use of the search system according to the invention the ASE should be designed to have as low memory overhead as possible in the data structure. Also, searching should be designed to be as fast as possible. However, there will usually be a trade-off berween these rwo factors.

To sum up, an ASE with a search system according to the invention shall comprise four major modules.
1. Document indexing module DIM for indexing documents in the KWS tree structure. This module should also contain all extensions to support several document roes.
2. Data storage module DSM based on a keyword-spaced sparse suffix tree (KWS tree).
3. Search algorithm module SAM for searching the KWS tree, comprising algorithms for exact and/or approximate matching of respectively words and word sequences.
4. User interface front-end module FEM comprising both a local server user
interface and a network interface for remote queries.
The four modules of the ASE works together to offer a complete search engine functionality. The data flow between the different modules is shown in fig. 7. Indexing a collection of documents in done in the document indexing module DIM comprising indexing algorithms. This module is, of course, not a part of the search system according to the invention, but indexing algorithms that can be used are well-known in the art. The text found in the documents is passed on to the data storage DSM module for storage. The data storage module is, of course, a part of the search system according to the invention and is as stated based on the KWS tree structure. The search algorithm module SAM contains algorithms for searching the data located in the data storage module. This module implements the search system according to the present invention and allows"Tor a search process querying the data structure for tree and node information, while maintaining state variables. The front-end module may for instance be implemented on a work station or a personal computer and the -like, providing the functionality as stated above.
As already stated in the introduction, it is to be understood that the search system according to the invention can be implemented as software written in a suitable high-level language on commercially available computer systems, including workstations. It may also as stated be implemented in the form of a dedicated processor device which advantageously may comprise a large number of parallel processors being able to process large word sequences in

parallel for approximate matching with a large number of query word sequences. The fixed operational parameters of the processor may then be entered in a low-level code, while keyword sequences input from the KWS tree structure allows for an extremely fast processing of queries on a huge amount of data, a.nd the search system according to present invention shall hence in high degree be suited for performing searches on e.g. the World Wide Web, even in a KWS tree structure large enough to index all documents presently offered on the World Wide Web and moreover capable of handling the expected data volume growth on the World Wide Web in the future.

WE CLAIM :
1. A method in a search system for information retrieval, particularly information stored in form of text, wherein a text T comprises words and / or symbols s and sequences S thereof, wherein the information retrieval takes place with a given or varying degree of matching between a query Q, wherein the query Q comprises words and / or symbols q and sequences P thereof, and retrieved information R comprising words and / or symbols and sequences thereof from the text T; wherein the search system comprises a data structure for storing at least a part of the text T; and a metric M which measures the degree of matching between the query Q and retrieved information R, and wherein the search system implements search algorithms for executing a search, particularly a full text search on the basis of keywords kw, wherein the information in the text T is divided into words s and word sequences S, the words being substrings of the entire text separated by word boundary terms and forming a sequence of symbols, and wherein each word is structured as a sequence of symbols, characterized by generating the data structure as a word-spaced sparse suffix tree SSTWS (T) of a text T for representing all the suffixes starting at a word separator symbol in the text T, storing sequence information of the words s in the text T in the word-spaced sparse suffix tree SSTW3 (T), generating a combined edit distance metric M comprising an edit distance metric D(s,q) for words s in the text T and a query word q in a query Q and a word-size dependent edit distance metric DWS(S,P) for sequences S of words s in the text T and a sequence P of words q in the query Q, the edit distance metric DWS(S, P) being the minimum sum of costs for edit operations transforming a sequence S into the sequence P, the minimum sum of costs being the minimum sum of cost functions for each edit operation weighted by a value proportional to the change in the total length of the sequence S or by the ratio of the current word length and average word length in the sequences S,P,

and determining the degree of matching between words s,q by calculating the edit distance D(s,q) between the words s of the retrieved information R and the word q of a query Q, or in case the words s,q are more than k errors from each other, determining the degree of matching between the word sequences SR; PQ of retrieved information R and a query Q respectively by calculating the edit distance DWS(SR; PQ) for all matches.
The method according to claim 1, comprising the step of additionally weighting an edit operation which changes a word s into word q with a parameter for the proximity between the characters of the words s;q, thus taking the similarity of the words s;q in regard when determining the cost of the edit operation in question.
The method according to claim 1, comprising the step of limiting the number of matches by calculating the edit distance DWS(SR,PQ) for restricted number of words in the query word sequence PQ
The method according to claim 1, comprising the step of defining the edit distance D(s,q) between words s and a word q recursively and calculating the edit distance D(s,q) by means of a dynamic programming procedure.
A method according to claim 1 comprising the step of defining the edit distance Dws(S,p) between sequences S and a sequence P recursively and calculating the edit distance DWS(S,P) by means of a dynamic programming procedure.

Documents:

in-pct-2001-0144-che abstract-duplicate.pdf

in-pct-2001-0144-che abstract.pdf

in-pct-2001-0144-che claims-duplicate.pdf

in-pct-2001-0144-che claims.pdf

in-pct-2001-0144-che correspondence-others.pdf

in-pct-2001-0144-che correspondence-po.pdf

in-pct-2001-0144-che description (complete)-duplicate.pdf

in-pct-2001-0144-che description (complete).pdf

in-pct-2001-0144-che drawings-duplicate.pdf

in-pct-2001-0144-che drawings.pdf

in-pct-2001-0144-che form-1.pdf

in-pct-2001-0144-che form-19.pdf

in-pct-2001-0144-che form-26.pdf

in-pct-2001-0144-che form-3.pdf

in-pct-2001-0144-che form-4.pdf

in-pct-2001-0144-che form-5.pdf

in-pct-2001-0144-che others.pdf

in-pct-2001-0144-che pct.pdf

in-pct-2001-0144-che petition.pdf

« Previous Patent

Next Patent »

Patent Number

202355

Indian Patent Application Number

IN/PCT/2001/144/CHE

PG Journal Number

05/2007

Publication Date

02-Feb-2007

Grant Date

03-Nov-2006

Date of Filing

31-Jan-2001

Name of Patentee

M/S. FAST SEARCH & TRANSFER ASA

Applicant Address

POST BOX NO. 1677 VIKA, N-0120 OSLO,

Inventors:

#	Inventor's Name	Inventor's Address
1	KNUT MAGNE RISVIK	SIGRID UNDSETS VEI 27, N-7023 TRONDHEIM, NORWAY

PCT International Classification Number

G06F 17/30

PCT International Application Number

PCT/NO1999/000233

PCT International Filing date

1999-07-09

PCT Conventions:

#	PCT Application Number	Date of Convention	Priority Country
1	19983175	1998-07-10	Norway