Title of Invention	SYSTEMS,METHODS,INTERFACES AND SOFTWARE FOR AUTOMATED COLLECTION AND INTEGRATIONOF ENTITY DATA INTO ONLINE DATABASES AND PROFESSIONAL DIRECTORIES.
Abstract	System (100) and method for automated collection and integration of entity data into online databases and professional directories are disclosed. The system comprising means (910) for extracting entity reference data for at least one person from each of a plurality of documents to form entity reference records, means (920) for forming at least one entity profile record by merging at least one of the entity reference records for a person with at least one other entity reference record for the same person, means (940) for categorizing at least one of the entity profile records based on a taxonomy and means (950) for defining links between at least one of the entity profile records and other documents or data sets.

Title of Invention

SYSTEMS,METHODS,INTERFACES AND SOFTWARE FOR AUTOMATED COLLECTION AND INTEGRATIONOF ENTITY DATA INTO ONLINE DATABASES AND PROFESSIONAL DIRECTORIES.

Abstract

System (100) and method for automated collection and integration of entity data into online databases and professional directories are disclosed. The system comprising means (910) for extracting entity reference data for at least one person from each of a plurality of documents to form entity reference records, means (920) for forming at least one entity profile record by merging at least one of the entity reference records for a person with at least one other entity reference record for the same person, means (940) for categorizing at least one of the entity profile records based on a taxonomy and means (950) for defining links between at least one of the entity profile records and other documents or data sets.

Full Text	Copyright Notice and Permission A portion of this patent document contains material subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyrights whatsoever. The following notice applies to this document: Copyright © 2003, Thomson Global Resources AG. Cross-Reference to Related Application This application claims priority to U.S. provisional application 60/533,588 filed on December 31,2003. The provisional application is incorporated herein by reference. Technical Field Various embodiments of the present invention concerns information-retrieval systems, such as those that provide legal documents or other related content. Background In recent years, the fantastic growth of the Internet and other computer networks has fueled an equally fantastic growth in the data accessible via these networks. One of the seminal modes for interacting with this data is through the use of hyperlinks within electronic documents. More recently, there has been interest in hyperlinking documents to other documents based on the names of people in the documents. For example, to facilitate legal research, West Publishing Company of Si Paul, Minnesota (doing business as Thomson West) provides thousands of electronic judicial opinions that hyperlink the names of attorneys and judges to their online biographical entries in the West Legal Directory, a proprietary directory of approximately 1,000,000 U.S. attorneys and 20,000 judges. These hyperlinks allow users accessing judicial opinions to quickly obtain contact and other specific information about lawyers and judges named in the opinions. The hyperlinks in these judicial opinions are generated automatically, using a system mat extracts first, middle, and last names; law firm name, city, and state; and court information from the text of the opinions and uses them as clues to determine whether to link the named attorneys and judges to their corresponding entries in the professional directory. See Christopher Dozier and Robert Haschart, "Automatic Extraction and Linking of Person Names in Legal Text" (Proceedings of RIAO 2000; Content Based Multimedia Information Access, Paris, France, pp. 1305-1321. April 2000), which is incorporated herein by reference. An improvement to this system is described in Christopher Dozier, System, Methods And Software For Automatic Hyperl inking Of Persons' Names In Documents To Professional Directories, WO 2003/060767A3 July 24,2003. The present inventors have recognized still additional need for improvement in these and other systems that generale automatic links- Brief Description of the Accompanying Drawings Figure 1 is a diagram of an exemplary information-retrieval system 100 corresponding to one or more embodiments of the invention; Figure 2 is a flowchart corresponding to one or more exemplary methods of operating system 100 and one or more embodiments of the invention; Figure 3-8 are facsimiles of exemplary user interfaces, each corresponding to one or more embodiments of the invention. Figure 9 is a flow chart corresponding to one or more embodiments of the invention. Figures 10 is a flow chart corresponding to one or more additional embodiment of the invention. Detailed Description of Exemplary Embodiments This description, which references and incorporates the above-identified Figures, describes one or more specific embodiments of an invention. These embodiments, offered not to limit but only to exemplify and teach the invention, are shown and described in sufficient detail to enable those skilled in the art to implement or practice the invention. Thus, where appropriate to avoid obscuring the invention, the description may omit certain information known to those of skill in the art. Exemplary Information-Retrieval System Figure 1 shows an exemplary online information-retrieval system 100. System 100 includes one or more databases 110, one or more servers 120, and one or more access devices 130. Databases 110 include a set of one or more databases. In the exemplary embodiment, the set includes a caselaw database 111, an expert witness directory 112, professional directories or licensing databases 113, a verdict and settlement database 114, an court-filings database 116. Caselaw database 111 generally includes electronic text and image copies of judicial opinions for decided cases for one or more local, state, federal, or international jurisdiction. Expert witness directory 112, which is defined in accord with one or more aspects of the present invention, includes one or more records or database structures, such as structure 1121. Structure 1121 includes an expert identifier portion 1121A which is logically associated with one or more directory documents or entries 112 1B, one or more verdict documents or entries 1121C, and one or more articles 1121D. Some embodiments logically associate the expert identifier with court filings documens, such as briefs and expert reports and/or other documents. Professional directories or licensing databases 113 include professional licensing data from one or more state, federal, or international licensing authorities. In the exemplary embodiment, this includes legal, medical, engineering, and scientific licensing or credentialing authorities. Verdict and settlement database 114 includes electronic text and image copies of documents related to the determined verdict, assessed damages, or negotiated settlement of legal disputes associated with cases within caselaw database 111. Articles database 115 includes articles technical, medical, professional, scientific or other scholarly or authorative journals and authoritative trade publications. Some embodiments includes patent publications. Court-filings database 116 includes electronic text and image copies of court filings related to one or more subsets of the judicial opinions caselaw database 111. Exemplary court-filing documents include briefs, motions, complaints, pleadings, discovery matter. Other databases 115 includes one or more other databases. containing documents regarding news stories, business and finance, science and technology, medicine and bioinformatics, and intellectual property information. In some embodiments, the logical relationships across documents are determined manually or using automatic discovery processes that leverage information such as litigant identities, dates, jurisdictions, attorney identifies, court dockets, and so forth to determine the existence or likelihood of a relationship between any pair of documents. Databases 110, which take the exemplary form of one or more electronic, magnetic, or optical data-storage devices, include or are otherwise associated with respective indices (not shown). Each of the indices includes terms and/or phrases in association with corresponding document addresses, identifiers, and other information for facilitating the functionality described below. Databases 112, 114, and 116 are coupled or couplable via a wireless or wireline communications network, such as a local-, wide-, private-, or virtual-private network, to server 120. Server 120, which is generally representative of one or more servers for serving data in the form of webpages or other markup language forms with associated applets, ActiveX controls, remote-invocation objects, or other related software and data structures to service clients of various "thicknesses." More particularly, server 120 includes a processor 121, a memory 122, a subscriber database 123, one or more search engines 124 and software module 125. Processor 121, which is generally representative of one or more local or distributed processors or virtual machines, is coupled to memory 122. Memory 122, which takes the exemplary form of one or more electronic, magnetic, or optical data- storage devices, stores subscription database 123, search engines 124, and interface module 125. Subscription database 123 includes subscriber-related data for controlling, administering, and managing pay-as-you-go- or subscription-based access of databases 110. Subscriber database 123 includes subscriber-related data for controlling, administering, and managing pay-as-you-go or subscription-based access of databases 110. Search engines 124 provides Boolean or natural-language search capabilities for databases 110. Interface module 125, which, among, other things defines one or portion of a graphical user interface that helps users define searches for databases 110. Software 125 includes one or more browser-compatible applets, webpage templates, user- interface elements, objects or control features or other programmatic objects or structures. More specifically, software 125 includes a search interface 1251 and a results interface 1252. Server 120 is communicatively coupled or couplable via a wireless or wireline communications network, such as a local-, wide-, private-, or virtual-private network, to one or more accesses devices, such as access device 130. Access device 130 is not only communicatively coupled or couplable to server 130, but also generally representative of one or more access devices. In the exemplary embodiment, access device 130 takes the form of a personal computer, workstation, personal digital assistant, mobile telephone, or any other device capable of providing an effective user interface with a server or database. Specifically, access device 130 includes one or more processors (or processing circuits) 131, a memory 132, a display 133, a keyboard 134, and a graphical pointer or selector 135. Memory 132 stores code (machine-readable or executable instructions) for an operating system 136, a browser 137, and a graphical user interface (GUI) 138. In the exemplary embodiment, operating system 136 takes the form of a version of the Microsoft Windows operating system, and browser 137 takes the form of a version of Microsoft Internet Explorer. Operating system 136 and browser 137 not only receive inputs from keyboard 134 and selector (or mouse) 135, but also support rendering of GUI 138 on display 133. Upon rendering, GUI 138 presents data in association with one or more interactive control features (or user-interface elements). (The exemplary embodiment defines one or more portions of interface 138 using applets or other programmatic objects or structures from server 120.) Specifically, graphical user interface 138 defines or provides one or more display control regions, such as a query region 1381, and a results region 1382. Each region (or page in some embodiments) is respectively defined in memory to display data from databases 110 and/or server 120 in combination with one or more interactive control features (elements or widgets). In the exemplary embodiment, each of these control features takes the form of a hyperlink or other browser- compatible command input. More specifically, query region 1381 includes interactive control features, such as an query input portion 1381A for receiving user input at least partially defining a profile query and a query submission button 1381B for submitting the profile query to server 120 for data from, for example, experts database 112. Results region 1382, which displays search results for a submitted query, includes a results listing portion 1382A and a document display portion 1382B. Listing portion 1382A includes control features 2A1 and 2A2 for accessing or retrieving one or more corresponding search result documents, such as professional profile data and related documents, from one or more of databases 110, such as expert database 112, via server 120. Each control feature includes a respective document identifier or label, such as EXP I, EXP 2 identifying respective name and/or city, state, and subject-matter expertise data for the corresponding expert or professional. Display portion 1382B displays at least a portion of the full text of a first displayed or user-selected one of the profiles identified within listing portion 1382A, EXP 2 in the illustration. (Some embodiments present regions 1382A and 1382B as selectable tabbed regions.) Portion 1382B also includes features 2B1,2B2, 2B3, and 2B4. User selection of feature 2B1 initiates retrieval and display of the profile text for the selected expert, EXP 2; selection of feature 2B2 initiates retrieval and display of licensing data for any licenses or other credentials held by the selected expert or professional image copy of the document displayed in region 1382B in a separate window; selection of feature 2B3 initiates display and retrieval of verdict data related to the expert or professional; and selection of feature 2B4 initiates retrieval and display of articles (from database 115) that are related to, for example authored by, the expert or professional. Other embodiments include additional control features for accessing court-filing documents, such as briefs, and/or expert reports authored by the expert or professional, or even deposition and trial transcripts where the expert or testimony was a participant. Still other embodiments provide control features for initiating an Internet search based on the selected expert and other data and for filtering results such search based on the profile of the expert or professional. Exemplary Methods of Operation Figure 2 shows a flow chart 200 of one or more exemplary methods of operating an information-management system, such as system 100. Flow chart 200 includes blocks 210-290, which are arranged and described in a serial execution sequence in the exemplary embodiment. However, other embodiments execute two or more blocks in parallel using multiple processors or processor-like devices or a single processor organized as two or more virtual machines or sub processors. Other embodiments also alter the process sequence or provide different functional partitions to achieve analogous results. For example, some embodiments may alter the client- server allocation of functions, such that functions shown and described on the server side are implemented in whole or in part on the client side, and vice versa. Moreover, still other embodiments implement the blocks as two or more interconnected hardware modules with related control and data signals communicated between and through the modules. Thus, this (and other exemplary process flows in this description) apply to software, hardware, and firmware implementations. Block 210 entails presenting a search interface to a user. In the exemplary embodiment, this entails a user directing a browser in an client access device to internet-protocol (IP) address for an online information-retrieval system, such as the Westlaw system and then logging onto the system. Successful login results in a web- based search interface, such as interface 138 in Figure 1 (or one or more portions thereof) being output from server 120, stored in memory 132, and displayed by client access device 130. Execution then advances to block 220. Block 220 entails receipt of a query. In the exemplary embodiment, the query defines one or more attributes of an entity, such as person professional. In some embodiments, the query string includes a set of tenrts and/or connectors, and in other embodiment includes a natural-language string. Also, in some embodiments, the set of target databases is defined automatically or by default based on the form of the system or search interface. Figures 3 and 4 show alternative search interfaces 300 and 400 which one or more embodiments use in place of interface 138 in Figure 1. Execution continues at block 230. Block 230 entails presenting search results to the user via a graphical user interface. In the exemplary embodiment, this entails the server or components under server control or command, executing the query against one or more of databases 110, for example, expert database 110, and identifying documents, such as professional profiles, that satisfy the query criteria. A listing of results is then presented or rendered as part of a web-based interface, such as interface 138 in Figure 1 or interface 500 in Figure 5. Execution proceeds to block 240. Block 240 entails presenting additional information regarding one or more one or more of the listed professionals. In the exemplary embodiment, this entails receiving a request in the form of a user selection of one or more of the professional profiles listed in the search results. These additional results may be displayed as shown in interface 138 in Figure 1 or respective interfaces 600,700, and 800 in Figures 6, 7, and 8. Interface 600 shows a listing of links 610 and 620 for additional information related to the selected professional. As shown in Figure 7, selection of link 610 initiates retrieval and display of a verdict document (or in some case a list of associated verdict documents) in interface 700. And, as shown in Figure 8, selection of link 620 initiates retrieval and display of an article (or in some cases a list of articles) in interface 800. Exemplary Method of Building Expert Directory In Figure 9, flow chart 900 shows an exemplary method of building an expert directory or database such as used in system 100. Flow chart 900 includes blocks 910-960. At block 910, the exemplary method begins with extraction of entity reference records from text documents. In the exemplary embodiment, this entails extracting entity references from approximately 300,000 jury verdict settlement (JVS) documents using Finite state transducers. JVS documents have a consistent structure that includes an expert witness section or paragraph, such as that exemplified in Table 1. Document The exemplary embodiment uses a parsing program to locate expert-witness paragraphs and find lexical elements (that is, terms used in this particular subject area) pertaining to an individual. These lexical elements include name, degree, area of expertise, organization, city, and state. Parsing a paragraph entails separating it into sentences, and then parsing each element using a separate or specific finite state transducer. The following example displays regular expressions from the finite state transducer used for the organization element. (Variables are prefixed by S-) Typically one expert is listed in a sentence along with his or her area of expertise and other information. If more than one expert is mentioned in a sentence, area of expertise and other elements closest to the name are typically associated with that name. Each JVS document generally lists only one expert witness; however, some expert witnesses are references in more than one JVS document. Table 2 shows an example of an entity reference records. Once the entity reference records are defined, execution continues at block 220. Block 920 entails defining profile records from the entity reference records: In the exemplary embodiment, defining the profile records entails merging expert- witness reference records that refer to the same person to create a unique expert- witness profile record for the expert. To this end, the exemplary embodiment sorts the reference records by last name to define a number of lastname groups. Records within each "last-name" group are then processed by selecting an unmerged expert reference record and creating an new expert profile record from this selected record. The new expert reference record is then marked as unmerged and compared to each unmerged reference record in the group using Bayesian matching to compute the probability that the expert in the profile record refers to the same individual referenced in the record. If the computed match probability exceeds a match threshold, the reference is marked as "merged." Jf unmerged records remain in the group, the cycle is repeated. Note that it is still possible for duplicate records to reside in the profile file if two or more reference records pertain to one individual (for example, because of a misspelled last name). To address this possibility, a final pass is made over the merged profile file, and record pairs are flagged for manual review. Table 3 shows an exemplary expert profile record created from expert reference records. Block 930 entails adding additional information to the expert reference records. In the exemplary embodiment, this entail harvesting information from other databases and sources, such as from professional licensing authorities, telephone directories, and so forth. References to experts in JVS documents, the original entity record source in this embodiment, often have little or no location information for experts, whereas professional license records typically include the expert's full name, and the full current home and/or business address, making them a promising source for additional data. One exemplary licensing authority is the Drug Enforcement Agency, which licenses health-care professionals to prescribe drugs. In determining whether a harvested license record (analogous to a reference record) and expert person refer to the same person, the exemplary embodiment computes a Bayesian match probability based on first name, middle name, last name, name suffix, city-state information, area of expertise, and name rarity. If the match probability meets or exceeds a threshold probability, one or more elements of information from the harvested license record are incorporated into the expert reference record. If the threshold criteria is not met, the harvested license record is stored in a database for merger consideration with later added or harvested records. In. (Some embodiments perform an extraction procedure on the supplemental data similar to that described at block 910 to define reference records, which are then sent as a set for merger processing as in block 910 with the expert reference records.) Block 940 entails categorizing expert profiles by area of expertise. In the exemplary embodiment, each expert witness record is assigned one or more classification categories in an expertise taxonomy. Categorization of the entity records allows users to browse and search expert witness (or other professional) profiles by area of expertise. To map an expert profile record to an expertise subcategory, the exemplary embodiment uses an expertise categorizer and a taxonomy that contains top-level categories and subcategories. The exemplary taxonomy includes the following top-level categories: Accident & Injury; Accounting & Economics; Computers & Electronics; Construction & Architecture; Criminal, Fraud and Personal Identity; Employment & Vocational; Engineering & Science; Environmental; Family & Child Custody; Legal & Insurance; Medical & Surgical; Property & Real Estate; Psychiatry & Psychology; Vehicles, Transportation, Equipment & Machines. Each categories includes one or more subcategories. For example, the "Accident & Injury" category has the following subcategories: Aerobics, Animals, Apparel, Asbestos, Boating, Bombing, Burn/Thermal, Child Care, Child Safety, Construction, Coroner, Cosmetologists/Beauticians/Barbers/Tattoos, Dog Bites, Entertainment, and Exercise. Assignment of subject-matter categories to an expert profile record entail using a function that maps a professional descriptor associated with the expert to a leaf node in the expertise taxonomy. This function is represented with the following equation: T = f(S) where T denotes a set of taxonomy nodes, and S is the professional descriptor. The exemplary function/uses a lexicon of 500 four-character sets that map professional descriptors to expertise area. For example, experts having the "onco" professional descriptor are categorized to the oncology specialist, oncologist, and pediatric oncologist subcategories. Other taxonomies are also feasible. The exemplary embodiment allows descriptors to map to more than one expertise area (that is, category or subcategory) in the taxonomy. For example, "pediatric surgeon" can be mapped to both the "pediatrics" node and "surgery" nodes. Table 5 shows an example of an expert profile record in which the expertise field has been mapped to the category "Medical & Surgical" and to the subcategories "pediatrics," "blood & plasma," and "oncology." to "Medical & Surgical" Block 940 entails associating one or more text documents and/or additional data sets with one or more of the professional profiles. To this end, the exemplary embodiment logically associates or links one or more JVS documents and/or Medline articles to expert-witness profile records using Bayesian based record matching. Table 6 shows a sample Medline article. To link JVS documents and medline abstracts to expert profile records, expert- reference records are extracted from the articles using one or more suitable parsers through parsing and matched to profile records using a Bayesian inference network similar to the profile-matching technology described previously. For JVS documents, the Bayesian network computes match probabilities using seven pieces of match evidence: last name, first name, middle name, name suffix, location, organization, and area of expertise. For medline articles, the match probability is based additionally on name rarity, as described in the previously mentioned Dozier patent application. Figure 10 shows a flow chart 1000 of an exemplary method of growing and maintaining one or more entity directories, such expert database that used in system 100. Flow chart 1100 includes process blocks 1010-1050. At block 1010, the exemplary method begins with receipt of a document. In the exemplary embodiment, this entails receipt of an unmarked document, such an a judicial opinion or brief. However, other embodiments receive and process other types of documents. Execution then advances to block 1020. Block 1020 entails determining the type of document. The exemplary embodiments uses one or more methods for determining document type, for example, looking for particular document format and syntax and/or keywords to differentiate among a set of types. In some embodiments, type can be inferred from the source of the document. Incoming content types, such as case law, jury verdicts, law reviews, briefs, etc., have a variety of grammar, syntax, and structural differences. After type (or document description) is determined, execution continues at block 1030. Block 1030 entails extracting one or more entity reference records from the received document based on the determined type of the document. In the exemplary embodiment, four types of entity records are extracted: personal names, such as attorneys, judges, expert witnesses; organizational names, such as firms and companies; product names, such as drugs and chemicals; and fact profiles ("vernacular" of subject area). Specialized or configurable parsers (finite state transducers), which are selected or configured on the basis of the determined document type and the entity record being built, identify and extract entity information for each type of entity. Parsers extract information by specifically searching for a named entity (person, address, company, etc.) or by relationships between entities. Parser text- extraction is based on the data's input criteria. For example, the more structured (tagged) data enables a "tighter" set of rales to be built within a parser. This set of rules allows more specific information to be extracted about a particular entity. A more "free" data collection, such as a web site, is not as conducive to rule-based parsers. A collection could also include a combination of structured, semi-structured, and free data. More specifically, parsers are developed through "regular-expression" methods. The regular expressions serves are "rules" for parsers to find entity types and categories of information. Block 1040 attempts to link or logically associate each extracted entity reference record with one or more existing authority directories. In the exemplary embodiment, this entails computing a Bayesian match probability for each extracted entity reference and one more corresponding candidate records in corresponding directories (or databases) that have been designated as authoritative in terms of accepted accuracy. If the match probability satisfies match criteria, the records are merged or associated and the input document. Execution then proceeds to block 1050. Block 1050 entails enriching unmatched entity reference records using a matching process. In the exemplary embodiment, this enriching process entails operating specific types of data harvesters on the web, other databases, and other directories or lists, to assemble a cache of new relevant profile information for databases, such as expert database 112 in Figure 1. The unmatched or unmarked entity records are then matched against the harvested entity records using Bayesian matching. Those that satisfy the match criteria are referred to a quality control process for verification or confirmation prior to addition to the relevant entity directory. The quality control process may be manual, semi-automatic, or fully automatic. For example, some embodiments base the type of quality control on the degree to which the match criteria is exceeded. In some embodiments, blocks 1050 operates in parallel with blocks 1010- 1040, continually retrieving new entity related data using any number of web crawlers, relational databases, or CDs, and attempting to building new entity records. Conclusion The embodiments described above are intended only to illustrate and teach one or more ways of practicing or implementing the present invention, not to restrict its breadth or scope. The actual scope of the invention, which embraces all ways of practicing or implementing the teachings of the invention, is denned only by the following claims and their equivalents. WE CLAIM: 1. A system comprising: means (910) for extracting entity reference data for at least one person from each of a plurality of documents to form entity reference records; means (920) for forming at least one entity profile record by merging at least one of the entity reference records for a person with at least one other entity reference record for the same person by: sorting the entity reference records by last name; selecting an unmerged entity reference record and creating an entity profile record from the selected unmerged entity reference record; and analyzing the unmerged entity reference record for determining a probability that a person in a entity profile record is the same person as referenced in the selected unmerged entity reference record; means (940) for categorizing at least one of the entity profile records based on a taxonomy; and means (950) for defining links between at least one of the entity profile records and other documents or data sets. 2. The system as claimed in claim 1, comprising: graphical user interface means (138) for defining a query related to an entity, for viewing at least one document resulting from the query, for selecting at least one of the defined links within a legal, financial, healthcare, scientific, or educational document, and for causing retrieval and display of at least a portion of the one of the entity profile records. 3. The system as claimed in claim 1 or claim 2, wherein at least one of the recited means include one or more processors, computer-readable medium, display devices, and network communications, with the machine-readable medium including coded instructions and data structures. 4. The system of any preceding claim: wherein the at least one other entity reference records are contained in a database (100); wherein the means for forming at least one entity profile record may fail to merge at least one of the entity reference records with at least one other entity reference records in the database; and wherein the system comprises: means, responsive to a failure to merge at least one of the entity reference records with at least one of the other entity reference records, for attempting to match each of the at least one entity reference record to a set of harvested entity reference records outside the database; and means, responsive to a match of at least one of the entity reference records to at least one of the harvested entity reference records, for merging the records and adding them to the database. 5. The system as claimed in any preceding claim, wherein the documents comprise jury verdict settlement documents. 6. The system as claimed in claim 5, wherein the means for extracting entity records comprises finite state transducers. 7. The system as claimed in any preceding claim, wherein the means for extracting at least one of the entity reference records includes means for identifying name, educational degree, area of expertise, organization, city, and state. 8. The system as claimed in claim 4, wherein the means for attempting to match at least one of the entity reference records to at least one of the harvested entity reference records includes means for computing a Bayesian match probability. 9. The system as claimed in any preceding claim: wherein each of the entity reference records references a person; and wherein the means for categorizing at least one of the defined entity records based on a taxonomy is adapted to automatically categorize each entity reference record to an expertise taxonomy. 10. The system as claimed in any preceding claim, the means for automatically extracting entity reference records is adapted to perform extraction based on document type. 11. A method comprising: extracting (910) entity reference data for at least one person from each of a plurality of documents to form entity reference records; forming (920) at least one entity reference profile by merging at least one of the entity reference records for a person with at least one other entity reference record for the same person by: sorting the entity reference records by last name; selecting an unmerged entity reference record and creating an entity profile record from the selected unmerged entity reference record; and analyzing the unmerged entity reference record for determining a probability that a person in a entity profile record is the same person as referenced in the selected unmerged entity reference record; automatically categorizing (940) at least one of the entity profile records based on an expertise taxonomy; and defining links (950) between at least one of the entity profile records and other documents or data sets. 12. The method as claimed in claim 11, comprising: receiving a query (210) related to an entity, displaying (230) one or more documents resulting from the query, receiving a selection of one or more of the defined links within a legal, financial, healthcare, scientific, or educational document; and retrieving and displaying (240) of at least a portion of me at least one entity profile record. 13. The method as claimed in claim 11 or claim 12, wherein the at least one other entity records are contained in a database (100); wherein at least one of the entity reference records may not be merged with at least one other entity reference records in the database; and wherein the method comprises: in response to a failure to merge at least one of the entity reference records with at least one of the other entity reference records, attempting to match each of the at least one entity reference record to a set of harvested entity reference records outside the database; and in response to a match of the at least one entity reference records to at least one of the harvested entity reference records, merging the matched records and adding them to the database. Abstract System and Method for Automated Collection and Integration of Entity Data into Online Databases and Professional Directories System (100) and method for automated collection and integration of entity data into online databases and professional directories are disclosed. The system comprising means (910) for extracting entity reference data for at least one person from each of a plurality of documents to form entity reference records, means (920) for forming at least one entity profile record by merging at least one of the entity reference records for a person with at least one other entity reference record for the same person, means (940) for categorizing at least one of the entity profile records based on a taxonomy and means (950) for defining links between at least one of the entity profile records and other documents or data sets.

Full Text

Copyright Notice and Permission
A portion of this patent document contains material subject to copyright
protection. The copyright owner has no objection to the facsimile reproduction by
anyone of the patent document or the patent disclosure, as it appears in the Patent and
Trademark Office patent files or records, but otherwise reserves all copyrights
whatsoever. The following notice applies to this document: Copyright © 2003,
Thomson Global Resources AG.
Cross-Reference to Related Application
This application claims priority to U.S. provisional application 60/533,588
filed on December 31,2003. The provisional application is incorporated herein by
reference.
Technical Field
Various embodiments of the present invention concerns information-retrieval
systems, such as those that provide legal documents or other related content.
Background
In recent years, the fantastic growth of the Internet and other computer
networks has fueled an equally fantastic growth in the data accessible via these
networks. One of the seminal modes for interacting with this data is through the use
of hyperlinks within electronic documents.
More recently, there has been interest in hyperlinking documents to other
documents based on the names of people in the documents. For example, to facilitate
legal research, West Publishing Company of Si Paul, Minnesota (doing business as
Thomson West) provides thousands of electronic judicial opinions that hyperlink the
names of attorneys and judges to their online biographical entries in the West Legal
Directory, a proprietary directory of approximately 1,000,000 U.S. attorneys and

20,000 judges. These hyperlinks allow users accessing judicial opinions to quickly
obtain contact and other specific information about lawyers and judges named in the
opinions.
The hyperlinks in these judicial opinions are generated automatically, using a
system mat extracts first, middle, and last names; law firm name, city, and state; and
court information from the text of the opinions and uses them as clues to determine
whether to link the named attorneys and judges to their corresponding entries in the
professional directory. See Christopher Dozier and Robert Haschart, "Automatic
Extraction and Linking of Person Names in Legal Text" (Proceedings of RIAO 2000;
Content Based Multimedia Information Access, Paris, France, pp. 1305-1321. April
2000), which is incorporated herein by reference. An improvement to this system is
described in Christopher Dozier, System, Methods And Software For Automatic
Hyperl inking Of Persons' Names In Documents To Professional Directories, WO
2003/060767A3 July 24,2003.
The present inventors have recognized still additional need for improvement in
these and other systems that generale automatic links-
Brief Description of the Accompanying Drawings
Figure 1 is a diagram of an exemplary information-retrieval system 100
corresponding to one or more embodiments of the invention;
Figure 2 is a flowchart corresponding to one or more exemplary methods of
operating system 100 and one or more embodiments of the invention;
Figure 3-8 are facsimiles of exemplary user interfaces, each corresponding to one
or more embodiments of the invention.
Figure 9 is a flow chart corresponding to one or more embodiments of the
invention.
Figures 10 is a flow chart corresponding to one or more additional embodiment of
the invention.

Detailed Description of Exemplary Embodiments
This description, which references and incorporates the above-identified
Figures, describes one or more specific embodiments of an invention. These
embodiments, offered not to limit but only to exemplify and teach the invention, are
shown and described in sufficient detail to enable those skilled in the art to implement
or practice the invention. Thus, where appropriate to avoid obscuring the invention,
the description may omit certain information known to those of skill in the art.
Exemplary Information-Retrieval System
Figure 1 shows an exemplary online information-retrieval system 100.
System 100 includes one or more databases 110, one or more servers 120, and one or
more access devices 130.
Databases 110 include a set of one or more databases. In the exemplary
embodiment, the set includes a caselaw database 111, an expert witness directory 112,
professional directories or licensing databases 113, a verdict and settlement database
114, an court-filings database 116.
Caselaw database 111 generally includes electronic text and image copies of
judicial opinions for decided cases for one or more local, state, federal, or
international jurisdiction. Expert witness directory 112, which is defined in accord
with one or more aspects of the present invention, includes one or more records or
database structures, such as structure 1121. Structure 1121 includes an expert
identifier portion 1121A which is logically associated with one or more directory
documents or entries 112 1B, one or more verdict documents or entries 1121C, and
one or more articles 1121D. Some embodiments logically associate the expert
identifier with court filings documens, such as briefs and expert reports and/or other
documents.
Professional directories or licensing databases 113 include professional
licensing data from one or more state, federal, or international licensing authorities.
In the exemplary embodiment, this includes legal, medical, engineering, and scientific
licensing or credentialing authorities. Verdict and settlement database 114 includes
electronic text and image copies of documents related to the determined verdict,
assessed damages, or negotiated settlement of legal disputes associated with cases

within caselaw database 111. Articles database 115 includes articles technical,
medical, professional, scientific or other scholarly or authorative journals and
authoritative trade publications. Some embodiments includes patent publications.
Court-filings database 116 includes electronic text and image copies of court filings
related to one or more subsets of the judicial opinions caselaw database 111.
Exemplary court-filing documents include briefs, motions, complaints, pleadings,
discovery matter. Other databases 115 includes one or more other databases.
containing documents regarding news stories, business and finance, science and
technology, medicine and bioinformatics, and intellectual property information. In
some embodiments, the logical relationships across documents are determined
manually or using automatic discovery processes that leverage information such as
litigant identities, dates, jurisdictions, attorney identifies, court dockets, and so forth
to determine the existence or likelihood of a relationship between any pair of
documents.
Databases 110, which take the exemplary form of one or more electronic,
magnetic, or optical data-storage devices, include or are otherwise associated with
respective indices (not shown). Each of the indices includes terms and/or phrases in
association with corresponding document addresses, identifiers, and other information
for facilitating the functionality described below. Databases 112, 114, and 116 are
coupled or couplable via a wireless or wireline communications network, such as a
local-, wide-, private-, or virtual-private network, to server 120.
Server 120, which is generally representative of one or more servers for
serving data in the form of webpages or other markup language forms with associated
applets, ActiveX controls, remote-invocation objects, or other related software and
data structures to service clients of various "thicknesses." More particularly, server
120 includes a processor 121, a memory 122, a subscriber database 123, one or more
search engines 124 and software module 125.
Processor 121, which is generally representative of one or more local or
distributed processors or virtual machines, is coupled to memory 122. Memory 122,
which takes the exemplary form of one or more electronic, magnetic, or optical data-
storage devices, stores subscription database 123, search engines 124, and interface
module 125.

Subscription database 123 includes subscriber-related data for controlling,
administering, and managing pay-as-you-go- or subscription-based access of
databases 110. Subscriber database 123 includes subscriber-related data for
controlling, administering, and managing pay-as-you-go or subscription-based access
of databases 110.
Search engines 124 provides Boolean or natural-language search capabilities
for databases 110.
Interface module 125, which, among, other things defines one or portion of a
graphical user interface that helps users define searches for databases 110. Software
125 includes one or more browser-compatible applets, webpage templates, user-
interface elements, objects or control features or other programmatic objects or
structures. More specifically, software 125 includes a search interface 1251 and a
results interface 1252.
Server 120 is communicatively coupled or couplable via a wireless or wireline
communications network, such as a local-, wide-, private-, or virtual-private network,
to one or more accesses devices, such as access device 130.
Access device 130 is not only communicatively coupled or couplable to server
130, but also generally representative of one or more access devices. In the
exemplary embodiment, access device 130 takes the form of a personal computer,
workstation, personal digital assistant, mobile telephone, or any other device capable
of providing an effective user interface with a server or database.
Specifically, access device 130 includes one or more processors (or processing
circuits) 131, a memory 132, a display 133, a keyboard 134, and a graphical pointer or
selector 135. Memory 132 stores code (machine-readable or executable instructions)
for an operating system 136, a browser 137, and a graphical user interface (GUI) 138.
In the exemplary embodiment, operating system 136 takes the form of a version of
the Microsoft Windows operating system, and browser 137 takes the form of a
version of Microsoft Internet Explorer. Operating system 136 and browser 137 not
only receive inputs from keyboard 134 and selector (or mouse) 135, but also support
rendering of GUI 138 on display 133. Upon rendering, GUI 138 presents data in
association with one or more interactive control features (or user-interface elements).

(The exemplary embodiment defines one or more portions of interface 138 using
applets or other programmatic objects or structures from server 120.)
Specifically, graphical user interface 138 defines or provides one or more
display control regions, such as a query region 1381, and a results region 1382. Each
region (or page in some embodiments) is respectively defined in memory to display
data from databases 110 and/or server 120 in combination with one or more
interactive control features (elements or widgets). In the exemplary embodiment,
each of these control features takes the form of a hyperlink or other browser-
compatible command input.
More specifically, query region 1381 includes interactive control features,
such as an query input portion 1381A for receiving user input at least partially
defining a profile query and a query submission button 1381B for submitting the
profile query to server 120 for data from, for example, experts database 112.
Results region 1382, which displays search results for a submitted query,
includes a results listing portion 1382A and a document display portion 1382B.
Listing portion 1382A includes control features 2A1 and 2A2 for accessing or
retrieving one or more corresponding search result documents, such as professional
profile data and related documents, from one or more of databases 110, such as expert
database 112, via server 120. Each control feature includes a respective document
identifier or label, such as EXP I, EXP 2 identifying respective name and/or city,
state, and subject-matter expertise data for the corresponding expert or professional.
Display portion 1382B displays at least a portion of the full text of a first
displayed or user-selected one of the profiles identified within listing portion 1382A,
EXP 2 in the illustration. (Some embodiments present regions 1382A and 1382B as
selectable tabbed regions.) Portion 1382B also includes features 2B1,2B2, 2B3, and
2B4. User selection of feature 2B1 initiates retrieval and display of the profile text for
the selected expert, EXP 2; selection of feature 2B2 initiates retrieval and display of
licensing data for any licenses or other credentials held by the selected expert or
professional image copy of the document displayed in region 1382B in a separate
window; selection of feature 2B3 initiates display and retrieval of verdict data related
to the expert or professional; and selection of feature 2B4 initiates retrieval and
display of articles (from database 115) that are related to, for example authored by,

the expert or professional. Other embodiments include additional control features for
accessing court-filing documents, such as briefs, and/or expert reports authored by the
expert or professional, or even deposition and trial transcripts where the expert or
testimony was a participant. Still other embodiments provide control features for
initiating an Internet search based on the selected expert and other data and for
filtering results such search based on the profile of the expert or professional.
Exemplary Methods of Operation
Figure 2 shows a flow chart 200 of one or more exemplary methods of
operating an information-management system, such as system 100. Flow chart 200
includes blocks 210-290, which are arranged and described in a serial execution
sequence in the exemplary embodiment. However, other embodiments execute two
or more blocks in parallel using multiple processors or processor-like devices or a
single processor organized as two or more virtual machines or sub processors. Other
embodiments also alter the process sequence or provide different functional partitions
to achieve analogous results. For example, some embodiments may alter the client-
server allocation of functions, such that functions shown and described on the server
side are implemented in whole or in part on the client side, and vice versa. Moreover,
still other embodiments implement the blocks as two or more interconnected
hardware modules with related control and data signals communicated between and
through the modules. Thus, this (and other exemplary process flows in this
description) apply to software, hardware, and firmware implementations.
Block 210 entails presenting a search interface to a user. In the exemplary
embodiment, this entails a user directing a browser in an client access device to
internet-protocol (IP) address for an online information-retrieval system, such as the
Westlaw system and then logging onto the system. Successful login results in a web-
based search interface, such as interface 138 in Figure 1 (or one or more portions
thereof) being output from server 120, stored in memory 132, and displayed by client
access device 130.
Execution then advances to block 220.
Block 220 entails receipt of a query. In the exemplary embodiment, the query
defines one or more attributes of an entity, such as person professional. In some

embodiments, the query string includes a set of tenrts and/or connectors, and in other
embodiment includes a natural-language string. Also, in some embodiments, the set
of target databases is defined automatically or by default based on the form of the
system or search interface. Figures 3 and 4 show alternative search interfaces 300 and
400 which one or more embodiments use in place of interface 138 in Figure 1.
Execution continues at block 230.
Block 230 entails presenting search results to the user via a graphical user
interface. In the exemplary embodiment, this entails the server or components under
server control or command, executing the query against one or more of databases 110,
for example, expert database 110, and identifying documents, such as professional
profiles, that satisfy the query criteria. A listing of results is then presented or
rendered as part of a web-based interface, such as interface 138 in Figure 1 or
interface 500 in Figure 5. Execution proceeds to block 240.
Block 240 entails presenting additional information regarding one or more one
or more of the listed professionals. In the exemplary embodiment, this entails
receiving a request in the form of a user selection of one or more of the professional
profiles listed in the search results. These additional results may be displayed as
shown in interface 138 in Figure 1 or respective interfaces 600,700, and 800 in
Figures 6, 7, and 8. Interface 600 shows a listing of links 610 and 620 for additional
information related to the selected professional. As shown in Figure 7, selection of
link 610 initiates retrieval and display of a verdict document (or in some case a list of
associated verdict documents) in interface 700. And, as shown in Figure 8, selection
of link 620 initiates retrieval and display of an article (or in some cases a list of
articles) in interface 800.
Exemplary Method of Building Expert Directory
In Figure 9, flow chart 900 shows an exemplary method of building an expert
directory or database such as used in system 100. Flow chart 900 includes blocks
910-960.

At block 910, the exemplary method begins with extraction of entity reference
records from text documents. In the exemplary embodiment, this entails extracting
entity references from approximately 300,000 jury verdict settlement (JVS)
documents using Finite state transducers. JVS documents have a consistent structure
that includes an expert witness section or paragraph, such as that exemplified in Table
1.

Document
The exemplary embodiment uses a parsing program to locate expert-witness
paragraphs and find lexical elements (that is, terms used in this particular subject area)
pertaining to an individual. These lexical elements include name, degree, area of
expertise, organization, city, and state. Parsing a paragraph entails separating it
into sentences, and then parsing each element using a separate or specific finite state
transducer. The following example displays regular expressions from the finite state
transducer used for the organization element. (Variables are prefixed by S-)

Typically one expert is listed in a sentence along with his or her area of expertise and
other information. If more than one expert is mentioned in a sentence, area of
expertise and other elements closest to the name are typically associated with that
name. Each JVS document generally lists only one expert witness; however, some

expert witnesses are references in more than one JVS document. Table 2 shows an
example of an entity reference records.

Once the entity reference records are defined, execution continues at block 220.
Block 920 entails defining profile records from the entity reference records: In
the exemplary embodiment, defining the profile records entails merging expert-
witness reference records that refer to the same person to create a unique expert-
witness profile record for the expert. To this end, the exemplary embodiment sorts
the reference records by last name to define a number of lastname groups. Records
within each "last-name" group are then processed by selecting an unmerged expert
reference record and creating an new expert profile record from this selected record.
The new expert reference record is then marked as unmerged and compared to each
unmerged reference record in the group using Bayesian matching to compute the
probability that the expert in the profile record refers to the same individual
referenced in the record. If the computed match probability exceeds a match
threshold, the reference is marked as "merged." Jf unmerged records remain in the
group, the cycle is repeated.
Note that it is still possible for duplicate records to reside in the profile file if
two or more reference records pertain to one individual (for example, because of a
misspelled last name). To address this possibility, a final pass is made over the
merged profile file, and record pairs are flagged for manual review. Table 3 shows an
exemplary expert profile record created from expert reference records.

Block 930 entails adding additional information to the expert reference
records. In the exemplary embodiment, this entail harvesting information from other
databases and sources, such as from professional licensing authorities, telephone
directories, and so forth. References to experts in JVS documents, the original entity
record source in this embodiment, often have little or no location information for
experts, whereas professional license records typically include the expert's full name,
and the full current home and/or business address, making them a promising source
for additional data.
One exemplary licensing authority is the Drug Enforcement Agency, which licenses
health-care professionals to prescribe drugs.
In determining whether a harvested license record (analogous to a reference
record) and expert person refer to the same person, the exemplary embodiment
computes a Bayesian match probability based on first name, middle name, last name,
name suffix, city-state information, area of expertise, and name rarity. If the match
probability meets or exceeds a threshold probability, one or more elements of
information from the harvested license record are incorporated into the expert
reference record. If the threshold criteria is not met, the harvested license record is
stored in a database for merger consideration with later added or harvested records.
In.

(Some embodiments perform an extraction procedure on the supplemental data
similar to that described at block 910 to define reference records, which are then sent
as a set for merger processing as in block 910 with the expert reference records.)

Block 940 entails categorizing expert profiles by area of expertise. In the
exemplary embodiment, each expert witness record is assigned one or more
classification categories in an expertise taxonomy. Categorization of the entity records
allows users to browse and search expert witness (or other professional) profiles by
area of expertise. To map an expert profile record to an expertise subcategory, the
exemplary embodiment uses an expertise categorizer and a taxonomy that contains
top-level categories and subcategories.
The exemplary taxonomy includes the following top-level categories:
Accident & Injury; Accounting & Economics; Computers & Electronics;
Construction & Architecture; Criminal, Fraud and Personal Identity; Employment &
Vocational; Engineering & Science; Environmental; Family & Child Custody; Legal
& Insurance; Medical & Surgical; Property & Real Estate; Psychiatry & Psychology;
Vehicles, Transportation, Equipment & Machines. Each categories includes one or

more subcategories. For example, the "Accident & Injury" category has the following
subcategories: Aerobics, Animals, Apparel, Asbestos, Boating, Bombing,
Burn/Thermal, Child Care, Child Safety, Construction, Coroner,
Cosmetologists/Beauticians/Barbers/Tattoos, Dog Bites, Entertainment, and Exercise.
Assignment of subject-matter categories to an expert profile record entail
using a function that maps a professional descriptor associated with the expert to a
leaf node in the expertise taxonomy. This function is represented with the following
equation:
T = f(S)
where T denotes a set of taxonomy nodes, and S is the professional descriptor. The
exemplary function/uses a lexicon of 500 four-character sets that map professional
descriptors to expertise area. For example, experts having the "onco" professional
descriptor are categorized to the oncology specialist, oncologist, and pediatric
oncologist subcategories. Other taxonomies are also feasible. The exemplary
embodiment allows descriptors to map to more than one expertise area (that is,
category or subcategory) in the taxonomy. For example, "pediatric surgeon" can be
mapped to both the "pediatrics" node and "surgery" nodes. Table 5 shows an
example of an expert profile record in which the expertise field has been mapped to
the category "Medical & Surgical" and to the subcategories "pediatrics," "blood &
plasma," and "oncology."

to "Medical & Surgical"
Block 940 entails associating one or more text documents and/or additional
data sets with one or more of the professional profiles. To this end, the exemplary
embodiment logically associates or links one or more JVS documents and/or Medline
articles to expert-witness profile records using Bayesian based record matching.
Table 6 shows a sample Medline article.

To link JVS documents and medline abstracts to expert profile records, expert-
reference records are extracted from the articles using one or more suitable parsers
through parsing and matched to profile records using a Bayesian inference network
similar to the profile-matching technology described previously. For JVS documents,
the Bayesian network computes match probabilities using seven pieces of match
evidence: last name, first name, middle name, name suffix, location, organization, and
area of expertise. For medline articles, the match probability is based additionally on
name rarity, as described in the previously mentioned Dozier patent application.
Figure 10 shows a flow chart 1000 of an exemplary method of growing and
maintaining one or more entity directories, such expert database that used in system
100. Flow chart 1100 includes process blocks 1010-1050.
At block 1010, the exemplary method begins with receipt of a document. In
the exemplary embodiment, this entails receipt of an unmarked document, such an a
judicial opinion or brief. However, other embodiments receive and process other
types of documents. Execution then advances to block 1020.

Block 1020 entails determining the type of document. The exemplary
embodiments uses one or more methods for determining document type, for example,
looking for particular document format and syntax and/or keywords to differentiate
among a set of types. In some embodiments, type can be inferred from the source of
the document. Incoming content types, such as case law, jury verdicts, law reviews,
briefs, etc., have a variety of grammar, syntax, and structural differences. After type
(or document description) is determined, execution continues at block 1030.
Block 1030 entails extracting one or more entity reference records from the
received document based on the determined type of the document. In the exemplary
embodiment, four types of entity records are extracted: personal names, such as
attorneys, judges, expert witnesses; organizational names, such as firms and companies;
product names, such as drugs and chemicals; and fact profiles ("vernacular" of subject
area). Specialized or configurable parsers (finite state transducers), which are
selected or configured on the basis of the determined document type and the entity
record being built, identify and extract entity information for each type of entity.
Parsers extract information by specifically searching for a named entity
(person, address, company, etc.) or by relationships between entities. Parser text-
extraction is based on the data's input criteria. For example, the more structured
(tagged) data enables a "tighter" set of rales to be built within a parser. This set of
rules allows more specific information to be extracted about a particular entity. A
more "free" data collection, such as a web site, is not as conducive to rule-based
parsers. A collection could also include a combination of structured, semi-structured,
and free data. More specifically, parsers are developed through "regular-expression"
methods. The regular expressions serves are "rules" for parsers to find entity types
and categories of information.
Block 1040 attempts to link or logically associate each extracted entity
reference record with one or more existing authority directories. In the exemplary
embodiment, this entails computing a Bayesian match probability for each extracted
entity reference and one more corresponding candidate records in corresponding
directories (or databases) that have been designated as authoritative in terms of
accepted accuracy. If the match probability satisfies match criteria, the records are

merged or associated and the input document. Execution then proceeds to block
1050.
Block 1050 entails enriching unmatched entity reference records using a
matching process. In the exemplary embodiment, this enriching process entails
operating specific types of data harvesters on the web, other databases, and other
directories or lists, to assemble a cache of new relevant profile information for
databases, such as expert database 112 in Figure 1. The unmatched or unmarked
entity records are then matched against the harvested entity records using Bayesian
matching. Those that satisfy the match criteria are referred to a quality control
process for verification or confirmation prior to addition to the relevant entity
directory. The quality control process may be manual, semi-automatic, or fully
automatic. For example, some embodiments base the type of quality control on the
degree to which the match criteria is exceeded.
In some embodiments, blocks 1050 operates in parallel with blocks 1010-
1040, continually retrieving new entity related data using any number of web
crawlers, relational databases, or CDs, and attempting to building new entity records.
Conclusion
The embodiments described above are intended only to illustrate and teach
one or more ways of practicing or implementing the present invention, not to restrict
its breadth or scope. The actual scope of the invention, which embraces all ways of
practicing or implementing the teachings of the invention, is denned only by the
following claims and their equivalents.

WE CLAIM:
1. A system comprising:
means (910) for extracting entity reference data for at least one person from
each of a plurality of documents to form entity reference records;
means (920) for forming at least one entity profile record by merging at least
one of the entity reference records for a person with at least one other
entity reference record for the same person by:
sorting the entity reference records by last name;
selecting an unmerged entity reference record and creating
an entity profile record from the selected unmerged entity
reference record; and
analyzing the unmerged entity reference record for
determining a probability that a person in a entity profile
record is the same person as referenced in the selected
unmerged entity reference record;
means (940) for categorizing at least one of the entity profile records based on
a taxonomy; and
means (950) for defining links between at least one of the entity profile
records and other documents or data sets.
2. The system as claimed in claim 1, comprising:
graphical user interface means (138) for defining a query related to an
entity, for viewing at least one document resulting from the query, for
selecting at least one of the defined links within a legal, financial,
healthcare, scientific, or educational document, and for causing retrieval
and display of at least a portion of the one of the entity profile records.
3. The system as claimed in claim 1 or claim 2, wherein at least one of the
recited means include one or more processors, computer-readable medium,

display devices, and network communications, with the machine-readable
medium including coded instructions and data structures.
4. The system of any preceding claim:
wherein the at least one other entity reference records are contained in a
database (100);
wherein the means for forming at least one entity profile record may fail to
merge at least one of the entity reference records with at least one other
entity reference records in the database; and
wherein the system comprises:
means, responsive to a failure to merge at least one of the entity
reference records with at least one of the other entity reference
records, for attempting to match each of the at least one entity
reference record to a set of harvested entity reference records
outside the database; and
means, responsive to a match of at least one of the entity reference
records to at least one of the harvested entity reference records,
for merging the records and adding them to the database.
5. The system as claimed in any preceding claim, wherein the documents
comprise jury verdict settlement documents.
6. The system as claimed in claim 5, wherein the means for extracting entity
records comprises finite state transducers.
7. The system as claimed in any preceding claim, wherein the means for
extracting at least one of the entity reference records includes means for
identifying name, educational degree, area of expertise, organization, city, and
state.

8. The system as claimed in claim 4, wherein the means for attempting to match
at least one of the entity reference records to at least one of the harvested
entity reference records includes means for computing a Bayesian match
probability.
9. The system as claimed in any preceding claim:
wherein each of the entity reference records references a person; and
wherein the means for categorizing at least one of the defined entity records
based on a taxonomy is adapted to automatically categorize each entity
reference record to an expertise taxonomy.
10. The system as claimed in any preceding claim, the means for automatically
extracting entity reference records is adapted to perform extraction based on
document type.
11. A method comprising:
extracting (910) entity reference data for at least one person from each of a
plurality of documents to form entity reference records;
forming (920) at least one entity reference profile by merging at least one of
the entity reference records for a person with at least one other entity
reference record for the same person by:
sorting the entity reference records by last name;
selecting an unmerged entity reference record and creating
an entity profile record from the selected unmerged entity
reference record; and
analyzing the unmerged entity reference record for
determining a probability that a person in a entity profile
record is the same person as referenced in the selected
unmerged entity reference record;
automatically categorizing (940) at least one of the entity profile records based
on an expertise taxonomy; and

defining links (950) between at least one of the entity profile records and other
documents or data sets.
12. The method as claimed in claim 11, comprising:
receiving a query (210) related to an entity, displaying (230) one or
more documents resulting from the query, receiving a selection of
one or more of the defined links within a legal, financial,
healthcare, scientific, or educational document; and retrieving and
displaying (240) of at least a portion of me at least one entity
profile record.
13. The method as claimed in claim 11 or claim 12,
wherein the at least one other entity records are contained in a database (100);
wherein at least one of the entity reference records may not be merged with at
least one other entity reference records in the database; and
wherein the method comprises:
in response to a failure to merge at least one of the entity reference
records with at least one of the other entity reference records,
attempting to match each of the at least one entity reference
record to a set of harvested entity reference records outside the
database; and
in response to a match of the at least one entity reference records to at
least one of the harvested entity reference records, merging the
matched records and adding them to the database.

Abstract

System and Method for Automated Collection
and Integration of Entity Data into Online
Databases and Professional Directories
System (100) and method for automated collection and integration of entity
data into online databases and professional directories are disclosed. The system
comprising means (910) for extracting entity reference data for at least one person
from each of a plurality of documents to form entity reference records, means (920)
for forming at least one entity profile record by merging at least one of the entity
reference records for a person with at least one other entity reference record for the
same person, means (940) for categorizing at least one of the entity profile records
based on a taxonomy and means (950) for defining links between at least one of the
entity profile records and other documents or data sets.

SYSTEMS,METHODS,INTERFACES AND SOFTWARE FOR AUTOMATED COLLECTION AND INTEGRATIONOF ENTITY DATA INTO ONLINE DATABASES AND PROFESSIONAL DIRECTORIES.

Documents:

Inventors:

PCT Conventions: