The Dekko search engine is the back end of the BiOS Patent Lens. Key features of Dekko are:
- Full inverted text index, with position lists and word counts
- Customisable XML parser for patent documents
- Minimal perfect hash technique for fast word search
- Boolean, TF-IDF and Proximity Matrix information retrieval (IR) models
- Written entirely in C; runs on any variety of Unix/Linux
- Threaded dekko server application for fast response to multiple user queries
The Dekko system has three main components:
- The dek indexer scans a document collection and creates an inverted ASCII index
- The dekbin compiler takes an ASCII index produced by *dek *and makes a compressed binary index
- The dekko server uses a compressed binary index produced by dekbin to answer queries
The dek indexer
The first stage in most Information Retrieval systems is the creation of an “Inverted Text Index”. This is just a list of all the words found in all the documents in a collection. For each word, a list is kept of all the files in which it occurs, and the positions at which it occurs in each of those files.
Key questions here include:
- What constitutes a ‘word’?
- What words should be indexed?
The dekbin compiler
An ASCII inverted index can be a very bulky item – typically of the same order of size as the original document collection. Searching such an index could be a slow process. The dekbin compiler takes a number of steps to make this problem manageable:
- For efficient storage, the ASCII index is converted into a compressed binary form
- For efficient searching, words are stored in the binary index using a minimal perfect hash26
- Loads the essential parts of the binary index into memory
- Maps the rest of the binary index into virtual memory (using mmap())
- Opens a socket connection on a user specified port
Formatted queries are then sent to the port, and if the query makes sense a list of matching documents is returned. In theory, a user could telnet directly to the port and type in a query. In practice, only the most trivial of queries can be done this way – the query format is complex and tricky to type at the keyboard! Usually a client program, for example by generating a form, will initiate the connection, send a correctly formatted query, and collect the results.27
26 Provided by Bob Jenkins at http://burtleburtle.net/bob/hash/perfect.html
27 Currently at CAMBIA we use a client written in Perl, but such a service can be provided with any other scripting language (Python, Ruby), Java, or C or C++. The fundamental requirement is the ability to connect to and communicate with a socket.
Features of Dekko
As described above, the three major components of Dekko, the dek text indexer, the dekbin compiler, and the dekko server, together provide a solution to the task of indexing a collection of text documents (in this case, patent documents) and then searching the index for documents relevant to a query.
There are a number of unique features of patent data as sourced from the various Patent Offices (WIPO, EPO, USPTO, IP Australia).28 For historical reasons, patent data comprise an interesting mixture of formats and encodings. We see data in 80-column card format, SGML/XML format, and text extracted from TIFF images using OCR, and variations in between. Much of the SGML/XML data is poorly formed and resists parsing by commonly available parsers. Embedded in the text we see what are now called “entities” in SGML and XML terminology. In older data, the formats of these entities do not conform to SGML standards, and entities are used to both encode data and to provide formatting and layout descriptions.
Similarly, we see various character encodings: 7-bit Asciii, ISO-8559-1 Latin1, 7 bit Ascii with some 8-bit characters derived from EBCDIC (mainframe) code pages.
The dek indexer is designed to be able to cope with all these variations of data format. It incorporates a purpose-built SGML/XML parser that is tolerant of badly formed documents and copes with unusual character sets. Dek is highly configurable – new formats and styles can be accommodated by simply editing a configuration file.
The presence of OCR data (WIPO PCT documents, EPO pre-2000 documents) causes other potential problems (see below). The OCR process is never perfect, due to limitations in both the quality of the original images from which the text is extracted and the imperfect nature of OCR algorithms. This can cause the text contain large numbers of essentially random words. The Dekko system incorporates several filtering and compression techniques to reduce the size of the indexes. Without these filters, the PCT index would contain more than 40 million words.
Similar filters remove DNA and protein sequences from the searchable index. These sequences are no more than random strings of letters, in the case of DNA containing combinations of the letters A,C,G and T.
Having filtered the text, dek compiles an inverted index. This is stored on disk in ASCII format, and even with rigorous word filters the size of an ASCII index can be of the same order as the text from which it was derived. The dekbin component of the dekko system processes and analyses the ASCII index – it compresses the ASCII index by a factor of 3; it counts the frequency of words in documents (and sections of documents, eg abstract, title, claims). The word counts provide the basis for relevance ranking in the dekko server.
It is also in dekbin that a technique called minimal perfect hashing is used. For each word in the index, a single number is computed. Every word in the index gets a unique number; the numbers range from 1 to the total number of words in the index.
The eponymous dekko server is the program that searches a dekbin compiled dek index. The dekko user sends the server a query comprising a number of words; for each word the minimal perfect hash number can be computed rapidly (just a few machine instructions) and the details of the word – which documents it occurs in, where it occurs in each document, how often it occurs in the whole collection – can be retrieved from a known place in the index.
The minimal perfect hashing technique results in extremely fast word retrieval; this, combined with ability to store the bulk of the compressed index in memory makes dekko into a very fast Information Retrieval (IR) system.
It should be kept in mind that the essence of providing a good search capability for patent data is not necessarily the search algorithm itself – it is the ability of the engine to work together with all the data formats and structures that are present in collections of patent documents. That knowledge has been designed and configured into Dekko.29
28Dekko has been purpose-designed at CAMBIA, specifically with the task of indexing these patent data.
29The Dekko system has been successfully used in the CAMBIA BiOS Patent Lens for several years. It is currently used to index WIPO (PCT), EPO, and US patent data. For a time, Australian patent data was also being added, and although this service has lapsed for want of funding, Dekko can do it, has done it, and is in fact ideally suited to doing it.