Information Retrieval

Information Retrieval (IR) is the art and science of searching for information in documents or for documents themselves. IR draws upon cognitive psychology, linguistics, semiotics, information science, computer science and librarianship.

While user interface design and functionality are important to the usability of systems, the underlying IR technology utilised is also important. Technologies that may be excellent for small data sets with a few concurrent users may be inflexible and unable to scale. With the rapid increase in the capacity, speed and affordability of random access memory (RAM) over recent years, techniques that allow full text indexes to be held in memory have become practical. A form of index, known as an inverted index, in which any word in the corpus to be searched can be used as key to find the documents in which it occurs, has proved to be extremely scalable.

Various IR techniques make use of:

  • the content of the documents being searched;
  • meta-data describing the documents and relationships between them, e.g. citations;
  • information external to the corpus being searched, such as dictionaries, thesauri and other bodies of relevant information.

Search Engines

search engine is software that performs IR using one or more IR models.  For text searching, the most prominent of these are the Boolean and the Vector Space models and their variants.

It should be noted that for this study, visual image searching was also investigated, for use in design patent searching, normal trade marks, and shape marks, The software available has significant technical limitations, and the greatest technical progress has tended to focus on security-related images such as faces and fingerprints rather than IP industry uses.

It would be useful to check this area of software development periodically, however, because although IP Australia trade mark and design patent examiners indicated it would be of limited use for searching IP data (because most searching is done using classifications), searching of the non-IP literature could eventually be facilitated by such tools.

The two main statistics used to describe the quality of retrieval are

  • Precision – The proportion of relevant documents of all documents retrieved
    P = (number of relevant documents retrieved) / (number of documents retrieved)
  • Recall – The proportion of retrieved documents of all relevant documents available:
    R = (number of relevant documents retrieved) / (number of relevant documents)

Thus, a search engine that retrieves all documents relevant to a query is said to have high recall, whereas a search that returns only relevant documents is said to have high precision.

Although the ideal technology would have both high recall and precision, in practice high recall can sometimes come at the cost of irrelevant hits (low precision) and high precision can come at the cost of the omission of some relevant hits (low recall).

These parameters are important to understand and control in the choice of an algorithm for intellectual property searching, because an examiner doing a prior art search may tolerate high recall with some loss of precision.  However, precision may be more important to a researcher doing a technology search.

Ideally, the expert searcher should be able to modulate the precision and recall directly, as well as through the choice of search terms and ranges such as date ranges within search fields.  An interface that does not allow such control, e.g. the Google default interface, may not be ideally suited for intellectual property searching in general, although particular versions of such an interface may be suitable for particular types of searches.

The quantitative evaluation of recall and precision statistics requires relevance to be determined by some means other than the system being evaluated, e.g. human experts may be used to rate relevance with respect to an information need expressed in natural language. This is the approach taken by the Text REtrieval Conference (TREC), which since 1992 provides a controlled environment for the evaluation of different IR techniques for specific tasks. The evaluation of the IR system includes the process of the human user translating the information need from natural language to the search engine’s query language. Typically much information is lost in this translation, limiting both the maximum achievable precision and recall.

Recent advances in IR attempt to address this problem in various ways:

  • Providing rich query languages that allow expert users to more closely represent their information requirement (improves both precision and recall);
  • Broadening the search using search terms extracted from documents that match the initial query – either automatic query expansion or query expansion using some form of relevance feedback – (improves recall at cost to precision);
  • Broadening the search using Natural Language Processing (NLP) techniques including stemming and using synonyms from thesauri (improves recall);
  • Multiple iterations of human/computer interaction using NLP techniques to refine the search (improves both precision and recall);
  • Application of NLP techniques based on Artificial Intelligence (AI) in the determination of document relevance e.g. use of Bayesian inference networks (improves both precision and recall) and Word Sense Disambiguation (improves precision).

Inverted Index

An Inverted Index data structure is used to support most IR techniques. Each word in the corpus has an index entry, which specifies the documents that include that word.14  A search for documents that contain a word then becomes a simple lookup in the index.15  The list of words may be reduced by using a ‘stop word list’16 or stemming. Index entries may be augmented with additional information such of the number of times the word appears in the document or the position of each occurrence of the word.

The inverted index is largely responsible for the impressive speed and scalability of current search engines. However, for large collections of documents the compilation of an inverted index can be a time consuming task, and the fast searching of an inverted index may require large amounts of computer memory. A large amount of research has been conducted on the problem of compressing inverted indexes to reduce these memory requirements.

___________________

14 http://en.wikipedia.org/wiki/Full_inverted_index

15 Paul E. Black, “inverted index”, from Dictionary of Algorithms and Data Structures, Paul E. Black, ed., NIST.  http://www.nist.gov/dads/HTML/invertedIndex.html
16 Stop words are those words which are so common that they are useless to index or use in searches. In English some obvious stop words would be “a,” “of,” “the,” “I,”, “it”, “you,” and “and”. Hans Peter Luhn, one of the pioneers in information retrieval, is credited with coining the phrase and using the concept in his design and implementation of KWIC indexing programs (http://en.wikipedia.org/wiki/Stop_words)

Stemming

In most cases, morphological variants of words have similar semantic interpretations and can be considered as equivalent for the purpose of IR applications.17

The key terms of a query or document may be represented by stems rather than by the original words. For IR purposes, it doesn’t usually matter whether the stems generated are genuine words or not – thus, “computation” might be stemmed to “comput” – provided that different words with the same ‘base meaning’ are reduced to the same form and words with distinct meanings are kept separate. Stemming provides several advantages:

  • a query can find a document with different morphological variants of the search term (improved recall); and
  • reduction in the number of distinct terms needed to represent the corpus reduces computer processing requirements.

___________________

17http://en.wikipedia.org/wiki/Stemming