Information retrieval models

Boolean Search

Most IR systems allow the use of Boolean expressions, explicitly or implicitly, to construct a query. The query may include Boolean operators such as AND, OR, and NOT and possibly a proximity operator such as NEAR. A list of documents that match the Boolean query will be returned by the IR system. If no Boolean operators occur in the query, the usual default is to interpolate an AND between terms: a query such as “intellectual property” will be interpreted as “intellectual AND property”.

A skilled user of a Boolean search IR system will be able to refine searches by specifying Boolean expressions of arbitrary complexity.

However most users prefer their searches “pre-refined” by the application of one or more of the other models following.

Term Frequency – Inverse Document Frequency (TF-IDF)

TF-IDF is the classical IR technique for automatic determination of document relevance. The term frequency in the given document gives measure of the importance of the term within the particular document.18  The inverse document frequency is a measure of the general importance of the term. A high weight in TF-IDF is reached by a high term frequency within in the given document and a low frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms.

There are many variations in the calculation of TF-IDF. Mostly the formulae used are ad hoc (determined by trial and error), but some recent variations are based on sound statistical methods based on natural language modelling. TF-IDF has proven to be efficient to implement and useful in a wide range of IR applications. It is the benchmark against which other techniques are compared.

The formulae used tend to be proprietary, but are known for open source software.  For example, Apache Lucene uses the following formulae:19
TF = sqrt(freq)
IDF = log(numDocs/(docFreq+1)) + 1
CAMBIA’s Dekko uses the following formulae to compute a normalised (ie in the range 0..1) TF-IDF score:
TF = freqij / max_occuri
IDF= log (N/ni) / log(N)
wij = TF * IDF
where
wij = TF-IDF score of term i in document j
freqij = frequency of term i in document j
max_occuri = the maximum number of occurrences of term i in any document
N = number of documents in the collection
ni = number of documents in which term i occurs at least once
The purpose of normalising the TF * IDF product is to allow relevance scores to be meaningfully compared across collections, allowing relevance results to be gathered from a “farm” of Dekko servers.

18 http://en.wikipedia.org/wiki/Tf-idf
19 http://lucene.apache.org/java/docs/

Proximity

When a query contains two or more terms, a natural measure of document relevance is apparent: within the document, what is the distance between each pair of terms? Intuitively, it seems clear that if a pair of words are adjacent (e.g. part of a phrase), the relation should be more significant than if they are separated by many other words. Elaborations of this scheme abound, some of them touching on aspects of so-called “natural language processing” – for example, applying different weightings depending on whether pairs of words appear in the same sentence or paragraph.

CAMBIA’s Dekko uses a triangular matrix of minimum distances between word pairs, summed and (as for TF-IDF) normalised to give a relevance score in the range 0-1.

Vector Space Models

Vector Space Models use a vector in a high dimensional space to represent each document.19 There are many variations, including Latent Sematic Analysis and Cluster Analysis (below), but generally the number of words in the corpus is reduced by using a stop word list, stemming and perhaps by replacing synonyms with the same word or even phrases with concepts.20  Each remaining word in the corpus becomes a dimension in the vector space. A document is represented by a coefficient for each dimension, i.e. a vector in this space. TF-IDF is commonly used for the coefficients. Queries are mapped into the vector space in the same way as documents. The cosine of the angle between two vectors is most commonly used as a measure of similarity between documents or relevance of a document for a specific query.

19http://en.wikipedia.org/wiki/Vector_Space_Model

20http://isp.imm.dtu.dk/thor/projects/multimedia/textmining/node5.html

Latent Semantic Analysis (LSA)

LSA is meant to solve two fundamental problems in natural language processing: synonymy and polysemy.  In synonymy, different writers use different words to describe the same idea. Thus, a person issuing a query in a search engine may use a different word than appears in a document, and may not retrieve the document.  In polysemy, the same word can have multiple meanings, so a searcher may retrieve unwanted documents with the alternate meanings.21

LSA uses a Vector Space Model in which the number of dimensions is reduced by Principal Component Analysis, a well-known technique of linear algebra. LSA relies on vectors that are nearby in the low-rank space representing semantically similar concepts.22  The relation-significance of such nearness is often, but not always valid, so such criteria can lead to results which can be justified on the mathematical level, but have no interpretable meaning in natural language. The right number of dimensions appears to be important; the best values yield up to four times as accurate simulation of human judgments as ordinary co-occurrence measures.

21http://en.wikipedia.org/wiki/Latent_semantic_analysis
22http://isp.imm.dtu.dk/thor/projects/multimedia/textmining/node10.html

Cluster Analysis

Cluster analysis is a multivariate statistical technique that allows the identification of groups, or clusters, of similar objects in space.23 It is a technique that has a long history and wide application in many fields of science, resulting in a huge body of literature on the subject. The application of cluster analysis to IR has hence been studied since the early days of computing with the aim of improving the effectiveness of retrieval:

“The Cluster Hypothesis is fundamental to the issue of improved effectiveness; it states that relevant documents tend to be more similar to each other than to non-relevant documents, and therefore tend to appear in the same clusters”.
Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchical clustering in information retrieval. Information Storage and Retreival, 7:217440.24

As with LSA, the results of clustering methods can be seen to rely too heavily on purely mathematical techniques and empirical parameters but may be effective in conjunction with the other IR models mentioned here. Recent studies on so-called “post-retrieval clustering” hint at a way forward for Cluster Analysis techniques.

23 http://isp.imm.dtu.dk/thor/projects/multimedia/textmining/node11.html
24  Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchical clustering in information retrieval. Information Storage and Retreival, 7:217440.