Skip to main content

Table 2 Term weighting and normalization in the vector space model.

From: A tutorial on information retrieval: basic terms and concepts

A typical term weighting strategy combines the inverse document frequency (IDF) and term frequency (TF). They are defined as:

TF(term, document) = frequency of term in document WEIGHT(term, document) = TF(term, document) * IDF(term)

The idea of IDF is that the fewer documents having the term, the more useful the term is in discriminating those documents having it from those not having it. On the other hand, if a term occurs many times in a document, then it is likely that the term is significant in representing the contents of the document. With this weighting strategy, the highest weight is accorded to terms that occur frequently in a document but infrequently elsewhere.

With very large collections, not all terms in the document are used for indexing. Some terms have to be removed. This is usually accomplished through the elimination of stopwords (such as articles and connectives), or the use of stemming (which reduces distinct words to their common grammatical root). Porter stemming [27] is probably the most widely used stemming algorithm in the IR community.

The most common approach to relevance ranking in VSM is to give each document a score based on the sum of the weights of terms common to the document and query. Terms in documents typically derive their weight from the TF*IDF. Then the similarity between each document and the query is computed with the formula:

One problem with TF*IDF weighting is that longer documents accumulate more weight in queries simply because they have more words. As such, some approaches "normalize" the weight of a document. The most common approach is cosine normalization [28]:

A variety of other variations to the basic VSM have been developed. For example, Okapi weighting is based on the Poisson distribution [29]. Another variation of TF*IDF document weighting, pivoted normalization, is also often used [30].