I’m trying to define the popular concept of “tags” in terms of the Information Retrieval and to find an appropriate algorithmic strategy for to automate the processes of choosing the tags for text documents.
The tags are a handy intuitive concept for labeling and describing the pieces of information. They are usually assigned informally, and as for any matter of “a common sense”, it is difficult to find their formal definition. It is even more difficult to reproduce and algorithmize the mental processes of tag choosing.
To avoid that fruitless work, I will consider some IR approaches which may be used for evaluation the document terms as the possible tag candidates. For to make that evaluation possible, two important purposes of the tags may be identified:
To name a thing means to separate it from others. So, a good tag should identify a document by showing its unique feature – how the document differs from other documents in a corpus.
A tag should also identify a semantic group to which the document belongs.
These two key properties seem totally different and even mutually opposite and discordant. Nevertheless, they are both important and the best tags perhaps integrates both properties in a dialectical way. Brief analysis of human-assigned tags on popular social web-services confirmed that assumption.
Traditional keyword extraction: TFIDF
The first strategy of autotagging is based on a popular term weighting technique, proposed by classical IR discipline. This technique is called shortly as TFIDF and this abbreviation reflects the basic formula for calculating a measure of relevance (or, “weight”) of a specific term in a given document: Term Frequency x Inverted Document Frequency.
Though there is a number of variations of this formula (differ in various normalization techniques), the basic idea behind them is the same: weight of a term in a document is directly proportional to the number of times the term appears in the document (Term Frequency, TF) and inversely proportional to the total number of documents containing this term (Document Frequency, DF).
This weighting scheme allows to effectively filter out so-called “common words” (with high DF) as well as the rare terms and misspellings (with low TF and DF both). The terms with high TF and relatively low DF got highest “importance” ranks and can be considered as possible candidates to tags.
Thus, TFIDF-based document tagging strategy may be described as follows:
- Build a list of all unique terms of a given document (term vector)
- Calculate TFIDF weight for each term in the list
- Sort the terms by their TFIDF weights
- Select N topmost terms, where N is a number of tags we want to assign for each document. We can also choose another method of tags number limitation, e.g. to select the terms above specified TFIDF threshold.
Improved autotagging strategies
The problem with reviewed simple TFIDF-based analysis is that it conflicts with the second part of our definition of the tags purpose: unification. Reviewed approach is good when we need to know what terms make a document different from others, but it doesn’t help if we need to select the terms which connect the document with its counterparts.
With this technique, the “tags cloud” tends to grow proportionally to the number of indexed documents with minimization of the number of documents per single tag at the same time. In an ideal case, it leads to the situation when each tag corresponds to an unique document. Such level of granularity is not what we want if we are going to control our taxonomy size and care about grouping function of our tags. The tags which would contribute into documents unification and grouping are overlooked by that algorithm because their DF’s are higher than of those contributing into a document uniqueness. We could smooth that effect by reduction an influence of DF parameter, but it increases the risk of appearing the common words as the document tags. Obviously, tweaking the basic TFIDF formula is not a solution for this problem.
To implement better autotagging strategy (which would consider a documents grouping factor), we need to analyze a document together with its semantic context – that is a cluster of documents including an initial document plus the documents similar to it. TFIDF ranking algorithm applied to that cluster gives us the terms relevant to the whole group, while the common words are still effectively suppressed. There is a high probability that the documents inside the cluster are mutually similar to each other, thus this strategy tends to re-use the same tag set for semantically related documents, as possible.
An easy way to find the document similarities is to build the document term vector (as we did before) and to issue it (or its highest ranked part) to a search system. Some search engines implement “more-like-this” feature that works the same way. The better results may be achieved if the search system allows to set a “boost factor” for separate query terms, proportionally to their weights in the term vector. Then, we just select the search results (above specific threshold) as the documents similar to the given one.
The whole strategy may looks like:
- Find a set of documents, similar to a given document.
- Build a composite term vector as an union of term vectors of the given document and its similarities.
- Calculate TFIDF term weights against the composite term vector.
- Sort the terms by their TFIDF weights
- Select topmost terms as the document tags
By tweaking the similarity thresholds and TFIDF normalizing (e.g. considering similarity coefficients in TFIDF calculation), we can adjust the results to get “more specific” or “more general” set of the tags (differentiation/unification balance), depending on our goals.
Another improved autotagging strategy may be implemented as an adaptive technique where TFIDF evaluation takes into account whether a given term is already a tag or not. But this approach strictly depends on an initial tag set and likely, will require some manual tagging before it could be used in auto mode.
P.S. If you found this article useful, please consider to participate in this survey. It will help the author in his further work on this subject. Thanks!