Skip to content

Latest commit

 

History

History
86 lines (66 loc) · 3.51 KB

README.md

File metadata and controls

86 lines (66 loc) · 3.51 KB

TF-IDF

tf-idf

This is a small and reasonably performant implementation of TF-IDF written in Clojure.

Usage

There is only a single namespace, dk.cst.tf-idf. This namespace contains the core TF-IDF functions:

(tf documents)      ; => seq of normalized term frequency maps
(idf documents)     ; => inverse term frequency map
(tf-idf documents)  ; => seq of term->tf-idf maps
(vocab documents)   ; => set containing the vocabulary

These core functions all take a sequence of documents— usually just strings, although this depends on what *tokenizer-xf* is bound to — and return regular Clojure collections. In order to avoid recalculating too many things, results of any intermediate calculations can usually also be fed into the next step of the algorithm.

The dk.cst.tf-idf namespace also contains a few extra utility functions, e.g. functions for picking terms from TF-IDF results:

;; Top 3 terms for every document.
(top-n-terms 3 (tf-idf documents))

;; Top 50 terms based on the highest recorded TF-IDF score.
(take 50 (order-terms max (tf-idf documents)))

;; Top 50 terms based on TF-IDF score sums.
(take 50 (order-terms + (tf-idf documents)))

Alternative tokenizers

The *tokenizer-xf* dynamic var contains a reference to the default transducer used to tokenize input documents.

In order to perform other kinds of text normalization, this dynamic var can be rebound to allow for alternative implementations. The simplest way to create a new tokenizer transducer is to use the included ->tokenizer-xf function:

(binding [*tokenizer-xf* (->tokenizer-xf :tokenize #(str/split % #"\s"))]
    (tf-idf documents))

Explanation of terms

This is a very brief explanation of the different terms used in TF-IDF.

Vocab

  • The list of all words considered in the corpus.

Term frequency

tf(d,t) = count(t in d) / count(x in d)
  • How many times does the word/lemma appear in the document?
  • Each frequency score is normalised by dividing with the total number of words in the text (count(x in d)).
  • Only the frequency of vocab is considered!

Document frequency (absolute)

df(d,t) = count(d containing t)
  • How many documents does word/lemma appear in?
  • Not normalised by default in this implementation, although you can always run (normalize-frequencies df-result) to achieve this.

Inverse document frequency

idf(d,t) = log(count(d) / (count(d containing t) + 1))
  • The document frequency, except this number has been normalised by dividing with count(d containing t).
  • This has the opposite effect of df(d,t) as rarer words will have a higher inverse document frequency than common words.
  • To avoid dividing by zero, 1 is added to count(d containing t).
  • To keep very rare terms from having gigantic scores, the final value returned is actually the logarithm applied to this expression.

TF-IDF

tfidf(d,t) = tf(d,t) * idf(d,t)
  • The product of the term frequency and the inverse document frequency.

Links