Since a computer cannot analyze text in its raw form, it must be converted into a numerical format – vectorization. One way to vectorize text data is TFIDF.
When dealing with textual data, it’s important to know which words are most important in a given document. For instance, if you’re trying to retrieve textual data on a particular topic, certain unique words may be more informative than generic words that occur very frequently.
While a straightforward count vectorizor will provide insight into how frequently a term occurs in a given document, a TFIDF (Term Frequency Inverse Document Frequency) approach will tell you whether or not to prioritize a word in a given document.
TFIDF is equal to: term frequency * inverse document frequency.
In simple terms:
Term frequency (TF) refers to how often a term occurs in a given document divided by the total number if words in that particular document.
Inverse document frequency (IDF) tells us which terms or words occur frequently across all documents and which ones occur rarely. Terms that are very common have a lower IDF and vice versa.
Our TFIDF score gives the words in a given document a weightage which provides an insight into which words in the text are most and least informative.
The most informative are those with a higher score in a given document and those with a lower score are less informative (commonly used words). It assigns a score rather than a frequency.
Doc 1: “I think that the purple sweater is the best choice for the event”
Doc 2: “She thought that the pink jeans were the best for the event.“
Doc 3: “I think the the best choice for the event is the red dress”
The word the occurs a lot and has a high frequency count.
But words like purple, sweater and jeans provide more information on the person’s personal clothing choice. That’s the magic of TFIDF.
Do you have any favorite resources on this topic?




