tf-idf-algorythm-explained

blog

The TF*IDF Algorithm Explained

Although the name suggests an algorithm, it is much simpler than the mathematical term used to describe it. Understanding the algorithm itself is not as important as understanding how it works exactly. Below you will find a simple and a straightforward way in which we’d like to explain it to you.

Google has been using TF*IDF for a long time as a ranking factor for your content. Google seems to focus more on term frequency than on counting keywords. It’s a series of complicated algorithms and complex calculations, but as we said before, what matters most is the way this algorithm works and how simple it is if correctly understood!

The aim of this article is to guide all content writers and SEO experts through the unknown topic of TF*IDF. It’s a strong algorithm used by search engines to understand the content better which is undervalued.

For example, if you’d want to search a term ‘Coke’ on Google, this is how Google can figure out if a page titled ‘COKE‘ is about:

a) Coca-Cola.
b) Cocaine.
c) A solid, carbon-rich residue derived from distillation of crude oil.
d) Coke County in Texas

Content writers can use tools that reverse engineer this algorithm and help them optimize the content of their website to be better for users & search engines.

The TF*IDF algorithm is used to weigh a keyword in your content and assign the importance to that keyword based on the number of times it appears in the document. Also, it checks how relevant the keyword is throughout the web, which is referred to as “corpus”.

In short, the TF*IDF algorithm is a great SEO tool for hunting keywords with a higher search volume and a comparatively lower competition.

For a term t in a document d, the weight Wt,d of term t in document d is given by:

Wt,d = TFt,d log (N/DFt )

Where:

  • TFt,d is the number of occurrences of t in document d.
  • DFt is the number of documents containing the term t.
  • N is the total number of documents in corpus.

What is TF*IDF?

To accentuate the content of an article, it is highly important to have keywords that will help an article appear as a top search result. One of the ways of assigning significance of uniqueness to an article is through the TF*IDF algorithm, also called TF-IDF or TF.IDF.

TF*IDF is an information retrieval technique that weighs a term’s frequency (TF) and its inverse document frequency (IDF). Each word or term has its respective TF and IDF score. The product of the TF and IDF scores of a term is called the TF*IDF weight of that term. In simple language, the higher the TF*IDF score (weight), the rarer the term and vice versa.

TF*IDF definition

TF (Term frequency) of a word is the frequency of a word (i.e. number of times it appears) in a document. When you know it, you’re able to see if you’re using a term too much or too little.

For example, when in a document containing 100 words the term ‘cat’ appears 12 times, the TF for the word ‘cat’ is

TFcat = 12/100 i.e. 0,12

IDF (Inverse document frequency) of a word is the measure of how significant that term is throughout the web, also referred to as the “corpus”.

For example, say the term ‘cat’ appears 10 million times in the whole corpus (i.e. web). Let’s assume there are 0.3 million documents that contain such a huge number of ‘cat’, then the IDF (i.e. log {DF}) is given by the total number of documents divided by the number of documents containing the term ‘cat’.

IDF (cat) = log (10.000.000/300.000) i.e 1,52

∴ Wcat =(TF*IDF) cat = 0,12 1,52 = 0,182

How you can benefit from using TF*IDF

Gather words. Write your content. Run a TF*IDF report for your words and get their weights. The higher the numerical weight value, the rarer the term. The smaller the weight, the more common the term. Compare all the terms with high TF*IDF weights with respect to their search volumes on the web. Select those with higher search volumes and lower competition. Work smart.

A good rule of thumb is, the more your content “makes sense” to the user, the more weight is assigned to it by the search engine. With words having high TF*IDF weight in your content, your content will always be amongst the top search results, so you can:

  • stop worrying about using the stop-words,
  • successfully hunt words with higher search volumes and lower competition,
  • be sure to have words that make your content unique and relevant to the user, etc.

Let’s get you up and rolling with TF*IDF optimization

For the first step, you need to sign up for OnPage.org (don’t worry it is 100% free).

OnPage.org lets you optimize your content using TF*IDF for free! With a very simple user interface, it is one of the best options for you or your content writer. Your content writer can create their own account and start work within minutes.

1. Register your free account at OnPage.org by following this link.

You will be asked to confirm your email address and enter your name, etc.

After that you are good to go.

2. Go to TF*IDF.

3. Type in your keyword and pick a country where you want to rank and click “Create TF*IDF report”.

4. After a few seconds you will be presented with the results view.

As you can see, this is a lot of raw data that would be hard to use. Fortunately, we can do cool stuff to make it easier to understand and use.

1. This is a “Single Words” view (as seen above) – it is useful sometimes, but for some queries, the data needs a lot of filtering.

2. “Two-Word combinations” view. My personal favorite as it is much easier to catch the general meaning this way. As you can see on the screenshot above, the two-word combinations are spot on and reflect most important phrases about our article topic.

3. Compare with URL option – this is where all the magic and cool stuff happens. Let’s check it out.

Before writing this part of this article, I first published a few headlines on my blog already. So let’s optimize this article by running a TF*IDF comparison report.

TF*IDF content optimization

Comparing our URL with TF*IDF results

And here are the results:

Again, by looking at the screenshot above, we don’t really see any actionable data. Let’s change that. We have a few options (as presented below with number tags).

Using sliders (1 and 2) we can filter out the results to only see the most important and relevant terms. Those 2 sliders can be confusing at first, but once you understand their use, they become quite easy to use.

  1. Proof Keyword Filter – with this slider you can filter out all the less relevant keywords and depending on your content detail level and length, keep only the most important ones. In most cases (when you are writing a blog post, or short article), you probably will be OK with 10 – 20 of the most important terms as the other ones can be too detailed or even off-topic.
  2. Zoom Tool – this slider doesn’t actually change our result. It is only used to zoom/stretch the results a little bit so they are easier to analyze.

OK, now that you know how to filter the data, let’s finally start optimizing our content.

Comparing and optimizing our content

To start with content optimization, click “Detailed Results”. This view will show you your current level of your content optimization according to TF*IDF.

Now you are presented with a detailed comparison of search results vs. your landing page or content.

On the screenshot above, I marked all the keywords not present within my content (idf weighting, vector space, space model, idf term, term weighting, etc.) that are closely related to the topic. Adding that to my post will improve topic relevancy of my article and help my page rank better.

How can you edit your content now to improve your TF*IDF measured relevancy? The easiest way to do that is to use Text Assistant.

Once you are in “Text Assistant” tool, all you have to do is paste your content and click “Save & analyse text”.

Now it all gets extremely easy. All you have to do is edit your content until you are happy with it.

Good luck and happy optimizing with TF*IDF!

 

Published
  • 04 December 2015
Comments
Category
Bartosz-Góralewicz

See all articles by Bartosz Góralewicz

Did you like this article?

Why not share this article:

Be in the loop. Get fresh SEO and Content Marketing updates!

Thanks! We are happy to have you on our list!

Expect some tasty news from the Elephate team soon.

more blog posts

Back to Blog list