tf-idf-algorithm-explained

blog

The TF*IDF Algorithm Explained

Google has already been using TF*IDF (or TF-IDF, TFIDF, TF.IDF, Artist formerly known as Prince) as a ranking factor for your content for a long time, as the search engine seems to focus more on term frequency rather than on counting keywords. While the visual complexity of the algorithm might turn a lot of people off, it is important to recognize that understanding TF*IDF is not as significant as knowing how it works.

TF*IDF is used by search engines to better understand content which is undervalued. For example, if you’d want to search a term “Coke” on Google, this is how Google can figure out if a page titled “COKE” is about:

a) Coca-Cola.
b) Cocaine.
c) A solid, carbon-rich residue derived from the distillation of crude oil.
d) A county in Texas.

The aim of this article is to guide all content writers and SEO experts through the unknown topic of TF*IDF. By better understanding how Google utilizes this algorithm, content writers can reverse engineer TF*IDF and thus optimize the content of a website to be better for users and search engines. And SEOs can use it as a tool for hunting keywords with a higher search volume and a comparatively lower competition.

What is TF*IDF?

TF*IDF is an information retrieval technique that weighs a term’s frequency (TF) and its inverse document frequency (IDF). Each word or term has its respective TF and IDF score. The product of the TF and IDF scores of a term is called the TF*IDF weight of that term.

Put simply, the higher the TF*IDF score (weight), the rarer the term and vice versa.

The TF*IDF algorithm is used to weigh a keyword in any content and assign the importance to that keyword based on the number of times it appears in the document. More importantly, it checks how relevant the keyword is throughout the web, which is referred to as corpus.

For a term t in a document d, the weight Wt,d of term t in document d is given by:

Wt,d = TFt,d log (N/DFt)

Where:

  • TFt,d is the number of occurrences of t in document d.
  • DFt is the number of documents containing the term t.
  • N is the total number of documents in the corpus.

All right. Don’t panic if you feel a headache coming on.

Let’s define this more concretely.

TF*IDF Defined

The TF (term frequency) of a word is the frequency of a word (i.e. number of times it appears) in a document. When you know it, you’re able to see if you’re using a term too much or too little.

For example, when a 100 word document contains the term “cat” 12 times, the TF for the word ‘cat’ is

TFcat = 12/100 i.e. 0.12

The IDF (inverse document frequency) of a word is the measure of how significant that term is in the whole corpus.

For example, say the term “cat” appears x amount of times in a 10,000,000 million document-sized corpus (i.e. web). Let’s assume there are 0.3 million documents that contain the term “cat”, then the IDF (i.e. log {DF}) is given by the total number of documents (10,000,000) divided by the number of documents containing the term “cat” (300,000).

IDF (cat) = log (10,000,000/300,000) = 1.52

∴ Wcat = (TF*IDF) cat = 0.12 * 1.52 = 0.182

Now that you have this figured out (right?), let’s look at how this can benefit you.

How you can benefit from using TF*IDF

Gather words. Write your content. Run a TF*IDF report for your words and get their weights. The higher the numerical weight value, the rarer the term. The smaller the weight, the more common the term. Compare all the terms with high TF*IDF weights with respect to their search volumes on the web. Select those with higher search volumes and lower competition. Work smart.

A good rule of thumb is, the more your content “makes sense” to the user, the more weight it is assigned by the search engine. With words having a high TF*IDF weight in your content, your content will always be among the top search results, so you can:

  • stop worrying about using the stop-words,
  • successfully hunt words with higher search volumes and lower competition,
  • be sure to have words that make your content unique and relevant to the user, etc.

Let’s get you up and rolling with TF*IDF optimization

1. Register your free account at Ryte by following this link.

First you will need to sign up for Ryte.com (again: 100% free). Ryte lets you optimize your content using TF*IDF for free! With a very simple user interface, it is one of the best options for you or your content writer. Your content writer can create their own account and start work within minutes.

You will see Register for free in the upper right corner and then come to this page.

After a couple of pages where they ask for your website and company information, you will come to this page:

You will be asked to confirm your email address and then you’re good to go.

2. Click on Content Success on the left side.

And then you will come to this page:

Once you are there, you simply need to choose your keyword and the language and country you are interested in, and click Start Content Analysis.

After a few seconds you will be presented with the results. In this case, we put in the keyword “tf*idf” for English in the United States:

 

As you can see, this is a lot of raw data that would be hard to use. Fortunately, we can do cool stuff to make it easier to understand and utilize.

Here are your options:

1. Single Word Report (as seen above) is sometimes useful, but for some queries, the data needs a lot of filtering.

2. Two-Word combinations (as seen below) is my personal favorite as it is much easier to catch the general meaning this way. As you can see on the screenshot, the two-word combinations are spot on and reflect the most important phrases about our article topic.

 

3. Add URL for comparison is where all the magic happens. Let’s check it out in more detail.

 TF*IDF content optimization

Comparing our URL with TF*IDF results

We are going to compare the results with this article (apologies for getting all meta with this thing!).

And here are the results:

Again, by looking at the screenshot above, we don’t really see any actionable data. Let’s change that. We have a few options (as presented below with number tags).

Using sliders on the upper part of the bar chart, we can filter out the results to only see the most important and relevant terms. Those two sliders can be confusing at first, but once you understand them, they become quite easy to use.

  1. Proof Keyword Filter is the slider button on the left. With this slider you can filter out all the less relevant keywords and, depending on your content detail level and length, keep only the most important ones. In most cases (when you are writing a blog post or short article), you probably will be OK with 10-20 of the most important terms as the other ones can be too detailed or even off-topic.
  2. Zoom Tool is the slider button on the right. This slider doesn’t actually change our results. It is only used to zoom in and out of the results so they are easier to analyze.

Now that you know how to filter the data, let’s finally start optimizing our content.

Comparing and optimizing our content

To start with content optimization, click Detailed Content Report located under Current content report. This view will show you your current level of your content optimization according to TF*IDF.

Now you are presented with a detailed comparison of the search results vs. your landing page or content.

Using the information above, you can determine the keywords not present in your content that are closely related to the topic. And by adding those keywords to your content, it will improve the topic relevancy and help your page rank better.

How can you edit your content now to improve your TF*IDF measured relevancy? The easiest way to do that is to use Content Optimizer, located under Current Content Report.

Once you click Content Optimizer, all you have to do is paste your content and click Analyse.

Now it all gets extremely easy. All you have to do is edit your content until you are happy with it.

And that should be more than enough to get you started. See, that wasn’t so bad, was it?

Good luck and happy optimizing with TF*IDF!

Published
  • 06 March 2018
Comments
Category
Bartosz-Góralewicz

See all articles by Bartosz Góralewicz

Did you like this article?

Why not share this article:

Be in the loop. Get fresh SEO and Content Marketing updates!

Thanks! We are happy to have you on our list!

Expect some tasty news from the Elephate team soon.

Share

more blog posts

Back to Blog list