< back to knowledge

Keyword Clustering by Search Intent using Python

by Andreas Voniatis Founder. Fractional SEO Consultant.

Our survey revealed that data driven SEO consultants see keyword clustering as a high priority. As featured on the Duda Livestream, this is the code for Keyword Clustering by Search Intent using Python. If you’d like more scripts like these, read our book or book some Python for SEO training for your company.

The main steps to cluster your keywords are to:

  1. Read in your SERPs Data
  2. Filter out noisy SERP Results like Google
  3. Compress the SERPs into a single value
  4. Align the SERPs ready for comparison
  5. Compare SERP for similarity
  6. Cluster By Similarity for Search Intent


Read in your SERPs Data

The keyword clustering by search intent technique works on the basis of comparing keywords with similar search results have highly similar intent.

Therefore we’ll need the SERPs of the keywords you want to cluster.

import re
import pandas as pd
import numpy as np
import py_stringmatching as sm

serps_input = pd.read_csv('data/[your SERPs extract from your Rank Tracking Software].csv')

serps_input.drop('date', inplace = True, axis = 1)

serps_input = serps_input.loc[~serps_input['url'].str.contains('google.com')]


Filter out noisy SERP Results like Google

To make the comparison process more effective, we require less noisy data. That means removing any web results that are Google URLs. The process is:

  1. Group the SERP dataframe by keywords
  2. set k_urls to 12 for the top 12 results
  3. Define the function to take the top 12
  4. Apply the function to filter
  5. Combined the filtered Keyword SERP groups
serps_grpby_keyword = serps_input.groupby("keyword")

k_urls = 12

def filter_k_urls(group_df):
    filtered_df = group_df.loc[group_df['url'].notnull()]
    filtered_df = filtered_df.loc[filtered_df['rank'] <= k_urls]
    filtered_df.drop('keyword', inplace = True, axis = 1)
    return filtered_df

filtered_serps = serps_grpby_keyword.apply(filter_k_urls)

filtered_serps_df = pd.concat([filtered_serps],axis=0).reset_index()

filtered_serps_df.drop('level_1', inplace = True, axis = 1)


Compress the SERPs into a single value

Once we have SERPs data cleaned of Google’s Universal result. We want a dataframe with 1 row per keyword with their SERPs in the other column in a single column. This will make it easier to compare SERPs. The process is:

  1. Grouping the dataframe by keyword again
  2. Define a function ‘string_serps’ to join the ranking URLs with a space in between to create a string of tokens (words)
  3. Apply the string_serps function
  4. Concatenate with initial data frame and clean
  5. Drop duplicate rows as we don’t want to compare the same SERP against itself and speed up the comparisons later on
filtserps_grpby_keyword = filtered_serps_df.groupby("keyword")

def string_serps(df):
    df['serp_string'] = ' '.join(df['url'])
    return df

strung_serps = filtserps_grpby_keyword.apply(string_serps)

strung_serps = pd.concat([strung_serps],axis=0)
strung_serps = strung_serps[['keyword', 'serp_string']]

strung_serps = strung_serps.drop_duplicates()


Align the SERPs ready for comparison

With the keywords and their reduced to a single row we can start lining each keyword SERP against each other ready for similarity comparison. The process is to define and run the function taking 2 inputs: k representing the SERP keyword and the dataframe which will:

    1. Filter dataframe for the keyword
    2. Rename the columns
    3. create a new dataframe containing all other keywords
    4. Put all the combinations of the keyword and other keywords together side by side
    5. Combine into a single dataframe

We then create a new data frame called ‘matched_serps’, and iterate through the keywords to create the comparison dataframe.

def serps_align(k, df):
    prime_df = df.loc[df.keyword == k]
    prime_df = prime_df.rename(columns = {"serp_string" : "serp_string_a", 'keyword': 'keyword_a'})

    comp_df = df.loc[df.keyword != k].reset_index(drop=True)
    prime_df = prime_df.loc[prime_df.index.repeat(len(comp_df.index))].reset_index(drop=True)

    prime_df = pd.concat([prime_df, comp_df], axis=1)
    prime_df = prime_df.rename(columns = {"serp_string" : "serp_string_b", 'keyword': 'keyword_b', "serp_string_a" : "serp_string", 'keyword_a': 'keyword'})
    return prime_df

columns = ['keyword', 'serp_string', 'keyword_b', 'serp_string_b']
matched_serps = pd.DataFrame(columns=columns)
matched_serps = matched_serps.fillna(0)

queries = serps_input['keyword'].tolist()

for q in queries:
    temp_df = serps_align(q, strung_serps)
    matched_serps = pd.concat([matched_serps, temp_df])

matched_serps['serp_string'] = matched_serps['serp_string'].astype(str)
matched_serps['serp_string_b'] = matched_serps['serp_string_b'].astype(str)


Compare SERP for (dis)similarity

With the dataframe set, we can now compare the dis-similarity, where we will:

  1. Define the function to only compare the top k_urls results
  2. use to tokenize the URLs
  3. keep only first k URLs
  4. get positions of matches
  5. Intersect URLs and their positions
  6. Determine the dissimilarity based on intersection
  7. Run the function to create a new column ‘serp_simi’ as in search intent similarity
ws_tok = sm.WhitespaceTokenizer()

# Define the function to only compare the top k_urls results 
def serps_similarity(serps_str1, serps_str2, k=k_urls):
    denom = k+1
    norm = sum([2*(1/i - 1.0/(denom)) for i in range(1, denom)])
    ws_tok = sm.WhitespaceTokenizer()

    serps_1 = ws_tok.tokenize(serps_str1)[:k]
    serps_2 = ws_tok.tokenize(serps_str2)[:k]

    match = lambda a, b: [b.index(x)+1 if x in b else None for x in a]

    pos_intersections = [(i+1,j) for i,j in enumerate(match(serps_1, serps_2)) if j is not None] 
    pos_in1_not_in2 = [i+1 for i,j in enumerate(match(serps_1, serps_2)) if j is None]
    pos_in2_not_in1 = [i+1 for i,j in enumerate(match(serps_2, serps_1)) if j is None]

    a_sum = sum([abs(1/i -1/j) for i,j in pos_intersections])
    b_sum = sum([abs(1/i -1/denom) for i in pos_in1_not_in2])
    c_sum = sum([abs(1/i -1/denom) for i in pos_in2_not_in1])

    intent_prime = a_sum + b_sum + c_sum
    intent_dist = 1 - (intent_prime/norm)
    return intent_dist

matched_serps['serp_simi'] = matched_serps.apply(lambda x: serps_similarity(x['serp_string'], x['serp_string_b']), axis=1)


Cluster By Similarity for Search Intent

With the SERP (dis)similarity determined between each keyword, we can cluster keywords by search intent.

This will work by setting a dissimilarity limit and group keywords with SERPs that are most similar i.e. have the same search intent.

70% similarity is we have gone for here, but this can vary according to the search intent, the industry and other factors.

simi_lim = 0.3

queries_in_df = list(set(matched_serps['keyword'].to_list()))
topic_groups = {}

def dict_key(dicto, keyo):
    return keyo in dicto

def dict_values(dicto, vala):
    return any(vala in val for val in dicto.values())

def what_key(dicto, vala):
    for k, v in dicto.items():
        if vala in v:
            return k

def find_topics(si, query_a, query_b, simi_lim = 0.3):
    if si < simi_lim:
        if (not dict_key(topic_groups, query_a)) and (not dict_key(topic_groups, query_a)) and (not dict_values(topic_groups, query_a)) and (not dict_values(topic_groups, query_a)): 
            topic_groups[query_a] = [query_a]
            topic_groups[query_b] = [query_b]
    elif (si >= simi_lim):
        if (not dict_values(topic_groups, query_a)) and (not dict_values(topic_groups, query_b)) and (not dict_key(topic_groups, query_a)) and (not dict_key(topic_groups, query_b)): 
            topic_groups[query_a] = [query_a] 
            topic_groups[query_a] = [query_b] 
        elif (dict_key(topic_groups, query_a)) and (not dict_values(topic_groups, query_b)): 
            d_key = what_key(topic_groups, query_a)
        elif (dict_key(topic_groups, query_a)) and (not dict_values(topic_groups, query_b)): 
            d_key = what_key(topic_groups, query_b)

[find_topics(x, y, z, 0.2) for x, y, z in zip(matched_serps['serp_simi'], matched_serps['keyword'], 

all_topic_groups = topic_groups

topic_groups_lst = []
for k, l in all_topic_groups.items():
    for v in l:
        topic_groups_lst.append([k, v])

topic_groups_dictdf = pd.DataFrame(topic_groups_lst, columns=['keyword', 'keyword_b'])
topic_groups_dictdf.drop_duplicates(inplace = True)
topic_groups_dictdf.sort_values('keyword', inplace = True)


Download Our SEO Guide