< back to knowledge

SEO Hypothesis Split Test Code

by Andreas Voniatis Founder. Fractional SEO Consultant.

Testing hypotheses in SEO is critical to ensuring your website doesn’t lose business in the process of making changes that aren’t certain to drive performance.

We’re assuming that the test subjects are your URLs because that is usually where SEO changes are made for testing or implementation.

We provide code to help you determine the:

  1. Zero Inflate Your Data
  2. Determine the distribution
  3. Assign to test and control
  4. Minimum sample size
  5. Evaluate the Test


Zero Inflate Your Data

Zero inflating your data serves the purpose of filling in the gaps for when your pages are missing data on certain dates, here’s how to rectify it:

    1. Expand the dataframe to include missing dates


date_lst = pd.date_range(start=gsc_df['date'].min(), end=gsc_df['date'].max(), freq='1d')
nd = len(date_lst)
nu = len(gsc_df['url'].unique())
gsc_inflated = pd.DataFrame({'url': gsc_df['url'].unique().tolist() * nd, 'date':np.repeat(datelist, nu)})
gsc_inflated = gsc_inflated.merge(gsc_aa_df, how='left')
    1. Impute zero or missing data. 


gsc_inflated[['date', 'impressions', ]] = gsc_inflated.groupby('url')[['date', 'impressions']].transform(lambda x: x.fillna(0))

Note for positions we’ve assumed the median between dates because the page may still rank but not receive impressions. 

gsc_inflated['position'] = gsc_inflated.groupby('url')['position'].transform(lambda x: x.fillna(x.median()))

However, a more prudent treatment could be to assume URLs on missing dates didn’t rank and impute a 100 rank position instead.

gsc_inflated['position'] = gsc_inflated.groupby('url')['position'].transform(lambda x: x.fillna(100))
gsc_inflated['impressions'] = gsc_inflated['impressions'].astype('int64')


Determine the distribution

The distribution may take different shapes and forms. If it’s traffic data, traffic for any given page especially on a large site is going to be poisson distributed while for rank position, positively skewed (i.e. the average is less than the median). That’s because GSC tends to report on URLs receiving impressions. The code below shows how to distribution can be determined:

group_dist = gsc_inflated.copy()

Calculate the mode (most frequent position)

mode_position = group_dist['position'].mode().iloc[0]

Calculate the count of unique positions

unique_positions_count = group_dist['position'].nunique() 
mean_position = group_dist['position'].mean() 
median_position = group_dist['position'].median() 
std_position = group_dist['position'].std() 
range_position = group_dist['position'].max() - group_dist['position'].min()

Calculate skewness based on mean and median

if mean_position > median_position:
    skewness = (3 * (mean_position - median_position)) / std_position
elif median_position > mean_position:
    skewness = (3 * (median_position - mean_position)) / std_position
    skewness = 0.0  # No skew if mean equals median

Print the results

print(f"Most Frequent Position (Mode): {mode_position}")
print(f"Number of Unique Positions: {unique_positions_count}")
print(f"Mean Position: {mean_position}")
print(f"Median Position: {median_position}")
print(f"Standard Deviation of Positions: {std_position}")
print(f"Range of Positions: {range_position}")
print(f"Skewness of the distribution: {skewness}")

Minimum sample size

The minimum sample size calculations will be highly dependent on the distribution and impact the model choice. The code to determine this is shown:

For Traffic (assuming Poisson):

def get_sigma(outcome):
    return np.count_nonzero(outcome.values == 0) / outcome.count() * 0.995

For Rank Position (assuming positive skew):

def get_sigma(outcome):
    median_value = np.median(outcome)
    count_below_median = np.count_nonzero(outcome.values < median_value)  
    adjustment_factor = count_below_median / outcome.count()
    return adjustment_factor * 0.995

sigma = get_sigma(gsc_inflated["position"])

Now we need to simulate sample distributions (think parallel universes we’re testing your SEO changes in!):

For Traffic (assuming Poisson):

def python_rzip(n, mu, sigma):
    rng = np.random.default_rng()
    P = rng.poisson(mu, n)
    return [p if random.random() > sigma else 0 for p in P] 

def simulate_sample(n, difference=0.1):
    control = python_rzip(n, mu, sigma)
    test = python_rzip(n, mu + difference, sigma)
    test = stats.ttest_ind(control, test)
    return test[1] 

def run_simulations(n, difference=0.1, n_simulations=300):
    p_values = [simulate_sample(n, difference) for i in range(n_simulations)]
    significant = sum(map(lambda x: x <= 0.15, p_values))
    return (significant / n_simulations).round(3) * 100


For Rank Position (assuming positive skew):

def python_rnormal(n, mu, sigma, skew=0):
    rng = np.random.default_rng()
    Z = rng.normal(size=n)
    alpha = skew / np.sqrt(1 + skew**2)
    xi = mu - sigma * alpha * np.sqrt(2 / np.pi)
    omega = np.sqrt(sigma**2 * (1 - 2 * alpha**2 / np.pi))
    X = xi + omega * Z
    return X 

def simulate_sample(n, difference=0.1, mu=20, sigma=10):
    control = python_rnormal(n, mu, sigma)
    test = python_rnormal(n, mu + difference, sigma)
    t_stat, p_value = stats.ttest_ind(control, test)
    return p_value


def run_simulations(n, difference=0.1, n_simulations=300, significance_level=0.05):
    p_values = np.array([simulate_sample(n, difference) for _ in range(n_simulations)])
    significant_count = np.sum(p_values <= significance_level)
    percentage_significant = (significant_count / n_simulations) * 100
    power = 1 - stats.norm.cdf(stats.norm.ppf(1 - significance_level) - np.sqrt(n) * difference / sigma)    
    return percentage_significant.round(3), power.round(3)

In either case, with the functions now defined, you can now simulate tests to determine sample size. here, we’re simulating difference sample sizes in increments of 10,000 up to 100,000:

[print(run_simulations(n=x), str(x)) for x in range(0, 110000, 10000)]

You’ll get a print out showing you the consistency and statistical power of different sample sizes:

0.0 0
27.700000000000003 10000
45.300000000000004 20000
60.3 30000
68.30000000000001 40000
75.0 50000
78.3 60000
86.3 70000
90.3 80000
92.30000000000001 90000
93.30000000000001 100000

At 80,000 data points we can be sure that we’ll hit statistical significance 90% of the time.


Assign the data between test and control

Once you’ve determined the minimum sample size you’re ready to start allocating URLs to test groups. This can be done a number of ways as shown below:


Content Type:

content_list = ['PDP', 'PLP']
df_assign['group'] = np.where(df_assign['content'].isin(content_list), 'test', 'control')


URL String Pattern:

df_assign['group'] = np.where(df_assign['url'].str.contains('[url-string-pattern]'), 'test', 'control')


Title Keywords:

keyword_list = ['keyword 1', 'keyword 2', …, 'keyword [n-1]', 'keyword [n]']
df_assign['group'] = np.where(df_assign['title'].str.contains('|'.join(keyword_list)), 'test', 'control')

Evaluate the Test

Having run the test either retrospectively or as fresh test, you’re ready to evaluate:

For reproducibility set the random factor



For Traffic:

X = np.where(test_analytics_expanded['ab_group'] == 'control', 0.0, 1.0)
X = add_constant(X)
X = np.asarray(X)
exp_model = NegativeBinomial(test_analytics_expanded['sessions'], X).fit()

For Position:

Splitting the data into two groups based on the ‘group’ column

group_test = gsc_assign[gsc_assign['group'] == 'test']['position']
group_control = gsc_assign[gsc_assign['group'] == 'control']['position']


Perform the Mann Whitney U Test

u_statistic, p_value_u = stats.mannwhitneyu(group_test, group_control)

Print the results
print(f"Mann-Whitney U Test Test Results")
print(f"MWU Statistic: {u_statistic}")
print(f"P-Value: {p_value_u}")


Additional context

n_test = len(group_test)
n_control = len(group_control)
mean_test = group_test.mean()
mean_control = group_control.mean()
std_test = group_test.std()
std_control = group_control.std()
print(f"\nAdditional Summary Statistics:")
print(f"Test Group: n={n_test}, mean={mean_test:.2f}, std={std_test:.2f}")
print(f"Control Group: n={n_control}, mean={mean_control:.2f}, std={std_control:.2f}")

If you’d like to know more about the science and more code, please consider my O’Reilly on-demand video course “Data Science SEO”.

Download Our SEO Guide