Testing hypotheses in SEO is critical to ensuring your website doesn’t lose business in the process of making changes that aren’t certain to drive performance.
We’re assuming that the test subjects are your URLs because that is usually where SEO changes are made for testing or implementation.
We provide code to help you determine the:
- Zero Inflate Your Data
- Determine the distribution
- Assign to test and control
- Minimum sample size
- Evaluate the Test
Zero Inflate Your Data
Zero inflating your data serves the purpose of filling in the gaps for when your pages are missing data on certain dates, here’s how to rectify it:
-
- Expand the dataframe to include missing dates
date_lst = pd.date_range(start=gsc_df['date'].min(), end=gsc_df['date'].max(), freq='1d') nd = len(date_lst) nu = len(gsc_df['url'].unique()) gsc_inflated = pd.DataFrame({'url': gsc_df['url'].unique().tolist() * nd, 'date':np.repeat(datelist, nu)}) gsc_inflated = gsc_inflated.merge(gsc_aa_df, how='left')
-
- Impute zero or missing data.
gsc_inflated[['date', 'impressions', ]] = gsc_inflated.groupby('url')[['date', 'impressions']].transform(lambda x: x.fillna(0))
Note for positions we’ve assumed the median between dates because the page may still rank but not receive impressions.
gsc_inflated['position'] = gsc_inflated.groupby('url')['position'].transform(lambda x: x.fillna(x.median()))
However, a more prudent treatment could be to assume URLs on missing dates didn’t rank and impute a 100 rank position instead.
gsc_inflated['position'] = gsc_inflated.groupby('url')['position'].transform(lambda x: x.fillna(100)) gsc_inflated['impressions'] = gsc_inflated['impressions'].astype('int64')
Determine the distribution
The distribution may take different shapes and forms. If it’s traffic data, traffic for any given page especially on a large site is going to be poisson distributed while for rank position, positively skewed (i.e. the average is less than the median). That’s because GSC tends to report on URLs receiving impressions. The code below shows how to distribution can be determined:
group_dist = gsc_inflated.copy()
Calculate the mode (most frequent position)
mode_position = group_dist['position'].mode().iloc[0]
Calculate the count of unique positions
unique_positions_count = group_dist['position'].nunique() mean_position = group_dist['position'].mean() median_position = group_dist['position'].median() std_position = group_dist['position'].std() range_position = group_dist['position'].max() - group_dist['position'].min()
Calculate skewness based on mean and median
if mean_position > median_position: skewness = (3 * (mean_position - median_position)) / std_position elif median_position > mean_position: skewness = (3 * (median_position - mean_position)) / std_position else: skewness = 0.0 # No skew if mean equals median
Print the results
print(f"Most Frequent Position (Mode): {mode_position}") print(f"Number of Unique Positions: {unique_positions_count}") print(f"Mean Position: {mean_position}") print(f"Median Position: {median_position}") print(f"Standard Deviation of Positions: {std_position}") print(f"Range of Positions: {range_position}") print(f"Skewness of the distribution: {skewness}")
Minimum sample size
The minimum sample size calculations will be highly dependent on the distribution and impact the model choice. The code to determine this is shown:
For Traffic (assuming Poisson):
def get_sigma(outcome): return np.count_nonzero(outcome.values == 0) / outcome.count() * 0.995
For Rank Position (assuming positive skew):
def get_sigma(outcome): median_value = np.median(outcome) count_below_median = np.count_nonzero(outcome.values < median_value) adjustment_factor = count_below_median / outcome.count() return adjustment_factor * 0.995 sigma = get_sigma(gsc_inflated["position"])
Now we need to simulate sample distributions (think parallel universes we’re testing your SEO changes in!):
For Traffic (assuming Poisson):
def python_rzip(n, mu, sigma): rng = np.random.default_rng() P = rng.poisson(mu, n) return [p if random.random() > sigma else 0 for p in P] def simulate_sample(n, difference=0.1): control = python_rzip(n, mu, sigma) test = python_rzip(n, mu + difference, sigma) test = stats.ttest_ind(control, test) return test[1] def run_simulations(n, difference=0.1, n_simulations=300): p_values = [simulate_sample(n, difference) for i in range(n_simulations)] significant = sum(map(lambda x: x <= 0.15, p_values)) return (significant / n_simulations).round(3) * 100
For Rank Position (assuming positive skew):
def python_rnormal(n, mu, sigma, skew=0): rng = np.random.default_rng() Z = rng.normal(size=n) alpha = skew / np.sqrt(1 + skew**2) xi = mu - sigma * alpha * np.sqrt(2 / np.pi) omega = np.sqrt(sigma**2 * (1 - 2 * alpha**2 / np.pi)) X = xi + omega * Z return X def simulate_sample(n, difference=0.1, mu=20, sigma=10): control = python_rnormal(n, mu, sigma) test = python_rnormal(n, mu + difference, sigma) t_stat, p_value = stats.ttest_ind(control, test) return p_value
def run_simulations(n, difference=0.1, n_simulations=300, significance_level=0.05): p_values = np.array([simulate_sample(n, difference) for _ in range(n_simulations)]) significant_count = np.sum(p_values <= significance_level) percentage_significant = (significant_count / n_simulations) * 100 power = 1 - stats.norm.cdf(stats.norm.ppf(1 - significance_level) - np.sqrt(n) * difference / sigma) return percentage_significant.round(3), power.round(3)
In either case, with the functions now defined, you can now simulate tests to determine sample size. here, we’re simulating difference sample sizes in increments of 10,000 up to 100,000:
[print(run_simulations(n=x), str(x)) for x in range(0, 110000, 10000)]
You’ll get a print out showing you the consistency and statistical power of different sample sizes:
0.0 0 27.700000000000003 10000 45.300000000000004 20000 60.3 30000 68.30000000000001 40000 75.0 50000 78.3 60000 86.3 70000 90.3 80000 92.30000000000001 90000 93.30000000000001 100000
At 80,000 data points we can be sure that we’ll hit statistical significance 90% of the time.
Assign the data between test and control
Once you’ve determined the minimum sample size you’re ready to start allocating URLs to test groups. This can be done a number of ways as shown below:
Content Type:
content_list = ['PDP', 'PLP'] df_assign['group'] = np.where(df_assign['content'].isin(content_list), 'test', 'control')
URL String Pattern:
df_assign['group'] = np.where(df_assign['url'].str.contains('[url-string-pattern]'), 'test', 'control')
Title Keywords:
keyword_list = ['keyword 1', 'keyword 2', …, 'keyword [n-1]', 'keyword [n]'] df_assign['group'] = np.where(df_assign['title'].str.contains('|'.join(keyword_list)), 'test', 'control')
Evaluate the Test
Having run the test either retrospectively or as fresh test, you’re ready to evaluate:
For reproducibility set the random factor
np.random.seed(1321)
For Traffic:
X = np.where(test_analytics_expanded['ab_group'] == 'control', 0.0, 1.0) X = add_constant(X) X = np.asarray(X) X exp_model = NegativeBinomial(test_analytics_expanded['sessions'], X).fit() exp_model.summary()
For Position:
Splitting the data into two groups based on the ‘group’ column
group_test = gsc_assign[gsc_assign['group'] == 'test']['position'] group_control = gsc_assign[gsc_assign['group'] == 'control']['position']
Perform the Mann Whitney U Test
u_statistic, p_value_u = stats.mannwhitneyu(group_test, group_control) Print the results print(f"Mann-Whitney U Test Test Results") print(f"MWU Statistic: {u_statistic}") print(f"P-Value: {p_value_u}")
Additional context
n_test = len(group_test) n_control = len(group_control) mean_test = group_test.mean() mean_control = group_control.mean() std_test = group_test.std() std_control = group_control.std() print(f"\nAdditional Summary Statistics:") print(f"Test Group: n={n_test}, mean={mean_test:.2f}, std={std_test:.2f}") print(f"Control Group: n={n_control}, mean={mean_control:.2f}, std={std_control:.2f}")
If you’d like to know more about the science and more code, please consider my O’Reilly on-demand video course “Data Science SEO”.