A/B Testing Process

Author

Jim

Published

September 20, 2024

Video Resource: Data Application Lab Videos

1 中文版本：

1.1 1. 了解指標的影響因素：

分析哪些因素可能會影響你的關鍵指標（target），並了解用戶在進入頁面後的行為和步驟（如用戶流程）。例如，Netflix 用戶點擊進入主頁後，會點擊按鈕來註冊。

1.2 2. 定義正確的衡量指標（Metric）：

選擇合適的指標來衡量實驗結果，如點擊率（Click Through Rate）、轉化率（Conversion Rate）或跳出率（Bounce Rate）。

1.3 3. 假設設計：

根據數據或經驗設計實驗假設，這可以是統計學上的零假設（null hypothesis，假設指標無變化）或經驗上的假設（假定某個變更會對指標產生影響）。

1.4 4. 設計測試計劃：

定義測試的細節，如改變某個按鈕的顏色，假設這樣會提高 5% 的點擊率，並選擇在哪些平台或地區進行實驗。

1.5 5. 協作與執行：

與其他部門合作，如 UI 工程師，確保所有實驗組的行為被正確記錄，並將用戶分配到不同的實驗組。

1.6 6. 執行實驗：

運用統計學概念來設置實驗，包括確定顯著性水平（significance level）、統計功效（statistical power）和樣本大小（sample size）等。

1.7 7. 分析實驗數據與結果：

檢查數據是否正確，確認是否符合預定的統計標準，並檢查指標是否有改善，同時觀察其他指標是否受到負面影響。

1.8 8. 得出結論：

總結實驗結果，並根據數據決定是否進行下一步行動，如優化或擴大實驗。

2 English Version:

2.1 1. Understand factors affecting the metric:

Analyze the factors that may influence your key metric (target) and understand the user behavior and steps (customer funnel) on the page. For example, on Netflix, users land on the homepage, click a button, and proceed to the registration page.

2.2 2. Define the right metric:

Choose appropriate metrics to measure the experiment, such as Click Through Rate (CTR), Conversion Rate, or Bounce Rate.

2.3 3. Hypothesis design:

Formulate hypotheses based on data or experience, which can be a null hypothesis (no change in the metric) or an alternative hypothesis (assuming a specific change will impact the metric).

2.4 4. Design a test plan:

Define the details of the test, such as changing a button color to green, hypothesizing that it may increase the click rate by 5%, and selecting the platform and user regions for the experiment.

2.5 5. Collaboration and execution:

Collaborate with other departments, such as UI engineers, to ensure the actions in different experiment groups are recorded correctly and users are assigned to appropriate groups.

2.6 6. Run the experiment:

Use statistical concepts to set up the experiment, including determining the significance level, statistical power, and sample size.

2.7 7. Analyze test data and results:

Verify if the data collected is accurate, ensure it meets the set statistical standards, check if the metric improved, and consider if other metrics were negatively impacted.

2.8 8. Draw conclusions:

Summarize the results of the experiment and decide on the next steps based on the data, such as further optimization or scaling the experiment.

3 Statistical Concepts:

3.1 1. Alpha (α) - Significance Level:

Definition: Alpha represents the probability of making a Type I Error, where you reject the Null Hypothesis (H₀) when there is no significant effect.
Typical Value: Often set to 0.05, meaning a 5% chance of rejecting H₀ incorrectly.
Relationship with p-value: A p-value less than α means you can reject H₀. A p-value greater than α means H₀ cannot be rejected.

3.2 2. Beta (β) - Type II Error:

Definition: Beta represents the probability of making a Type II Error, where you fail to reject H₀ when there is an actual effect.
Statistical Power: The power is defined as 1 - β, usually aiming for 0.8 (80%), meaning an 80% chance of detecting the effect.
Relationship with Sample Size: A larger sample size reduces β, making it easier to detect an effect.

3.3 3. Comparing Type I and Type II Errors:

Type I Error (α): Incorrectly rejecting H₀.
Type II Error (β): Failing to reject H₀ when you should.

4 原個人筆記：

了解到一個指標(target)會受到哪些影響，了解用戶進入頁面會做的動作和步驟（customer funnel）(each action steps)，像是netflix一進來看到頁面，點擊按鈕後，會進入註冊信息。
如何define 一個正確的 Metric，像是click through rate(看到這個頁面到點擊進去的機率)、conversion rate（一個客戶來到網站後，轉化為你的用戶、購買者的轉化率、bounce rate(一點擊進去網站後就走的機率)
Hypothesis，有兩種：從統計學上的non hypothesis（比較容易，指標不會變），或者是根據經驗做實驗假定一個假設
test plan怎麼進行這個測試，像是假定改變一個按鈕的顏色變綠色，可能會提高5%，和選擇對什麼平台、什麼地區用戶採取實驗
和其他部門，像是UI engineer合作，確保不同實驗組和實驗的動作能被記錄並且被分到不同的basket裡面
Run experiment，統計學概念，決定significance level, statictical power, sample size…etc
analyze test data and result，檢查收集到的資料對不對，像是如果有紅色跟綠色兩筆資料，但搜集的資料都是90%以上都是綠色，說明實驗做錯了，此時就要回到（第5步驟）重新收集data。還有檢查有沒有符合訂下的統計標準（第6步驟）、Metric有沒有變好(第2步驟)、考慮有沒有對其他指標產生negative影響
conclusion

5 Summary:

Alpha is set as the threshold (commonly 0.05) for rejecting H₀, while Beta represents the probability of missing a significant effect. Statistical power helps ensure a high probability of detecting actual effects.