diff --git a/math/probability-and-statistics-for-ml/AB Testing.ipynb b/math/probability-and-statistics-for-ml/AB Testing.ipynb new file mode 100644 index 0000000..cd086dc --- /dev/null +++ b/math/probability-and-statistics-for-ml/AB Testing.ipynb @@ -0,0 +1,1041 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "751a404c", + "metadata": {}, + "source": [ + "# AB Testing - Average Session Duration\n", + "\n", + "Welcome! In this assignment you will be presented with two cases that require an AB test to choose an action to improve an existing product. You will perform AB test for a continuous and a proportion metric. For this you will define functions that estimate the relevant information out of the samples, compute the relevant statistic given each case and take a decision on whether to (or not) reject the null hypothesis.\n", + "\n", + "Let's get started!\n", + "\n", + "# Outline\n", + "- [ 1 - Introduction](#1)\n", + "- [ 2 - Exploring and handling the data](#2)\n", + "- [ 3 - Revisiting the theory](#3)\n", + "- [ 4 - Step by step computation](#4)\n", + " - [ Exercise 1](#ex01)\n", + " - [ Exercise 2](#ex02)\n", + " - [ Exercise 3](#ex03)\n", + " - [ Exercise 4](#ex04)\n", + " - [ Exercise 5](#ex05)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "a5759717", + "metadata": { + "tags": [ + "graded" + ] + }, + "outputs": [], + "source": [ + "import math\n", + "import numpy as np\n", + "import pandas as pd\n", + "from scipy import stats" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "7205aaa1", + "metadata": {}, + "outputs": [], + "source": [ + "import w4_unittest" + ] + }, + { + "cell_type": "markdown", + "id": "ab8f62a7", + "metadata": {}, + "source": [ + "\n", + "## 1 - Introduction\n", + "\n", + "Suppose you have a website that provides machine learning content in a blog-like format. Recently you saw an article claiming that similar websites could improve their engagement by simply using a specific color palette for the background. Since this change seems pretty easy to implement you decide to run an AB test to see if this change does in fact drive your users to stay more time in your website.\n", + "\n", + "The metric you decide to evaluate is the `average session duration`, which measures how much time on average your users are spending on your website. This metric currently has a value of 30.87 minutes.\n", + "\n", + "Without further considerations you decide to run the test for 20 days by randomly splitting your users into two segments:\n", + "- `control`: These users will keep seeing your original website.\n", + "\n", + "\n", + "- `variation`: These users will see your website with the new background colors.\n", + "\n", + "\n", + "## 2 - Exploring and handling the data\n", + "\n", + "Run the next cell to load the data from the test:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "9ebfb03e", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_iduser_typesession_duration
0BM3C0BJ7CSvariation15.528769
1MJWN6XNH6Lvariation32.287590
246ZPHHABLSvariation43.718217
3OHA298DHUGvariation49.519702
4AKJ77X6F4Acontrol61.709028
5BFNWMGU6DXvariation71.779283
6UFO2V8ZKFBvariation23.291835
74CEIM3VRS9control25.219461
890AGF68FF8control26.240482
9R3DQFO6068variation20.780244
\n", + "
" + ], + "text/plain": [ + " user_id user_type session_duration\n", + "0 BM3C0BJ7CS variation 15.528769\n", + "1 MJWN6XNH6L variation 32.287590\n", + "2 46ZPHHABLS variation 43.718217\n", + "3 OHA298DHUG variation 49.519702\n", + "4 AKJ77X6F4A control 61.709028\n", + "5 BFNWMGU6DX variation 71.779283\n", + "6 UFO2V8ZKFB variation 23.291835\n", + "7 4CEIM3VRS9 control 25.219461\n", + "8 90AGF68FF8 control 26.240482\n", + "9 R3DQFO6068 variation 20.780244" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Load the data from the test using pd.read_csv\n", + "data = pd.read_csv(\"background_color_experiment.csv\")\n", + "\n", + "# Print the first 10 rows\n", + "data.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "e8f146e1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The dataset size is: 4186\n" + ] + } + ], + "source": [ + "print(f\"The dataset size is: {len(data)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "2d257a0f", + "metadata": {}, + "source": [ + "The data shows for every user the average session duration and the version of the website they interacted with. To separate both segments for easier computations you can slice the Pandas dataframe by running the following cell. You may want to revisit our ungraded labs on Pandas if you are still not familiar with it. However, no need to worry because you don't need to be a Pandas expert to complete this assignment." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "9552f1a8", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2069 users saw the original website with an average duration of 32.92 minutes\n", + "\n", + "2117 users saw the new website with an average duration of 33.83 minutes\n" + ] + } + ], + "source": [ + "# Separate the data from the two groups (sd stands for session duration)\n", + "control_sd_data = data[data[\"user_type\"]==\"control\"][\"session_duration\"]\n", + "variation_sd_data = data[data[\"user_type\"]==\"variation\"][\"session_duration\"]\n", + "\n", + "print(f\"{len(control_sd_data)} users saw the original website with an average duration of {control_sd_data.mean():.2f} minutes\\n\")\n", + "print(f\"{len(variation_sd_data)} users saw the new website with an average duration of {variation_sd_data.mean():.2f} minutes\")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "071d9e8c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "4 61.709028\n", + "7 25.219461\n", + "8 26.240482\n", + "11 58.748026\n", + "13 137.036803\n", + " ... \n", + "4172 28.490594\n", + "4174 24.228853\n", + "4177 20.661284\n", + "4178 17.101761\n", + "4183 17.839862\n", + "Name: session_duration, Length: 2069, dtype: float64" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "control_sd_data" + ] + }, + { + "cell_type": "markdown", + "id": "0946faf2", + "metadata": {}, + "source": [ + "Notice that the split is not perfectly balanced. This is common in AB testing as there is randomness associated with the way the users are assigned to each group. \n", + "\n", + "At first glance it looks like the change to the background did in fact drive users to stay longer on your website. However you know better than driving conclusions at face value out of this data so you decide to perform a hypothesis test to know if there is a significant difference between the **means** of these two segments. \n", + "\n", + "\n", + "## 3 - Revisiting the theory\n", + "\n", + "Let's revisit the theory you saw in the lectures and apply it to this problem. If you are confident with the theory and you feel that you don't need a revision, you may skip this section direct to 1.4!\n", + "\n", + "Remember that your job is to measure if changing the website's background color leads to an increase of the time visitors spend on it. Rewriting this as hypothesis test, the **null hypothesis** is that the change did not affect the time a visitor spend. Let's name the variables:\n", + "\n", + "- $\\mu_c$ is the average time a user **in the control group** spend in the website. Recall that the **control group** is the group accessing the website without the change in the background color.\n", + "- $\\mu_v$ is the average time a user **in the variation groups** spend in the website. Recall that the **variation group** is the groups accessing the website **with the updated background color**.\n", + "\n", + "Also, recall that your intention is to measure if the background color leads to an **increase** in the time a visitor spend in the website. So writing this experiment as a hypothesis test, the **null hypothesis** is then $H_0: \\mu_c = \\mu_v$ and the **alternative hypothesis** is $H_1: \\mu_v > \\mu_c$, or equivalently, $H_1: \\mu_v - \\mu_c > 0$. \n", + "\n", + "Therefore, the hypothesis you will test is:\n", + "\n", + "$$H_0: \\mu_v = \\mu_c \\quad \\text{vs.} \\quad H_1: \\mu_v - \\mu_c > 0$$\n", + "\n", + "This is a **right-tailed** test, as you are looking for an increase in the average time. As you saw above, you have more than 2000 users per group, this is a great amount of data so it is reasonable to rely in the Central Limit Theorem that the **average time** for each group follows a normal distribution. Remember that this result is for the group **average time** altogether and not that the time each user spend follows a normal distribution. You don't know the exact distribution for the amount of time a user spend in your website, however, the CLT assures that if we gather enough data, their average time will be very close to a normal distribution whose mean is the average time a user spend in the website. Let's then define two new quantities:\n", + "\n", + "- $\\overline{X}_c$ - the control group **sample mean**.\n", + "- $\\overline{X}_v$ - the variation group **sample mean**.\n", + "- $n_c$ - the control group **size**.\n", + "- $n_v$ - the variation group **size**.\n", + "\n", + "So, by the Central Limit Theorem, you may suppose that\n", + "\n", + "- $$\\overline{X}_c \\sim N\\left(\\mu_c, \\left(\\frac{\\sigma_c}{\\sqrt{n_c}}\\right)^2\\right)$$\n", + "- $$\\overline{X}_v \\sim N\\left(\\mu_v, \\left(\\frac{\\sigma_v}{\\sqrt{n_v}}\\right)^2\\right)$$\n", + "\n", + "Note that with our assumptions of normality, $\\overline{X}_v - \\overline{X}_c$ also follows a normal distribution. So, if $H_0$ is true, then $\\mu_c = \\mu_v$ and $\\mu_v - \\mu_c = 0$, therefore:\n", + "\n", + "$$\\overline{X}_c - \\overline{X}_v \\sim N\\left(\\mu_v - \\mu_c, \\left(\\dfrac{\\sigma_v}{\\sqrt{n_v}}\\right)^2 + \\left(\\dfrac{\\sigma_c}{\\sqrt{n_c}}\\right)^2\\right) = N\\left(0, \\left(\\dfrac{\\sigma_v}{\\sqrt{n_v}}\\right)^2 + \\left(\\dfrac{\\sigma_c}{\\sqrt{n_c}}\\right)^2\\right)$$\n", + "\n", + "Or, equivalently:\n", + "\n", + "$$\\frac{\\left( \\overline{X}_v - \\overline{X}_c \\right)}{\\sqrt{\\left(\\frac{\\sigma_v}{\\sqrt{n_v}}\\right)^2 + \\left(\\frac{\\sigma_c}{\\sqrt{n_c}}\\right)^2}} \\sim N(0, 1)$$\n", + "\n", + "However, remember that **you don't know the exact values for** $\\sigma_v$ and $\\sigma_c$, as they are the **population standard deviation** and you are working with a sample, so the best you can do is compute the **sample standard deviation**. So you must replace $\\sigma_c$ and $\\sigma_v$ by the sample standard deviation, respectively, $s_c$ and $s_v$. You also saw in the lectures that replacing the population standard deviation by the sample standard deviation changes the random variable from a Normal to a t-student:\n", + "\n", + "$$t = \\frac{\\left( \\overline{X}_v - \\overline{X}_c \\right)}{\\sqrt{\\left(\\frac{s_v}{\\sqrt{n_v}}\\right)^2 + \\left(\\frac{s_c}{\\sqrt{n_c}}\\right)^2}} \\sim t_d$$\n", + "\n", + "Where $d$ is the **degrees of freedom** for this scenario. If we suppose that both groups have the same standard deviation, then $d = n_c + n_v - 2$, however there is no argument supporting this supposition, so the formula for the degrees of freedom gets a bit messier:\n", + "\n", + "$$d = \\frac{\\left[\\frac{s_{v}^2}{n_v} + \\frac{s_{c}^2}{n_c} \\right]^2}{\\frac{(s_{v}^2/n_v)^2}{n_v-1} + \\frac{(s_{c}^2/n_c)^2}{n_c-1}}$$\n", + "\n", + "Once you get the actual value for $t_d$ the, with a given significance level $\\alpha$, you can decide if this value falls within the range of values that are likely to occur in the $t$-student distribution (where 'likely' is related with your significance level). To perform this step you must find the value $p$ such that \n", + "\n", + "$$p = P(t_d > t | H_0)$$\n", + "\n", + "If this value is less than your significance level $\\alpha$, then you **reject the null hypothesis**, because it means that you observed a value that is very unlikely to occur (unlikely here means that is less than the significance level you have set) if $H_0$ is true.\n", + "\n", + "Also, remember that $P(t_d \\leq t)$ is the $\\text{CDF}$ (cumulative distribution function) for the $t$-student distribution with $d$ degrees of freedom in the point $x = t$, so to compute $P(t_d > t)$ you may compute:\n", + "\n", + "$$P(t_d > t) = 1 - \\text{CDF}_{t_d}(t)$$\n", + "\n", + "Since $P(t_d \\leq t) + P(t_d > t) = 1$" + ] + }, + { + "cell_type": "markdown", + "id": "9b280e63", + "metadata": {}, + "source": [ + "\n", + "## 4 - Step by step computation\n", + "\n", + "\n", + "Wrapping up everything discussed above:\n", + "\n", + "The hypothesis test is given by:\n", + "\n", + "$$H_0: \\mu_v = \\mu_c \\quad \\text{vs.} \\quad H_1: \\mu_v - \\mu_c > 0$$\n", + "\n", + "You will start computing:\n", + "\n", + "- $n_c$ and $n_v$, the control and variation group sizes, respectively.\n", + "- $\\overline{X}_c$ and $\\overline{X}_v$, the average time spent by the users in the control and variation group, respectively. \n", + "- $s_c$ and $s_v$, the **sample** standard deviation for the time spend by the users in the control and variation group, respectively.\n", + "\n", + "With these quantities in hand, the next steps are to compute:\n", + "\n", + "- $d$, the degrees of freedom of the $t$-student distribution, $t_d$.\n", + "- The $t$-value, which it will be called $t$.\n", + "- The $p$ value for the distribution $t_d$ for the $t$-value, i.e., the value $p = P(t_d > t | H_0)$.\n", + "\n", + "Finally, for a given significance level $\\alpha$, you will be able to decide if you reject or not $H_0$, depending on wether $p \\leq \\alpha$ or not.\n", + "\n", + "Let's get your hands into work now! Run the cell below to retrieve the session times for the control and variation groups." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "20e6f0fe", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# X_c stores the session tome for the control group and X_v, for the variation group. \n", + "X_c = control_sd_data.to_numpy()\n", + "X_v = variation_sd_data.to_numpy()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "e58a42be", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The first 20 entries for X_c are:\n", + "[ 61.70902753 25.21946052 26.2404824 58.7480264 137.03680289\n", + " 19.92148102 18.8252202 75.25179496 38.27213776 29.17104128\n", + " 15.69643672 37.83860271 30.06843075 21.00318148 86.19711927\n", + " 46.96997965 46.47776713 14.83464105 17.70441365 26.44693676]\n", + "\n", + "The first 20 entries for X_v are:\n", + "[15.52876878 32.28759003 43.7182168 49.51970242 71.77928343 23.29183517\n", + " 20.78024375 36.44129464 48.75034676 16.5952978 44.49566616 26.67006134\n", + " 34.43667579 20.72109411 19.60185277 41.74218978 19.74485294 32.62018094\n", + " 44.99513901 70.8916231 ]\n", + "\n" + ] + } + ], + "source": [ + "print(f\"The first 20 entries for X_c are:\\n{X_c[:20]}\\n\")\n", + "print(f\"The first 20 entries for X_v are:\\n{X_v[:20]}\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "c2b2ba39", + "metadata": {}, + "source": [ + "\n", + "### Exercise 1\n", + "\n", + "In this exercise, you will write a function to retrieve the basic statistics for `X_c` and `X_d`. In other words, this function will compute, for a given numpy array:\n", + "\n", + "- Its size (in your case, $n_c$ and $n_v$).\n", + "- Its mean (in your case, $\\overline{X}_c$ and $\\overline{X}_v$)\n", + "- Its sample standard deviation(in your case, $s_c$ and $s_v$)\n", + "\n", + "This function inputs a numpy array and outputs a tuple in the form `(n, x, s)` where `n` is the numpy array size, `x` is its mean and `s`is its **sample** standard deviation.\n", + "\n", + "Hint: \n", + "- Recall that the sample standard deviation is computed by replacing $N$ by $N-1$ in the variance formula. \n", + "- You may compute an array size using the `len`function.\n", + "- Any array in numpy has a method called `.mean()` to compute its mean.\n", + "- Any array in numpy has a method called `.std()` to compute the standard deviation and a parameter called `ddof` where if you pass `ddof = 1`, it will use $N-1$ instead of $N$. " + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "1e72eca9", + "metadata": { + "tags": [ + "graded" + ] + }, + "outputs": [], + "source": [ + "def get_stats(X):\n", + " \"\"\"\n", + " Calculate basic statistics of a given data set.\n", + "\n", + " Parameters:\n", + " X (numpy.array): Input data.\n", + "\n", + " Returns:\n", + " tuple: A tuple containing:\n", + " - n (int): Number of elements in the data set.\n", + " - x (float): Mean of the data set.\n", + " - s (float): Sample standard deviation of the data set.\n", + " \"\"\"\n", + "\n", + " ### START CODE HERE ###\n", + " \n", + " # Get the group size\n", + " n = len(X)\n", + " # Get the group mean\n", + " x = X.mean()\n", + " # Get the group sample standard deviation (do not forget to pass the parameter ddof if using the method .std)\n", + " s = X.std(ddof=1)\n", + "\n", + " ### END CODE HERE ###\n", + "\n", + " return (n,x,s)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "22977ba5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[92m All tests passed\n" + ] + } + ], + "source": [ + "w4_unittest.test_get_stats(get_stats)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "a38aa921", + "metadata": {}, + "outputs": [], + "source": [ + "n_c, x_c, s_c = get_stats(X_c)\n", + "n_v, x_v, s_v = get_stats(X_v)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "5954831a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "For X_c:\n", + "\tn_c = 2069, x_c = 32.92, s_c = 17.54 \n", + "For X_v:\n", + "\tn_v = 2117, x_v = 33.83, s_v = 18.24 \n" + ] + } + ], + "source": [ + "print(f\"For X_c:\\n\\tn_c = {n_c}, x_c = {x_c:.2f}, s_c = {s_c:.2f} \")\n", + "print(f\"For X_v:\\n\\tn_v = {n_v}, x_v = {x_v:.2f}, s_v = {s_v:.2f} \")" + ] + }, + { + "cell_type": "markdown", + "id": "d501af8f", + "metadata": {}, + "source": [ + "##### __Expected Output__ \n", + "\n", + "```Python\n", + "For X_c:\n", + "\tn_c = 2069, x_c = 32.92, s_c = 17.54 \n", + "For X_v:\n", + "\tn_v = 2117, x_v = 33.83, s_v = 18.24 \n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "6f1c455e", + "metadata": {}, + "source": [ + "\n", + "### Exercise 2\n", + "\n", + "In this exercise you will implement a function to compute $d$, the degrees of freedom for the $t$-student distribution. It is given by the following formula:\n", + "\n", + "$$d = \\frac{\\left[\\frac{s_{c}^2}{n_c} + \\frac{s_{v}^2}{n_v} \\right]^2}{\\frac{(s_{c}^2/n_c)^2}{n_c-1} + \\frac{(s_{v}^2/n_v)^2}{n_v-1}}$$\n", + "\n", + "Hint: You may use the syntax `x**2`to square a number in python, or you may use the function `np.square`. The latter may help to keep your code cleaner. Pay attention in the parenthesis as they will indicate the order that Python will perform the computation!" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "0689d522", + "metadata": { + "tags": [ + "graded" + ] + }, + "outputs": [], + "source": [ + "def degrees_of_freedom(n_v, s_v, n_c, s_c):\n", + " \"\"\"Computes the degrees of freedom for two samples.\n", + "\n", + " Args:\n", + " control_metrics (estimation_metrics_cont): The metrics for the control sample.\n", + " variation_metrics (estimation_metrics_cont): The metrics for the variation sample.\n", + "\n", + " Returns:\n", + " numpy.float: The degrees of freedom.\n", + " \"\"\"\n", + " \n", + " ### START CODE HERE ###\n", + " \n", + " # To make the code clean, let's divide the numerator and the denominator. \n", + " # Also, note that the value s_c^2/n_c and s_v^2/n_v appears both in the numerator and denominator, so let's also compute them separately\n", + "\n", + " # Compute s_v^2/n_v (remember to use Python syntax or np.square)\n", + " s_v_n_v = s_v**2 / n_v\n", + "\n", + " # Compute s_c^2/n_c (remember to use Python syntax or np.square)\n", + " s_c_n_c = s_c**2 / n_c\n", + "\n", + "\n", + " # Compute the numerator in the formula given above\n", + " numerator = (s_v_n_v + s_c_n_c)**2\n", + "\n", + " # Compute the denominator in the formula given above. Attention that s_c_n_c and s_v_n_v appears squared here!\n", + " # Also, remember to use parenthesis to indicate the operation order. Note that a/b+1 is different from a/(b+1).\n", + " denominator = s_c_n_c**2 / (n_c - 1) + s_v_n_v**2 / (n_v - 1)\n", + " \n", + " ### END CODE HERE ###\n", + "\n", + " dof = numerator/denominator\n", + " \n", + " return dof" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "6ee0e5e1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[92m All tests passed\n" + ] + } + ], + "source": [ + "w4_unittest.test_degrees_of_freedom(degrees_of_freedom)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "edb611dd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The degrees of freedom for the t-student in this scenario is: 4182.97\n" + ] + } + ], + "source": [ + "d = degrees_of_freedom(n_v, s_v, n_c, s_c)\n", + "print(f\"The degrees of freedom for the t-student in this scenario is: {d:.2f}\")" + ] + }, + { + "cell_type": "markdown", + "id": "d5bd2db5", + "metadata": {}, + "source": [ + "### __Expected output__\n", + "\n", + "`The degrees of freedom for the t-student in this scenario is: 4182.97\n", + "`" + ] + }, + { + "cell_type": "markdown", + "id": "3df20f77", + "metadata": {}, + "source": [ + "\n", + "### Exercise 3\n", + "\n", + "In this exercise, you will compute the $t$-value, given by\n", + "\n", + "$$t = \\frac{\\left( \\overline{X}_v - \\overline{X}_c \\right)}{\\sqrt{\\left(\\frac{s_v}{\\sqrt{n_v}}\\right)^2 + \\left(\\frac{s_c}{\\sqrt{n_c}}\\right)^2}} = \\frac{\\left( \\overline{X}_v - \\overline{X}_c \\right)}{\\sqrt{\\frac{s_v^2}{n_v} + \\frac{s_c^2}{n_c}}}$$\n", + "\n", + "Remember that you are storing $\\overline{X}_c$ and $\\overline{X}_v$ in the variables $x_c$ and $x_d$, respectively. " + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "49345865", + "metadata": { + "tags": [ + "graded" + ] + }, + "outputs": [], + "source": [ + "def t_value(n_v, x_v, s_v, n_c, x_c, s_c):\n", + "\n", + " ### START CODE HERE ###\n", + "\n", + " # As you did before, let's split the numerator and denominator to make the code cleaner.\n", + " # Also, let's compute again separately s_c^2/n_c and s_v^2/n_v.\n", + "\n", + " # Compute s_v^2/n_v (remember to use Python syntax or np.square)\n", + " s_v_n_v = s_v**2 / n_v\n", + "\n", + " # Compute s_c^2/n_c (remember to use Python syntax or np.square)\n", + " s_c_n_c = s_c**2 / n_c\n", + "\n", + " # Compute the numerator for the t-value as given in the formula above\n", + " numerator = x_v - x_c\n", + "\n", + " # Compute the denominator for the t-value as given in the formula above. You may use np.sqrt to compute the square root.\n", + " denominator = math.sqrt(s_v_n_v + s_c_n_c)\n", + " \n", + " ### END CODE HERE ###\n", + "\n", + " t = numerator/denominator\n", + "\n", + " return t" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "787f496b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[92m All tests passed\n" + ] + } + ], + "source": [ + "w4_unittest.test_t_value(t_value)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "c264a82a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The t-value for this experiment is: 1.64\n" + ] + } + ], + "source": [ + "t = t_value(n_v, x_v, s_v, n_c, x_c, s_c)\n", + "print(f\"The t-value for this experiment is: {t:.2f}\")" + ] + }, + { + "cell_type": "markdown", + "id": "e978292c", + "metadata": {}, + "source": [ + "##### __Expected Output__\n", + "\n", + "```\n", + "The t-value for this experiment is: 1.64\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "0eba57b1", + "metadata": {}, + "source": [ + "\n", + "### Exercise 4\n", + "\n", + "In this exercise, you will compute the $p$ value for $t_d$, for a given significance level $\\alpha$. Recall that this experiment is a right-tailed t-test, because you are investigating wether the background color change increases the time spent by users in your website or not. \n", + "\n", + "In this experiment the $p$-value for a significance level of $\\alpha$ is given by\n", + "\n", + "$$p = P(t_d > t) = 1 - \\text{CDF}_{t_d}(t)$$\n", + "\n", + "\n", + "Hint: \n", + "- You may use the scipy function `stats.t(df = d)` to get the $t$-student distribution with `d`degrees of freedom. \n", + "- To compute its CDF, you may use its method `.cdf`. \n", + "\n", + "Example:\n", + "\n", + "Suppose you want to compute the CDF for a $t$-student distribution with $d = 10$ degrees of freedom for a t-value of $1.21$." + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "e2199728", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The CDF for the t-student distribution with 10 degrees of freedom and t-value = 1.21, or equivalently P(t_10 < 1.21) is equal to: 0.87\n" + ] + } + ], + "source": [ + "t_10 = stats.t(df = 10)\n", + "cdf = t_10.cdf(1.21)\n", + "print(f\"The CDF for the t-student distribution with 10 degrees of freedom and t-value = 1.21, or equivalently P(t_10 < 1.21) is equal to: {cdf:.2f}\")" + ] + }, + { + "cell_type": "markdown", + "id": "eaf9b9a0", + "metadata": {}, + "source": [ + "This means that there is a probability of 87% that you will observe a value less than 1.21 when sampling from a $t$-student distribution with 10 degrees of freedom.\n", + "\n", + "Ok, now you are ready to write a function to compute the $p$-value for the $t$-student distribution, with $d$ degrees of freedom and a given $t$-value." + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "170e5d25", + "metadata": { + "tags": [ + "graded" + ] + }, + "outputs": [], + "source": [ + "def p_value(d, t_value):\n", + "\n", + " ### START CODE HERE ###\n", + "\n", + " # Load the t-student distribution with $d$ degrees of freedom. Remember that the parameter in the stats.t is given by df.\n", + " t_d = stats.t(df = d)\n", + "\n", + " # Compute the p-value, P(t_d > t). Remember to use the t_d.cdf with the proper adjustments as discussed above.\n", + " p = 1 - t_d.cdf(t_value)\n", + "\n", + "\n", + " ### END CODE HERE ###\n", + " return p" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "98dd95fb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[92m All tests passed\n" + ] + } + ], + "source": [ + "w4_unittest.test_p_value(p_value)" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "fc5ee62f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The p-value for t_15 with t-value = 1.10 is: 0.1443\n", + "The p-value for t_30 with t-value = 1.10 is: 0.1400\n" + ] + } + ], + "source": [ + "print(f\"The p-value for t_15 with t-value = 1.10 is: {p_value(15, 1.10):.4f}\")\n", + "print(f\"The p-value for t_30 with t-value = 1.10 is: {p_value(30, 1.10):.4f}\")" + ] + }, + { + "cell_type": "markdown", + "id": "df30fe94", + "metadata": {}, + "source": [ + "### __Expected output__\n", + "\n", + "```\n", + "The p-value for t_15 with t-value = 1.10 is: 0.1443\n", + "The p-value for t_30 with t-value = 1.10 is: 0.1400\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "57f3a4fd", + "metadata": {}, + "source": [ + "\n", + "### Exercise 5\n", + "\n", + "In this exercise you will wrap up all the functions you have built so far to decide if you accept $H_0$ or not, given a significance level of $\\alpha$.\n", + "\n", + "It will input both control and validation groups and it will output `Reject H_0$` or `Do not reject H_0` accordingly.\n", + "\n", + "Remember that you **reject** $H_0$ if the p-value is **less than** $\\alpha$. " + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "5e8409a6", + "metadata": { + "tags": [ + "graded" + ] + }, + "outputs": [], + "source": [ + "def make_decision(X_v, X_c, alpha = 0.05):\n", + "\n", + " ### START CODE HERE ###\n", + "\n", + " # Compute n_v, x_v and s_v\n", + " n_v, x_v, s_v = get_stats(X_v)\n", + "\n", + " # Compute n_c, x_c and s_c\n", + " n_c, x_c, s_c = get_stats(X_c)\n", + "\n", + " # Compute the degrees of freedom for the t-student distribution for this experiment.\n", + " # Pay attention to the arguments order. You may look the function definition above to make sure you don't swap values.\n", + " # Also, remember that x_c and x_v are not used in this computation\n", + " d = degrees_of_freedom(n_v, s_v, n_c, s_c)\n", + " \n", + " # Compute the t-value\n", + " t = t_value(n_v, x_v, s_v, n_c, x_c, s_c)\n", + "\n", + " # Compute the p-value for the t-student distribution with d degrees of freedom\n", + " p = p_value(d, t)\n", + "\n", + " # This is the decision step. Compare p with alpha to decide about rejecting H_0 or not. \n", + " # Pay attention to the return value for each block to properly write the condition.\n", + "\n", + " if p < alpha:\n", + " return 'Reject H_0'\n", + " else:\n", + " return 'Do not reject H_0'\n", + "\n", + " ### END CODE HERE ###" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "f20b5cc9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[92m All tests passed\n" + ] + } + ], + "source": [ + "w4_unittest.test_make_decision(make_decision)" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "01702b6a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "For an alpha of 0.06 the decision is to: Reject H_0\n", + "For an alpha of 0.05 the decision is to: Do not reject H_0\n", + "For an alpha of 0.04 the decision is to: Do not reject H_0\n", + "For an alpha of 0.01 the decision is to: Do not reject H_0\n" + ] + } + ], + "source": [ + "alphas = [0.06, 0.05, 0.04, 0.01]\n", + "for alpha in alphas:\n", + " print(f\"For an alpha of {alpha} the decision is to: {make_decision(X_v, X_c, alpha = alpha)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "13632472", + "metadata": {}, + "source": [ + "**Congratulations on finishing this assignment!**\n", + "\n", + "Now you have created all the required steps to perform an AB test for a simple scenario!\n", + "\n", + "**This is the last assignment of the course and the specialization so give yourself a pat on the back for such a great accomplishment! Nice job!!!!**" + ] + }, + { + "cell_type": "markdown", + "id": "36a0e982", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "grader_version": "2", + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}