UofT-DSI · MRKGITCODE · Nov 11, 2024 · Nov 19, 2024
diff --git a/02_activities/assignments/a1_sampling_and_reproducibility.md b/02_activities/assignments/a1_sampling_and_reproducibility.md
@@ -12,10 +12,66 @@ Alter the code so that it is reproducible. Describe the changes you made to the
 
 # Author: YOUR NAME
 
-```
-Please write your explanation here...
 
-```
+MD RAZAUL KARIM 
+
+
+In this link , I have seen  the potential impact of sampling bias in contact tracing on data regarding
+sources of infection for COVID-19 in Toronto, Ontario, Canada. Although institutional outbreaks have consistently accounted for a large proportion of all COVID-19 .The blog post highlights how the sampling procedure for contact tracing introduces a systematic bias, resulting in an incomplete and non-representative sample of COVID-19 cases. Here’s a breakdown of the sampling procedure in the model code provided, and how it aligns with the explanation in the blog:
+
+Event Attendance Sampling:
+
+Sample Size: The entire population of 1,000 people.
+Sampling Frame: All individuals in the population attend exactly one of two event types:
+Weddings: 2 events with 100 attendees each.
+Brunches: 80 events with 10 attendees each.
+This setup matches the assumption in the blog where some gatherings (like brunches) involve many small groups, while others (like weddings) involve a few large groups.
+Underlying Distribution: Each individual is assumed to attend exactly one event, which sets up distinct groups by event type and size. No probabilistic selection occurs here; it’s a fixed assignment of people to events.
+Primary Infection Sampling:
+
+Sample Size: 10% of all attendees are marked as infected based on an independent random sampling.
+Sampling Frame: All attendees across all events have the same probability of infection.
+Underlying Distribution: Each attendee has a 10% chance of being infected, independent of others (i.i.d.), with the binomial distribution describing the infection rate per event.
+This sampling mimics the real-world randomness in transmission within an event, where infection is independent but consistent across event types, reflecting the true infection rate without bias.
+Primary Contact Tracing Sampling:
+
+Sample Size: Only a subset of infected individuals are contact-traced successfully (20% tracing success).
+Sampling Frame: The infected subset of individuals across all events.
+Underlying Distribution: Each infected individual has a 20% chance of being traced, which is modeled as a binomial process where each infection independently has a chance to be traced.
+This aligns with the blog’s mention of resource constraints and the imperfect recall in contact tracing, making only a fraction of infections traceable. The non-random nature of this tracing success means that some infections are systematically untraceable.
+Secondary Contact Tracing:
+
+Procedure: If at least two infected individuals at an event are successfully traced, the model assumes all individuals at that event are then fully traced.
+Sampling Frame: Only events with two or more successfully traced cases qualify for secondary tracing.
+Underlying Process: Secondary tracing amplifies certain events, allowing more infections at large events (like weddings) to be identified. This effect is deterministic once the tracing threshold is met, which biases the sample toward settings with more traceable cases(example weeding)
+This closely follows the blog’s discussion, where events that are easier to trace (e.g., weddings or jails) are systematically overrepresented in tracing results.
+Comparison to the Original Blog Post
+Running the whitby_covid_tracing.py script with 50,000 repetitions as originally specified would yield output close to the original graphs in Whitby’s blog post if the model assumptions align. The large number of repetitions provides enough data points to accurately estimate the distribution of infection sources and tracing outcomes, generating smooth histograms.
+
+Modifying the Number of Repetitions
+Reducing repetitions to 1,000 and running the script several times will likely yield varying results across runs. This is because fewer repetitions reduce sample stability, leading to more variation in output. With 1,000 repetitions, histograms would appear less smooth and vary between runs, which could make them less reliable for inference about the underlying trends.
+
+Making the Code Reproducible
+To ensure reproducibility, add a fixed random seed before sampling:
+
+python code 
+
+import os
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+
+Set a random seed for reproducibility
+np.random.seed(42)
+
+The rest of the code remains unchanged
+Define constants, functions, and perform the simulation
+Adding np.random.seed(42) before any sampling will allow the code to produce consistent results in every run. This change ensures that every call to a random function yields the same outcome, enabling reproducible results regardless of system or environment. This does not need to match Whitby’s exact graphs but will produce stable and repeatable histograms with each execution of the script.
+
+True Infection Rates: Reflect the true proportion of infections across different events, unaffected by tracing ability.
+Observed Infection Rates: Heavily skewed by secondary contact tracing, making events like weddings appear more prominent in the observed infection data than they are in reality.
+This skewed sampling mirrors the real-world limitations and biases highlighted in the blog post, where limited tracing resources result in overrepresentation of certain high-visibility events, creating a misleading view of transmission sources.
 
 
 ## Criteria