Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Assignment1 #31

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 67 additions & 3 deletions 02_activities/assignments/a1_sampling_and_reproducibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,74 @@ Modify the number of repetitions in the simulation to 1000 (from the original 50

Alter the code so that it is reproducible. Describe the changes you made to the code and how they affected the reproducibility of the script file. The output does not need to match Whitby’s original blogpost/graphs, it just needs to produce the same output when run multiple times

# Author: YOUR NAME
# Author: ANSHU DWIVEDI

```
Please write your explanation here...
1. Examining the Code in whitby_covid_tracing.py

Sampling Stages in the Model

The sampling stages in the `whitby_covid_tracing.py` script occur as follows:

- Infection Sampling: Within the `simulate_event` function, a subset of attendees at events is randomly infected based on an `ATTACK_RATE` (10% of the population). This is done through `np.random.choice`, which randomly selects individuals to infect.

- Primary Tracing Sampling: Using `np.random.rand`, the code determines whether each infected individual is traced, with a 20% probability of success (`TRACE_SUCCESS`).

- Secondary Tracing Sampling: Further tracing occurs when events, like weddings, exceed a threshold of traced cases (defined by `SECONDARY_TRACE_THRESHOLD = 2`), introducing a bias by prioritizing certain events for tracing.

Sampling Procedure Description

The sampling procedure simulates infection spread and contact tracing within two groups—those attending weddings and brunches—to illustrate how traceability biases toward specific types of gatherings.

Referenced Functions:

- `simulate_event`: Manages infection and tracing logic.
- `np.random.choice`: Randomly selects infected individuals using a uniform distribution.
- `np.random.rand`: Uses a Bernoulli distribution (20% probability) to decide if an infected person is traced.

Sample Size: Each simulation involves 1,000 individuals, with 200 attending weddings and 800 attending brunches.

Sampling Frame: Consists of all event attendees, categorized by event type.

Underlying Distributions:

- Uniform Distribution: For selecting infected individuals.
- Bernoulli Distribution: For deciding tracing outcomes.

Relation to the Blog Post: The code models Whitby’s observations on contact tracing bias, favoring large, easily traceable events like weddings. The use of uniform and Bernoulli distributions mimics real-world randomness, introducing a bias in tracing for larger events.

2. Running the Script and Comparing Results

After running the Python script, a histogram was generated to compare actual infections at weddings to cases traced to weddings. These results support Andrew Whitby’s insights on contact tracing biases.

Observations from the Graph:

- Actual Infections vs. Traced Cases: One bar set reflects actual infections from weddings, while another shows cases traced to weddings.
- Illustrated Bias: The graph indicates a higher tracing rate for weddings than actual infections, aligning with Whitby’s point that tracing can skew data to make certain events, like weddings, appear more significant sources of infection.

Does the Code Reproduce the Graphs from the Original Blog Post?

Yes, the code replicates Whitby’s blog findings. The generated histogram demonstrates the tracing bias by showing two main aspects:
- The actual infections originating from weddings.
- The traced cases associated with weddings.

The resulting graph confirms Whitby’s argument about the inherent bias in contact tracing, which makes certain events seem more infection-prone than they are.

3. Modifying Repetitions for Reproducibility Observations

Reducing the simulation’s repetitions from 50,000 to 1,000 revealed notable changes in result consistency across runs:

Observations:

- Higher Variability: With only 1,000 repetitions, each run exhibited more variability, highlighting how random fluctuations impact results more at a smaller scale. This caused differences in distribution shapes and frequencies between runs.
- Less Smooth Histograms: The histograms appeared more irregular, with gaps and spikes, unlike the smoother distribution from 50,000 repetitions.
- Lower Reproducibility: The tracing bias trend remained visible, but case proportions fluctuated between runs, making exact results harder to replicate compared to a larger sample size.

Conclusion on Reproducibility: With fewer repetitions, results were less reproducible due to greater random variation. While the general tracing bias trend was present, specific values varied between runs. Larger sample sizes, like 50,000, are more reliable for stable results, reducing the influence of random variation.

4. Altering Code for Reproducibility

To ensure consistent results, I added `np.random.seed(42)` at the beginning of the script. Setting this seed makes `np.random` functions like `np.random.choice` and `np.random.rand` generate the same "random" values each time the script runs, ensuring reproducibility across different environments and repetition levels.

```

Expand All @@ -27,7 +91,7 @@ Please write your explanation here...

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.
🚨 Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md) 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
Expand Down
10 changes: 7 additions & 3 deletions 02_activities/assignments/whitby_covid_tracing.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@
import matplotlib.pyplot as plt
import seaborn as sns

# Set a seed for reproducibility
np.random.seed(42)


# Note: Suppressing FutureWarnings to maintain a clean output. This is specifically to ignore warnings about
# deprecated features in the libraries we're using (e.g., 'use_inf_as_na' option in Pandas, used by Seaborn),
# which we currently have no direct control over. This action is taken to ensure that our output remains
Expand Down Expand Up @@ -67,8 +71,8 @@ def simulate_event(m):

return p_wedding_infections, p_wedding_traces

# Run the simulation 50000 times
results = [simulate_event(m) for m in range(50000)]
# Run the simulation 1000 times, first run after changing the the number of repetitions
results = [simulate_event(m) for m in range(1000)]
props_df = pd.DataFrame(results, columns=["Infections", "Traces"])

# Plotting the results
Expand All @@ -80,4 +84,4 @@ def simulate_event(m):
plt.title("Impact of Contact Tracing on Perceived Infection Sources")
plt.legend()
plt.tight_layout()
plt.show()
plt.show()
Loading