Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to the survey section a component on selection of clusters with probability proportional to size #88

Open
pbkeating opened this issue Jan 19, 2022 · 3 comments

Comments

@pbkeating
Copy link

pbkeating commented Jan 19, 2022

At MSF, we have an Excel tool that supports identification of clusters with probability proportional to size, but this can also be done in R

A first attempt at doing this with a sample dataset included for testing purposes
I've validated this using 2 datasets - previously used one from MSF activities and from this WHO doc https://www.who.int/tb/advisory_bodies/impact_measurement_taskforce/meetings/prevalence_survey/psws_probability_prop_size_bierrenbach.pdf

gen_data 
--------------------------------------------------------------------------------
This section is for generating a fake dataset to test out the code
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -

```{r gen_data}
## set seed
set.seed(50)

## Number of locations to select from
n <- 20

## Prefix
prefix <- "location "

##Suffix 
suffix <- seq(1:n)

## Combine to create basic cluster selection dataset
clusters <- data.frame(location_name = paste0(prefix, suffix),
                       location_population = sample(1000:25000, n, replace = TRUE))

read_data 
--------------------------------------------------------------------------------
This section is for importing your actual location and population data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->

```{r read_data, warning = FALSE, message = FALSE}

### Read in location and population data ---------------------------------------------------------------

## Excel file ------------------------------------------------------------------
## read in location data sheet
# clusters  <- rio::import(here::here("03 Sampling files", "cluster_data.xlsx"), 
#                                na = ".")


identify_clusters
--------------------------------------------------------------------------------
This section is to specify or calculate the following:
- total population in the survey area
- the number of clusters for the survey
- the sampling interval, which is the total population divided by the number of clusters in the survey
- the random starting point

These figures will be combined together in a for loop to obtain a list of the clusters to be surveyed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->
```{r identify_clusters}
## Set seed to ensure the random start remains the same each time
set.seed(50)

## Calculate total population
total_pop <- sum(clusters$location_population, na.rm = T)

## Calculate cumulative sum of the population
clusters$cum_sum <- cumsum(clusters$location_population)

## Specify the number of clusters
cluster_number <- 10

## Calculate sampling interval and round it up
sampling_interval <- round(total_pop/cluster_number, digits = 0)


## Select a random starting point between 1 and the sampling interval
random_start <- sample(1:sampling_interval,1)


## This for loop will identify the locations to survey
for (i in 1:length(clusters$cum_sum)) {
  if (i == 1) {
    clusters$number_clusters[i] = as.integer(((clusters$cum_sum[i] - random_start)/(sampling_interval) +1))
    clusters$cum_clusters[i] = clusters$number_clusters
  } else {
    clusters$number_clusters[i] = as.integer((((clusters$cum_sum[i] - random_start)/(sampling_interval) +1) - clusters$cum_clusters[i-1]), digits = 0)
    clusters$cum_clusters[i] = clusters$number_clusters[i] + clusters$cum_clusters[i-1]
  }
}


@pbkeating pbkeating changed the title Add to the survey section with a component on selection of clusters with probability proportional to size Add to the survey section a component on selection of clusters with probability proportional to size Jan 19, 2022
@aspina7
Copy link
Contributor

aspina7 commented Jan 19, 2022

@AlexandreBlake just for info - if get round to sampling

@AlexandreBlake
Copy link
Contributor

Thanks @pbkeating ! I was planning to generate data for a sampling frame at some point. 1 less thing to do.

@aspina7
Copy link
Contributor

aspina7 commented Sep 2, 2024

for #102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants