-
Notifications
You must be signed in to change notification settings - Fork 5
Defining parameters
We will consider a small sample dataset to understand how parameters can be defined to calculate the AP values. The following dataset has features from two multi-well plates with associated metadata.
Metadata_perturbation | Metadata_plate | Metadata_Well | Metadata_Sample_type | Feature 1 | Feature 2 |
---|---|---|---|---|---|
Treatment1 | P1 | A1 | Treated | 1000 | 300 |
Treatment2 | P1 | A2 | Treated | 300 | 100 |
NA | P1 | A3 | Control | 10 | 500 |
NA | P1 | B1 | Control | 15 | 438 |
Treatment1 | P1 | B2 | Treated | 700 | 400 |
Treatment2 | P1 | B3 | Treated | 250 | 75 |
Treatment1 | P2 | A1 | Treated | 750 | 250 |
Treatment2 | P2 | A2 | Treated | 250 | 150 |
NA | P2 | A3 | Control | 20 | 450 |
NA | P2 | B1 | Control | 17 | 525 |
Treatment1 | P2 | B2 | Treated | 800 | 325 |
Treatment2 | P2 | B3 | Treated | 250 | 87 |
Suppose a user is interested in computing the AP of the treated samples against the controls, profiles of two different perturbations will be considered a positive pair, and a pair of control and perturbed profiles will be considered a negative pair. Let’s see how we can define the parameters for this particular case:
The following two parameters define the positive pairs,
-
pos_sameby - takes a list as input. A positive pair is defined using this parameter. In the example above, the perturbed groups are positive pairs, and any metadata that identifies a particular sample as a perturbed sample can be provided here. In this case, it will be the column ‘Metadata_perturbation’. e.g pos_sameby = [‘Metadata_perturbation’]
-
pos_diffby - takes a list as input. This parameter defines the profiles that should not be considered as a positive pair while computing the metrics. For example, if we would like to avoid replicates of treated/perturbed samples from the same plate or well position, then metadata of those replicates can be provided here. In this case, we will use ‘Metadata_plate’ as the input. e.g pos_diffby = [‘Metadata_plate’]
The following two parameters define the negative pairs,
-
neg_sameby - takes a list as input. This helps restrict the ‘neg_diffby’ samples that should be considered for calculating the metrics. If one is interested in taking the negative samples only from the same plate as the perturbed samples are, then ‘Metadata_plate’ can be given here. This ensures that control profiles from different plates are excluded for the calculation. e.g neg_sameby = [‘Metadata_plate’]
-
neg_diffby - takes a list as input. neg_diffby allows us to define what the perturbed samples need to be compared against (i.e whether to be compared against the controls or other perturbed samples). In this specific example, since we intend to differentiate the perturbed samples from the controls ‘Metadata_sample_type’ serves as the input data. e.g neg_diffby = [‘Metadata_sample_type’]
Other parameters,
- meta : takes dataframe as input. A dataframe with only the metadata associated with the profiles should be provided.
- features : takes a numpy array as an input. An numpy array of all feature values without any NaNs.
- batch_size: takes integer as input. This will be the total number of pairs that will be considered for computing AP values.
Note: Define the parameters of the function in the order they appear in the function call.
Once the above parameters are defined the following command can be used to calculate the average precision,
result = copairs.map.average_precision(meta, features, pos_sameby, pos_diffby, neg_sameby, neg_diffby, batch_size)
The output of the above step is a CSV file containing AP values for all the samples along with the details of the number of positive and negative pairs that were used for the calculation.
This step groups the profiles based on the ‘sameby’ value provided by the user and calculates the mean of the AP values for each unique group. A single false-discovery rate corrected p-value is also calculated for each of the unique groups.
- result - takes the output csv that was obtained in the previous step.
-
sameby - takes a list as input. In the example that we are discussing, since the perturbed samples are considered positive pairs, the metadata that we used to define
pos_sameby
can be used here as well (i.e ‘Metadata_perturbation’) - threshold - defines the threshold value below which the calculated mAP values will be considered significant
- null_size - takes an integer as input. It defines the number of points in the null distribution.
- seed - takes an integer as input.
mAP = mean_average_precision(result, sameby, null_size, threshold, seed)