Skip to content

Defining parameters

PaulaLlanos edited this page Aug 1, 2024 · 3 revisions

Defining groups and calculating the average precision (AP) values:

Parameters:

We will consider a small sample dataset to understand how parameters can be defined to calculate the AP values. The following dataset has features from two multi-well plates with associated metadata.

Metadata_perturbation Metadata_plate Metadata_Well Metadata_Sample_type Feature 1 Feature 2
Treatment1 P1 A1 Treated 1000 300
Treatment2 P1 A2 Treated 300 100
NA P1 A3 Control 10 500
NA P1 B1 Control 15 438
Treatment1 P1 B2 Treated 700 400
Treatment2 P1 B3 Treated 250 75
Treatment1 P2 A1 Treated 750 250
Treatment2 P2 A2 Treated 250 150
NA P2 A3 Control 20 450
NA P2 B1 Control 17 525
Treatment1 P2 B2 Treated 800 325
Treatment2 P2 B3 Treated 250 87

Suppose a user is interested in computing the AP of the treated samples against the controls, profiles of two different perturbations will be considered a positive pair, and a pair of control and perturbed profiles will be considered a negative pair. Let’s see how we can define the parameters for this particular case:

The following two parameters define the positive pairs,

  • pos_sameby - takes a list as input. A positive pair is defined using this parameter. In the example above, the perturbed groups are positive pairs, and any metadata that identifies a particular sample as a perturbed sample can be provided here. In this case, it will be the column ‘Metadata_perturbation’. e.g pos_sameby = [‘Metadata_perturbation’]

  • pos_diffby - takes a list as input. This parameter defines the profiles that should not be considered as a positive pair while computing the metrics. For example, if we would like to avoid replicates of treated/perturbed samples from the same plate or well position, then metadata of those replicates can be provided here. In this case, we will use ‘Metadata_plate’ as the input. e.g pos_diffby = [‘Metadata_plate’]

The following two parameters define the negative pairs,

  • neg_sameby - takes a list as input. This helps restrict the ‘neg_diffby’ samples that should be considered for calculating the metrics. If one is interested in taking the negative samples only from the same plate as the perturbed samples are, then ‘Metadata_plate’ can be given here. This ensures that control profiles from different plates are excluded for the calculation. e.g neg_sameby = [‘Metadata_plate’]

  • neg_diffby - takes a list as input. neg_diffby allows us to define what the perturbed samples need to be compared against (i.e whether to be compared against the controls or other perturbed samples). In this specific example, since we intend to differentiate the perturbed samples from the controls ‘Metadata_sample_type’ serves as the input data. e.g neg_diffby = [‘Metadata_sample_type’]

Other parameters,

  • meta : takes dataframe as input. A dataframe with only the metadata associated with the profiles should be provided.
  • features : takes a numpy array as an input. An numpy array of all feature values without any NaNs.
  • batch_size: takes integer as input. This will be the total number of pairs that will be considered for computing AP values.

Note: Define the parameters of the function in the order they appear in the function call.

Once the above parameters are defined the following command can be used to calculate the average precision,

result = copairs.map.average_precision(meta, features, pos_sameby, pos_diffby, neg_sameby, neg_diffby, batch_size)

The output of the above step is a CSV file containing AP values for all the samples along with the details of the number of positive and negative pairs that were used for the calculation.

Calculating the mAP values:

This step groups the profiles based on the ‘sameby’ value provided by the user and calculates the mean of the AP values for each unique group. A single false-discovery rate corrected p-value is also calculated for each of the unique groups.

Parameters:

  • result - takes the output csv that was obtained in the previous step.
  • sameby - takes a list as input. In the example that we are discussing, since the perturbed samples are considered positive pairs, the metadata that we used to define pos_sameby can be used here as well (i.e ‘Metadata_perturbation’)
  • threshold - defines the threshold value below which the calculated mAP values will be considered significant
  • null_size - takes an integer as input. It defines the number of points in the null distribution.
  • seed - takes an integer as input.
mAP = mean_average_precision(result, sameby, null_size, threshold, seed)