Merge pull request #68 from alxndrkalinin/v0.4.2

v0.4.2
cytomining · Oct 22, 2024 · 44378ed · 44378ed
2 parents fc829c0 + 7d47818
commit 44378ed
Show file tree

Hide file tree

Showing 22 changed files with 3,600 additions and 157 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -16,7 +16,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: ["3.8", "3.9", "3.10"]
+        python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
 
     steps:
     - uses: actions/checkout@v3
@@ -26,17 +26,8 @@ jobs:
         python-version: ${{ matrix.python-version }}
     - name: Install dependencies
       run: |
-        python -m pip install --upgrade pip build
-        python -m pip install flake8 pytest
-        python -m build
-        pip install -e .
-    - name: Lint with flake8
-      run: |
-        # stop the build if there are Python syntax errors or undefined names
-        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
-        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
-        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
+        python -m pip install --upgrade pip
+        pip install -e .[test]
     - name: Test with pytest
       run: |
-        python -m pip install scikit-learn
         pytest
diff --git a/.github/workflows/ruff.yml b/.github/workflows/ruff.yml
@@ -0,0 +1,11 @@
+name: Ruff
+on: [push, pull_request]
+jobs:
+  ruff:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/ruff-action@v1
+      - uses: astral-sh/ruff-action@v1
+        with:
+          args: "format --check"
diff --git a/README.md b/README.md
@@ -1,119 +1,60 @@
  # copairs
 
-Find pairs and compute metrics between them.
+`copairs` is a Python package for finding groups of profiles based on metadata and calculate mean Average Precision to assess intra- vs inter-group similarities.
 
-## Installation
+## Getting started
 
-```bash
-pip install git+https://github.com/cytomining/[email protected]
-```
-
-## Usage
+### System requirements
+copairs supports Python 3.8+ and should work with all modern operating systems (tested with MacOS 13.5, Ubuntu 18.04, Windows 10).
 
-### Data
+### Dependencies
+copairs depends on widely used Python packages:
+* numpy
+* pandas
+* tqdm
+* statsmodels
+* [optional] plotly
 
-Say you have a dataset with 20 samples taken in 3 plates `p1, p2, p3`,
-each plate is composed of 5 wells `w1, w2, w3, w4, w5`, and each well 
-has one or more labels (`t1, t2, t3, t4`) assigned.
+### Installation
 
-```python
-import pandas as pd
-import random
-
-random.seed(0)
-n_samples = 20
-dframe = pd.DataFrame({
-    'plate': [random.choice(['p1', 'p2', 'p3']) for _ in range(n_samples)],
-    'well': [random.choice(['w1', 'w2', 'w3', 'w4', 'w5']) for _ in range(n_samples)],
-    'label': [random.choice(['t1', 't2', 't3', 't4']) for _ in range(n_samples)]
-})
-dframe = dframe.drop_duplicates()
-dframe = dframe.sort_values(by=['plate', 'well', 'label'])
-dframe = dframe.reset_index(drop=True)
+To install copairs and dependencies, run:
+```bash
+pip install copairs
 ```
 
-|    | plate   | well   | label   |
-|---:|:--------|:-------|:--------|
-|  0 | p1      | w2     | t4      |
-|  1 | p1      | w3     | t2      |
-|  2 | p1      | w3     | t4      |
-|  3 | p1      | w4     | t1      |
-|  4 | p1      | w4     | t3      |
-|  5 | p2      | w1     | t1      |
-|  6 | p2      | w2     | t1      |
-|  7 | p2      | w3     | t1      |
-|  8 | p2      | w3     | t2      |
-|  9 | p2      | w3     | t3      |
-| 10 | p2      | w4     | t2      |
-| 11 | p2      | w5     | t1      |
-| 12 | p2      | w5     | t3      |
-| 13 | p3      | w1     | t3      |
-| 14 | p3      | w1     | t4      |
-| 15 | p3      | w4     | t2      |
-| 16 | p3      | w5     | t2      |
-| 17 | p3      | w5     | t4      |
-
-### Getting valid pairs
-
-To get pairs of samples that share the same `label` but comes from different
-`plate`s at different `well` positions: 
-
-```python
-from copairs import Matcher
-matcher = Matcher(dframe, ['plate', 'well', 'label'], seed=0)
-pairs_dict = matcher.get_all_pairs(sameby=['label'], diffby=['plate', 'well'])
+To also install dependencies for running examples, run:
+```bash
+pip install copairs[demo]
 ```
 
-`pairs_dict` is a `label_id: pairs` dictionary containing the list of valid
-pairs for every unique value of `labels`
+### Testing
 
-```
-{'t4': [(0, 17), (0, 14), (17, 2), (2, 14)],
- 't2': [(1, 16), (1, 10), (1, 15), (8, 16), (8, 15), (10, 16)],
- 't1': [(3, 11), (3, 5), (3, 6), (3, 7)],
- 't3': [(9, 4), (9, 13), (13, 4), (13, 12), (4, 12)]}
+To run tests, run:
+```bash
+pip install -e .[test]
+pytest
 ```
 
-### Getting valid pairs from a multilabel column
-
-For eficiency reasons, you may not want to have duplicated rows. You can
-group all the labels in a single row and use `MatcherMultilabel` to find the
-corresponding pairs:
+## Usage
 
-```python
-dframe_multi = dframe.groupby(['plate', 'well'])['label'].unique().reset_index()
-```
+We provide examples demonstrating how to use copairs for:
+- [grouping profiles based on their metadata](./examples/finding_pairs.ipynb)
+- [calculating mAP to assess phenotypic activity and consistnecy of perturbation using real data](./examples/mAP_demo.ipynb)
 
-|    | plate   | well   | label              |
-|---:|:--------|:-------|:-------------------|
-|  0 | p1      | w2     | ['t4']             |
-|  1 | p1      | w3     | ['t2', 't4']       |
-|  2 | p1      | w4     | ['t1', 't3']       |
-|  3 | p2      | w1     | ['t1']             |
-|  4 | p2      | w2     | ['t1']             |
-|  5 | p2      | w3     | ['t1', 't2', 't3'] |
-|  6 | p2      | w4     | ['t2']             |
-|  7 | p2      | w5     | ['t1', 't3']       |
-|  8 | p3      | w1     | ['t3', 't4']       |
-|  9 | p3      | w4     | ['t2']             |
-| 10 | p3      | w5     | ['t2', 't4']       |
 
-```python
-from copairs import MatcherMultilabel
-matcher_multi = MatcherMultilabel(dframe_multi,
-                                  columns=['plate', 'well', 'label'],
-                                  multilabel_col='label',
-                                  seed=0)
-pairs_multi = matcher_multi.get_all_pairs(sameby=['label'],
-                                          diffby=['plate', 'well'])
-```
+## Citation
+If you find this work useful for your research, please cite our [pre-print](https://doi.org/10.1101/2024.04.01.587631):
 
-`pairs_multi` is also a `label_id: pairs` dictionary with the same
-structure discussed before:
+Kalinin, A.A., Arevalo, J., Vulliard, L., Serrano, E., Tsang, H., Bornholdt, M., Rajwa, B., Carpenter, A.E., Way, G.P. and Singh, S., 2024. A versatile information retrieval framework for evaluating profile strength and similarity. bioRxiv, pp.2024-04. doi:10.1101/2024.04.01.587631
 
+BibTeX:
 ```
-{'t4': [(0, 10), (0, 8), (10, 1), (1, 8)],
- 't2': [(1, 10), (1, 6), (1, 9), (5, 10), (5, 9), (6, 10)],
- 't1': [(2, 7), (2, 3), (2, 4), (2, 5)],
- 't3': [(5, 2), (5, 8), (8, 2), (8, 7), (2, 7)]}
+@article{kalinin2024versatile,
+  title={A versatile information retrieval framework for evaluating profile strength and similarity},
+  author={Kalinin, Alexandr A and Arevalo, John and Vulliard, Loan and Serrano, Erik and Tsang, Hillary and Bornholdt, Michael and Rajwa, Bartek and Carpenter, Anne E and Way, Gregory P and Singh, Shantanu},
+  journal={bioRxiv},
+  pages={2024--04},
+  year={2024},
+  doi={10.1101/2024.04.01.587631}
+}
 ```