Skip to content

Commit

Permalink
feat: Custom column names and suffixes for overlap and nearest operat…
Browse files Browse the repository at this point in the history
…ions (#43)

* doc: Installation instructions

* chore: Readme refactor

* feat: Add support for custom column names and suffixes

* Fixing needless borrow

* Removing assertion and adding test case for non-default suffixes

* Creating release 0.3.0
  • Loading branch information
mwiewior authored Dec 21, 2024
1 parent 4bf723c commit 0f25a4d
Show file tree
Hide file tree
Showing 16 changed files with 421 additions and 181 deletions.
1 change: 1 addition & 0 deletions .github/workflows/publish_to_pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ on:
- 'docs/**'
- 'benchmark/**'
- 'mkdocs.yml'
- 'README.md'
pull_request:
workflow_dispatch:

Expand Down
11 changes: 11 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,14 @@ repos:
hooks:
- id: fmt
- id: cargo-check

### FIXME
# - repo: https://github.com/ddkasa/check-mkdocs.git
# rev: 65e819a4c62ee22c38f244b51b63f2f9b89a66d0
# hooks:
# - id: check-mkdocs
# name: check-mkdocs
# args: ["--config", "mkdocs.yml"] # Optional, mkdocs.yml is the default
# # If you have additional plugins or libraries that are not included in
# # check-mkdocs, add them here
# additional_dependencies: ['mkdocs-material', 'mkdocs-jupyter', 'mkdocstrings-python']
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "polars_bio"
version = "0.2.11"
version = "0.3.0"
edition = "2021"

[lib]
Expand Down
63 changes: 7 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,10 @@
# polars_bio
# polars-bio - Next-gen Python DataFrame operations for genomics!
![CI](https://github.com/biodatageeks/polars-bio/actions/workflows/publish_to_pypi.yml/badge.svg?branch=master)
![Docs](https://github.com/biodatageeks/polars-bio/actions/workflows/publish_documentation.yml/badge.svg?branch=master)
![logo](docs/assets/logo-large.png)

## Features

[polars-bio](https://pypi.org/project/polars-bio/) is a Python library for genomics built on top of [polars](https://pola.rs/), [Apache Arrow](https://arrow.apache.org/) and [Apache DataFusion](https://datafusion.apache.org/).
It provides a DataFrame API for genomics data and is designed to be blazing fast, memory efficient and easy to use.

## Genomic ranges operations

| Features | Bioframe | polars-bio | PyRanges | Pybedtools | PyGenomics | GenomicRanges |
|--------------|--------------------|---------------------|--------------------|--------------------|--------------------|--------------------|
| overlap | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| nearest | :white_check_mark: | :white_check_mark: | :white_check_mark: | | | |
| cluster | :white_check_mark: | | | | | |
| merge | :white_check_mark: | | | | | |
| complement | :white_check_mark: | | | | | |
| select/slice | :white_check_mark: | | | | | |
| | | | | | | |
| coverage | :white_check_mark: | | | | | |
| expand | :white_check_mark: | | | | | |
| sort | :white_check_mark: | | | | | |


## Input/Output
| I/O | Bioframe | polars-bio | PyRanges | Pybedtools | PyGenomics | GenomicRanges |
|------------------|--------------------|------------------------|--------------------|------------|------------|---------------|
| Pandas DataFrame | :white_check_mark: | :white_check_mark: | :white_check_mark: | | | |
| Polars DataFrame | | :white_check_mark: | | | | |
| Polars LazyFrame | | :white_check_mark: | | | | |
| Native readers | | :white_check_mark: | | | | |


## Genomic file format
| I/O | Bioframe | polars-bio | PyRanges | Pybedtools | PyGenomics | GenomicRanges |
|----------------|--------------------|------------|--------------------|------------|------------|---------------|
| BED | :white_check_mark: | | :white_check_mark: | | | |
| BAM | | | | | | |
| VCF | | | | | | |


## Performance
![img.png](benchmark/results-overlap-0.1.1.png)

![img.png](benchmark/results-overlap-df-0.1.1.png)

![img.png](benchmark/results-nearest-0.1.1.png)

## Remarks

Pyranges is multithreaded, but :

* Requires Ray backend plus
```bash
nb_cpu: int, default 1

How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple.
Will only lead to speedups on large datasets.
```

* for nearest returns no empty rows if there is no overlap (we follow Bioframe where nulls are returned)
#
Read the [documentation](https://biodatageeks.github.io/polars-bio/)
Loading

0 comments on commit 0f25a4d

Please sign in to comment.