Skip to content

Commit

Permalink
add RNAstarlign & ArchiveII datasets
Browse files Browse the repository at this point in the history
Signed-off-by: Zhiyuan Chen <[email protected]>
  • Loading branch information
ZhiyuanChen committed Oct 25, 2024
1 parent f09d83a commit 9432c39
Show file tree
Hide file tree
Showing 9 changed files with 434 additions and 0 deletions.
9 changes: 9 additions & 0 deletions docs/docs/datasets/archiveii.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
authors:
- Zhiyuan Chen
date: 2024-05-04
---

# ArchiveII

--8<-- "multimolecule/datasets/archiveii/README.md:24:"
9 changes: 9 additions & 0 deletions docs/docs/datasets/rnastralign.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
authors:
- Zhiyuan Chen
date: 2024-05-04
---

# RNAStrAlign

--8<-- "multimolecule/datasets/rnastralign/README.md:24:"
2 changes: 2 additions & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ nav:
- bpRNA-1m: datasets/bprna.md
- bpRNA-spot: datasets/bprna-spot.md
- bpRNA-new: datasets/bprna-new.md
- RNAStrAlign: datasets/rnastralign.md
- ArchiveII: datasets/archiveii.md
- RYOS: datasets/ryos.md
- EternaBench-CM: datasets/eternabench-cm.md
- EternaBench-Switch: datasets/eternabench-switch.md
Expand Down
2 changes: 2 additions & 0 deletions multimolecule/datasets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ date: 2024-05-04
- [EternaBench-CM](eternabench-cm)
- [EternaBench-Switch](eternabench-switch)
- [EternaBench-External](eternabench-external)
- [RNAStrAlign](rnastralign)
- [ArchiveII](archiveii)

## Usage

Expand Down
2 changes: 2 additions & 0 deletions multimolecule/datasets/README.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ date: 2024-05-04
- [EternaBench-CM](eternabench-cm)
- [EternaBench-Switch](eternabench-switch)
- [EternaBench-External](eternabench-external)
- [RNAStrAlign](rnastralign)
- [ArchiveII](archiveii)

## 使用

Expand Down
100 changes: 100 additions & 0 deletions multimolecule/datasets/archiveii/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
language: rna
tags:
- Biology
- RNA
license:
- agpl-3.0
size_categories:
- 10K<n<100K
source_datasets:
- multimolecule/bprna
- multimolecule/pdb
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
pretty_name: ArchiveII
library_name: multimolecule
---

# ArchiveII

ArchiveII is a dataset of RNA sequences and their secondary structures, widely used in RNA secondary structure prediction benchmarks.

ArchiveII contains 2975 RNA samples across 10 RNA families, with sequence lengths ranging from 28 to 2968 nucleotides.
This dataset is frequently used to evaluate RNA secondary structure prediction methods, including those that handle both pseudoknotted and non-pseudoknotted structures.

It is considered complementary to the [RNAStrAlign](./rnastralign) dataset.

## Disclaimer

This is an UNOFFICIAL release of the ArchiveII by Mehdi Saman Booy, et al.

**The team releasing ArchiveII did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.**

## Dataset Description

- **Homepage**: https://multimolecule.danling.org/datasets/archiveii
- **datasets**: https://huggingface.co/datasets/multimolecule/archiveii
- **Point of Contact**: [Mehdi Saman Booy](mailto:[email protected])

## Example Entry

| id | sequence | secondary_structure | family |
| ------------------- | ------------------ | ------------------- | ---------- |
| 16S_rRNA-A.fulgidus | AUUCUGGUUGAUCCU... | ...(((((...(((.... | 16S_rRNA |

## Column Description

- **id**:
A unique identifier for each RNA entry. This ID is derived from the family and the original `.sta` file name, and serves as a reference to the specific RNA structure within the dataset.

- **sequence**:
The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:

- **A**: Adenine
- **C**: Cytosine
- **G**: Guanine
- **U**: Uracil

- **secondary_structure**:
The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA's standard:

- **Dots (`.`)**: Represent unpaired nucleotides.
- **Parentheses (`(` and `)`)**: Represent base pairs in standard stems (page 1).

- **family**:
The RNA family to which the sequence belongs, such as 16S rRNA, 5S rRNA, etc.

## Related Datasets

- [RNAStrAlign](https://huggingface.co/datasets/multimolecule/rnastralign): A database of RNA secondary with the same families as ArchiveII, usually used for training.
- [bpRNA-spot](https://huggingface.co/datasets/multimolecule/bprna-spot): Another commonly used database in RNA secondary structures prediction.

## License

This dataset is licensed under the [AGPL-3.0 License](https://www.gnu.org/licenses/agpl-3.0.html).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```

## Citation

```bibtex
@article{samanbooy2022rna,
author = {Saman Booy, Mehdi and Ilin, Alexander and Orponen, Pekka},
journal = {BMC Bioinformatics},
keywords = {Deep learning; Pseudoknotted structures; RNA structure prediction},
month = feb,
number = 1,
pages = {58},
publisher = {Springer Science and Business Media LLC},
title = {{RNA} secondary structure prediction with convolutional neural networks},
volume = 23,
year = 2022
}
```
95 changes: 95 additions & 0 deletions multimolecule/datasets/archiveii/archiveii.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# MultiMolecule
# Copyright (C) 2024-Present MultiMolecule

# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.

# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

from __future__ import annotations

import os
from pathlib import Path

import torch
from tqdm import tqdm

from multimolecule.datasets.conversion_utils import ConvertConfig as ConvertConfig_
from multimolecule.datasets.conversion_utils import save_dataset

torch.manual_seed(1016)


def convert_ct(file):
if not isinstance(file, Path):
file = Path(file)
with open(file) as f:
lines = f.readlines()

first_line = lines[0].strip().split()
num_bases = int(first_line[0])

sequence = []
dot_bracket = ["."] * num_bases

for i in range(1, num_bases + 1):
line = lines[i].strip().split()
sequence.append(line[1])
pair_index = int(line[4])

if pair_index > 0:
if int(lines[pair_index].strip().split()[4]) != i:
raise ValueError(
f"Invalid pairing at position {i}: pair_index {pair_index} does not point back correctly."
)
if pair_index > i:
dot_bracket[i - 1] = "("
dot_bracket[pair_index - 1] = ")"

family, name = file.stem.split("_", 1)
if family in ("5s", "16s", "23s"):
family = family.upper() + "_rRNA"
elif family == "srp":
family = family.upper()
elif family == "grp1":
family = "group_I_intron"
elif family == "grp2":
family = "group_II_intron"
id = family + "-" + name

return {
"id": id,
"sequence": "".join(sequence),
"secondary_structure": "".join(dot_bracket),
"family": family,
}


def convert_dataset(convert_config):
files = [
os.path.join(convert_config.dataset_path, f)
for f in os.listdir(convert_config.dataset_path)
if f.endswith(".ct")
]
files.sort()
data = [convert_ct(file) for file in tqdm(files, total=len(files))]
save_dataset(convert_config, data, filename="test.parquet")


class ConvertConfig(ConvertConfig_):
root: str = os.path.dirname(__file__)
output_path: str = os.path.basename(os.path.dirname(__file__))


if __name__ == "__main__":
config = ConvertConfig()
config.parse() # type: ignore[attr-defined]
convert_dataset(config)
102 changes: 102 additions & 0 deletions multimolecule/datasets/rnastralign/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
---
language: rna
tags:
- Biology
- RNA
license:
- agpl-3.0
size_categories:
- 10K<n<100K
source_datasets:
- multimolecule/bprna
- multimolecule/pdb
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
pretty_name: RNAStrAlign
library_name: multimolecule
---

# RNAStrAlign

RNAStrAlign is a comprehensive dataset of RNA sequences and their secondary structures.

RNAStrAlign aggregates data from multiple established RNA structure repositories, covering diverse RNA families such as 5S ribosomal RNA, tRNA, and group I introns.

It is considered complementary to the [ArchiveII](./archiveii) dataset.

## Disclaimer

This is an UNOFFICIAL release of the RNAStrAlign by Zhen Tan, et al.

**The team releasing RNAStrAlign did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.**

## Dataset Description

- **Homepage**: https://multimolecule.danling.org/datasets/rnastralign
- **datasets**: https://huggingface.co/datasets/multimolecule/rnastralign
- **Point of Contact**: [David H. Mathews](mailto:[email protected]) and [Gaurav Sharma](mailto:[email protected])

## Example Entry

| id | sequence | secondary_structure | family | subfamily |
| -------------------------------- | ------------------ | ------------------- | ---------- | -------------- |
| 16S_rRNA-Actinobacteria-AB002635 | ACACAUGCAAGCGAA... | .(((.(((..((..(... | 16S_rRNA | Actinobacteria |

## Column Description

- **id**:
A unique identifier for each RNA entry. This ID is derived from the family and the original `.sta` file name, and serves as a reference to the specific RNA structure within the dataset.

- **sequence**:
The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:

- **A**: Adenine
- **C**: Cytosine
- **G**: Guanine
- **U**: Uracil

- **secondary_structure**:
The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA's standard:

- **Dots (`.`)**: Represent unpaired nucleotides.
- **Parentheses (`(` and `)`)**: Represent base pairs in standard stems (page 1).

- **family**:
The RNA family to which the sequence belongs, such as 16S rRNA, 5S rRNA, etc.

- **subfamily**:
A more specific subfamily within the family, such as Actinobacteria for 16S rRNA.

Not all families have subfamilies, in which case this field will be `None`.

## Related Datasets

- [ArchiveII](https://huggingface.co/datasets/multimolecule/archiveii): A database of RNA secondary with the same families as RNAStrAlign, usually used for testing.
- [bpRNA-spot](https://huggingface.co/datasets/multimolecule/bprna-spot): Another commonly used database in RNA secondary structures prediction.

## License

This dataset is licensed under the [AGPL-3.0 License](https://www.gnu.org/licenses/agpl-3.0.html).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```

## Citation

```bibtex
@article{ran2017turbofold,
author = {Tan, Zhen and Fu, Yinghan and Sharma, Gaurav and Mathews, David H},
journal = {Nucleic Acids Research},
month = nov,
number = 20,
pages = {11570--11581},
title = {{TurboFold} {II}: {RNA} structural alignment and secondary structure prediction informed by multiple homologs},
volume = 45,
year = 2017
}
```
Loading

0 comments on commit 9432c39

Please sign in to comment.