-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Zhiyuan Chen <[email protected]>
- Loading branch information
1 parent
3c5c457
commit 704d5f2
Showing
3 changed files
with
179 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
--- | ||
authors: | ||
- Zhiyuan Chen | ||
date: 2024-05-04 | ||
--- | ||
|
||
# RYOS | ||
|
||
--8<-- "multimolecule/datasets/ryos/README.md:21:" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
--- | ||
language: rna | ||
tags: | ||
- Biology | ||
- RNA | ||
license: | ||
- agpl-3.0 | ||
size_categories: | ||
- 1K<n<10K | ||
task_categories: | ||
- text-generation | ||
- fill-mask | ||
task_ids: | ||
- language-modeling | ||
- masked-language-modeling | ||
pretty_name: RYOS | ||
library_name: multimolecule | ||
--- | ||
|
||
# RYOS | ||
|
||
![RYOS](https://eternagame.org/sites/default/files/hero-covid.jpg) | ||
|
||
RYOS is a database of RNA backbone stability in aqueous solution. | ||
|
||
## Statement | ||
|
||
_Deep learning models for predicting RNA degradation via dual crowdsourcing_ is published in [Nature Machine Intelligence](https://doi.org/10.1038/s42256-022-00571-8), which is a Closed Access / Author-Fee journal. | ||
|
||
> Machine learning has been at the forefront of the movement for free and open access to research. | ||
> | ||
> We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step. | ||
The MultiMolecule team is committed to the principles of open access and open science. | ||
|
||
We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences. | ||
|
||
Please consider signing the [Statement on Nature Machine Intelligence](https://openaccess.engineering.oregonstate.edu). | ||
|
||
## Disclaimer | ||
|
||
This is an UNOFFICIAL release of the [RYOS](https://www.kaggle.com/competitions/stanford-covid-vaccine) by Hannah K. Wayment-Steele et al. | ||
|
||
**The team releasing RYOS did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.** | ||
|
||
## Dataset Description | ||
|
||
- **Homepage**: https://multimolecule.danling.org/datasets/ryos | ||
- **Point of Contact**: [Rhiju Das](https://biochemistry.stanford.edu/people/rhiju-das/) | ||
- **Kaggle Challenge**: https://www.kaggle.com/competitions/stanford-covid-vaccine | ||
- **Eterna Round 1**: https://eternagame.org/labs/9830365 | ||
- **Eterna Round 2**: https://eternagame.org/labs/10207059 | ||
|
||
## Example Entry | ||
|
||
TODO | ||
|
||
## Column Description | ||
|
||
TODO | ||
|
||
## Variations | ||
|
||
This dataset is available in two subsets: | ||
|
||
- [RYOS-1](https://huggingface.co/datasets/multimolecule/ryos-1): The RYOS-1 dataset. | ||
- [RYOS-2](https://huggingface.co/datasets/multimolecule/ryos-2): The RYOS-2 dataset. | ||
|
||
## License | ||
|
||
This dataset is licensed under the [AGPL-3.0 License](https://www.gnu.org/licenses/agpl-3.0.html). | ||
|
||
```spdx | ||
SPDX-License-Identifier: AGPL-3.0-or-later | ||
``` | ||
|
||
## Citation | ||
|
||
```bibtex | ||
@article{waymentsteele2021, | ||
author = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Watkins, Andrew M and Kim, Do Soon and Tunguz, Bojan and Reade, Walter and Demkin, Maggie and Romano, Jonathan and Wellington-Oguri, Roger and Nicol, John J and Gao, Jiayang and Onodera, Kazuki and Fujikawa, Kazuki and Mao, Hanfei and Vandewiele, Gilles and Tinti, Michele and Steenwinckel, Bram and Ito, Takuya and Noumi, Taiga and He, Shujun and Ishi, Keiichiro and Lee, Youhan and {\"O}zt{\"u}rk, Fatih and Chiu, Anthony and {\"O}zt{\"u}rk, Emin and Amer, Karim and Fares, Mohamed and Participants, Eterna and Das, Rhiju}, | ||
journal = {ArXiv}, | ||
month = oct, | ||
title = {Deep learning models for predicting {RNA} degradation via dual crowdsourcing}, | ||
year = 2021 | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
# MultiMolecule | ||
# Copyright (C) 2024-Present MultiMolecule | ||
|
||
# This program is free software: you can redistribute it and/or modify | ||
# it under the terms of the GNU Affero General Public License as published by | ||
# the Free Software Foundation, either version 3 of the License, or | ||
# any later version. | ||
|
||
# This program is distributed in the hope that it will be useful, | ||
# but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | ||
# GNU Affero General Public License for more details. | ||
|
||
# You should have received a copy of the GNU Affero General Public License | ||
# along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
|
||
from __future__ import annotations | ||
|
||
import os | ||
|
||
import danling as dl | ||
import torch | ||
|
||
from multimolecule.datasets.conversion_utils import ConvertConfig as ConvertConfig_ | ||
from multimolecule.datasets.conversion_utils import save_dataset | ||
|
||
torch.manual_seed(1016) | ||
|
||
cols = [ | ||
"ID", | ||
"design_name", | ||
"sequence", | ||
"structure", | ||
"errors", | ||
"signal_to_noise", | ||
"reactivity", | ||
"signal_to_noise_reactivity", | ||
"errors_reactivity", | ||
"deg_pH10", | ||
"signal_to_noise_deg_pH10", | ||
"errors_deg_pH10", | ||
"deg_50C", | ||
"signal_to_noise_deg_50C", | ||
"errors_deg_50C", | ||
"deg_Mg_pH10", | ||
"signal_to_noise_deg_Mg_pH10", | ||
"errors_deg_Mg_pH10", | ||
"deg_Mg_50C", | ||
"signal_to_noise_deg_Mg_50C", | ||
"errors_deg_Mg_50C", | ||
"SN_filter", | ||
] | ||
|
||
|
||
def convert_dataset(convert_config): | ||
df = dl.load_pandas(convert_config.dataset_path) | ||
ryos1 = df[df["RYOS"] == 1] | ||
ryos2 = df[df["RYOS"] == 2] | ||
data1 = { | ||
"train": ryos1[ryos1["split"] == "public_train"][cols], | ||
"validation": ryos1[ryos1["split"] == "public_test"][cols], | ||
"test": ryos1[ryos1["split"] == "private_test"][cols], | ||
} | ||
data2 = { | ||
"train": ryos2[ryos2["split"] != "private_test"][cols], | ||
"test": ryos2[ryos2["split"] == "private_test"][cols], | ||
} | ||
repo_id, output_path = convert_config.repo_id, convert_config.output_path | ||
convert_config.repo_id, convert_config.output_path = repo_id + "-1", output_path + "-1" | ||
save_dataset(convert_config, data1) | ||
convert_config.repo_id, convert_config.output_path = repo_id + "-2", output_path + "-2" | ||
save_dataset(convert_config, data2) | ||
|
||
|
||
class ConvertConfig(ConvertConfig_): | ||
root: str = os.path.dirname(__file__) | ||
output_path: str = os.path.basename(os.path.dirname(__file__)) | ||
|
||
|
||
if __name__ == "__main__": | ||
config = ConvertConfig() | ||
config.parse() # type: ignore[attr-defined] | ||
convert_dataset(config) |