Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: drugchat dataset #341

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions data/drugchat_liang_zhang_et_al/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
name: drugchat_liang_zhang_et_al
description: |-
Instruction tuning dataset used for the LLM component of DrugChat.
10,834 compounds (3,8962 from ChEMBL and 6,942 from PubChem) containing
descriptive drug information were collected. 143,517 questions were generated
using the molecules' classification, properties and descriptions from ChEBI, LOTUS & YMDB.
targets:
- id: Answer
description: answer to the question about the SMILES
type: string
identifiers:
- id: SMILES
type: SMILES
description: SMILES
- id: Question
type: string
description: Question about SMILES
license: CC BY 4.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure about this license? The repo seems to be under a BSD license?

links:
- url: https://www.techrxiv.org/articles/preprint/DrugChat_Towards_Enabling_ChatGPT-Like_Capabilities_on_Drug_Molecule_Graphs/22945922
description: corresponding publication
- url: https://github.com/UCSD-AI4H/drugchat
description: rep & data source
num_points: 143,517
bibtex:
- |-
@article{Liang2023,
author = "Youwei Liang and Ruiyi Zhang and Li Zhang and Pengtao Xie",
title = "{DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs}",
year = "2023",
month = "5",
url = "https://www.techrxiv.org/articles/preprint/DrugChat_Towards_Enabling_ChatGPT-Like_Capabilities_on_Drug_Molecule_Graphs/22945922",
doi = "10.36227/techrxiv.22945922.v1"}
51 changes: 51 additions & 0 deletions data/drugchat_liang_zhang_et_al/transform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
from datasets import concatenate_datasets, load_dataset

PUBCHEM_DATASET = "alxfgh/PubChem_Drug_Instruction_Tuning"
CHEMBL_DATASET = "alxfgh/ChEMBL_Drug_Instruction_Tuning"
Comment on lines +3 to +4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put the requests.get code her or make those paths customizable?



if __name__ == "__main__":
# Load the two datasets
dataset1 = load_dataset(PUBCHEM_DATASET)
dataset2 = load_dataset(CHEMBL_DATASET)

# Verify that the datasets have the same schema (i.e., the same fields)
assert (
dataset1["train"].features == dataset2["train"].features
), "Datasets do not have the same schema"

# Concatenate the 'train' split of dataset2 to the 'train' split of dataset1
combined_dataset = concatenate_datasets([dataset1["train"], dataset2["train"]])

# Define the fractions for train/test/valid split
train_fraction = 0.8
test_fraction = 0.1
# The remaining part will be the validation fraction

# Generate the train/test/valid splits
train_test_valid_datasets = combined_dataset.train_test_split(
test_size=test_fraction, shuffle=True
)
train_valid_datasets = train_test_valid_datasets["train"].train_test_split(
test_size=(1 - train_fraction) / (1 - test_fraction), shuffle=True
)

final_datasets = {
"train": train_valid_datasets["train"],
"test": train_test_valid_datasets["test"],
"valid": train_valid_datasets["test"],
}

# Add the 'split' column to each dataset
for split in final_datasets:
final_datasets[split] = final_datasets[split].add_column(
"split", [split] * len(final_datasets[split])
)

# Concatenate all splits again
all_datasets = concatenate_datasets(
[final_datasets[split] for split in final_datasets]
)

# Save the combined dataset as a CSV file
all_datasets.to_csv("data_clean.csv")