diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml index 4d74fad..1f73464 100644 --- a/.github/workflows/main.yml +++ b/.github/workflows/main.yml @@ -58,16 +58,13 @@ jobs: python atomgpt/forward_models/forward_models.py --config_name atomgpt/examples/forward_model/config.json echo 'inverse model' - python atomgpt/examples/inverse_model/run.py + #python atomgpt/examples/inverse_model/run.py coverage run -m pytest coverage report -m -i codecov #codecov --token="85bd9c5d-9e55-4f6d-bd69-350ee5e3bb41" - #train_alignn.py -h - #echo 'Pre-trained models' - #pretrained.py -h - #find . -type f > after_test_files.txt + find . -type f > after_test_files.txt diff --git a/README.md b/README.md index 2717b96..c95dd5d 100644 --- a/README.md +++ b/README.md @@ -1,29 +1,61 @@ # AtomGPT: atomistic generative pre-trained transformer for forward and inverse materials design -Large language models (LLMs) such as generative pretrained transformers (GPTs) have shown potential for various commercial applications, but their applicability for materials design remains underexplored. In this work, AtomGPT is introduced as a model specifically developed for materials design based on transformer architectures, demonstrating capabilities for both atomistic property prediction and structure generation. This study shows that a combination of chemical and structural text descriptions can efficiently predict material properties with accuracy comparable to graph neural network models, including formation energies, electronic bandgaps from two different methods, and superconducting transition temperatures. Furthermore, AtomGPT can generate atomic structures for tasks such as designing new superconductors, with the predictions validated through density functional theory calculations. This work paves the way for leveraging LLMs in forward and inverse materials design, offering an efficient approach to the discovery and optimization of materials. +Large language models (LLMs) such as [ChatGPT](https://openai.com/chatgpt/) have shown immense potential for various commercial applications, but their applicability for materials design remains underexplored. In this work, AtomGPT is introduced as a model specifically developed for materials design based on transformer architectures, demonstrating capabilities for both atomistic property prediction and structure generation tasks. This study shows that a combination of chemical and structural text descriptions can efficiently predict material properties with accuracy comparable to graph neural network models, including formation energies, electronic bandgaps from two different methods, and superconducting transition temperatures. Furthermore, AtomGPT can generate atomic structures for tasks such as designing new superconductors, with the predictions validated through density functional theory calculations. This work paves the way for leveraging LLMs in forward and inverse materials design, offering an efficient approach to the discovery and optimization of materials. +Both forward and inverse models take a config.json file as an input. Such a config file provides basic training parameters, and an `id_prop.csv` file path similar to the ALIGNN (https://github.com/usnistgov/alignn) model. See an example here: [id_prop.csv](https://github.com/usnistgov/atomgpt/blob/develop/atomgpt/examples/forward_model/id_prop.csv). ## Forward model example (structure to property) +Forwards model are used for developing surrogate models for atomic structure to property predictions. It requires text input which can be either the raw POSCAR type files or a text description of the material. After that, we can use Google-T5/ OpenAI GPT2 etc. models with customizing langauage head for accomplishing such a task. The description of a material is generated with [ChemNLP/describer](https://github.com/usnistgov/jarvis/blob/master/jarvis/core/atoms.py#L1567) function. If you turn [`convert`](https://github.com/usnistgov/atomgpt/blob/develop/atomgpt/forward_models/forward_models.py#L277) to `False`, you can also train on bare POSCAR files. + ``` python atomgpt/forward_models/forward_models.py --config_name atomgpt/examples/forward_model/config.json ``` ## Inverse model example (property to structure) +Inverse models are used for generating materials given property and description such as chemical formula. Currently, we use Mistral model, but other models such as Gemma, Lllama etc. can also be easily used. After the structure generation, we can optimize the structure with ALIGNN-FF model (example [here](https://colab.research.google.com/github/knc6/jarvis-tools-notebooks/blob/master/jarvis-tools-notebooks/ALIGNN_Structure_Relaxation_Phonons_Interface.ipynb) and then subject to density functional theory calculations for a few selected candidates using JARVIS-DFT or similar workflow (tutorial for example [here](https://pages.nist.gov/jarvis/tutorials/). Note that currently, the inversely model training as well as conference requires GPUs. + ``` python atomgpt/inverse_models/inverse_models.py --config_name atomgpt/examples/inverse_model/config.json ``` # Google colab/Jupyter notebook - -[![Open in Google Colab]](https://github.com/knc6/jarvis-tools-notebooks/blob/master/jarvis-tools-notebooks/atomgpt_example.ipynb) +Examples for running AtomGPT is given in the [notebook](https://colab.research.google.com/github/knc6/jarvis-tools-notebooks/blob/master/jarvis-tools-notebooks/atomgpt_example.ipynb) +[![Open in Google Colab]](https://colab.research.google.com/github/knc6/jarvis-tools-notebooks/blob/master/jarvis-tools-notebooks/atomgpt_example.ipynb) [Open in Google Colab]: https://colab.research.google.com/assets/colab-badge.svg +For other notebook example, see [here](https://github.com/JARVIS-Materials-Design/jarvis-tools-notebooks) + +![AtomGPT layer schematic](https://github.com/usnistgov/atomgpt/blob/develop/atomgpt/data/schematic.jpeg) + + +# Referenes: + +1. [AtomGPT: Atomistic Generative Pretrained Transformer for Forward and Inverse Materials Design](https://pubs.acs.org/doi/full/10.1021/acs.jpclett.4c01126) +2. [ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data](https://github.com/usnistgov/chemnlp) + + +How to contribute +----------------- + +For detailed instructions, please see [Contribution instructions](https://github.com/usnistgov/jarvis/blob/master/Contribution.rst) + + +Correspondence +-------------------- + +Please report bugs as Github issues (https://github.com/usnistgov/atomgpt/issues) or email to kamal.choudhary@nist.gov. -(Documentation development is in progress...) + +Funding support +-------------------- +NIST-MGI (https://www.nist.gov/mgi) and CHIPS (https://www.nist.gov/chips) +Code of conduct +-------------------- +Please see [Code of conduct](https://github.com/usnistgov/jarvis/blob/master/CODE_OF_CONDUCT.md) diff --git a/atomgpt/config.py b/atomgpt/config.py deleted file mode 100644 index f0bb917..0000000 --- a/atomgpt/config.py +++ /dev/null @@ -1,30 +0,0 @@ -from typing import Optional -from pydantic_settings import BaseSettings -class TrainingPropConfig(BaseSettings): - """Training config defaults and validation.""" - - benchmark_file: Optional[str] = None - # "AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae" - id_prop_path: Optional[str] = None - prefix: str = "xyz" - model_name: str = "gpt2" - leaderboard_dir: str = ( - "/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - ) - batch_size: int = 8 - max_length: int = 512 - num_epochs: int = 500 - latent_dim: int = 1024 - learning_rate: float = 1e-3 - test_each_run: bool = True - include_struct: bool = False - pretrained_path: str = "" - seed_val: int = 42 - n_train: Optional[int] = None - n_val: Optional[int] = None - n_test: Optional[int] = None - train_ratio: Optional[float] = None - val_ratio: float = 0.1 - test_ratio: float = 0.1 - keep_data_order: bool = False - output_dir: str = "temp" diff --git a/atomgpt/data/schematic.jpeg b/atomgpt/data/schematic.jpeg new file mode 100644 index 0000000..cadf99c Binary files /dev/null and b/atomgpt/data/schematic.jpeg differ diff --git a/atomgpt/forward_models/train_id_prop.py b/atomgpt/forward_models/train_id_prop.py deleted file mode 100644 index c87a66d..0000000 --- a/atomgpt/forward_models/train_id_prop.py +++ /dev/null @@ -1,735 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -import random -from jarvis.db.jsonutils import loadjson, dumpjson -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error -import json -from jarvis.db.figshare import get_jid_data -from jarvis.core.atoms import Atoms -from jarvis.analysis.structure.spacegroup import Spacegroup3D -from jarvis.analysis.diffraction.xrd import XRD -from jarvis.core.specie import Specie -import pprint -from collections import defaultdict -from tqdm import tqdm -import time -import json -import zipfile -from typing import Optional -from pydantic_settings import BaseSettings -import csv -import pprint -class TrainingPropConfig(BaseSettings): - """Training config defaults and validation.""" - - id_prop_path: Optional[str] = "robo_desc.json.zip" - prefix: str = "atomgpt_run" - model_name: str = "gpt2" - batch_size: int = 16 - max_length: int = 512 - num_epochs: int = 500 - latent_dim: int = 1024 - learning_rate: float = 1e-3 - test_each_run: bool = True - include_struct: bool = False - pretrained_path: str = "" - seed_val: int = 42 - n_train: Optional[int] = None - n_val: Optional[int] = None - n_test: Optional[int] = None - output_dir: str = "out_temp" - train_ratio: Optional[float] = None - val_ratio: float = 0.1 - test_ratio: float = 0.1 - keep_data_order: bool = True - - -def get_id_train_val_test( - total_size=1000, - split_seed=123, - train_ratio=None, - val_ratio=0.1, - test_ratio=0.1, - n_train=None, - n_test=None, - n_val=None, - keep_data_order=True, -): - """Get train, val, test IDs.""" - if ( - train_ratio is None - and val_ratio is not None - and test_ratio is not None - ): - if train_ratio is None: - assert val_ratio + test_ratio < 1 - train_ratio = 1 - val_ratio - test_ratio - print("Using rest of the dataset except the test and val sets.") - else: - assert train_ratio + val_ratio + test_ratio <= 1 - # indices = list(range(total_size)) - if n_train is None: - n_train = int(train_ratio * total_size) - if n_test is None: - n_test = int(test_ratio * total_size) - if n_val is None: - n_val = int(val_ratio * total_size) - ids = list(np.arange(total_size)) - if not keep_data_order: - random.seed(split_seed) - random.shuffle(ids) - # np.random.shuffle(ids) - if n_train + n_val + n_test > total_size: - raise ValueError( - "Check total number of samples.", - n_train + n_val + n_test, - ">", - total_size, - ) - - # shuffle consistently with https://github.com/txie-93/cgcnn/data.py - # i.e. shuffle the index in place with standard library random.shuffle - # first obtain only valid indices - - # test_size = round(N * 0.2) - - # full train/val test split - # ids = ids[::-1] - id_train = ids[:n_train] - id_val = ( - ids[-(n_val + n_test) : -n_test] - if n_test > 0 - else ids[-(n_val + n_test) :] - ) # noqa:E203 - id_test = ids[-n_test:] if n_test > 0 else [] - return id_train, id_val, id_test - - -def make_id_prop( - benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip", - desc_file="robo_desc.json.zip", - leaderboard_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - # leaderboard_dir="/work/03943/kamalch/ls6/Software/atomgpt/jarvis_leaderboard/jarvis_leaderboard/", - output_dir="test_id_prop", -): - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop_name = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop_name + ".json.zip" - temp2 = dataset + "_" + prop_name + ".json" - fname = os.path.join(leaderboard_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - output_dir = prop_name + "_" + dataset - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - if not os.path.exists(output_dir): - os.makedirs(output_dir) - train_ids = list(bench["train"].keys()) - test_ids = list(bench["test"].keys()) - if "val" in bench: - val_ids = list(bench["val"].keys()) - else: - val_ids = test_ids - print("Saving files in", output_dir) - if ".zip" in desc_file: - zp = zipfile.ZipFile(desc_file) - dat = json.loads(zp.read(desc_file.split(".zip")[0].split("/")[-1])) - - else: - dat = loadjson(desc_file) - - dat2 = {} - for i in dat: - dat2[i["id"]] = i["desc"] - dft_3d2 = {} - for i in dft_3d: - dft_3d2[i[id_tag]] = i - mem = [] - for i in train_ids: - desc = dat2[i] - prop = dft_3d2[i][prop_name] - info = {} - info["id"] = i - info["desc"] = desc - info["prop"] = prop - mem.append(info) - for i in val_ids: - desc = dat2[i] - - prop = dft_3d2[i][prop_name] - info = {} - info["id"] = i - info["desc"] = desc - info["prop"] = prop - mem.append(info) - for i in test_ids: - desc = dat2[i] - prop = dft_3d2[i][prop_name] - info = {} - info["id"] = i - info["desc"] = desc - info["prop"] = prop - mem.append(info) - print("total", len(dft_3d)) - print("test_ids", len(test_ids)) - print("val_ids", len(val_ids)) - print("train_ids", len(train_ids)) - filename = os.path.join(output_dir, "id_prop_llm.json") - filename_config = os.path.join(output_dir, "config.json") - minfo = {} - minfo["n_train"] = len(train_ids) - minfo["n_val"] = len(val_ids) - minfo["n_test"] = len(test_ids) - minfo["id_prop_path"] = os.path.abspath(filename) - minfo["output_dir"] = os.path.abspath(output_dir) - - dumpjson(data=minfo, filename=filename_config) - dumpjson(data=mem, filename=filename) - return output_dir - - -## -os.environ["WANDB_ANONYMOUS"] = "must" -random_seed = 42 -random.seed(random_seed) -torch.manual_seed(random_seed) -np.random.seed(random_seed) -torch.cuda.manual_seed_all(random_seed) -try: - import torch_xla.core.xla_model as xm - - xm.set_rng_state(random_seed) -except ImportError: - pass -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(random_seed) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -# device = "cpu" - - -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - # torch.tensor(inputs*10,dtype=inputs.dtype) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt(config_file="config.json"): - print("Running AtomGPT prop predictor.") - run_path = os.path.abspath(config_file).split("config.json")[0] - print('PATH', run_path) - config = loadjson(config_file) - config = TrainingPropConfig(**config) - pprint.pprint(config) - id_prop_path = config.id_prop_path - if ".zip" in id_prop_path: - zp = zipfile.ZipFile(id_prop_path) - dat = json.loads(zp.read(id_prop_path.split(".zip")[0])) - elif ".csv" in id_prop_path: - with open(id_prop_path, "r") as f: - reader = csv.reader(f) - dt = [row for row in reader] - - dat=[] - for i in dt: - info={} - info['id']=i[0] - info['prop']=[float(j) for j in i[1:]] # float(i[1]) - with open(os.path.join(run_path,info['id']),"r") as f: - lines=f.read() - info['desc']=lines - dat.append(info) - - else: - dat = loadjson(id_prop_path) - print("len", len(dat)) - prefix = config.prefix - model_name = config.model_name - batch_size = config.batch_size - max_length = config.max_length - num_epochs = config.num_epochs - latent_dim = config.latent_dim - learning_rate = config.learning_rate - test_each_run = config.test_each_run - pretrained_path = config.pretrained_path - seed_val = config.seed_val - include_struct = config.include_struct - n_train = config.n_train - n_val = config.n_val - n_test = config.n_test - train_ratio = config.train_ratio - val_ratio = config.val_ratio - test_ratio = config.test_ratio - output_dir = config.output_dir - keep_data_order = config.keep_data_order - - f = open(os.path.join(config.output_dir, "config.json"), "w") - f.write(json.dumps(config.dict(), indent=4)) - f.close() - - id_train, id_val, id_test = get_id_train_val_test( - total_size=len(dat), - split_seed=seed_val, - train_ratio=train_ratio, - val_ratio=val_ratio, - test_ratio=test_ratio, - n_train=n_train, - n_test=n_test, - n_val=n_val, - keep_data_order=keep_data_order, - ) - - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - train_info = [] - val_info = [] - test_info = [] - for ii, i in enumerate(dat): - if ii in id_train: - train_texts.append(i["desc"]) - train_targets.append(i["prop"]) - train_ids_temp.append(i["id"]) - train_info.append(i) - if ii in id_test: - test_texts.append(i["desc"]) - test_targets.append(i["prop"]) - test_ids_temp.append(i["id"]) - val_info.append(i) - if ii in id_val: - val_texts.append(i["desc"]) - val_targets.append(i["prop"]) - val_ids_temp.append(i["id"]) - test_info.append(i) - print("test_texts:", len(test_texts)) - print("val_texts example:", val_texts[0]) - print("test_texts example:", test_texts[0]) - - print("Train\n", pd.DataFrame(train_info)) - print("Val\n", pd.DataFrame(val_info)) - print("test\n", pd.DataFrame(test_info)) - - print("total", len(dat)) - print("test_ids", len(id_test)) - print("val_ids", len(id_val)) - print("train_ids", len(id_train)) - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - tokenizer.add_special_tokens({"unk_token": "#"}) - tokenizer.add_special_tokens({"unk_token": "&"}) - tokenizer.add_special_tokens({"unk_token": "@"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Sequential( - # torch.nn.Linear(model.config.hidden_size, 1), - torch.nn.Linear(model.config.hidden_size, latent_dim), - # torch.nn.Linear( latent_dim,256), - # torch.nn.Transformer(d_model=latent_dim, nhead=1, num_encoder_layers=1, num_decoder_layers=1), - # torch.nn.Linear(latent_dim, latent_dim), - # torch.nn.Linear(latent_dim, latent_dim), - # torch.nn.ReLU(), - # torch.nn.LeakyReLU(), - # torch.nn.Dropout(p=0.2), - # torch.nn.TransformerEncoder(torch.nn.TransformerEncoderLayer(d_model=latent_dim, nhead=4), num_layers=2), - # torch.nn.Linear(256, 1), - torch.nn.Linear(latent_dim, 1), - ) - if pretrained_path != "": - model.load_state_dict(torch.load(pretrained_path, map_location=device)) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - # set_seed(seed) - # set_deterministic() - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - # TODO: knc6 change later - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - # output_dir = prefix + "_out" # + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - # optimizer.zero_grad() - train_loss += loss.item() - scheduler.step() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - model.eval() - val_loss = 0 - t1 = time.time() - fname = os.path.join(output_dir, "val_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - with torch.no_grad(): - for batch in val_dataloader: - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - ids = batch[1] - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - f.write(line) - f.close() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - model.eval() - with torch.no_grad(): - if test_each_run: - t1_test = time.time() - # model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - test_loss = 0 - for batch in test_dataloader: - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - test_loss += loss.item() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - f.write(line) - test_loss = test_loss / len(test_dataloader) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - # mae, - test_loss, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results_final.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - optimizer.zero_grad() - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - #output_dir = make_id_prop() - output_dir="." - run_atomgpt(config_file=output_dir + "/config.json") - # config_file="config.json" - # ) diff --git a/atomgpt/inverse_models/__init__.py b/atomgpt/inverse_models/__init__.py index ff7129e..9c90797 100644 --- a/atomgpt/inverse_models/__init__.py +++ b/atomgpt/inverse_models/__init__.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - from .loader import FastLanguageModel from .llama import FastLlamaModel from .mistral import FastMistralModel diff --git a/atomgpt/inverse_models/_utils.py b/atomgpt/inverse_models/_utils.py index a53de42..6a1d3cf 100644 --- a/atomgpt/inverse_models/_utils.py +++ b/atomgpt/inverse_models/_utils.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - import torch from typing import Union, Optional, List, Any, Callable import warnings diff --git a/atomgpt/inverse_models/dpo.py b/atomgpt/inverse_models/dpo.py index b7c7305..b004f40 100644 --- a/atomgpt/inverse_models/dpo.py +++ b/atomgpt/inverse_models/dpo.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - try: from transformers.utils.notebook import ( IntervalStrategy, diff --git a/atomgpt/inverse_models/gemma.py b/atomgpt/inverse_models/gemma.py index 5dd2a5a..eaed401 100644 --- a/atomgpt/inverse_models/gemma.py +++ b/atomgpt/inverse_models/gemma.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - from .llama import * from ._utils import __version__ diff --git a/atomgpt/inverse_models/inf.py b/atomgpt/inverse_models/inference.py similarity index 99% rename from atomgpt/inverse_models/inf.py rename to atomgpt/inverse_models/inference.py index dd54a06..e8135cd 100644 --- a/atomgpt/inverse_models/inf.py +++ b/atomgpt/inverse_models/inference.py @@ -1,4 +1,4 @@ - +"""Module for inference.""" from jarvis.db.jsonutils import loadjson from unsloth import FastLanguageModel import torch diff --git a/atomgpt/inverse_models/inverse_models.py b/atomgpt/inverse_models/inverse_models.py index 4e8f2f1..f27f38f 100644 --- a/atomgpt/inverse_models/inverse_models.py +++ b/atomgpt/inverse_models/inverse_models.py @@ -28,7 +28,7 @@ help="Name of the config file", ) - +# Adapted from https://github.com/unslothai/unsloth class TrainingPropConfig(BaseSettings): """Training config defaults and validation.""" @@ -157,7 +157,7 @@ def formatting_prompts_func(examples): def text2atoms(response): tmp_atoms_array = response.strip("").split("\n") # tmp_atoms_array= [element for element in tmp_atoms_array if element != ''] - print("tmp_atoms_array", tmp_atoms_array) + # print("tmp_atoms_array", tmp_atoms_array) lat_lengths = np.array(tmp_atoms_array[1].split(), dtype="float") lat_angles = np.array(tmp_atoms_array[2].split(), dtype="float") diff --git a/atomgpt/inverse_models/kernels/__init__.py b/atomgpt/inverse_models/kernels/__init__.py index fb49219..94d5a8e 100644 --- a/atomgpt/inverse_models/kernels/__init__.py +++ b/atomgpt/inverse_models/kernels/__init__.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - from atomgpt.inverse_models.kernels.cross_entropy_loss import fast_cross_entropy_loss from .rms_layernorm import fast_rms_layernorm from .rope_embedding import fast_rope_embedding, inplace_rope_embedding diff --git a/atomgpt/inverse_models/kernels/cross_entropy_loss.py b/atomgpt/inverse_models/kernels/cross_entropy_loss.py index 0acff4c..f6d5a26 100644 --- a/atomgpt/inverse_models/kernels/cross_entropy_loss.py +++ b/atomgpt/inverse_models/kernels/cross_entropy_loss.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - import triton import triton.language as tl import torch diff --git a/atomgpt/inverse_models/kernels/fast_lora.py b/atomgpt/inverse_models/kernels/fast_lora.py index edce605..8f88434 100644 --- a/atomgpt/inverse_models/kernels/fast_lora.py +++ b/atomgpt/inverse_models/kernels/fast_lora.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - import torch from atomgpt.inverse_models.kernels.utils import fast_dequantize, QUANT_STATE, get_lora_parameters, matmul_lora diff --git a/atomgpt/inverse_models/kernels/geglu.py b/atomgpt/inverse_models/kernels/geglu.py index 97a25fa..8074357 100644 --- a/atomgpt/inverse_models/kernels/geglu.py +++ b/atomgpt/inverse_models/kernels/geglu.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - import triton import triton.language as tl import torch diff --git a/atomgpt/inverse_models/kernels/rms_layernorm.py b/atomgpt/inverse_models/kernels/rms_layernorm.py index 6d06dbc..000410b 100644 --- a/atomgpt/inverse_models/kernels/rms_layernorm.py +++ b/atomgpt/inverse_models/kernels/rms_layernorm.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - import triton import triton.language as tl import torch diff --git a/atomgpt/inverse_models/kernels/rope_embedding.py b/atomgpt/inverse_models/kernels/rope_embedding.py index 87e0178..64c0cb9 100644 --- a/atomgpt/inverse_models/kernels/rope_embedding.py +++ b/atomgpt/inverse_models/kernels/rope_embedding.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - import triton import triton.language as tl import torch diff --git a/atomgpt/inverse_models/kernels/swiglu.py b/atomgpt/inverse_models/kernels/swiglu.py index 6614e5d..2856651 100644 --- a/atomgpt/inverse_models/kernels/swiglu.py +++ b/atomgpt/inverse_models/kernels/swiglu.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - import triton import triton.language as tl import torch diff --git a/atomgpt/inverse_models/kernels/utils.py b/atomgpt/inverse_models/kernels/utils.py index 1f2085d..9f56d20 100644 --- a/atomgpt/inverse_models/kernels/utils.py +++ b/atomgpt/inverse_models/kernels/utils.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - import triton MAX_FUSED_SIZE = 65536 next_power_of_2 = triton.next_power_of_2 diff --git a/atomgpt/inverse_models/llama.py b/atomgpt/inverse_models/llama.py index 0b6b6d7..3d60ca6 100644 --- a/atomgpt/inverse_models/llama.py +++ b/atomgpt/inverse_models/llama.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - import torch from typing import Optional, Tuple, List, Union from torch.nn.functional import scaled_dot_product_attention diff --git a/atomgpt/inverse_models/loader.py b/atomgpt/inverse_models/loader.py index 93ff812..33c5b2f 100644 --- a/atomgpt/inverse_models/loader.py +++ b/atomgpt/inverse_models/loader.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - from atomgpt.inverse_models.llama import FastLlamaModel, logger from atomgpt.inverse_models.mistral import FastMistralModel from atomgpt.inverse_models.qwen2 import FastQwen2Model diff --git a/atomgpt/inverse_models/mapper.py b/atomgpt/inverse_models/mapper.py index b4fbe57..3e6c777 100644 --- a/atomgpt/inverse_models/mapper.py +++ b/atomgpt/inverse_models/mapper.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - __all__ = [ "INT_TO_FLOAT_MAPPER", "FLOAT_TO_INT_MAPPER", diff --git a/atomgpt/inverse_models/mistral.py b/atomgpt/inverse_models/mistral.py index 762ecbc..cdb6175 100644 --- a/atomgpt/inverse_models/mistral.py +++ b/atomgpt/inverse_models/mistral.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - from atomgpt.inverse_models.llama import * import os from atomgpt.inverse_models._utils import __version__ diff --git a/atomgpt/inverse_models/qwen2.py b/atomgpt/inverse_models/qwen2.py index 76fe31a..fb0b41a 100644 --- a/atomgpt/inverse_models/qwen2.py +++ b/atomgpt/inverse_models/qwen2.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - from .llama import * from .mistral import FastMistralModel import os diff --git a/atomgpt/inverse_models/tokenizer_utils.py b/atomgpt/inverse_models/tokenizer_utils.py index 1cbe49b..7f29621 100644 --- a/atomgpt/inverse_models/tokenizer_utils.py +++ b/atomgpt/inverse_models/tokenizer_utils.py @@ -1,17 +1,3 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - from transformers import AutoTokenizer from transformers.convert_slow_tokenizer import convert_slow_tokenizer from transformers import PreTrainedTokenizerFast diff --git a/atomgpt/scripts/bnbb.py b/atomgpt/scripts/bnbb.py deleted file mode 100644 index ce21e34..0000000 --- a/atomgpt/scripts/bnbb.py +++ /dev/null @@ -1,11 +0,0 @@ -import bitsandbytes -import bitsandbytes as bnb -get_ptr = bnb.functional.get_ptr -import ctypes -import torch -cdequantize_blockwise_fp32 = bnb.functional.lib.cdequantize_blockwise_fp32 -cdequantize_blockwise_fp16_nf4 = bnb.functional.lib.cdequantize_blockwise_fp16_nf4 -cdequantize_blockwise_bf16_nf4 = bnb.functional.lib.cdequantize_blockwise_bf16_nf4 -cgemm_4bit_inference_naive_fp16 = bnb.functional.lib.cgemm_4bit_inference_naive_fp16 -cgemm_4bit_inference_naive_bf16 = bnb.functional.lib.cgemm_4bit_inference_naive_bf16 - diff --git a/atomgpt/scripts/dpo.py b/atomgpt/scripts/dpo.py deleted file mode 100644 index b7c7305..0000000 --- a/atomgpt/scripts/dpo.py +++ /dev/null @@ -1,120 +0,0 @@ -# Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -try: - from transformers.utils.notebook import ( - IntervalStrategy, - NotebookTrainingTracker, - NotebookProgressCallback, - ) - HAS_NOTEBOOK = True -except: - HAS_NOTEBOOK = False -pass - -DPOTrainer_metrics = [ - "rewards/chosen", - "rewards/rejected", - "rewards/accuracies", - "rewards/margins", - "logps/rejected", - "logps/chosen", - "logits/rejected", - "logits/chosen", -] -set_DPOTrainer_metrics = frozenset(DPOTrainer_metrics) - - -def NotebookProgressCallback_on_train_begin(self, args, state, control, **kwargs): - self.first_column = "Epoch" if args.evaluation_strategy == IntervalStrategy.EPOCH else "Step" - self.training_loss = 0 - self.last_log = 0 - column_names = [self.first_column] + ["Training Loss"] - if args.evaluation_strategy != IntervalStrategy.NO: - column_names.append("Validation Loss") - column_names += [x.replace("/", " / ") for x in DPOTrainer_metrics] - self.training_tracker = NotebookTrainingTracker(state.max_steps, column_names) -pass - - -def NotebookProgressCallback_on_log(self, args, state, control, logs=None, **kwargs): - # Only for when there is no evaluation - if args.evaluation_strategy == IntervalStrategy.NO and "loss" in logs: - values = {"Training Loss": logs["loss"]} - for metric in DPOTrainer_metrics: - values[metric.replace("/", " / ")] = logs[metric] - pass - # First column is necessarily Step since we're not in epoch eval strategy - values["Step"] = state.global_step - self.training_tracker.write_line(values) - pass -pass - - -def NotebookTrainingTracker_write_line(self, values): - """ - Write the values in the inner table. - - Args: - values (`Dict[str, float]`): The values to display. - """ - if self.inner_table is None: - self.inner_table = [list(values.keys()), list(values.values())] - else: - columns = self.inner_table[0] - new_values = {} - for key, value in values.items(): - lowered = key.lower() - if lowered in set_DPOTrainer_metrics: - new_values[lowered.replace("/", " / ")] = value - else: - new_values[key] = value - pass - values = new_values - - self.inner_table[0] = columns - if len(self.inner_table) > 1: - last_values = self.inner_table[-1] - first_column = self.inner_table[0][0] - if last_values[0] != values[first_column]: - # write new line - self.inner_table.append([values[c] if c in values else "No Log" for c in columns]) - else: - # update last line - new_values = values - for c in columns: - if c not in new_values.keys(): - new_values[c] = last_values[columns.index(c)] - self.inner_table[-1] = [new_values[c] for c in columns] - else: - # Edit for evaluation purposes - self.inner_table.append([values[c] if c in values else 0 for c in columns]) - pass - pass -pass - - -def PatchDPOTrainer(): - if HAS_NOTEBOOK: - from transformers.trainer import is_in_notebook - if is_in_notebook(): - # Patch DPO notebook printing - NotebookTrainingTracker.write_line = NotebookTrainingTracker_write_line - from transformers.trainer import DEFAULT_PROGRESS_CALLBACK - DEFAULT_PROGRESS_CALLBACK.on_train_begin = NotebookProgressCallback_on_train_begin - DEFAULT_PROGRESS_CALLBACK.on_log = NotebookProgressCallback_on_log - pass - pass -pass - diff --git a/atomgpt/scripts/finetune.py b/atomgpt/scripts/finetune.py deleted file mode 100644 index ae997ec..0000000 --- a/atomgpt/scripts/finetune.py +++ /dev/null @@ -1,359 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" -from jarvis.db.figshare import data -import transformers -import torch -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms - -# from tqdm import tqdm -import time -import json -import zipfile - -os.environ["WANDB_ANONYMOUS"] = "must" -np.random.seed(42) -torch.manual_seed(42) -torch.cuda.manual_seed_all(42) -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(42) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) - -IGNORE_INDEX = -100 -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - -def get_crystal_string(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + "\n" + " ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - - return crystal_str - - -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt( - prefix="sample", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=16, - max_length=128, - num_epochs=500, - learning_rate=5e-5, -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - val_ids = list(bench["val"].keys()) - test_ids = list(bench["test"].keys()) - - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - model = transformers.AutoModelForCausalLM.from_pretrained(model_name) - - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in dft_3d: - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string(atoms) - if i["jid"] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i["jid"]) - elif i["jid"] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i["jid"]) - elif i["jid"] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i["jid"]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Linear( - model.config.hidden_size, 1 - ) # Single output for regression - model.to(device) - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - test_dataloader = val_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - print("benchmark_file", benchmark_file) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze(0) # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze(0) # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - run_atomgpt(benchmark_file=benchmark_file) diff --git a/atomgpt/scripts/finetune.py.bak b/atomgpt/scripts/finetune.py.bak deleted file mode 100644 index ae997ec..0000000 --- a/atomgpt/scripts/finetune.py.bak +++ /dev/null @@ -1,359 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" -from jarvis.db.figshare import data -import transformers -import torch -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms - -# from tqdm import tqdm -import time -import json -import zipfile - -os.environ["WANDB_ANONYMOUS"] = "must" -np.random.seed(42) -torch.manual_seed(42) -torch.cuda.manual_seed_all(42) -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(42) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) - -IGNORE_INDEX = -100 -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - -def get_crystal_string(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + "\n" + " ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - - return crystal_str - - -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt( - prefix="sample", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=16, - max_length=128, - num_epochs=500, - learning_rate=5e-5, -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - val_ids = list(bench["val"].keys()) - test_ids = list(bench["test"].keys()) - - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - model = transformers.AutoModelForCausalLM.from_pretrained(model_name) - - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in dft_3d: - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string(atoms) - if i["jid"] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i["jid"]) - elif i["jid"] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i["jid"]) - elif i["jid"] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i["jid"]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Linear( - model.config.hidden_size, 1 - ) # Single output for regression - model.to(device) - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - test_dataloader = val_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - print("benchmark_file", benchmark_file) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze(0) # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze(0) # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - run_atomgpt(benchmark_file=benchmark_file) diff --git a/atomgpt/scripts/finetune1.py b/atomgpt/scripts/finetune1.py deleted file mode 100644 index 85eb1a6..0000000 --- a/atomgpt/scripts/finetune1.py +++ /dev/null @@ -1,548 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error - -# from tqdm import tqdm -import time -import json -import zipfile - -os.environ["WANDB_ANONYMOUS"] = "must" -np.random.seed(42) -torch.manual_seed(42) -torch.cuda.manual_seed_all(42) -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(42) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) -#torch.set_default_dtype(torch.float16) -IGNORE_INDEX = -100 -torch.cuda.empty_cache() - - -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - -def get_crystal_string_old(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + "\n" + " ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - - return crystal_str - - -def get_crystal_string(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.1f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - return crystal_str - - -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=8, - max_length=512, - num_epochs=500, - latent_dim=512, - learning_rate=1e-3, - # learning_rate=1e-3, - test_each_run=True, - # learning_rate=5e-5, -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - # dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - val_ids = list(bench["val"].keys()) - test_ids = list(bench["test"].keys()) - - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - #load_in_8bit=False, - #torch_dtype=torch.float16, - #load_in_8bit=True, - # device_map="auto" - ) - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in dft_3d: - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string(atoms) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Sequential( - torch.nn.Linear(model.config.hidden_size, latent_dim), - torch.nn.Linear(latent_dim, latent_dim), - #torch.nn.Linear(latent_dim, latent_dim), - torch.nn.Linear(latent_dim, 1), - ) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - val_dataloader = test_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - # if 't5' in model_name: - # decoder_input_ids = tokenizer("", return_tensors="pt").input_ids.to(device) - # decoder_input_ids = model._shift_right(decoder_input_ids) - # predictions = ( - # model(input_ids = input_ids.to(device),decoder_input_ids=decoder_input_ids).logits.squeeze().mean(dim=-1) - # ) - # else: - # predictions = ( - # model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - # ) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - #print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - # optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - if test_each_run: - t1_test = time.time() - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - mae, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - model_name="meta-llama/Llama-2-7b-hf", - model_name="google/flan-t5-small" - model_name="google/flan-t5-base" - model_name = "facebook/opt-350m" - model_name = "mistralai/Mixtral-8x7B-v0.1" - model_name = "mistralai/Mistral-7B-Instruct-v0.1" - model_name="gpt2" - model_name="gpt2-medium" - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - num_epochs=500, - batch_size=2 - ) diff --git a/atomgpt/scripts/finetune1.py.bak_0.12843 b/atomgpt/scripts/finetune1.py.bak_0.12843 deleted file mode 100644 index 968ba64..0000000 --- a/atomgpt/scripts/finetune1.py.bak_0.12843 +++ /dev/null @@ -1,547 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error - -# from tqdm import tqdm -import time -import json -import zipfile - -os.environ["WANDB_ANONYMOUS"] = "must" -np.random.seed(42) -torch.manual_seed(42) -torch.cuda.manual_seed_all(42) -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(42) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) -#torch.set_default_dtype(torch.float16) -IGNORE_INDEX = -100 -torch.cuda.empty_cache() - - -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - -def get_crystal_string_old(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + "\n" + " ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - - return crystal_str - - -def get_crystal_string(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.1f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - return crystal_str - - -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=8, - max_length=512, - num_epochs=500, - latent_dim=512, - learning_rate=1e-3, - # learning_rate=1e-3, - test_each_run=True, - # learning_rate=5e-5, -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - # dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - val_ids = list(bench["val"].keys()) - test_ids = list(bench["test"].keys()) - - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - #load_in_8bit=False, - #torch_dtype=torch.float16, - #load_in_8bit=True, - # device_map="auto" - ) - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in dft_3d: - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string(atoms) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Sequential( - torch.nn.Linear(model.config.hidden_size, latent_dim), - # torch.nn.Linear(latent_dim, latent_dim), - torch.nn.Linear(latent_dim, latent_dim), - torch.nn.Linear(latent_dim, 1), - ) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - val_dataloader = test_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - # if 't5' in model_name: - # decoder_input_ids = tokenizer("", return_tensors="pt").input_ids.to(device) - # decoder_input_ids = model._shift_right(decoder_input_ids) - # predictions = ( - # model(input_ids = input_ids.to(device),decoder_input_ids=decoder_input_ids).logits.squeeze().mean(dim=-1) - # ) - # else: - # predictions = ( - # model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - # ) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - #print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - # optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - if test_each_run: - t1_test = time.time() - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - mae, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - model_name="meta-llama/Llama-2-7b-hf", - model_name="google/flan-t5-small" - model_name="google/flan-t5-base" - model_name = "facebook/opt-350m" - model_name = "mistralai/Mixtral-8x7B-v0.1" - model_name = "mistralai/Mistral-7B-Instruct-v0.1" - model_name="gpt2" - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - num_epochs=300, - batch_size=16 - ) diff --git a/atomgpt/scripts/finetune1.py.bak_0.139 b/atomgpt/scripts/finetune1.py.bak_0.139 deleted file mode 100644 index 8306582..0000000 --- a/atomgpt/scripts/finetune1.py.bak_0.139 +++ /dev/null @@ -1,532 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error - -# from tqdm import tqdm -import time -import json -import zipfile - -os.environ["WANDB_ANONYMOUS"] = "must" -np.random.seed(42) -torch.manual_seed(42) -torch.cuda.manual_seed_all(42) -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(42) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) - -IGNORE_INDEX = -100 -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - -def get_crystal_string_old(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + "\n" + " ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - - return crystal_str - - -def get_crystal_string(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.1f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - return crystal_str - - -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=16, - max_length=512, - num_epochs=500, - latent_dim=512, - learning_rate=1e-3, - # learning_rate=1e-3, - test_each_run=True, - # learning_rate=5e-5, -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - # dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - val_ids = list(bench["val"].keys()) - test_ids = list(bench["test"].keys()) - - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - # load_in_8bit=True, - # device_map="auto" - ) - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in dft_3d: - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string(atoms) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Sequential( - torch.nn.Linear(model.config.hidden_size, latent_dim), - # torch.nn.Linear(latent_dim, latent_dim), - torch.nn.Linear(latent_dim, latent_dim), - torch.nn.Linear(latent_dim, 1), - ) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - model.to(device) - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - test_dataloader = val_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - # if 't5' in model_name: - # decoder_input_ids = tokenizer("", return_tensors="pt").input_ids.to(device) - # decoder_input_ids = model._shift_right(decoder_input_ids) - # predictions = ( - # model(input_ids = input_ids.to(device),decoder_input_ids=decoder_input_ids).logits.squeeze().mean(dim=-1) - # ) - # else: - # predictions = ( - # model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - # ) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - # optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - if test_each_run: - t1_test = time.time() - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - mae, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - run_atomgpt( - # model_name="gpt2", - model_name="google/flan-t5-small", - # model_name="meta-llama/Llama-2-7b-hf", - benchmark_file=benchmark_file, - num_epochs=300, - ) diff --git a/atomgpt/scripts/finetune1.py.bak_0.146 b/atomgpt/scripts/finetune1.py.bak_0.146 deleted file mode 100644 index 380e837..0000000 --- a/atomgpt/scripts/finetune1.py.bak_0.146 +++ /dev/null @@ -1,384 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" -from jarvis.db.figshare import data -import transformers -import torch -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms - -# from tqdm import tqdm -import time -import json -import zipfile - -os.environ["WANDB_ANONYMOUS"] = "must" -np.random.seed(42) -torch.manual_seed(42) -torch.cuda.manual_seed_all(42) -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(42) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) - -IGNORE_INDEX = -100 -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - #default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) -def get_crystal_string_old(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + "\n" + " ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - - return crystal_str - -def get_crystal_string(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.1f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - return crystal_str - - -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=16, - max_length=512, - num_epochs=500, - latent_dim=512, - learning_rate=5e-5, -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - #dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag="jid" - if "jid" in dft_3d[0]: - id_tag="jid" - else: - id_tag="id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - val_ids = list(bench["val"].keys()) - test_ids = list(bench["test"].keys()) - - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - model = transformers.AutoModelForCausalLM.from_pretrained(model_name) - - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in dft_3d: - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string(atoms) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, latent_dim),torch.nn.Linear( latent_dim, latent_dim), torch.nn.Linear( latent_dim, 1) ) - #model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - model.to(device) - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - test_dataloader = val_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - print("benchmark_file", benchmark_file) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze(0) # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze(0) # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - run_atomgpt(benchmark_file=benchmark_file,num_epochs=300) diff --git a/atomgpt/scripts/finetune1a.py b/atomgpt/scripts/finetune1a.py deleted file mode 100644 index 64ccf1b..0000000 --- a/atomgpt/scripts/finetune1a.py +++ /dev/null @@ -1,384 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" -from jarvis.db.figshare import data -import transformers -import torch -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms - -# from tqdm import tqdm -import time -import json -import zipfile - -os.environ["WANDB_ANONYMOUS"] = "must" -np.random.seed(42) -torch.manual_seed(42) -torch.cuda.manual_seed_all(42) -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(42) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) - -IGNORE_INDEX = -100 -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - #default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) -def get_crystal_string_old(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + "\n" + " ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - - return crystal_str - -def get_crystal_string(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.1f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - return crystal_str - - -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=16, - max_length=512, - num_epochs=500, - latent_dim=512, - learning_rate=5e-5, -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - #dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag="jid" - if "jid" in dft_3d[0]: - id_tag="jid" - else: - id_tag="id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - val_ids = list(bench["val"].keys()) - test_ids = list(bench["test"].keys()) - - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - model = transformers.AutoModelForCausalLM.from_pretrained(model_name) - - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in dft_3d: - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string(atoms) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, latent_dim),torch.nn.Linear( latent_dim, 1) ) - #model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - model.to(device) - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - test_dataloader = val_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - print("benchmark_file", benchmark_file) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze(0) # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze(0) # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - run_atomgpt(benchmark_file=benchmark_file,num_epochs=300) diff --git a/atomgpt/scripts/finetune2.py b/atomgpt/scripts/finetune2.py deleted file mode 100644 index 88433b7..0000000 --- a/atomgpt/scripts/finetune2.py +++ /dev/null @@ -1,358 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" -from jarvis.db.figshare import data -import transformers -import torch -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms - -# from tqdm import tqdm -import time -import json -import zipfile - -os.environ["WANDB_ANONYMOUS"] = "must" -np.random.seed(42) -torch.manual_seed(42) -torch.cuda.manual_seed_all(42) -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(42) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) - -IGNORE_INDEX = -100 -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - - -def get_crystal_string(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - - - return crystal_str -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=16, - max_length=128, - num_epochs=500, - learning_rate=5e-5, -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - val_ids = list(bench["val"].keys()) - test_ids = list(bench["test"].keys()) - - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - model = transformers.AutoModelForCausalLM.from_pretrained(model_name) - - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in dft_3d: - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string(atoms) - if i["jid"] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i["jid"]) - elif i["jid"] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i["jid"]) - elif i["jid"] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i["jid"]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.Linear( 256, 1) ) - #model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - model.to(device) - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - test_dataloader = val_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - print("benchmark_file", benchmark_file) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze(0) # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze(0) # .squeeze(0) - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - run_atomgpt(benchmark_file=benchmark_file,num_epochs=300) diff --git a/atomgpt/scripts/finetune3.py b/atomgpt/scripts/finetune3.py deleted file mode 100644 index 968ba64..0000000 --- a/atomgpt/scripts/finetune3.py +++ /dev/null @@ -1,547 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error - -# from tqdm import tqdm -import time -import json -import zipfile - -os.environ["WANDB_ANONYMOUS"] = "must" -np.random.seed(42) -torch.manual_seed(42) -torch.cuda.manual_seed_all(42) -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(42) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) -#torch.set_default_dtype(torch.float16) -IGNORE_INDEX = -100 -torch.cuda.empty_cache() - - -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - -def get_crystal_string_old(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + "\n" + " ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - - return crystal_str - - -def get_crystal_string(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.1f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - return crystal_str - - -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=8, - max_length=512, - num_epochs=500, - latent_dim=512, - learning_rate=1e-3, - # learning_rate=1e-3, - test_each_run=True, - # learning_rate=5e-5, -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - # dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - val_ids = list(bench["val"].keys()) - test_ids = list(bench["test"].keys()) - - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - #load_in_8bit=False, - #torch_dtype=torch.float16, - #load_in_8bit=True, - # device_map="auto" - ) - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in dft_3d: - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string(atoms) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Sequential( - torch.nn.Linear(model.config.hidden_size, latent_dim), - # torch.nn.Linear(latent_dim, latent_dim), - torch.nn.Linear(latent_dim, latent_dim), - torch.nn.Linear(latent_dim, 1), - ) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - val_dataloader = test_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - # if 't5' in model_name: - # decoder_input_ids = tokenizer("", return_tensors="pt").input_ids.to(device) - # decoder_input_ids = model._shift_right(decoder_input_ids) - # predictions = ( - # model(input_ids = input_ids.to(device),decoder_input_ids=decoder_input_ids).logits.squeeze().mean(dim=-1) - # ) - # else: - # predictions = ( - # model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - # ) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - #print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - # optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - if test_each_run: - t1_test = time.time() - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - mae, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - model_name="meta-llama/Llama-2-7b-hf", - model_name="google/flan-t5-small" - model_name="google/flan-t5-base" - model_name = "facebook/opt-350m" - model_name = "mistralai/Mixtral-8x7B-v0.1" - model_name = "mistralai/Mistral-7B-Instruct-v0.1" - model_name="gpt2" - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - num_epochs=300, - batch_size=16 - ) diff --git a/atomgpt/scripts/finetune4.py b/atomgpt/scripts/finetune4.py deleted file mode 100644 index ab31042..0000000 --- a/atomgpt/scripts/finetune4.py +++ /dev/null @@ -1,564 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error - -# from tqdm import tqdm -import time -import json -import zipfile - -os.environ["WANDB_ANONYMOUS"] = "must" -np.random.seed(42) -torch.manual_seed(42) -torch.cuda.manual_seed_all(42) -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(42) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) -#torch.set_default_dtype(torch.float16) -IGNORE_INDEX = -100 -torch.cuda.empty_cache() - - -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - - - -def get_crystal_string_1225(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.2f}".format(x) for x in lengths]) - + "\n" - +" ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) +" "+" ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - #extra=atoms.composition.reduced_formula - #crystal_str+=" "+extra - return crystal_str - -def get_crystal_string(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.2f}".format(x) for x in lengths]) - + "\n" - +" ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) +" "+" ".join(["{0:.3f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - extra=atoms.composition.reduced_formula - crystal_str+="\n"+extra - return crystal_str - - -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=8, - max_length=512, - num_epochs=500, - latent_dim=1024, - learning_rate=1e-3, - # learning_rate=1e-3, - test_each_run=True, - #learning_rate=5e-5, - pretrained_path="", -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - # dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - val_ids = list(bench["val"].keys()) - test_ids = list(bench["test"].keys()) - - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - #load_in_8bit=False, - #torch_dtype=torch.float16, - #load_in_8bit=True, - # device_map="auto" - ) - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in dft_3d: - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string(atoms) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - print("test_texts:",test_texts[0]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - tokenizer.add_special_tokens({"unk_token": ""}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Sequential( - torch.nn.Linear(model.config.hidden_size, latent_dim), - # torch.nn.Linear(latent_dim, latent_dim), - torch.nn.Linear(latent_dim, latent_dim), - torch.nn.Linear(latent_dim, 1), - ) - if pretrained_path!="": - model.load_state_dict(torch.load(pretrained_path,map_location=device)) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - val_dataloader = test_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - #scheduler = torch.optim.lr_scheduler.StepLR( - # optimizer, - # step_size=30, - # ) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - # if 't5' in model_name: - # decoder_input_ids = tokenizer("", return_tensors="pt").input_ids.to(device) - # decoder_input_ids = model._shift_right(decoder_input_ids) - # predictions = ( - # model(input_ids = input_ids.to(device),decoder_input_ids=decoder_input_ids).logits.squeeze().mean(dim=-1) - # ) - # else: - # predictions = ( - # model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - # ) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - #print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - # optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - if test_each_run: - t1_test = time.time() - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - mae, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - model_name="meta-llama/Llama-2-7b-hf", - model_name="google/flan-t5-base" - model_name = "facebook/opt-350m" - model_name = "mistralai/Mixtral-8x7B-v0.1" - model_name = "mistralai/Mistral-7B-Instruct-v0.1" - model_name="google/flan-t5-small" - model_name="google-t5/t5-small" - model_name="gpt2" - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - #num_epochs=300, - #pretrained_path="xyz_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - #pretrained_path="ss_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - prefix="xyz1", - batch_size=16 - ) diff --git a/atomgpt/scripts/finetune5.py b/atomgpt/scripts/finetune5.py deleted file mode 100644 index bfa7d28..0000000 --- a/atomgpt/scripts/finetune5.py +++ /dev/null @@ -1,644 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error -from describe import atoms_describer -# from tqdm import tqdm -import time -import json -import zipfile - - -### -#from robocrys import StructureCondenser, StructureDescriber - - -def get_robo(structure=None): -#structure = Structure.from_file("POSCAR") # other file formats also supported - -# alternatively, uncomment the lines below to use the MPRester object -# to fetch structures from the Materials Project database -# from pymatgen import MPRester -# structure = MPRester(API_KEY=None).get_structure_by_material_id("mp-856") - - condenser = StructureCondenser() - describer = StructureDescriber() - - #condensed_structure = condenser.condense_structure(structure) - #description = describer.describe(condensed_structure) - description = describer.describe(structure) - print(description) - return description -## -os.environ["WANDB_ANONYMOUS"] = "must" -np.random.seed(42) -torch.manual_seed(42) -torch.cuda.manual_seed_all(42) -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(42) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) -#torch.set_default_dtype(torch.float16) -IGNORE_INDEX = -100 -torch.cuda.empty_cache() - -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -scale=torch.tensor(100) #.to(device) -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-PBE_gap-halide_peroskites-test-mae.csv.zip", - #default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - - - -def get_crystal_string_1225(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - +" ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) +" "+" ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - extra=str(atoms.num_atoms)+"\n"+atoms.composition.reduced_formula - #crystal_str+=" "+extra - extra+="\n"+crystal_str - return extra - #return crystal_str - -def get_crystal_string(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.2f}".format(x) for x in lengths]) - + "\n" - +" ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) +" "+" ".join(["{0:.3f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - #extra=str(atoms.num_atoms)+"\n"+atoms.composition.reduced_formula - #crystal_str+=" "+extra - #extra+="\n"+crystal_str - #return extra - #extra=atoms.composition.reduced_formula - #crystal_str+="\n"+extra+"\n"+atoms.spacegroup()+"\n" - #return crystal_str -def get_crystal_string_t(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.2f}".format(x) for x in lengths]) - + "#\n" - +" ".join([str(int(x)) for x in angles]) - + "@\n" - + "\n".join( - [ - str(t) +" "+" ".join(["{0:.3f}".format(x) for x in c])+"&" - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - #extra=str(atoms.num_atoms)+"\n"+atoms.composition.reduced_formula - #crystal_str+=" "+extra - #extra+="\n"+crystal_str - #return extra - #extra=atoms.composition.reduced_formula - #crystal_str+="\n"+extra+"\n"+atoms.spacegroup()+"\n" - crystal_str = atoms_describer(atoms)+"\n*\n"+crystal_str - return crystal_str - -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - #torch.tensor(inputs*10,dtype=inputs.dtype) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=8, - max_length=512, - num_epochs=500, - latent_dim=512, - learning_rate=1e-3, - # learning_rate=1e-3, - test_each_run=True, - #learning_rate=5e-5, - pretrained_path="", -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - # dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - test_ids = list(bench["test"].keys()) - if 'val' in bench: - val_ids = list(bench["val"].keys()) - else: - val_ids = test_ids - print("total",len(dft_3d)) - print("test_ids",len(test_ids)) - print("val_ids",len(val_ids)) - print("train_ids",len(train_ids)) - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - #load_in_8bit=False, - #torch_dtype=torch.float16, - #load_in_8bit=True, - # device_map="auto" - ) - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in dft_3d: - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string_t(atoms) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - print("test_texts:",len(test_texts)) - print("test_texts:",test_texts[0]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - tokenizer.add_special_tokens({"unk_token": "#"}) - tokenizer.add_special_tokens({"unk_token": "&"}) - tokenizer.add_special_tokens({"unk_token": "@"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Sequential( - #torch.nn.Linear(model.config.hidden_size, 1), - torch.nn.Linear(model.config.hidden_size, latent_dim), - #torch.nn.Transformer(d_model=latent_dim, nhead=1, num_encoder_layers=1, num_decoder_layers=1), - # torch.nn.Linear(latent_dim, latent_dim), - #torch.nn.Linear(latent_dim, latent_dim), - #torch.nn.ReLU(), - #torch.nn.LeakyReLU(), - #torch.nn.Dropout(p=0.2), - #torch.nn.TransformerEncoder(torch.nn.TransformerEncoderLayer(d_model=latent_dim, nhead=4), num_layers=2), - torch.nn.Linear(latent_dim, 1), - ) - if pretrained_path!="": - model.load_state_dict(torch.load(pretrained_path,map_location=device)) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - val_dataloader = test_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - #scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10, eta_min=0.001) - #scheduler = torch.optim.lr_scheduler.StepLR( - # optimizer, - # step_size=30, - # ) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - # if 't5' in model_name: - # decoder_input_ids = tokenizer("", return_tensors="pt").input_ids.to(device) - # decoder_input_ids = model._shift_right(decoder_input_ids) - # predictions = ( - # model(input_ids = input_ids.to(device),decoder_input_ids=decoder_input_ids).logits.squeeze().mean(dim=-1) - # ) - # else: - # predictions = ( - # model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - # ) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - #print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - # optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - if test_each_run: - t1_test = time.time() - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - mae, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - model_name = "facebook/opt-350m" - model_name = "mistralai/Mixtral-8x7B-v0.1" - model_name="google/flan-t5-small" - model_name="google/flan-t5-base" - model_name = "mistralai/Mistral-7B-Instruct-v0.1" - model_name="google-t5/t5-small" - model_name="xlnet/xlnet-base-cased" - model_name="afmck/testing-llama-tiny" - model_name="EleutherAI/gpt-neo-125m" - model_name="openai-community/gpt2-medium" - model_name="meta-llama/Llama-2-7b-hf" - model_name="stas/tiny-random-llama-2" - model_name="ahxt/llama2_xs_460M_experimental" - model_name="gpt2" - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - #num_epochs=300, - #pretrained_path="xyz_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - #pretrained_path="ss_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - prefix="xyzt", - batch_size=16, - latent_dim=1024, - num_epochs=5000, - #batch_size=16 - ) diff --git a/atomgpt/scripts/finetune6.py b/atomgpt/scripts/finetune6.py deleted file mode 100644 index 6323ac1..0000000 --- a/atomgpt/scripts/finetune6.py +++ /dev/null @@ -1,659 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -import random -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error -from describe import atoms_describer - -# from tqdm import tqdm -import time -import json -import zipfile - - -### -# from robocrys import StructureCondenser, StructureDescriber - - -def get_robo(structure=None): - # structure = Structure.from_file("POSCAR") # other file formats also supported - - # alternatively, uncomment the lines below to use the MPRester object - # to fetch structures from the Materials Project database - # from pymatgen import MPRester - # structure = MPRester(API_KEY=None).get_structure_by_material_id("mp-856") - - condenser = StructureCondenser() - describer = StructureDescriber() - - # condensed_structure = condenser.condense_structure(structure) - # description = describer.describe(condensed_structure) - description = describer.describe(structure) - print(description) - return description - - -## -os.environ["WANDB_ANONYMOUS"] = "must" -random_seed = 42 -random.seed(random_seed) -torch.manual_seed(random_seed) -np.random.seed(random_seed) -torch.cuda.manual_seed_all(random_seed) -try: - import torch_xla.core.xla_model as xm - - xm.set_rng_state(random_seed) -except ImportError: - pass -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(random_seed) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) -IGNORE_INDEX = -100 -# torch.cuda.empty_cache() -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-PBE_gap-halide_peroskites-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - -def get_crystal_string_1225(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - extra = str(atoms.num_atoms) + "\n" + atoms.composition.reduced_formula - # crystal_str+=" "+extra - extra += "\n" + crystal_str - return extra - # return crystal_str - - -def get_crystal_string(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.2f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.3f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - # extra=str(atoms.num_atoms)+"\n"+atoms.composition.reduced_formula - # crystal_str+=" "+extra - # extra+="\n"+crystal_str - # return extra - # extra=atoms.composition.reduced_formula - # crystal_str+="\n"+extra+"\n"+atoms.spacegroup()+"\n" - # return crystal_str - - -def get_crystal_string_t(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.2f}".format(x) for x in lengths]) - + "#\n" - + " ".join([str(int(x)) for x in angles]) - + "@\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.3f}".format(x) for x in c]) + "&" - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - # extra=str(atoms.num_atoms)+"\n"+atoms.composition.reduced_formula - # crystal_str+=" "+extra - # extra+="\n"+crystal_str - # return extra - # extra=atoms.composition.reduced_formula - # crystal_str+="\n"+extra+"\n"+atoms.spacegroup()+"\n" - crystal_str = atoms_describer(atoms) + "\n*\n" + crystal_str - return crystal_str - - -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - # torch.tensor(inputs*10,dtype=inputs.dtype) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=8, - max_length=512, - num_epochs=500, - latent_dim=512, - learning_rate=1e-3, - # learning_rate=1e-3, - test_each_run=True, - # learning_rate=5e-5, - pretrained_path="", -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - # dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - test_ids = list(bench["test"].keys()) - if "val" in bench: - val_ids = list(bench["val"].keys()) - else: - val_ids = test_ids - print("total", len(dft_3d)) - print("test_ids", len(test_ids)) - print("val_ids", len(val_ids)) - print("train_ids", len(train_ids)) - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in dft_3d: - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string_t(atoms) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - print("test_texts:", len(test_texts)) - print("test_texts:", test_texts[0]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - tokenizer.add_special_tokens({"unk_token": "#"}) - tokenizer.add_special_tokens({"unk_token": "&"}) - tokenizer.add_special_tokens({"unk_token": "@"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Sequential( - # torch.nn.Linear(model.config.hidden_size, 1), - torch.nn.Linear(model.config.hidden_size, latent_dim), - # torch.nn.Linear( latent_dim,256), - # torch.nn.Transformer(d_model=latent_dim, nhead=1, num_encoder_layers=1, num_decoder_layers=1), - # torch.nn.Linear(latent_dim, latent_dim), - # torch.nn.Linear(latent_dim, latent_dim), - # torch.nn.ReLU(), - # torch.nn.LeakyReLU(), - # torch.nn.Dropout(p=0.2), - # torch.nn.TransformerEncoder(torch.nn.TransformerEncoderLayer(d_model=latent_dim, nhead=4), num_layers=2), - # torch.nn.Linear(256, 1), - torch.nn.Linear(latent_dim, 1), - ) - if pretrained_path != "": - model.load_state_dict(torch.load(pretrained_path, map_location=device)) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - # set_seed(seed) - # set_deterministic() - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - val_dataloader = test_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - # scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10, eta_min=0.001) - # scheduler = torch.optim.lr_scheduler.StepLR( - # optimizer, - # step_size=30, - # ) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - # if 't5' in model_name: - # decoder_input_ids = tokenizer("", return_tensors="pt").input_ids.to(device) - # decoder_input_ids = model._shift_right(decoder_input_ids) - # predictions = ( - # model(input_ids = input_ids.to(device),decoder_input_ids=decoder_input_ids).logits.squeeze().mean(dim=-1) - # ) - # else: - # predictions = ( - # model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - # ) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - # optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - if test_each_run: - t1_test = time.time() - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - mae, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - model_name = "facebook/opt-350m" - model_name = "mistralai/Mixtral-8x7B-v0.1" - model_name = "google/flan-t5-small" - model_name = "google/flan-t5-base" - model_name = "mistralai/Mistral-7B-Instruct-v0.1" - model_name = "google-t5/t5-small" - model_name = "xlnet/xlnet-base-cased" - model_name = "afmck/testing-llama-tiny" - model_name = "EleutherAI/gpt-neo-125m" - model_name = "openai-community/gpt2-medium" - model_name = "meta-llama/Llama-2-7b-hf" - model_name = "stas/tiny-random-llama-2" - model_name = "ahxt/llama2_xs_460M_experimental" - model_name = "gpt2" - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - # num_epochs=300, - # pretrained_path="xyz_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - # pretrained_path="ss_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - prefix="xyzt6", - batch_size=16, - latent_dim=1024, - num_epochs=5000, - # batch_size=16 - ) diff --git a/atomgpt/scripts/finetune7.py b/atomgpt/scripts/finetune7.py deleted file mode 100644 index 7d151bc..0000000 --- a/atomgpt/scripts/finetune7.py +++ /dev/null @@ -1,784 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -import random -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error - -# from describe import atoms_describer -import json -from jarvis.db.figshare import get_jid_data -from jarvis.core.atoms import Atoms -from jarvis.analysis.structure.spacegroup import Spacegroup3D -from jarvis.analysis.diffraction.xrd import XRD -from jarvis.core.specie import Specie -import pprint -from collections import defaultdict - -from tqdm import tqdm -import time -import json -import zipfile -from transformers import GPT2Config, GPT2Model, GPT2Tokenizer - -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-PBE_gap-halide_peroskites-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - -def atoms_describer( - atoms=[], - xrd_peaks=5, - xrd_round=1, - cutoff=4, - take_n_bomds=2, - include_spg=True, -): - """Describe an atomic structure.""" - if include_spg: - spg = Spacegroup3D(atoms) - theta, d_hkls, intens = XRD().simulate(atoms=(atoms)) - # x = atoms.atomwise_angle_and_radial_distribution() - # bond_distances = {} - # for i, j in x[-1]["different_bond"].items(): - # bond_distances[i.replace("_", "-")] = ", ".join( - # map(str, (sorted(list(set([round(jj, 2) for jj in j]))))) - # ) - dists = defaultdict(list) - elements = atoms.elements - for i in atoms.get_all_neighbors(r=cutoff): - for j in i: - key = "-".join(sorted([elements[j[0]], elements[j[1]]])) - dists[key].append(j[2]) - bond_distances = {} - for i, j in dists.items(): - dist = sorted(set([round(k, 2) for k in j])) - if len(dist) >= take_n_bomds: - dist = dist[0:take_n_bomds] - bond_distances[i] = ", ".join(map(str, dist)) - fracs = {} - for i, j in (atoms.composition.atomic_fraction).items(): - fracs[i] = round(j, 3) - info = {} - chem_info = { - "atomic_formula": atoms.composition.reduced_formula, - "prototype": atoms.composition.prototype, - "molecular_weight": round(atoms.composition.weight / 2, 2), - "atomic_fraction": (fracs), - "atomic_X": ", ".join( - map(str, [Specie(s).X for s in atoms.uniq_species]) - ), - "atomic_Z": ", ".join( - map(str, [Specie(s).Z for s in atoms.uniq_species]) - ), - } - struct_info = { - "lattice_parameters": ", ".join( - map(str, [round(j, 2) for j in atoms.lattice.abc]) - ), - "lattice_angles": ", ".join( - map(str, [round(j, 2) for j in atoms.lattice.angles]) - ), - # "spg_number": spg.space_group_number, - # "spg_symbol": spg.space_group_symbol, - "top_k_xrd_peaks": ", ".join( - map( - str, - sorted(list(set([round(i, xrd_round) for i in theta])))[ - 0:xrd_peaks - ], - ) - ), - "density": round(atoms.density, 3), - # "crystal_system": spg.crystal_system, - # "point_group": spg.point_group_symbol, - # "wyckoff": ", ".join(list(set(spg._dataset["wyckoffs"]))), - "bond_distances": bond_distances, - # "natoms_primitive": spg.primitive_atoms.num_atoms, - # "natoms_conventional": spg.conventional_standard_structure.num_atoms, - } - if include_spg: - struct_info["spg_number"] = spg.space_group_number - struct_info["spg_symbol"] = spg.space_group_symbol - struct_info["crystal_system"] = spg.crystal_system - struct_info["point_group"] = spg.point_group_symbol - struct_info["wyckoff"] = ", ".join(list(set(spg._dataset["wyckoffs"]))) - struct_info["natoms_primitive"] = spg.primitive_atoms.num_atoms - struct_info[ - "natoms_conventional" - ] = spg.conventional_standard_structure.num_atoms - info["chemical_info"] = chem_info - info["structure_info"] = struct_info - line = "The number of atoms are: " + str( - atoms.num_atoms - ) # +"., The elements are: "+",".join(atoms.elements)+". " - for i, j in info.items(): - if not isinstance(j, dict): - line += "The " + i + " is " + j + ". " - else: - # print("i",i) - # print("j",j) - for ii, jj in j.items(): - tmp = "" - if isinstance(jj, dict): - for iii, jjj in jj.items(): - tmp += iii + ": " + str(jjj) + " " - else: - tmp = jj - line += "The " + ii + " is " + str(tmp) + ". " - return line - - -os.environ["WANDB_ANONYMOUS"] = "must" -random_seed = 42 -random.seed(random_seed) -torch.manual_seed(random_seed) -np.random.seed(random_seed) -torch.cuda.manual_seed_all(random_seed) -try: - import torch_xla.core.xla_model as xm - - xm.set_rng_state(random_seed) -except ImportError: - pass -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(random_seed) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) -IGNORE_INDEX = -100 -# torch.cuda.empty_cache() - - -def get_crystal_string_1225(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - extra = str(atoms.num_atoms) + "\n" + atoms.composition.reduced_formula - # crystal_str+=" "+extra - extra += "\n" + crystal_str - return extra - # return crystal_str - - -def get_crystal_string_t(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.2f}".format(x) for x in lengths]) - + "#\n" - + " ".join([str(int(x)) for x in angles]) - + "@\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.3f}".format(x) for x in c]) + "&" - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - crystal_str = atoms_describer(atoms) + "\n*\n" + crystal_str - return crystal_str - - -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - -config = GPT2Config.from_pretrained("gpt2") -class ForcePredictor(torch.nn.Module): - def __init__(self, gpt2_model): - super(ForcePredictor, self).__init__() - self.gpt2 = gpt2_model - self.linear = torch.nn.Linear(config.n_embd, 1) # Assuming force is a 3D vector - - def forward(self, input_ids): - outputs = self.gpt2(input_ids) - last_hidden_states = outputs.last_hidden_state - force_pred = self.linear(last_hidden_states[:, -1, :]) - #print("force_pred",outputs.keys()) - return force_pred - - -class AtomGPTPredictorLMhead(torch.nn.Module): - def __init__( - self, model_name=None, n_out=1, latent_dim=1024, tokenizer="" - ): - super(AtomGPTPredictorLMhead, self).__init__() - self.model_name = model_name - self.n_out = n_out - self.latent_dim = latent_dim - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - #config = GPT2Config.from_pretrained("gpt2") - #model = GPT2Model(config) - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - model.resize_token_embeddings(len(tokenizer)) - self.config = model.config - model.lm_head = torch.nn.Sequential( - torch.nn.Linear(model.config.hidden_size, latent_dim), - torch.nn.Linear(latent_dim, n_out), - ) - self.model = model - - def forward(self, input_ids): - #outputs = self.model(input_ids) - if "t5" in model_name: - outputs = self.model(input_ids, decoder_input_ids=input_ids) - else: - outputs = self.model(input_ids) - return outputs - - -class AtomGPTPredictorHiddenFeats(torch.nn.Module): - def __init__(self, model_name=None, n_out=1, tokenizer=""): - super(AtomGPTPredictorHiddenFeats, self).__init__() - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - model.resize_token_embeddings(len(tokenizer)) - self.model = model - self.config = self.model.config - self.global_out = torch.nn.Linear(self.config.n_embd, n_out) - - def forward(self, input_ids): - outputs = self.model(input_ids) - print('outputs',outputs.keys()) - last_hidden_states = outputs.last_hidden_state - pred = self.linear(last_hidden_states[:, -1, :]) - return pred - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=8, - max_length=512, - num_epochs=500, - latent_dim=512, - learning_rate=1e-3, - # learning_rate=1e-3, - test_each_run=True, - # learning_rate=5e-5, - pretrained_path="", -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - # dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - test_ids = list(bench["test"].keys()) - if "val" in bench: - val_ids = list(bench["val"].keys()) - else: - val_ids = test_ids - print("total", len(dft_3d)) - print("test_ids", len(test_ids)) - print("val_ids", len(val_ids)) - print("train_ids", len(train_ids)) - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - #print('Non tokenizer format') - #tokenizer = GPT2Tokenizer.from_pretrained(model_name) - config = GPT2Config.from_pretrained("gpt2") - model = GPT2Model(config) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - tokenizer.add_special_tokens({"unk_token": "#"}) - tokenizer.add_special_tokens({"unk_token": "&"}) - tokenizer.add_special_tokens({"unk_token": "@"}) - model.resize_token_embeddings(len(tokenizer)) - model=ForcePredictor(model) - #model=AtomGPTPredictorHiddenFeats(model_name=model_name, tokenizer=tokenizer) - #model = AtomGPTPredictorLMhead(model_name=model_name, tokenizer=tokenizer) - #tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in tqdm(dft_3d): - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string_t(atoms) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - print("test_texts:", len(test_texts)) - print("test_texts:", test_texts[0]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if pretrained_path != "": - model.load_state_dict(torch.load(pretrained_path, map_location=device)) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - # set_seed(seed) - # set_deterministic() - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - val_dataloader = test_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - # scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10, eta_min=0.001) - # scheduler = torch.optim.lr_scheduler.StepLR( - # optimizer, - # step_size=30, - # ) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - # if 't5' in model_name: - # decoder_input_ids = tokenizer("", return_tensors="pt").input_ids.to(device) - # decoder_input_ids = model._shift_right(decoder_input_ids) - # predictions = ( - # model(input_ids = input_ids.to(device),decoder_input_ids=decoder_input_ids).logits.squeeze().mean(dim=-1) - # ) - # else: - # predictions = ( - # model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - # ) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - # optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - if test_each_run: - t1_test = time.time() - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - mae, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - # box = [[2.715, 2.715, 0], [0, 2.715, 2.715], [2.715, 0, 2.715]] - # coords = [[0, 0, 0], [0.25, 0.2, 0.25]] - # elements = ["Si", "Si"] - # Si = Atoms(lattice_mat=box, coords=coords, elements=elements) - # tmp=atoms_describer(Si) - # print(tmp) - # import sys - # sys.exit() - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - model_name = "facebook/opt-350m" - model_name = "mistralai/Mixtral-8x7B-v0.1" - model_name = "google/flan-t5-small" - model_name = "google/flan-t5-base" - model_name = "mistralai/Mistral-7B-Instruct-v0.1" - model_name = "xlnet/xlnet-base-cased" - model_name = "afmck/testing-llama-tiny" - model_name = "EleutherAI/gpt-neo-125m" - model_name = "meta-llama/Llama-2-7b-hf" - model_name = "stas/tiny-random-llama-2" - model_name = "ahxt/llama2_xs_460M_experimental" - model_name = "google-t5/t5-small" - model_name = "openai-community/gpt2-medium" - model_name = "gpt2" - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - # num_epochs=300, - # pretrained_path="xyz_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - # pretrained_path="ss_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - prefix="xyzt6", - #batch_size=5, - batch_size=16, - latent_dim=1024, - num_epochs=5000, - # batch_size=16 - ) diff --git a/atomgpt/scripts/finetune7a.py.alignn b/atomgpt/scripts/finetune7a.py.alignn deleted file mode 100644 index 8fcfed0..0000000 --- a/atomgpt/scripts/finetune7a.py.alignn +++ /dev/null @@ -1,877 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -import random -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error - -# from describe import atoms_describer -import json -from jarvis.db.figshare import get_jid_data -from jarvis.core.atoms import Atoms -from jarvis.analysis.structure.spacegroup import Spacegroup3D -from jarvis.analysis.diffraction.xrd import XRD -from jarvis.core.specie import Specie -import pprint -from collections import defaultdict - -from tqdm import tqdm -import time -import json -import zipfile -from transformers import GPT2Config, GPT2Model, GPT2Tokenizer -from jarvis.core.atoms import Atoms - -#from alignn.graphs import Graph - -#from alignn.pretrained import get_figshare_model - -import torch - -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") - - -#model_alignn = get_figshare_model() -#model_alignn.to(device) - - - -def get_val(model, g, lg): - activation = {} - def getActivation(name): - # the hook signature - def hook(model, input, output): - activation[name] = output.detach() - return hook - h = model.readout.register_forward_hook(getActivation("readout")) - out = model([g, lg]) - h.remove() - return activation["readout"][0] - -def get_alignn_feats(model_alignn='',atoms=''): - g, lg = Graph.atom_dgl_multigraph(atoms) - x = get_val(model_alignn, g.to(device), lg.to(device)) - return x - -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-PBE_gap-halide_peroskites-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - -def atoms_describer( - atoms=[], - xrd_peaks=5, - xrd_round=1, - cutoff=4, - take_n_bomds=2, - include_spg=True, -): - """Describe an atomic structure.""" - if include_spg: - spg = Spacegroup3D(atoms) - theta, d_hkls, intens = XRD().simulate(atoms=(atoms)) - # x = atoms.atomwise_angle_and_radial_distribution() - # bond_distances = {} - # for i, j in x[-1]["different_bond"].items(): - # bond_distances[i.replace("_", "-")] = ", ".join( - # map(str, (sorted(list(set([round(jj, 2) for jj in j]))))) - # ) - dists = defaultdict(list) - elements = atoms.elements - for i in atoms.get_all_neighbors(r=cutoff): - for j in i: - key = "-".join(sorted([elements[j[0]], elements[j[1]]])) - dists[key].append(j[2]) - bond_distances = {} - for i, j in dists.items(): - dist = sorted(set([round(k, 2) for k in j])) - if len(dist) >= take_n_bomds: - dist = dist[0:take_n_bomds] - bond_distances[i] = ", ".join(map(str, dist)) - fracs = {} - for i, j in (atoms.composition.atomic_fraction).items(): - fracs[i] = round(j, 3) - info = {} - chem_info = { - "atomic_formula": atoms.composition.reduced_formula, - "prototype": atoms.composition.prototype, - "molecular_weight": round(atoms.composition.weight / 2, 2), - "atomic_fraction": (fracs), - "atomic_X": ", ".join( - map(str, [Specie(s).X for s in atoms.uniq_species]) - ), - "atomic_Z": ", ".join( - map(str, [Specie(s).Z for s in atoms.uniq_species]) - ), - } - struct_info = { - "lattice_parameters": ", ".join( - map(str, [round(j, 2) for j in atoms.lattice.abc]) - ), - "lattice_angles": ", ".join( - map(str, [round(j, 2) for j in atoms.lattice.angles]) - ), - # "spg_number": spg.space_group_number, - # "spg_symbol": spg.space_group_symbol, - "top_k_xrd_peaks": ", ".join( - map( - str, - sorted(list(set([round(i, xrd_round) for i in theta])))[ - 0:xrd_peaks - ], - ) - ), - "density": round(atoms.density, 3), - # "crystal_system": spg.crystal_system, - # "point_group": spg.point_group_symbol, - # "wyckoff": ", ".join(list(set(spg._dataset["wyckoffs"]))), - "bond_distances": bond_distances, - # "natoms_primitive": spg.primitive_atoms.num_atoms, - # "natoms_conventional": spg.conventional_standard_structure.num_atoms, - } - if include_spg: - struct_info["spg_number"] = spg.space_group_number - struct_info["spg_symbol"] = spg.space_group_symbol - struct_info["crystal_system"] = spg.crystal_system - struct_info["point_group"] = spg.point_group_symbol - struct_info["wyckoff"] = ", ".join(list(set(spg._dataset["wyckoffs"]))) - struct_info["natoms_primitive"] = spg.primitive_atoms.num_atoms - struct_info[ - "natoms_conventional" - ] = spg.conventional_standard_structure.num_atoms - info["chemical_info"] = chem_info - info["structure_info"] = struct_info - line = "The number of atoms are: " + str( - atoms.num_atoms - ) # +"., The elements are: "+",".join(atoms.elements)+". " - for i, j in info.items(): - if not isinstance(j, dict): - line += "The " + i + " is " + j + ". " - else: - # print("i",i) - # print("j",j) - for ii, jj in j.items(): - tmp = "" - if isinstance(jj, dict): - for iii, jjj in jj.items(): - tmp += iii + ": " + str(jjj) + " " - else: - tmp = jj - line += "The " + ii + " is " + str(tmp) + ". " - return line - - -def set_seed(): - os.environ["WANDB_ANONYMOUS"] = "must" - random_seed = 42 - random.seed(random_seed) - torch.manual_seed(random_seed) - np.random.seed(random_seed) - torch.cuda.manual_seed_all(random_seed) - try: - import torch_xla.core.xla_model as xm - - xm.set_rng_state(random_seed) - except ImportError: - pass - torch.backends.cudnn.deterministic = True - torch.backends.cudnn.benchmark = False - os.environ["PYTHONHASHSEED"] = str(random_seed) - os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") - torch.use_deterministic_algorithms(True) - - -def get_crystal_string_t(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.2f}".format(x) for x in lengths]) - + "#\n" - + " ".join([str(int(x)) for x in angles]) - + "@\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.3f}".format(x) for x in c]) + "&" - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - crystal_str = atoms_describer(atoms) + "\n*\n" + crystal_str - return crystal_str - - -class AtomGPTDataset(Dataset): - def __init__( - self, - texts=[], - targets=[], - ids=[], - extra_feats=[], - tokenizer="", - max_length=128, - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - self.extra_feats = extra_feats - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - if self.extra_feats: - feats = self.extra_feats[idx] - inputs = torch.cat(inputs, feats) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -class ForcePredictor(torch.nn.Module): - config = GPT2Config.from_pretrained("gpt2") - - def __init__(self, gpt2_model): - super(ForcePredictor, self).__init__() - self.gpt2 = gpt2_model - self.linear = torch.nn.Linear( - config.n_embd, 1 - ) # Assuming force is a 3D vector - - def forward(self, input_ids): - outputs = self.gpt2(input_ids) - last_hidden_states = outputs.last_hidden_state - force_pred = self.linear(last_hidden_states[:, -1, :]) - # print("force_pred",outputs.keys()) - return force_pred - - -class AtomGPTPredictorLMhead(torch.nn.Module): - def __init__( - self, model_name=None, n_out=1, latent_dim=1024, tokenizer="" - ): - - super(AtomGPTPredictorLMhead, self).__init__() - # random_seed = 42 - # random.seed(random_seed) - # torch.manual_seed(random_seed) - # np.random.seed(random_seed) - # torch.cuda.manual_seed_all(random_seed) - # try: - # import torch_xla.core.xla_model as xm - # xm.set_rng_state(random_seed) - # except ImportError: - # pass - # torch.backends.cudnn.deterministic = True - # torch.backends.cudnn.benchmark = False - # os.environ["PYTHONHASHSEED"] = str(random_seed) - # os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") - # torch.use_deterministic_algorithms(True) - - self.model_name = model_name - self.n_out = n_out - self.latent_dim = latent_dim - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - # config = GPT2Config.from_pretrained("gpt2") - # model = GPT2Model(config) - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - model.resize_token_embeddings(len(tokenizer)) - self.config = model.config - model.lm_head = torch.nn.Sequential( - torch.nn.Linear(model.config.hidden_size, latent_dim), - torch.nn.Linear(latent_dim, n_out), - ) - self.model = model - - def forward(self, input_ids): - # outputs = self.model(input_ids) - if "t5" in model_name: - outputs = self.model(input_ids, decoder_input_ids=input_ids) - else: - outputs = self.model(input_ids) - return outputs - - -class AtomGPTPredictorHiddenFeats(torch.nn.Module): - def __init__(self, model_name=None, n_out=1, tokenizer=""): - super(AtomGPTPredictorHiddenFeats, self).__init__() - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - model.resize_token_embeddings(len(tokenizer)) - self.model = model - self.config = self.model.config - self.global_out = torch.nn.Linear(self.config.n_embd, n_out) - - def forward(self, input_ids): - outputs = self.model(input_ids) - print("outputs", outputs.keys()) - last_hidden_states = outputs.last_hidden_state - pred = self.linear(last_hidden_states[:, -1, :]) - return pred - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=8, - max_length=512, - num_epochs=500, - latent_dim=512, - learning_rate=1e-3, - # learning_rate=1e-3, - test_each_run=True, - # learning_rate=5e-5, - pretrained_path="", -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - # dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - test_ids = list(bench["test"].keys()) - if "val" in bench: - val_ids = list(bench["val"].keys()) - else: - val_ids = test_ids - print("total", len(dft_3d)) - print("test_ids", len(test_ids)) - print("val_ids", len(val_ids)) - print("train_ids", len(train_ids)) - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - # print('Non tokenizer format') - # tokenizer = GPT2Tokenizer.from_pretrained(model_name) - # config = GPT2Config.from_pretrained("gpt2") - # model = GPT2Model(config) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - tokenizer.add_special_tokens({"unk_token": "#"}) - tokenizer.add_special_tokens({"unk_token": "&"}) - tokenizer.add_special_tokens({"unk_token": "@"}) - # model.resize_token_embeddings(len(tokenizer)) - # model=ForcePredictor(model) - # model=AtomGPTPredictorHiddenFeats(model_name=model_name, tokenizer=tokenizer) - set_seed() - model = AtomGPTPredictorLMhead( - model_name=model_name, tokenizer=tokenizer, latent_dim=latent_dim - ) - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - train_feats=[] - val_feats=[] - test_feats=[] - for i in tqdm(dft_3d): - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string_t(atoms) - feat=[]#get_alignn_feats(atoms=atoms) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - train_feats.append(feat) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - test_feats.append(feat) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - val_feats.append(feat) - print("test_texts:", len(test_texts)) - print("test_texts:", test_texts[0]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if pretrained_path != "": - model.load_state_dict(torch.load(pretrained_path, map_location=device)) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - # set_seed(seed) - # set_deterministic() - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate) - # optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) - # optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - extra_feats=train_feats, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - extra_feats=val_feats, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - extra_feats=test_feats, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - val_dataloader = test_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - # scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10, eta_min=0.001) - # scheduler = torch.optim.lr_scheduler.StepLR( - # optimizer, - # step_size=30, - # ) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - # if 't5' in model_name: - # decoder_input_ids = tokenizer("", return_tensors="pt").input_ids.to(device) - # decoder_input_ids = model._shift_right(decoder_input_ids) - # predictions = ( - # model(input_ids = input_ids.to(device),decoder_input_ids=decoder_input_ids).logits.squeeze().mean(dim=-1) - # ) - # else: - # predictions = ( - # model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - # ) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - # optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - if test_each_run: - t1_test = time.time() - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - mae, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - # box = [[2.715, 2.715, 0], [0, 2.715, 2.715], [2.715, 0, 2.715]] - # coords = [[0, 0, 0], [0.25, 0.2, 0.25]] - # elements = ["Si", "Si"] - # Si = Atoms(lattice_mat=box, coords=coords, elements=elements) - # tmp=atoms_describer(Si) - # print(tmp) - # import sys - # sys.exit() - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - model_name = "facebook/opt-350m" - model_name = "mistralai/Mixtral-8x7B-v0.1" - model_name = "google/flan-t5-small" - model_name = "google/flan-t5-base" - model_name = "mistralai/Mistral-7B-Instruct-v0.1" - model_name = "xlnet/xlnet-base-cased" - model_name = "afmck/testing-llama-tiny" - model_name = "EleutherAI/gpt-neo-125m" - model_name = "meta-llama/Llama-2-7b-hf" - model_name = "stas/tiny-random-llama-2" - model_name = "ahxt/llama2_xs_460M_experimental" - model_name = "google-t5/t5-small" - model_name = "openai-community/gpt2-medium" - model_name = "gpt2" - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - # num_epochs=300, - # pretrained_path="xyz_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - # pretrained_path="ss_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - prefix="xyzt6", - # batch_size=5, - max_length=512, - # max_length=256, - batch_size=16, - latent_dim=1000, - # latent_dim=1024, - num_epochs=5000, - # batch_size=16 - ) - import sys - - sys.exit() - latent_dims = [ - 128, - 256, - 512, - 800, - 1024, - 1200, - 1500, - 2048, - 2500, - 3000, - 3500, - 4000, - ] - for i in latent_dims: - prefix = "lat_lat_" + str(i) - print(prefix) - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - prefix=prefix, - batch_size=16, - latent_dim=i, - num_epochs=150, - ) - max_lengths = [128, 256, 512, 640, 768, 896, 1000] - for i in max_lengths: - prefix = "max_lengt_" + str(i) - print(prefix) - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - prefix=prefix, - batch_size=16, - max_length=i, - num_epochs=150, - ) diff --git a/atomgpt/scripts/finetune7a.py.bak b/atomgpt/scripts/finetune7a.py.bak deleted file mode 100644 index d59d880..0000000 --- a/atomgpt/scripts/finetune7a.py.bak +++ /dev/null @@ -1,815 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -import random -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error - -# from describe import atoms_describer -import json -from jarvis.db.figshare import get_jid_data -from jarvis.core.atoms import Atoms -from jarvis.analysis.structure.spacegroup import Spacegroup3D -from jarvis.analysis.diffraction.xrd import XRD -from jarvis.core.specie import Specie -import pprint -from collections import defaultdict - -from tqdm import tqdm -import time -import json -import zipfile -from transformers import GPT2Config, GPT2Model, GPT2Tokenizer - -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-PBE_gap-halide_peroskites-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - -def atoms_describer( - atoms=[], - xrd_peaks=5, - xrd_round=1, - cutoff=4, - take_n_bomds=2, - include_spg=True, -): - """Describe an atomic structure.""" - if include_spg: - spg = Spacegroup3D(atoms) - theta, d_hkls, intens = XRD().simulate(atoms=(atoms)) - # x = atoms.atomwise_angle_and_radial_distribution() - # bond_distances = {} - # for i, j in x[-1]["different_bond"].items(): - # bond_distances[i.replace("_", "-")] = ", ".join( - # map(str, (sorted(list(set([round(jj, 2) for jj in j]))))) - # ) - dists = defaultdict(list) - elements = atoms.elements - for i in atoms.get_all_neighbors(r=cutoff): - for j in i: - key = "-".join(sorted([elements[j[0]], elements[j[1]]])) - dists[key].append(j[2]) - bond_distances = {} - for i, j in dists.items(): - dist = sorted(set([round(k, 2) for k in j])) - if len(dist) >= take_n_bomds: - dist = dist[0:take_n_bomds] - bond_distances[i] = ", ".join(map(str, dist)) - fracs = {} - for i, j in (atoms.composition.atomic_fraction).items(): - fracs[i] = round(j, 3) - info = {} - chem_info = { - "atomic_formula": atoms.composition.reduced_formula, - "prototype": atoms.composition.prototype, - "molecular_weight": round(atoms.composition.weight / 2, 2), - "atomic_fraction": (fracs), - "atomic_X": ", ".join( - map(str, [Specie(s).X for s in atoms.uniq_species]) - ), - "atomic_Z": ", ".join( - map(str, [Specie(s).Z for s in atoms.uniq_species]) - ), - } - struct_info = { - "lattice_parameters": ", ".join( - map(str, [round(j, 2) for j in atoms.lattice.abc]) - ), - "lattice_angles": ", ".join( - map(str, [round(j, 2) for j in atoms.lattice.angles]) - ), - # "spg_number": spg.space_group_number, - # "spg_symbol": spg.space_group_symbol, - "top_k_xrd_peaks": ", ".join( - map( - str, - sorted(list(set([round(i, xrd_round) for i in theta])))[ - 0:xrd_peaks - ], - ) - ), - "density": round(atoms.density, 3), - # "crystal_system": spg.crystal_system, - # "point_group": spg.point_group_symbol, - # "wyckoff": ", ".join(list(set(spg._dataset["wyckoffs"]))), - "bond_distances": bond_distances, - # "natoms_primitive": spg.primitive_atoms.num_atoms, - # "natoms_conventional": spg.conventional_standard_structure.num_atoms, - } - if include_spg: - struct_info["spg_number"] = spg.space_group_number - struct_info["spg_symbol"] = spg.space_group_symbol - struct_info["crystal_system"] = spg.crystal_system - struct_info["point_group"] = spg.point_group_symbol - struct_info["wyckoff"] = ", ".join(list(set(spg._dataset["wyckoffs"]))) - struct_info["natoms_primitive"] = spg.primitive_atoms.num_atoms - struct_info[ - "natoms_conventional" - ] = spg.conventional_standard_structure.num_atoms - info["chemical_info"] = chem_info - info["structure_info"] = struct_info - line = "The number of atoms are: " + str( - atoms.num_atoms - ) # +"., The elements are: "+",".join(atoms.elements)+". " - for i, j in info.items(): - if not isinstance(j, dict): - line += "The " + i + " is " + j + ". " - else: - # print("i",i) - # print("j",j) - for ii, jj in j.items(): - tmp = "" - if isinstance(jj, dict): - for iii, jjj in jj.items(): - tmp += iii + ": " + str(jjj) + " " - else: - tmp = jj - line += "The " + ii + " is " + str(tmp) + ". " - return line - - -def set_seed(): - os.environ["WANDB_ANONYMOUS"] = "must" - random_seed = 42 - random.seed(random_seed) - torch.manual_seed(random_seed) - np.random.seed(random_seed) - torch.cuda.manual_seed_all(random_seed) - try: - import torch_xla.core.xla_model as xm - - xm.set_rng_state(random_seed) - except ImportError: - pass - torch.backends.cudnn.deterministic = True - torch.backends.cudnn.benchmark = False - os.environ["PYTHONHASHSEED"] = str(random_seed) - os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") - torch.use_deterministic_algorithms(True) - - -def get_crystal_string_1225(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.1f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.2f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - extra = str(atoms.num_atoms) + "\n" + atoms.composition.reduced_formula - # crystal_str+=" "+extra - extra += "\n" + crystal_str - return extra - # return crystal_str - - -def get_crystal_string_t(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.2f}".format(x) for x in lengths]) - + "#\n" - + " ".join([str(int(x)) for x in angles]) - + "@\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.3f}".format(x) for x in c]) + "&" - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - crystal_str = atoms_describer(atoms) + "\n*\n" + crystal_str - return crystal_str - - -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -class ForcePredictor(torch.nn.Module): - config = GPT2Config.from_pretrained("gpt2") - - def __init__(self, gpt2_model): - super(ForcePredictor, self).__init__() - self.gpt2 = gpt2_model - self.linear = torch.nn.Linear( - config.n_embd, 1 - ) # Assuming force is a 3D vector - - def forward(self, input_ids): - outputs = self.gpt2(input_ids) - last_hidden_states = outputs.last_hidden_state - force_pred = self.linear(last_hidden_states[:, -1, :]) - # print("force_pred",outputs.keys()) - return force_pred - - -class AtomGPTPredictorLMhead(torch.nn.Module): - def __init__( - self, model_name=None, n_out=1, latent_dim=1024, tokenizer="" - ): - - super(AtomGPTPredictorLMhead, self).__init__() - #random_seed = 42 - #random.seed(random_seed) - #torch.manual_seed(random_seed) - #np.random.seed(random_seed) - #torch.cuda.manual_seed_all(random_seed) - #try: - # import torch_xla.core.xla_model as xm - # xm.set_rng_state(random_seed) - #except ImportError: - # pass - #torch.backends.cudnn.deterministic = True - #torch.backends.cudnn.benchmark = False - #os.environ["PYTHONHASHSEED"] = str(random_seed) - #os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") - #torch.use_deterministic_algorithms(True) - - self.model_name = model_name - self.n_out = n_out - self.latent_dim = latent_dim - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - # config = GPT2Config.from_pretrained("gpt2") - # model = GPT2Model(config) - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - model.resize_token_embeddings(len(tokenizer)) - self.config = model.config - model.lm_head = torch.nn.Sequential( - torch.nn.Linear(model.config.hidden_size, latent_dim), - torch.nn.Linear(latent_dim, n_out), - ) - self.model = model - - def forward(self, input_ids): - # outputs = self.model(input_ids) - if "t5" in model_name: - outputs = self.model(input_ids, decoder_input_ids=input_ids) - else: - outputs = self.model(input_ids) - return outputs - - -class AtomGPTPredictorHiddenFeats(torch.nn.Module): - def __init__(self, model_name=None, n_out=1, tokenizer=""): - super(AtomGPTPredictorHiddenFeats, self).__init__() - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - model.resize_token_embeddings(len(tokenizer)) - self.model = model - self.config = self.model.config - self.global_out = torch.nn.Linear(self.config.n_embd, n_out) - - def forward(self, input_ids): - outputs = self.model(input_ids) - print("outputs", outputs.keys()) - last_hidden_states = outputs.last_hidden_state - pred = self.linear(last_hidden_states[:, -1, :]) - return pred - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=8, - max_length=512, - num_epochs=500, - latent_dim=512, - learning_rate=1e-3, - # learning_rate=1e-3, - test_each_run=True, - # learning_rate=5e-5, - pretrained_path="", -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - # dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - test_ids = list(bench["test"].keys()) - if "val" in bench: - val_ids = list(bench["val"].keys()) - else: - val_ids = test_ids - print("total", len(dft_3d)) - print("test_ids", len(test_ids)) - print("val_ids", len(val_ids)) - print("train_ids", len(train_ids)) - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - # print('Non tokenizer format') - # tokenizer = GPT2Tokenizer.from_pretrained(model_name) - # config = GPT2Config.from_pretrained("gpt2") - # model = GPT2Model(config) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - tokenizer.add_special_tokens({"unk_token": "#"}) - tokenizer.add_special_tokens({"unk_token": "&"}) - tokenizer.add_special_tokens({"unk_token": "@"}) - # model.resize_token_embeddings(len(tokenizer)) - # model=ForcePredictor(model) - # model=AtomGPTPredictorHiddenFeats(model_name=model_name, tokenizer=tokenizer) - set_seed() - model = AtomGPTPredictorLMhead(model_name=model_name, tokenizer=tokenizer,latent_dim=latent_dim) - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in tqdm(dft_3d): - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string_t(atoms) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - print("test_texts:", len(test_texts)) - print("test_texts:", test_texts[0]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if pretrained_path != "": - model.load_state_dict(torch.load(pretrained_path, map_location=device)) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - # set_seed(seed) - # set_deterministic() - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate) - #optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) - #optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - val_dataloader = test_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - # scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10, eta_min=0.001) - # scheduler = torch.optim.lr_scheduler.StepLR( - # optimizer, - # step_size=30, - # ) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - # if 't5' in model_name: - # decoder_input_ids = tokenizer("", return_tensors="pt").input_ids.to(device) - # decoder_input_ids = model._shift_right(decoder_input_ids) - # predictions = ( - # model(input_ids = input_ids.to(device),decoder_input_ids=decoder_input_ids).logits.squeeze().mean(dim=-1) - # ) - # else: - # predictions = ( - # model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - # ) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - # optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - if test_each_run: - t1_test = time.time() - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - mae, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - # box = [[2.715, 2.715, 0], [0, 2.715, 2.715], [2.715, 0, 2.715]] - # coords = [[0, 0, 0], [0.25, 0.2, 0.25]] - # elements = ["Si", "Si"] - # Si = Atoms(lattice_mat=box, coords=coords, elements=elements) - # tmp=atoms_describer(Si) - # print(tmp) - # import sys - # sys.exit() - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - model_name = "facebook/opt-350m" - model_name = "mistralai/Mixtral-8x7B-v0.1" - model_name = "google/flan-t5-small" - model_name = "google/flan-t5-base" - model_name = "mistralai/Mistral-7B-Instruct-v0.1" - model_name = "xlnet/xlnet-base-cased" - model_name = "afmck/testing-llama-tiny" - model_name = "EleutherAI/gpt-neo-125m" - model_name = "meta-llama/Llama-2-7b-hf" - model_name = "stas/tiny-random-llama-2" - model_name = "ahxt/llama2_xs_460M_experimental" - model_name = "google-t5/t5-small" - model_name = "openai-community/gpt2-medium" - model_name = "gpt2" - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - # num_epochs=300, - # pretrained_path="xyz_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - # pretrained_path="ss_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - prefix="xyzt6", - # batch_size=5, - batch_size=16, - latent_dim=1024, - num_epochs=5000, - # batch_size=16 - ) - import sys - sys.exit() - latent_dims=[128,256,512,800,1024,1200,1500,2048,2500,3000,3500,4000] - for i in latent_dims: - prefix='lat_lat_'+str(i) - print(prefix) - run_atomgpt(model_name=model_name,benchmark_file=benchmark_file,prefix=prefix,batch_size=16,latent_dim=i,num_epochs=150) - diff --git a/atomgpt/scripts/finetune7alignn.py b/atomgpt/scripts/finetune7alignn.py deleted file mode 100644 index 9e02bb0..0000000 --- a/atomgpt/scripts/finetune7alignn.py +++ /dev/null @@ -1,907 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -import random -from jarvis.db.jsonutils import loadjson,dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error - -# from describe import atoms_describer -import json -from jarvis.db.figshare import get_jid_data -from jarvis.core.atoms import Atoms -from jarvis.analysis.structure.spacegroup import Spacegroup3D -from jarvis.analysis.diffraction.xrd import XRD -from jarvis.core.specie import Specie -import pprint -from collections import defaultdict - -from tqdm import tqdm -import time -import json -import zipfile -from transformers import GPT2Config, GPT2Model, GPT2Tokenizer -from jarvis.core.atoms import Atoms - -#from alignn.graphs import Graph - -#from alignn.pretrained import get_figshare_model - -import torch - -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") - - -#model_alignn = get_figshare_model() -#model_alignn.to(device) - - - -def get_val(model, g, lg): - activation = {} - def getActivation(name): - # the hook signature - def hook(model, input, output): - activation[name] = output.detach() - return hook - h = model.readout.register_forward_hook(getActivation("readout")) - out = model([g, lg]) - h.remove() - return activation["readout"][0] - -df_afeats=pd.DataFrame(loadjson("feats.json")) - -def get_alignn_feats(jid=''): - return df_afeats[df_afeats['id']==jid]['pred'].values[0] - -def get_alignn_feats_1(model_alignn='',atoms=''): - g, lg = Graph.atom_dgl_multigraph(atoms) - x = get_val(model_alignn, g.to(device), lg.to(device)) - return x - -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - #default="AI-SinglePropertyPrediction-PBE_gap-halide_peroskites-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - -def atoms_describer( - atoms=[], - xrd_peaks=5, - xrd_round=1, - cutoff=4, - take_n_bomds=2, - include_spg=True, -): - """Describe an atomic structure.""" - if include_spg: - spg = Spacegroup3D(atoms) - theta, d_hkls, intens = XRD().simulate(atoms=(atoms)) - # x = atoms.atomwise_angle_and_radial_distribution() - # bond_distances = {} - # for i, j in x[-1]["different_bond"].items(): - # bond_distances[i.replace("_", "-")] = ", ".join( - # map(str, (sorted(list(set([round(jj, 2) for jj in j]))))) - # ) - dists = defaultdict(list) - elements = atoms.elements - for i in atoms.get_all_neighbors(r=cutoff): - for j in i: - key = "-".join(sorted([elements[j[0]], elements[j[1]]])) - dists[key].append(j[2]) - bond_distances = {} - for i, j in dists.items(): - dist = sorted(set([round(k, 2) for k in j])) - if len(dist) >= take_n_bomds: - dist = dist[0:take_n_bomds] - bond_distances[i] = ", ".join(map(str, dist)) - fracs = {} - for i, j in (atoms.composition.atomic_fraction).items(): - fracs[i] = round(j, 3) - info = {} - chem_info = { - "atomic_formula": atoms.composition.reduced_formula, - "prototype": atoms.composition.prototype, - "molecular_weight": round(atoms.composition.weight / 2, 2), - "atomic_fraction": (fracs), - "atomic_X": ", ".join( - map(str, [Specie(s).X for s in atoms.uniq_species]) - ), - "atomic_Z": ", ".join( - map(str, [Specie(s).Z for s in atoms.uniq_species]) - ), - } - struct_info = { - "lattice_parameters": ", ".join( - map(str, [round(j, 2) for j in atoms.lattice.abc]) - ), - "lattice_angles": ", ".join( - map(str, [round(j, 2) for j in atoms.lattice.angles]) - ), - # "spg_number": spg.space_group_number, - # "spg_symbol": spg.space_group_symbol, - "top_k_xrd_peaks": ", ".join( - map( - str, - sorted(list(set([round(i, xrd_round) for i in theta])))[ - 0:xrd_peaks - ], - ) - ), - "density": round(atoms.density, 3), - # "crystal_system": spg.crystal_system, - # "point_group": spg.point_group_symbol, - # "wyckoff": ", ".join(list(set(spg._dataset["wyckoffs"]))), - "bond_distances": bond_distances, - # "natoms_primitive": spg.primitive_atoms.num_atoms, - # "natoms_conventional": spg.conventional_standard_structure.num_atoms, - } - if include_spg: - struct_info["spg_number"] = spg.space_group_number - struct_info["spg_symbol"] = spg.space_group_symbol - struct_info["crystal_system"] = spg.crystal_system - struct_info["point_group"] = spg.point_group_symbol - struct_info["wyckoff"] = ", ".join(list(set(spg._dataset["wyckoffs"]))) - struct_info["natoms_primitive"] = spg.primitive_atoms.num_atoms - struct_info[ - "natoms_conventional" - ] = spg.conventional_standard_structure.num_atoms - info["chemical_info"] = chem_info - info["structure_info"] = struct_info - line = "The number of atoms are: " + str( - atoms.num_atoms - ) # +"., The elements are: "+",".join(atoms.elements)+". " - for i, j in info.items(): - if not isinstance(j, dict): - line += "The " + i + " is " + j + ". " - else: - # print("i",i) - # print("j",j) - for ii, jj in j.items(): - tmp = "" - if isinstance(jj, dict): - for iii, jjj in jj.items(): - tmp += iii + ": " + str(jjj) + " " - else: - tmp = jj - line += "The " + ii + " is " + str(tmp) + ". " - return line - - -def set_seed(): - os.environ["WANDB_ANONYMOUS"] = "must" - random_seed = 42 - random.seed(random_seed) - torch.manual_seed(random_seed) - np.random.seed(random_seed) - torch.cuda.manual_seed_all(random_seed) - try: - import torch_xla.core.xla_model as xm - - xm.set_rng_state(random_seed) - except ImportError: - pass - torch.backends.cudnn.deterministic = True - torch.backends.cudnn.benchmark = False - os.environ["PYTHONHASHSEED"] = str(random_seed) - os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") - torch.use_deterministic_algorithms(True) - - -def get_crystal_string_t(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.2f}".format(x) for x in lengths]) - + "#\n" - + " ".join([str(int(x)) for x in angles]) - + "@\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.3f}".format(x) for x in c]) + "&" - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - crystal_str = atoms_describer(atoms) + "\n*\n" + crystal_str - return crystal_str - - -class AtomGPTDataset(Dataset): - def __init__( - self, - texts=[], - targets=[], - ids=[], - extra_feats=[], - tokenizer="", - max_length=128, - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - self.extra_feats = extra_feats - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - )['input_ids'].squeeze() - feats=torch.empty(1) - if self.extra_feats: - #print('fts',self.extra_feats[idx]) - feats = torch.tensor(np.array(self.extra_feats[idx]),dtype=torch.float32) - #print('feats',feats.shape) - #print('inputs',inputs,type(inputs)) - #inputs = torch.cat((inputs, feats),0) - return ( - inputs, - self.ids[idx], - feats, - torch.tensor(self.targets[idx], dtype=torch.float32), - - ) - - -class ForcePredictor(torch.nn.Module): - config = GPT2Config.from_pretrained("gpt2") - - def __init__(self, gpt2_model): - super(ForcePredictor, self).__init__() - self.gpt2 = gpt2_model - self.linear = torch.nn.Linear( - config.n_embd, 1 - ) # Assuming force is a 3D vector - - def forward(self, input_ids): - outputs = self.gpt2(input_ids) - last_hidden_states = outputs.last_hidden_state - force_pred = self.linear(last_hidden_states[:, -1, :]) - # print("force_pred",outputs.keys()) - return force_pred - - -class AtomGPTPredictorLMhead(torch.nn.Module): - def __init__( - self, model_name=None, n_out=1, latent_dim=1024,n_feats=0, tokenizer="" - ): - - super(AtomGPTPredictorLMhead, self).__init__() - # random_seed = 42 - # random.seed(random_seed) - # torch.manual_seed(random_seed) - # np.random.seed(random_seed) - # torch.cuda.manual_seed_all(random_seed) - # try: - # import torch_xla.core.xla_model as xm - # xm.set_rng_state(random_seed) - # except ImportError: - # pass - # torch.backends.cudnn.deterministic = True - # torch.backends.cudnn.benchmark = False - # os.environ["PYTHONHASHSEED"] = str(random_seed) - # os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") - # torch.use_deterministic_algorithms(True) - - self.model_name = model_name - self.n_out = n_out - self.latent_dim = latent_dim - self.n_feats=n_feats - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - # config = GPT2Config.from_pretrained("gpt2") - # model = GPT2Model(config) - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - model.resize_token_embeddings(len(tokenizer)) - self.config = model.config - model.lm_head = torch.nn.Linear(model.config.hidden_size, latent_dim) - self.out = torch.nn.Linear(latent_dim, n_out) - if self.n_feats>0: - self.feature_layer = torch.nn.Linear(self.n_feats, self.latent_dim) - self.model = model - - def forward(self, input_ids,feats=[]): - # outputs = self.model(input_ids) - if "t5" in model_name: - outputs = self.model(input_ids, decoder_input_ids=input_ids) - else: - outputs = self.model(input_ids) - if self.n_feats>0: - feature_embedding = self.feature_layer(feats) - outputs+=feature_embedding - out=self.out(outputs,self.n_out) - return out - - -class AtomGPTPredictorHiddenFeats(torch.nn.Module): - def __init__(self, model_name=None, n_out=1, tokenizer=""): - super(AtomGPTPredictorHiddenFeats, self).__init__() - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - model.resize_token_embeddings(len(tokenizer)) - self.model = model - self.config = self.model.config - self.global_out = torch.nn.Linear(self.config.n_embd, n_out) - - def forward(self, input_ids): - outputs = self.model(input_ids) - print("outputs", outputs.keys()) - last_hidden_states = outputs.last_hidden_state - pred = self.linear(last_hidden_states[:, -1, :]) - return pred - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=8, - max_length=512, - num_epochs=500, - latent_dim=512, - learning_rate=1e-3, - # learning_rate=1e-3, - test_each_run=True, - # learning_rate=5e-5, - pretrained_path="", -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - # dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - test_ids = list(bench["test"].keys()) - if "val" in bench: - val_ids = list(bench["val"].keys()) - else: - val_ids = test_ids - print("total", len(dft_3d)) - print("test_ids", len(test_ids)) - print("val_ids", len(val_ids)) - print("train_ids", len(train_ids)) - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - # print('Non tokenizer format') - # tokenizer = GPT2Tokenizer.from_pretrained(model_name) - # config = GPT2Config.from_pretrained("gpt2") - # model = GPT2Model(config) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - tokenizer.add_special_tokens({"unk_token": "#"}) - tokenizer.add_special_tokens({"unk_token": "&"}) - tokenizer.add_special_tokens({"unk_token": "@"}) - # model.resize_token_embeddings(len(tokenizer)) - # model=ForcePredictor(model) - # model=AtomGPTPredictorHiddenFeats(model_name=model_name, tokenizer=tokenizer) - set_seed() - model = AtomGPTPredictorLMhead( - model_name=model_name, tokenizer=tokenizer, n_feats=256,latent_dim=latent_dim - ) - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - train_feats=[] - val_feats=[] - test_feats=[] - for i in tqdm(dft_3d): - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string_t(atoms) - feat=get_alignn_feats(i[id_tag]) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - train_feats.append(feat) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - test_feats.append(feat) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - val_feats.append(feat) - print("test_texts:", len(test_texts)) - print("test_texts:", test_texts[0]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if pretrained_path != "": - model.load_state_dict(torch.load(pretrained_path, map_location=device)) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - # set_seed(seed) - # set_deterministic() - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate) - # optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) - # optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - extra_feats=train_feats, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - extra_feats=val_feats, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - extra_feats=test_feats, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - val_dataloader = test_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - # scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10, eta_min=0.001) - # scheduler = torch.optim.lr_scheduler.StepLR( - # optimizer, - # step_size=30, - # ) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0].squeeze() # .squeeze(0) - #input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - # if 't5' in model_name: - # decoder_input_ids = tokenizer("", return_tensors="pt").input_ids.to(device) - # decoder_input_ids = model._shift_right(decoder_input_ids) - # predictions = ( - # model(input_ids = input_ids.to(device),decoder_input_ids=decoder_input_ids).logits.squeeze().mean(dim=-1) - # ) - # else: - # predictions = ( - # model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - # ) - feats=batch[2] - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - feats=feats.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - feats=feats.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[-1].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - # optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - #input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - input_ids = batch[0].squeeze() # .squeeze(0) - feats=batch[2] - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - feats=feats.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - feats=feats.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[-1].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - if test_each_run: - t1_test = time.time() - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - #input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - input_ids = batch[0].squeeze() # .squeeze(0) - feats=batch[2] - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - feats=feats.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - feats=feats.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[2] - targets = batch[-1].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - mae, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - #input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - input_ids = batch[0].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - feats=feats.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device),feats=feats.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[-1].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - # box = [[2.715, 2.715, 0], [0, 2.715, 2.715], [2.715, 0, 2.715]] - # coords = [[0, 0, 0], [0.25, 0.2, 0.25]] - # elements = ["Si", "Si"] - # Si = Atoms(lattice_mat=box, coords=coords, elements=elements) - # tmp=atoms_describer(Si) - # print(tmp) - # import sys - # sys.exit() - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - model_name = "facebook/opt-350m" - model_name = "mistralai/Mixtral-8x7B-v0.1" - model_name = "google/flan-t5-small" - model_name = "google/flan-t5-base" - model_name = "mistralai/Mistral-7B-Instruct-v0.1" - model_name = "xlnet/xlnet-base-cased" - model_name = "afmck/testing-llama-tiny" - model_name = "EleutherAI/gpt-neo-125m" - model_name = "meta-llama/Llama-2-7b-hf" - model_name = "stas/tiny-random-llama-2" - model_name = "ahxt/llama2_xs_460M_experimental" - model_name = "google-t5/t5-small" - model_name = "openai-community/gpt2-medium" - model_name = "gpt2" - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - # num_epochs=300, - # pretrained_path="xyz_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - # pretrained_path="ss_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - prefix="xyzt6", - # batch_size=5, - max_length=512, - # max_length=256, - batch_size=16, - latent_dim=1000, - # latent_dim=1024, - num_epochs=5000, - # batch_size=16 - ) - import sys - - sys.exit() - latent_dims = [ - 128, - 256, - 512, - 800, - 1024, - 1200, - 1500, - 2048, - 2500, - 3000, - 3500, - 4000, - ] - for i in latent_dims: - prefix = "lat_lat_" + str(i) - print(prefix) - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - prefix=prefix, - batch_size=16, - latent_dim=i, - num_epochs=150, - ) - max_lengths = [128, 256, 512, 640, 768, 896, 1000] - for i in max_lengths: - prefix = "max_lengt_" + str(i) - print(prefix) - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - prefix=prefix, - batch_size=16, - max_length=i, - num_epochs=150, - ) diff --git a/atomgpt/scripts/finetune7b.py b/atomgpt/scripts/finetune7b.py deleted file mode 100644 index d43c4cf..0000000 --- a/atomgpt/scripts/finetune7b.py +++ /dev/null @@ -1,812 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -import random -from jarvis.db.jsonutils import dumpjson -import sys -import argparse -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error - -# from describe import atoms_describer -import json -from jarvis.db.figshare import get_jid_data -from jarvis.core.atoms import Atoms -from jarvis.analysis.structure.spacegroup import Spacegroup3D -from jarvis.analysis.diffraction.xrd import XRD -from jarvis.core.specie import Specie -import pprint -from collections import defaultdict - -from tqdm import tqdm -import time -import json -import zipfile -from transformers import GPT2Config, GPT2Model, GPT2Tokenizer - -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--benchmark_file", - default="AI-SinglePropertyPrediction-PBE_gap-halide_peroskites-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip", - # default="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae", - help="Benchmarks available in jarvis_leaderboard/benchmarks/*/*.zip", -) - - -def atoms_describer( - atoms=[], - xrd_peaks=5, - xrd_round=1, - cutoff=4, - take_n_bomds=2, - include_spg=True, -): - """Describe an atomic structure.""" - if include_spg: - spg = Spacegroup3D(atoms) - theta, d_hkls, intens = XRD().simulate(atoms=(atoms)) - # x = atoms.atomwise_angle_and_radial_distribution() - # bond_distances = {} - # for i, j in x[-1]["different_bond"].items(): - # bond_distances[i.replace("_", "-")] = ", ".join( - # map(str, (sorted(list(set([round(jj, 2) for jj in j]))))) - # ) - dists = defaultdict(list) - elements = atoms.elements - for i in atoms.get_all_neighbors(r=cutoff): - for j in i: - key = "-".join(sorted([elements[j[0]], elements[j[1]]])) - dists[key].append(j[2]) - bond_distances = {} - for i, j in dists.items(): - dist = sorted(set([round(k, 2) for k in j])) - if len(dist) >= take_n_bomds: - dist = dist[0:take_n_bomds] - bond_distances[i] = ", ".join(map(str, dist)) - fracs = {} - for i, j in (atoms.composition.atomic_fraction).items(): - fracs[i] = round(j, 3) - info = {} - chem_info = { - "atomic_formula": atoms.composition.reduced_formula, - "prototype": atoms.composition.prototype, - "molecular_weight": round(atoms.composition.weight / 2, 2), - "atomic_fraction": (fracs), - "atomic_X": ", ".join( - map(str, [Specie(s).X for s in atoms.uniq_species]) - ), - "atomic_Z": ", ".join( - map(str, [Specie(s).Z for s in atoms.uniq_species]) - ), - } - struct_info = { - "lattice_parameters": ", ".join( - map(str, [round(j, 2) for j in atoms.lattice.abc]) - ), - "lattice_angles": ", ".join( - map(str, [round(j, 2) for j in atoms.lattice.angles]) - ), - # "spg_number": spg.space_group_number, - # "spg_symbol": spg.space_group_symbol, - "top_k_xrd_peaks": ", ".join( - map( - str, - sorted(list(set([round(i, xrd_round) for i in theta])))[ - 0:xrd_peaks - ], - ) - ), - "density": round(atoms.density, 3), - # "crystal_system": spg.crystal_system, - # "point_group": spg.point_group_symbol, - # "wyckoff": ", ".join(list(set(spg._dataset["wyckoffs"]))), - "bond_distances": bond_distances, - # "natoms_primitive": spg.primitive_atoms.num_atoms, - # "natoms_conventional": spg.conventional_standard_structure.num_atoms, - } - if include_spg: - struct_info["spg_number"] = spg.space_group_number - struct_info["spg_symbol"] = spg.space_group_symbol - struct_info["crystal_system"] = spg.crystal_system - struct_info["point_group"] = spg.point_group_symbol - struct_info["wyckoff"] = ", ".join(list(set(spg._dataset["wyckoffs"]))) - struct_info["natoms_primitive"] = spg.primitive_atoms.num_atoms - struct_info[ - "natoms_conventional" - ] = spg.conventional_standard_structure.num_atoms - info["chemical_info"] = chem_info - info["structure_info"] = struct_info - line = "The number of atoms are: " + str( - atoms.num_atoms - ) # +"., The elements are: "+",".join(atoms.elements)+". " - for i, j in info.items(): - if not isinstance(j, dict): - line += "The " + i + " is " + j + ". " - else: - # print("i",i) - # print("j",j) - for ii, jj in j.items(): - tmp = "" - if isinstance(jj, dict): - for iii, jjj in jj.items(): - tmp += iii + ": " + str(jjj) + " " - else: - tmp = jj - line += "The " + ii + " is " + str(tmp) + ". " - return line - - -def set_seed(): - os.environ["WANDB_ANONYMOUS"] = "must" - random_seed = 42 - random.seed(random_seed) - torch.manual_seed(random_seed) - np.random.seed(random_seed) - torch.cuda.manual_seed_all(random_seed) - try: - import torch_xla.core.xla_model as xm - - xm.set_rng_state(random_seed) - except ImportError: - pass - torch.backends.cudnn.deterministic = True - torch.backends.cudnn.benchmark = False - os.environ["PYTHONHASHSEED"] = str(random_seed) - os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") - torch.use_deterministic_algorithms(True) - - - -def get_crystal_string_t(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.2f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.3f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - crystal_str = atoms_describer(atoms) + "\n*\n" + crystal_str - return crystal_str - -def get_crystal_string_t1(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.2f}".format(x) for x in lengths]) - + "#\n" - + " ".join([str(int(x)) for x in angles]) - + "@\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.3f}".format(x) for x in c]) + "&" - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - crystal_str = atoms_describer(atoms) + "\n*\n" + crystal_str - return crystal_str - - -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -class ForcePredictor(torch.nn.Module): - config = GPT2Config.from_pretrained("gpt2") - - def __init__(self, gpt2_model): - super(ForcePredictor, self).__init__() - self.gpt2 = gpt2_model - self.linear = torch.nn.Linear( - config.n_embd, 1 - ) # Assuming force is a 3D vector - - def forward(self, input_ids): - outputs = self.gpt2(input_ids) - last_hidden_states = outputs.last_hidden_state - force_pred = self.linear(last_hidden_states[:, -1, :]) - # print("force_pred",outputs.keys()) - return force_pred - - -class AtomGPTPredictorLMhead(torch.nn.Module): - def __init__( - self, model_name=None, n_out=1, latent_dim=1024, tokenizer="" - ): - - super(AtomGPTPredictorLMhead, self).__init__() - #random_seed = 42 - #random.seed(random_seed) - #torch.manual_seed(random_seed) - #np.random.seed(random_seed) - #torch.cuda.manual_seed_all(random_seed) - #try: - # import torch_xla.core.xla_model as xm - # xm.set_rng_state(random_seed) - #except ImportError: - # pass - #torch.backends.cudnn.deterministic = True - #torch.backends.cudnn.benchmark = False - #os.environ["PYTHONHASHSEED"] = str(random_seed) - #os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") - #torch.use_deterministic_algorithms(True) - - self.model_name = model_name - self.n_out = n_out - self.latent_dim = latent_dim - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - # config = GPT2Config.from_pretrained("gpt2") - # model = GPT2Model(config) - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - model.resize_token_embeddings(len(tokenizer)) - self.config = model.config - model.lm_head = torch.nn.Sequential( - torch.nn.Linear(model.config.hidden_size, latent_dim), - torch.nn.Linear(latent_dim, n_out), - ) - self.model = model - - def forward(self, input_ids): - # outputs = self.model(input_ids) - if "t5" in model_name: - outputs = self.model(input_ids, decoder_input_ids=input_ids) - else: - outputs = self.model(input_ids) - return outputs - - -class AtomGPTPredictorHiddenFeats(torch.nn.Module): - def __init__(self, model_name=None, n_out=1, tokenizer=""): - super(AtomGPTPredictorHiddenFeats, self).__init__() - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - model.resize_token_embeddings(len(tokenizer)) - self.model = model - self.config = self.model.config - self.global_out = torch.nn.Linear(self.config.n_embd, n_out) - - def forward(self, input_ids): - outputs = self.model(input_ids) - print("outputs", outputs.keys()) - last_hidden_states = outputs.last_hidden_state - pred = self.linear(last_hidden_states[:, -1, :]) - return pred - - -def run_atomgpt( - prefix="ss", - model_name="gpt2", - benchmark_file="AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip", - root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - batch_size=8, - max_length=512, - num_epochs=500, - latent_dim=512, - learning_rate=1e-3, - # learning_rate=1e-3, - test_each_run=True, - # learning_rate=5e-5, - pretrained_path="", -): - # Load pre-trained tokenizer - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # model = GPT2LMHeadModel.from_pretrained("gpt2") - - # dft_3d = data("dft_3d") - # root_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" - # benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip" - # benchmark_file = "AI-SinglePropertyPrediction-optb88vdw_bandgap-dft_3d-test-mae.csv.zip" - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop + ".json.zip" - temp2 = dataset + "_" + prop + ".json" - fname = os.path.join(root_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - - # train_atoms = [] - # val_atoms = [] - # test_atoms = [] - # train_targets = [] - # val_targets = [] - # test_targets = [] - train_ids = list(bench["train"].keys()) - test_ids = list(bench["test"].keys()) - if "val" in bench: - val_ids = list(bench["val"].keys()) - else: - val_ids = test_ids - print("total", len(dft_3d)) - print("test_ids", len(test_ids)) - print("val_ids", len(val_ids)) - print("train_ids", len(train_ids)) - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - # print('Non tokenizer format') - # tokenizer = GPT2Tokenizer.from_pretrained(model_name) - # config = GPT2Config.from_pretrained("gpt2") - # model = GPT2Model(config) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - tokenizer.add_special_tokens({"unk_token": "#"}) - tokenizer.add_special_tokens({"unk_token": "&"}) - tokenizer.add_special_tokens({"unk_token": "@"}) - # model.resize_token_embeddings(len(tokenizer)) - # model=ForcePredictor(model) - # model=AtomGPTPredictorHiddenFeats(model_name=model_name, tokenizer=tokenizer) - set_seed() - model = AtomGPTPredictorLMhead(model_name=model_name, tokenizer=tokenizer,latent_dim=latent_dim) - # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - - for i in tqdm(dft_3d): - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - tmp = get_crystal_string_t(atoms) - if i[id_tag] in train_ids: - train_texts.append(tmp) - train_targets.append(i[prop]) - train_ids_temp.append(i[id_tag]) - elif i[id_tag] in test_ids: - test_texts.append(tmp) - test_targets.append(i[prop]) - test_ids_temp.append(i[id_tag]) - elif i[id_tag] in val_ids: - val_texts.append(tmp) - val_targets.append(i[prop]) - val_ids_temp.append(i[id_tag]) - print("test_texts:", len(test_texts)) - print("test_texts:", test_texts[0]) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if pretrained_path != "": - model.load_state_dict(torch.load(pretrained_path, map_location=device)) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - # set_seed(seed) - # set_deterministic() - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate) - #optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) - #optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - - # val_dataset = train_dataset - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - val_dataloader = test_dataloader - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - # scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10, eta_min=0.001) - # scheduler = torch.optim.lr_scheduler.StepLR( - # optimizer, - # step_size=30, - # ) - print("train_data", len(train_texts)) - print("test_data", len(test_texts)) - output_dir = prefix + "_out_" + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - # if 't5' in model_name: - # decoder_input_ids = tokenizer("", return_tensors="pt").input_ids.to(device) - # decoder_input_ids = model._shift_right(decoder_input_ids) - # predictions = ( - # model(input_ids = input_ids.to(device),decoder_input_ids=decoder_input_ids).logits.squeeze().mean(dim=-1) - # ) - # else: - # predictions = ( - # model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - # ) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - # optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - if test_each_run: - t1_test = time.time() - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - mae, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - # decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - # box = [[2.715, 2.715, 0], [0, 2.715, 2.715], [2.715, 0, 2.715]] - # coords = [[0, 0, 0], [0.25, 0.2, 0.25]] - # elements = ["Si", "Si"] - # Si = Atoms(lattice_mat=box, coords=coords, elements=elements) - # tmp=atoms_describer(Si) - # print(tmp) - # import sys - # sys.exit() - args = parser.parse_args(sys.argv[1:]) - benchmark_file = args.benchmark_file - model_name = "facebook/opt-350m" - model_name = "mistralai/Mixtral-8x7B-v0.1" - model_name = "google/flan-t5-small" - model_name = "google/flan-t5-base" - model_name = "mistralai/Mistral-7B-Instruct-v0.1" - model_name = "xlnet/xlnet-base-cased" - model_name = "afmck/testing-llama-tiny" - model_name = "EleutherAI/gpt-neo-125m" - model_name = "meta-llama/Llama-2-7b-hf" - model_name = "stas/tiny-random-llama-2" - model_name = "ahxt/llama2_xs_460M_experimental" - model_name = "google-t5/t5-small" - model_name = "openai-community/gpt2-medium" - model_name = "gpt2" - run_atomgpt( - model_name=model_name, - benchmark_file=benchmark_file, - # num_epochs=300, - # pretrained_path="xyz_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - # pretrained_path="ss_out_google/flan-t5-small_tinnet_N_ead/best_model.pt", - prefix="xyzt6", - # batch_size=5, - batch_size=16, - latent_dim=1024, - num_epochs=5000, - # batch_size=16 - ) - import sys - sys.exit() - latent_dims=[128,256,512,800,1024,1200,1500,2048,2500,3000,3500,4000] - for i in latent_dims: - prefix='lat_lat_'+str(i) - print(prefix) - run_atomgpt(model_name=model_name,benchmark_file=benchmark_file,prefix=prefix,batch_size=16,latent_dim=i,num_epochs=150) - diff --git a/atomgpt/scripts/gp2atom_km.py b/atomgpt/scripts/gp2atom_km.py deleted file mode 100644 index 0bee973..0000000 --- a/atomgpt/scripts/gp2atom_km.py +++ /dev/null @@ -1,171 +0,0 @@ -from typing import * -import pandas as pd -from transformers import GPT2Config, GPT2ForSequenceClassification, GPT2TokenizerFast, TrainingArguments, Trainer -from sklearn.metrics import mean_absolute_error, mean_squared_error -from jarvis.core.atoms import Atoms -from jarvis.io.vasp.inputs import Poscar -from jarvis.db.figshare import data -import torch,json -import numpy as np -from describe import atoms_describer -from sklearn.model_selection import train_test_split -import ast -import os -# torch.cuda.is_available = lambda : False -import argparse -from multiprocessing import Pool -from tqdm import tqdm - -# if output dir not exists, create it -if not os.path.exists('output'): - os.mkdir('output') - -parser = argparse.ArgumentParser() -parser.add_argument('--prop', type=str, required=True) -parser.add_argument('--modelname', type=str, default='gpt2') -parser.add_argument('--random_state', type=int, default=0) -parser.add_argument('--dataset_name', type=str, default='dft_3d_2021') - -args = parser.parse_args() - -prop = args.prop -modelname = args.modelname -random_state = args.random_state -dataset_name = args.dataset_name - -output_dir=f'output/{modelname}_{dataset_name}_{prop}' - -print('imports done') -print('torch.cuda.is_available',torch.cuda.is_available()) - -#%% -def process_data(i): - atoms = i['atoms'] - lattice_mat = np.round(np.array(atoms['lattice_mat']), decimals=4) - coords = np.round(np.array(atoms['coords']), decimals=4) - i['atoms'] = Atoms(lattice_mat=lattice_mat, elements=atoms['elements'], coords=coords, cartesian=atoms['cartesian']) - i['atoms'] = json.dumps(atoms_describer(i['atoms'])) - return i - -print('prop',prop,flush=True) - -# prop = 'exfoliation_energy' -df_csv = f'{dataset_name}_described.csv' -if os.path.exists(df_csv): - df = pd.read_csv(df_csv)[['atoms',prop]] -else: - dat = data(dataset_name) - - pool = Pool() - dd = [] - for result in tqdm(pool.imap(process_data, dat), total=len(dat)): - dd.append(result) - - df = pd.DataFrame(dd) - # df = df.set_index(df.columns[0]) - df = df.replace('na', '') - df = df.replace('',None) - df.to_csv(df_csv) - -# replace all values of "na" with numpy nan -df = df.dropna(subset=[prop]) - -# random split into train and test -train_dd, test_dd = train_test_split(df, test_size=0.2, random_state=random_state) -train_ids, test_ids = train_dd.index, test_dd.index -n_train, n_test = len(train_dd), len(test_dd) -print(n_train, n_test) - -# use the 'atoms' and 'prop' column to create a dataframe with 'text' and 'label' columns -print('MAD of test set',np.abs(df.loc[test_ids,prop]-df.loc[test_ids,prop].mean()).mean()) - -text = df['atoms'] -label = df[prop].apply(lambda x: [x]) - -train_df = pd.DataFrame({'text':text.loc[train_ids],'label':label.loc[train_ids]}) -test_df = pd.DataFrame({'text':text.loc[test_ids],'label':label.loc[test_ids]}) - -print('df created') - - -config = GPT2Config.from_pretrained( - modelname, - # 'gpt2-medium', - pad_token_id=50256, # eos_token_id - num_labels=1, -) -tokenizer = GPT2TokenizerFast.from_pretrained( - config.model_type, - padding=True, - truncation=True, - pad_token_id=config.pad_token_id, - pad_token="<|endoftext|>", # eos_token -) -tokenizer.pad_token -model = GPT2ForSequenceClassification(config) - - -print('model loaded') - - -def tokenize(df: pd.DataFrame, tokenizer: GPT2TokenizerFast) -> List[Dict[str, Any]]: - tokenized_df = pd.DataFrame( - df['text'].apply(tokenizer).tolist() - ) - return ( - pd.merge( - df, - tokenized_df, - left_index=True, - right_index=True, - ) - .drop(columns="text") - .to_dict("records") - ) - -train_ds = tokenize(train_df, tokenizer) -test_ds = tokenize(test_df, tokenizer) - -print('tokenized') - -def compute_metrics(pred): - labels = pred.label_ids - predictions = pred.predictions - return { - "mae": mean_absolute_error(labels, predictions), - #"mse": mean_squared_error(labels, predictions), - } - -training_args = TrainingArguments( - report_to="none", - evaluation_strategy="steps", - max_steps=1000, - eval_steps=50, - # per_device_train_batch_size=16, - # per_device_eval_batch_size=128, - metric_for_best_model="mse", - greater_is_better=False, - learning_rate=5e-5, - # going to delete all of this - output_dir=output_dir, - save_strategy="no", -) - -trainer = Trainer( - model=model, - args=training_args, - train_dataset=train_ds, - eval_dataset=test_ds, - tokenizer=tokenizer, - compute_metrics=compute_metrics -) - -print('trainer loaded') - -trainer.train() -# save model -trainer.save_model(f'{output_dir}/final_{modelname}_{dataset_name}_{prop}') -# save scores -scores = trainer.evaluate() -with open(f'{output_dir}/scores_{modelname}_{dataset_name}_{prop}.json','w') as f: - json.dump(scores,f) diff --git a/atomgpt/scripts/gpt.py b/atomgpt/scripts/gpt.py deleted file mode 100644 index 87cc2a8..0000000 --- a/atomgpt/scripts/gpt.py +++ /dev/null @@ -1,93 +0,0 @@ -#mean_absolute_error: 54.10120434782608 -import json -import numpy as np -from transformers import AutoTokenizer, AutoModelForSequenceClassification -from transformers import AutoTokenizer, GPT2Model -from sklearn.linear_model import LinearRegression -from sklearn.model_selection import train_test_split -from sklearn.metrics import mean_absolute_error, mean_squared_error -import torch -from jarvis.core.atoms import Atoms -from jarvis.io.vasp.inputs import Poscar -from jarvis.db.figshare import data -from sklearn.ensemble import RandomForestRegressor -import matplotlib.pyplot as plt - -# Load JSON data -def load_data_from_json(json_path): - with open(json_path, 'r') as file: - data = json.load(file) - return data - -# Preprocess data and convert text to embeddings -tag = 'formation_energy_peratom' -def preprocess_data(dat,prop='',model='gpt2'):#, model_name): - #tokenizer = AutoTokenizer.from_pretrained(model_name) - tokenizer = AutoTokenizer.from_pretrained(model) - #model = AutoModelForSequenceClassification.from_pretrained(model_name) - model = GPT2Model.from_pretrained(model) - - embeddings = [] - labels=[] - print(model) - for entry in dat: - try: - text=Poscar(Atoms.from_dict(entry['atoms'])).to_string() - #text = entry['text'] - inputs = tokenizer(text, return_tensors="pt") - with torch.no_grad(): - output = model(**inputs) - #print(output.keys(),output['past_key_values']) - emb = output.last_hidden_state.mean(dim=1).numpy().flatten() - #print('emb',emb,emb.shape) - embeddings.append(emb) - labels.append(entry[prop]) - #labels.append(entry['exfoliation_energy']) - #embeddings.append(output.last_hidden_state.mean(dim=1).numpy()) - #embeddings.append(output.last_hidden_state.mean(dim=1).numpy()) - except Exception as exp: - print ('exp',exp,text,len(text)) - pass - - embeddings = np.vstack(embeddings) - #labels = np.array([entry['exfoliation_energy'] for entry in dat]) - return embeddings, labels - -# Main function -def main(): - dat = data('dft_3d') - dd=[] - prop = 'formation_energy_peratom'#'exfoliation_energy' - #prop = 'exfoliation_energy' - for i in dat: - if i[prop]!='na': #[0:10] - dd.append(i) - #dd=dd[0:10] - print('dd',len(dd)) - X, y = preprocess_data(dd,prop=prop)#, model_name) - - # Split the data into training and testing sets - X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - - # Initialize and fit a linear regression model - regression_model = RandomForestRegressor() #LinearRegression() - regression_model.fit(X_train, y_train) - - # Predict using the test set - y_pred = regression_model.predict(X_test) - - # Evaluate the model - mse = mean_squared_error(y_test, y_pred) - mae = mean_absolute_error(y_test, y_pred) - print("mean_absolute_error:", mae) - plt.plot(y_test, y_pred,'.') - plt.savefig('plot.png') - plt.close() - #print("Mean Squared Error:", mse) - -if __name__ == "__main__": - main() -#info=[{"text":"Ram is a good boy","target":1},{"text":"Ravan is bad boy","target":0}] -#embeddings, labels = preprocess_data(info,"gpt2") -#print('embeddings',embeddings,embeddings.shape) -#print('labels',labels,labels.shape) diff --git a/atomgpt/scripts/gpt2_describer.py b/atomgpt/scripts/gpt2_describer.py deleted file mode 100644 index 3d96aa7..0000000 --- a/atomgpt/scripts/gpt2_describer.py +++ /dev/null @@ -1,93 +0,0 @@ -#mean_absolute_error: 64.72426134969325 -import json -import numpy as np -from transformers import AutoTokenizer, AutoModelForSequenceClassification -from transformers import AutoTokenizer, GPT2Model -from sklearn.linear_model import LinearRegression -from sklearn.model_selection import train_test_split -from sklearn.metrics import mean_absolute_error, mean_squared_error -import torch -from jarvis.core.atoms import Atoms -from jarvis.io.vasp.inputs import Poscar -from jarvis.db.figshare import data -from sklearn.ensemble import RandomForestRegressor -import matplotlib.pyplot as plt -from chemnlp.utils.describe import atoms_describer -# Load JSON data -def load_data_from_json(json_path): - with open(json_path, 'r') as file: - data = json.load(file) - return data - -# Preprocess data and convert text to embeddings -tag = 'formation_energy_peratom' -def preprocess_data(dat,prop='',model='gpt2'):#, model_name): - #tokenizer = AutoTokenizer.from_pretrained(model_name) - tokenizer = AutoTokenizer.from_pretrained(model) - #model = AutoModelForSequenceClassification.from_pretrained(model_name) - model = GPT2Model.from_pretrained(model) - - embeddings = [] - labels=[] - print(model) - for entry in dat: - try: - text=json.dumps(atoms_describer(atoms=Atoms.from_dict(entry['atoms']))) #Poscar(Atoms.from_dict(entry['atoms'])).to_string() - #text = entry['text'] - inputs = tokenizer(text, return_tensors="pt") - with torch.no_grad(): - output = model(**inputs) - #print(output.keys(),output['past_key_values']) - emb = output.last_hidden_state.mean(dim=1).numpy().flatten() - #print('emb',emb,emb.shape) - embeddings.append(emb) - labels.append(entry[prop]) - #labels.append(entry['exfoliation_energy']) - #embeddings.append(output.last_hidden_state.mean(dim=1).numpy()) - #embeddings.append(output.last_hidden_state.mean(dim=1).numpy()) - except Exception as exp: - print ('exp',exp,text,len(text)) - pass - - embeddings = np.vstack(embeddings) - #labels = np.array([entry['exfoliation_energy'] for entry in dat]) - return embeddings, labels - -# Main function -def main(): - dat = data('dft_3d') - dd=[] - prop = 'formation_energy_peratom'#'exfoliation_energy' - prop = 'exfoliation_energy' - for i in dat: - if i[prop]!='na': #[0:10] - dd.append(i) - #dd=dd[0:10] - print('dd',len(dd)) - X, y = preprocess_data(dd,prop=prop)#, model_name) - - # Split the data into training and testing sets - X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - - # Initialize and fit a linear regression model - regression_model = RandomForestRegressor() #LinearRegression() - regression_model.fit(X_train, y_train) - - # Predict using the test set - y_pred = regression_model.predict(X_test) - - # Evaluate the model - mse = mean_squared_error(y_test, y_pred) - mae = mean_absolute_error(y_test, y_pred) - print("mean_absolute_error:", mae) - plt.plot(y_test, y_pred,'.') - plt.savefig('plot.png') - plt.close() - #print("Mean Squared Error:", mse) - -if __name__ == "__main__": - main() -#info=[{"text":"Ram is a good boy","target":1},{"text":"Ravan is bad boy","target":0}] -#embeddings, labels = preprocess_data(info,"gpt2") -#print('embeddings',embeddings,embeddings.shape) -#print('labels',labels,labels.shape) diff --git a/atomgpt/scripts/gpt2_describer1.py b/atomgpt/scripts/gpt2_describer1.py deleted file mode 100644 index 502269d..0000000 --- a/atomgpt/scripts/gpt2_describer1.py +++ /dev/null @@ -1,93 +0,0 @@ -import json -import numpy as np -from transformers import AutoTokenizer, AutoModelForSequenceClassification -from transformers import AutoTokenizer, GPT2Model -from sklearn.linear_model import LinearRegression -from sklearn.model_selection import train_test_split -from sklearn.metrics import mean_absolute_error, mean_squared_error -import torch -from jarvis.core.atoms import Atoms -from jarvis.io.vasp.inputs import Poscar -from jarvis.db.figshare import data -from sklearn.ensemble import RandomForestRegressor -import matplotlib.pyplot as plt -from chemnlp.utils.describe import atoms_describer -# Load JSON data -def load_data_from_json(json_path): - with open(json_path, 'r') as file: - data = json.load(file) - return data - -# Preprocess data and convert text to embeddings -tag = 'formation_energy_peratom' -def preprocess_data(dat,prop='',model='gpt2'):#, model_name): - #tokenizer = AutoTokenizer.from_pretrained(model_name) - tokenizer = AutoTokenizer.from_pretrained(model) - #model = AutoModelForSequenceClassification.from_pretrained(model_name) - model = GPT2Model.from_pretrained(model) - - embeddings = [] - labels=[] - print(model) - for entry in dat: - try: - text=json.dumps(atoms_describer(atoms=Atoms.from_dict(entry['atoms']))) #Poscar(Atoms.from_dict(entry['atoms'])).to_string() - #text = entry['text'] - inputs = tokenizer(text, return_tensors="pt") - with torch.no_grad(): - output = model(**inputs) - #print(output.keys(),output['past_key_values']) - emb = output.last_hidden_state.mean(dim=1).numpy().flatten() - #print('emb',emb,emb.shape) - embeddings.append(emb) - labels.append(entry[prop]) - #labels.append(entry['exfoliation_energy']) - #embeddings.append(output.last_hidden_state.mean(dim=1).numpy()) - #embeddings.append(output.last_hidden_state.mean(dim=1).numpy()) - except Exception as exp: - print ('exp',exp,text,len(text)) - pass - - embeddings = np.vstack(embeddings) - #labels = np.array([entry['exfoliation_energy'] for entry in dat]) - return embeddings, labels - -# Main function -def main(): - dat = data('dft_3d') - dd=[] - prop = 'formation_energy_peratom'#'exfoliation_energy' - prop = 'exfoliation_energy' - model = "databricks/dolly-v2-3b" - for i in dat: - if i[prop]!='na': #[0:10] - dd.append(i) - #dd=dd[0:10] - print('dd',len(dd)) - X, y = preprocess_data(dd,prop=prop,model=model)#, model_name) - - # Split the data into training and testing sets - X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - - # Initialize and fit a linear regression model - regression_model = RandomForestRegressor() #LinearRegression() - regression_model.fit(X_train, y_train) - - # Predict using the test set - y_pred = regression_model.predict(X_test) - - # Evaluate the model - mse = mean_squared_error(y_test, y_pred) - mae = mean_absolute_error(y_test, y_pred) - print("mean_absolute_error:", mae) - plt.plot(y_test, y_pred,'.') - plt.savefig('plot.png') - plt.close() - #print("Mean Squared Error:", mse) - -if __name__ == "__main__": - main() -#info=[{"text":"Ram is a good boy","target":1},{"text":"Ravan is bad boy","target":0}] -#embeddings, labels = preprocess_data(info,"gpt2") -#print('embeddings',embeddings,embeddings.shape) -#print('labels',labels,labels.shape) diff --git a/atomgpt/scripts/gpt2_robo.py b/atomgpt/scripts/gpt2_robo.py deleted file mode 100644 index 04ea53a..0000000 --- a/atomgpt/scripts/gpt2_robo.py +++ /dev/null @@ -1,116 +0,0 @@ -#mean_absolute_error: 64.72426134969325 -import json -import numpy as np -from transformers import AutoTokenizer, AutoModelForSequenceClassification -from transformers import AutoTokenizer, GPT2Model -from sklearn.linear_model import LinearRegression -from sklearn.model_selection import train_test_split -from sklearn.metrics import mean_absolute_error, mean_squared_error -import torch -from jarvis.core.atoms import Atoms -from jarvis.io.vasp.inputs import Poscar -from jarvis.db.figshare import data -from sklearn.ensemble import RandomForestRegressor -import matplotlib.pyplot as plt -from chemnlp.utils.describe import atoms_describer -# Load JSON data - -from pymatgen.core.structure import Structure -from robocrys import StructureCondenser, StructureDescriber - - -def get_robo(structure=None): -#structure = Structure.from_file("POSCAR") # other file formats also supported - -# alternatively, uncomment the lines below to use the MPRester object -# to fetch structures from the Materials Project database -# from pymatgen import MPRester -# structure = MPRester(API_KEY=None).get_structure_by_material_id("mp-856") - - condenser = StructureCondenser() - describer = StructureDescriber() - - #condensed_structure = condenser.condense_structure(structure) - #description = describer.describe(condensed_structure) - description = describer.describe(structure) - print(description) - return description - -def load_data_from_json(json_path): - with open(json_path, 'r') as file: - data = json.load(file) - return data - -# Preprocess data and convert text to embeddings -tag = 'formation_energy_peratom' -def preprocess_data(dat,prop='',model='gpt2'):#, model_name): - #tokenizer = AutoTokenizer.from_pretrained(model_name) - tokenizer = AutoTokenizer.from_pretrained(model) - #model = AutoModelForSequenceClassification.from_pretrained(model_name) - model = GPT2Model.from_pretrained(model) - - embeddings = [] - labels=[] - print(model) - for entry in dat: - try: - text=get_robo(Atoms.from_dict(entry['atoms']).pymatgen_converter()) #Poscar(Atoms.from_dict(entry['atoms'])).to_string() - #text=json.dumps(atoms_describer(atoms=Atoms.from_dict(entry['atoms']))) #Poscar(Atoms.from_dict(entry['atoms'])).to_string() - #text = entry['text'] - inputs = tokenizer(text, return_tensors="pt") - with torch.no_grad(): - output = model(**inputs) - #print(output.keys(),output['past_key_values']) - emb = output.last_hidden_state.mean(dim=1).numpy().flatten() - #print('emb',emb,emb.shape) - embeddings.append(emb) - labels.append(entry[prop]) - #labels.append(entry['exfoliation_energy']) - #embeddings.append(output.last_hidden_state.mean(dim=1).numpy()) - #embeddings.append(output.last_hidden_state.mean(dim=1).numpy()) - except Exception as exp: - print ('exp',exp,text,len(text)) - pass - - embeddings = np.vstack(embeddings) - #labels = np.array([entry['exfoliation_energy'] for entry in dat]) - return embeddings, labels - -# Main function -def main(): - dat = data('dft_3d') - dd=[] - prop = 'formation_energy_peratom'#'exfoliation_energy' - prop = 'exfoliation_energy' - for i in dat: - if i[prop]!='na': #[0:10] - dd.append(i) - #dd=dd[0:10] - print('dd',len(dd)) - X, y = preprocess_data(dd,prop=prop)#, model_name) - - # Split the data into training and testing sets - X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - - # Initialize and fit a linear regression model - regression_model = RandomForestRegressor() #LinearRegression() - regression_model.fit(X_train, y_train) - - # Predict using the test set - y_pred = regression_model.predict(X_test) - - # Evaluate the model - mse = mean_squared_error(y_test, y_pred) - mae = mean_absolute_error(y_test, y_pred) - print("mean_absolute_error:", mae) - plt.plot(y_test, y_pred,'.') - plt.savefig('plot.png') - plt.close() - #print("Mean Squared Error:", mse) - -if __name__ == "__main__": - main() -#info=[{"text":"Ram is a good boy","target":1},{"text":"Ravan is bad boy","target":0}] -#embeddings, labels = preprocess_data(info,"gpt2") -#print('embeddings',embeddings,embeddings.shape) -#print('labels',labels,labels.shape) diff --git a/atomgpt/scripts/gpt2atom.py b/atomgpt/scripts/gpt2atom.py deleted file mode 100644 index bdd0b34..0000000 --- a/atomgpt/scripts/gpt2atom.py +++ /dev/null @@ -1,201 +0,0 @@ -from typing import * -import pandas as pd -from transformers import GPT2Config, GPT2ForSequenceClassification, GPT2TokenizerFast, TrainingArguments, Trainer -from sklearn.metrics import mean_absolute_error, mean_squared_error -from jarvis.core.atoms import Atoms -from jarvis.io.vasp.inputs import Poscar -from jarvis.db.figshare import data -import torch,json -import numpy as np -torch.cuda.is_available = lambda : False -# Mondegreen fun -# 0. is the misheard version -# 1. is the real version -# regression task -dat = data('dft_3d') -dd=[] -prop = 'formation_energy_peratom'#'exfoliation_energy' -prop = 'dfpt_piezo_max_dielectric' -prop = 'exfoliation_energy' -for i in dat: - if i[prop]!='na': #[0:10] - atoms=i['atoms'] - lattice_mat = np.round(np.array(atoms['lattice_mat']),decimals=4) - coords = np.round(np.array(atoms['coords']),decimals=4) - atoms=Atoms(lattice_mat=lattice_mat,elements=atoms['elements'],coords=coords,cartesian=atoms['cartesian'],props=atoms['props']) - i['atoms']=atoms.to_dict() - dd.append(i) - #dd=dd[0:10] -#dd=dd[10:22] -n_train=int(len(dd)*.8) -n_test=len(dd)-n_train -train_dd=dd[0:n_train] -test_dd=dd[-n_test:] - -train_df = pd.DataFrame([ - {"text": "Money for nothin' and chips for free", "label": [0.]}, - {"text": "Money for nothin' and your chicks for free", "label": [1.]}, - - {"text": "Every time you go away, you take a piece of meat with you", "label": [0.]}, - {"text": "Every time you go away take a piece of me with you", "label": [1.]}, - - {"text": "Sue Lawley", "label": [0.]}, - {"text": "So lonely", "label": [1.]}, - - {"text": "We built this city on sausage rolls", "label": [0.]}, - {"text": "We built this city on rock 'n' roll", "label": [1.]}, - - {"text": "Saving his life from this warm sausage tea", "label": [0.]}, - {"text": "Spare him his life from this monstrosity", "label": [1.]}, - - {"text": "See that girl, watch her scream, kicking the dancing queen", "label": [0.]}, - {"text": "See that girl, watch that scene, dig in the dancing queen", "label": [1.]}, - - {"text": "Excuse me while I kiss this guy", "label": [0.]}, - {"text": "Excuse me while I kiss the sky", "label": [1.]}, - - {"text": "Dancing queen, feel the beat from the tangerine", "label": [0.]}, - {"text": "Dancing queen, feel the beat from the tambourine", "label": [1.]}, - - {"text": "Sweet dreams are made of cheese", "label": [0.]}, - {"text": "Sweet dreams are made of these", "label": [1.]}, - - {"text": "Calling Jamaica", "label": [0.]}, - {"text": "Call me when you try to wake her", "label": [1.]}, - - {"text": "Or should I just keep chasing penguins", "label": [0.]}, - {"text": "Or should I just keep chasing pavements", "label": [1.]}, - - {"text": "All the lonely Starbucks lovers", "label": [0.]}, - {"text": "Got a long list of ex-lovers", "label": [1.]}, - - {"text": "I can see clearly now, Lorraine is gone", "label": [0.]}, - {"text": "I can see clearly now, the rain is gone", "label": [1.]}, - - {"text": "Gimme Gimme Gimme a man after midnight, take me to the doctors at the break of the day", "label": [0.]}, - {"text": "Gimme Gimme Gimme a man after midnight, take me through the darkness to the break of the day", "label": [1.]}, - - {"text": "Poppadom Peach", "label": [0.]}, - {"text": "Papa don’t preach", "label": [1.]}, - - {"text": "It doesn’t make a difference if we’re naked or not", "label": [0.]}, - {"text": "It doesn’t make a difference if we make it or not", "label": [1.]}, - - {"text": "I'm farting carrots", "label": [0.]}, - {"text": "I'm 14 carat", "label": [1.]}, - - {"text": "Then I saw her face, now I'm gonna leave her", "label": [0.]}, - {"text": "Then I saw her face, now I'm a believer", "label": [1.]}, - - {"text": "I want to hold your ham", "label": [0.]}, - {"text": "I want to hold your hand", "label": [1.]}, - - {"text": "Kicking your cat all over the place", "label": [0.]}, - {"text": "Kicking your can all over the place", "label": [1.]}, -]) - - -test_df = pd.DataFrame([ - {"text": "Blue seal in the sky with diamonds", "label": [0.]}, - {"text": "Lucy in the sky with diamonds", "label": [1.]}, - - {"text": "Here we are now, in containers", "label": [0.]}, - {"text": "Here we are now, entertain us", "label": [1.]}, - - {"text": "Let's pee in the corner, let's pee in the spotlight", "label": [0.]}, - {"text": "That's me in the corner, that's me in the spotlight", "label": [1.]}, - - {"text": "I remove umbilicals", "label": [0.]}, - {"text": "I believe in miracles", "label": [1.]}, - - {"text": "I like big butts in a can of limes", "label": [0.]}, - {"text": "I like big butts and I cannot lie", "label": [1.]}, -]) - - -mem=[] -for i in train_dd: - info={} - text=Poscar(Atoms.from_dict(i['atoms'])).to_string() - #text=(Atoms.from_dict(i['atoms'])).composition.reduced_formula - #text=json.dumps(i['atoms']) - info['text']=text - info['label']=[i[prop]] - mem.append(info) -train_df = pd.DataFrame(mem) - -mem=[] -for i in test_dd: - info={} - text=Poscar(Atoms.from_dict(i['atoms'])).to_string() - #text=(Atoms.from_dict(i['atoms'])).composition.reduced_formula - #text=json.dumps(i['atoms']) - info['text']=text - info['label']=[i[prop]] - mem.append(info) -test_df = pd.DataFrame(mem) - -config = GPT2Config.from_pretrained( - "gpt2", - pad_token_id=50256, # eos_token_id - num_labels=1, -) -tokenizer = GPT2TokenizerFast.from_pretrained( - config.model_type, - padding=True, - truncation=True, - pad_token_id=config.pad_token_id, - pad_token="<|endoftext|>", # eos_token -) -tokenizer.pad_token -model = GPT2ForSequenceClassification(config) - -def tokenize(df: pd.DataFrame, tokenizer: GPT2TokenizerFast) -> List[Dict[str, Any]]: - tokenized_df = pd.DataFrame( - df.text.apply(tokenizer).tolist() - ) - return ( - pd.merge( - df, - tokenized_df, - left_index=True, - right_index=True, - ) - .drop(columns="text") - .to_dict("records") - ) - -train_ds = tokenize(train_df, tokenizer) -test_ds = tokenize(test_df, tokenizer) - -def compute_metrics(pred): - labels = pred.label_ids - predictions = pred.predictions - - return { - "mae": mean_absolute_error(labels, predictions), - #"mse": mean_squared_error(labels, predictions), - } - -training_args = TrainingArguments( - report_to="none", - evaluation_strategy="steps", - max_steps=100, - eval_steps=10, - metric_for_best_model="mse", - greater_is_better=False, - # going to delete all of this - output_dir="kaggle", - save_strategy="no", -) - -trainer = Trainer( - model=model, - args=training_args, - train_dataset=train_ds, - eval_dataset=test_ds, - tokenizer=tokenizer, - compute_metrics=compute_metrics -) - -trainer.train() diff --git a/atomgpt/scripts/usloth_gen.py b/atomgpt/scripts/usloth_gen.py deleted file mode 100644 index 41fa37f..0000000 --- a/atomgpt/scripts/usloth_gen.py +++ /dev/null @@ -1,137 +0,0 @@ -from unsloth import FastLanguageModel -import torch -from datasets import load_dataset -from trl import SFTTrainer -from transformers import TrainingArguments - -max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally! -dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+ -load_in_4bit = ( - True # Use 4bit quantization to reduce memory usage. Can be False. -) - -# 4bit pre quantized models we support for 4x faster downloading + no OOMs. -fourbit_models = [ - "unsloth/mistral-7b-bnb-4bit", - "unsloth/mistral-7b-instruct-v0.2-bnb-4bit", - "unsloth/llama-2-7b-bnb-4bit", - "unsloth/llama-2-13b-bnb-4bit", - "unsloth/codellama-34b-bnb-4bit", - "unsloth/tinyllama-bnb-4bit", -] # More models at https://huggingface.co/unsloth - -nm = "unsloth/mistral-7b-bnb-4bit" -nm = fourbit_models[-2] -nm = fourbit_models[0] -model, tokenizer = FastLanguageModel.from_pretrained( - model_name=nm, # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B - max_seq_length=max_seq_length, - dtype=dtype, - load_in_4bit=load_in_4bit, - # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf -) - - -model = FastLanguageModel.get_peft_model( - model, - r=16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 - target_modules=[ - "q_proj", - "k_proj", - "v_proj", - "o_proj", - "gate_proj", - "up_proj", - "down_proj", - ], - lora_alpha=16, - lora_dropout=0, # Supports any, but = 0 is optimized - bias="none", # Supports any, but = "none" is optimized - use_gradient_checkpointing=True, - random_state=3407, - use_rslora=False, # We support rank stabilized LoRA - loftq_config=None, # And LoftQ -) - -alpaca_prompt = """Below is a description of a superconductor material.. - -### Instruction: -{} - -### Input: -{} - -### Output: -{}""" - -EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN - - -def formatting_prompts_func(examples): - instructions = examples["instruction"] - inputs = examples["input"] - outputs = examples["output"] - texts = [] - for instruction, input, output in zip(instructions, inputs, outputs): - # Must add EOS_TOKEN, otherwise your generation will go on forever! - text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN - texts.append(text) - return { - "text": texts, - } - - -dataset = load_dataset( - "json", data_files="alpaca_Tc_supercon.json", split="train" -) -dataset = dataset.map( - formatting_prompts_func, - batched=True, -) - -trainer = SFTTrainer( - model=model, - tokenizer=tokenizer, - train_dataset=dataset, - dataset_text_field="text", - max_seq_length=max_seq_length, - dataset_num_proc=2, - packing=False, # Can make training 5x faster for short sequences. - args=TrainingArguments( - per_device_train_batch_size=2, - gradient_accumulation_steps=4, - warmup_steps=5, - overwrite_output_dir=True, - # max_steps = 60, - learning_rate=2e-4, - fp16=not torch.cuda.is_bf16_supported(), - bf16=torch.cuda.is_bf16_supported(), - logging_steps=1, - optim="adamw_8bit", - weight_decay=0.01, - lr_scheduler_type="linear", - seed=3407, - output_dir="outputs", - num_train_epochs=10, - report_to="none", - ), -) - -trainer_stats = trainer.train() -model.save_pretrained("lora_model_m") -# alpaca_prompt = Copied from above -FastLanguageModel.for_inference(model) # Enable native 2x faster inference -inputs = tokenizer( - [ - alpaca_prompt.format( - "Below is a description of a superconductor material.", # instruction - "The chemical formula is YCI The Tc_supercon is 6.483. The spacegroup is 12. Generate atomic structure description with lattice lengths, angles, coordinates and atom types.", # input - "", # output - leave this blank for generation! - ) - ], - return_tensors="pt", -).to("cuda") - -outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True) -# outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True) -print("xyz", tokenizer.batch_decode(outputs)) diff --git a/atomgpt/scripts/usloth_prop.py b/atomgpt/scripts/usloth_prop.py deleted file mode 100644 index af02399..0000000 --- a/atomgpt/scripts/usloth_prop.py +++ /dev/null @@ -1,406 +0,0 @@ -from sklearn.metrics import mean_absolute_error -import pandas as pd -from unsloth import FastLanguageModel -import torch -from datasets import load_dataset -from trl import SFTTrainer -from transformers import TrainingArguments -import re -import os -import json -import zipfile -from jarvis.core.atoms import Atoms -from jarvis.db.figshare import data -from jarvis.db.jsonutils import loadjson, dumpjson -from jarvis.analysis.structure.spacegroup import Spacegroup3D -from jarvis.analysis.diffraction.xrd import XRD -from jarvis.core.specie import Specie -from collections import defaultdict - - -def atoms_describer( - atoms=[], - xrd_peaks=5, - xrd_round=1, - cutoff=4, - take_n_bomds=2, - include_spg=True, -): - """Describe an atomic structure.""" - if include_spg: - spg = Spacegroup3D(atoms) - theta, d_hkls, intens = XRD().simulate(atoms=(atoms)) - # x = atoms.atomwise_angle_and_radial_distribution() - # bond_distances = {} - # for i, j in x[-1]["different_bond"].items(): - # bond_distances[i.replace("_", "-")] = ", ".join( - # map(str, (sorted(list(set([round(jj, 2) for jj in j]))))) - # ) - dists = defaultdict(list) - elements = atoms.elements - for i in atoms.get_all_neighbors(r=cutoff): - for j in i: - key = "-".join(sorted([elements[j[0]], elements[j[1]]])) - dists[key].append(j[2]) - bond_distances = {} - for i, j in dists.items(): - dist = sorted(set([round(k, 2) for k in j])) - if len(dist) >= take_n_bomds: - dist = dist[0:take_n_bomds] - bond_distances[i] = ", ".join(map(str, dist)) - fracs = {} - for i, j in (atoms.composition.atomic_fraction).items(): - fracs[i] = round(j, 3) - info = {} - chem_info = { - "atomic_formula": atoms.composition.reduced_formula, - "prototype": atoms.composition.prototype, - "molecular_weight": round(atoms.composition.weight / 2, 2), - "atomic_fraction": (fracs), - "atomic_X": ", ".join( - map(str, [Specie(s).X for s in atoms.uniq_species]) - ), - "atomic_Z": ", ".join( - map(str, [Specie(s).Z for s in atoms.uniq_species]) - ), - } - struct_info = { - "lattice_parameters": ", ".join( - map(str, [round(j, 2) for j in atoms.lattice.abc]) - ), - "lattice_angles": ", ".join( - map(str, [round(j, 2) for j in atoms.lattice.angles]) - ), - # "spg_number": spg.space_group_number, - # "spg_symbol": spg.space_group_symbol, - "top_k_xrd_peaks": ", ".join( - map( - str, - sorted(list(set([round(i, xrd_round) for i in theta])))[ - 0:xrd_peaks - ], - ) - ), - "density": round(atoms.density, 3), - # "crystal_system": spg.crystal_system, - # "point_group": spg.point_group_symbol, - # "wyckoff": ", ".join(list(set(spg._dataset["wyckoffs"]))), - "bond_distances": bond_distances, - # "natoms_primitive": spg.primitive_atoms.num_atoms, - # "natoms_conventional": spg.conventional_standard_structure.num_atoms, - } - if include_spg: - struct_info["spg_number"] = spg.space_group_number - struct_info["spg_symbol"] = spg.space_group_symbol - struct_info["crystal_system"] = spg.crystal_system - struct_info["point_group"] = spg.point_group_symbol - struct_info["wyckoff"] = ", ".join(list(set(spg._dataset["wyckoffs"]))) - struct_info["natoms_primitive"] = spg.primitive_atoms.num_atoms - struct_info["natoms_conventional"] = ( - spg.conventional_standard_structure.num_atoms - ) - info["chemical_info"] = chem_info - info["structure_info"] = struct_info - line = "The number of atoms are: " + str( - atoms.num_atoms - ) # +"., The elements are: "+",".join(atoms.elements)+". " - for i, j in info.items(): - if not isinstance(j, dict): - line += "The " + i + " is " + j + ". " - else: - # print("i",i) - # print("j",j) - for ii, jj in j.items(): - tmp = "" - if isinstance(jj, dict): - for iii, jjj in jj.items(): - tmp += iii + ": " + str(jjj) + " " - else: - tmp = jj - line += "The " + ii + " is " + str(tmp) + ". " - return line - - -def get_crystal_string_t(atoms): - lengths = atoms.lattice.abc # structure.lattice.parameters[:3] - angles = atoms.lattice.angles - atom_ids = atoms.elements - frac_coords = atoms.frac_coords - - crystal_str = ( - " ".join(["{0:.2f}".format(x) for x in lengths]) - + "\n" - + " ".join([str(int(x)) for x in angles]) - + "\n" - + "\n".join( - [ - str(t) + " " + " ".join(["{0:.3f}".format(x) for x in c]) - for t, c in zip(atom_ids, frac_coords) - ] - ) - ) - - # crystal_str = atoms_describer(atoms) + "\n*\n" + crystal_str - return crystal_str - - -def make_alpaca_json_gen(dataset=[], prop="Tc_supercon"): - alpaca_prompt = """Below is a description of a material.. - - ### Instruction: - {} - - ### Input: - {} - - ### Output: - {}""" - - mem = [] - all_ids = [] - for i in dataset: - if i[prop] != "na": - atoms = Atoms.from_dict(i["atoms"]) - info = {} - info["instruction"] = ( - "Below is a description of a superconductor material." - ) - info["input"] = ( - "The chemical formula is " - + atoms.composition.reduced_formula - + " The " - + prop - + " is " - + str(round(i[prop], 3)) - + ". The spacegroup is " - + i["spg_number"] - + "." - + " Generate atomic structure description with lattice lengths, angles, coordinates and atom types." - ) - info["output"] = get_crystal_string_t(atoms) - mem.append(info) - return mem - - -def make_alpaca_json_pred( - dataset=[], prop="Tc_supercon", id_tag="jid", ids=[] -): - alpaca_prompt = """Below is a description of a material.. - - ### Instruction: - {} - - ### Input: - {} - - ### Output: - {}""" - all_ids = [] - mem = [] - for i in dataset: - if i[prop] != "na" and i[id_tag] in ids: - atoms = Atoms.from_dict(i["atoms"]) - info = {} - info["instruction"] = ( - "Predict " + prop + " property of this material" - ) - info["input"] = get_crystal_string_t(atoms) - info["output"] = str(round(i[prop], 2)) - mem.append(info) - all_ids.append(i[id_tag]) - return alpaca_prompt, mem, all_ids - - -benchmark_file = ( - "AI-SinglePropertyPrediction-PBE_gap-halide_peroskites-test-mae.csv.zip" -) -root_dir = "/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard" -method = benchmark_file.split("-")[0] -task = benchmark_file.split("-")[1] -prop = benchmark_file.split("-")[2] -dataset = benchmark_file.split("-")[3] -temp = dataset + "_" + prop + ".json.zip" -temp2 = dataset + "_" + prop + ".json" -fname = os.path.join(root_dir, "benchmarks", method, task, temp) -zp = zipfile.ZipFile(fname) -bench = json.loads(zp.read(temp2)) -dft_3d = data(dataset) -id_tag = "jid" -if "jid" in dft_3d[0]: - id_tag = "jid" -else: - id_tag = "id" - -# train_atoms = [] -# val_atoms = [] -# test_atoms = [] -# train_targets = [] -# val_targets = [] -# test_targets = [] -train_ids = list(bench["train"].keys()) -test_ids = list(bench["test"].keys()) -if "val" in bench: - val_ids = list(bench["val"].keys()) -else: - val_ids = test_ids -print("total", len(dft_3d)) -print("test_ids", len(test_ids)) -print("val_ids", len(val_ids)) -print("train_ids", len(train_ids)) -alpaca_prompt, train_data, train_ids = make_alpaca_json_pred( - dataset=dft_3d, prop=prop, id_tag=id_tag, ids=train_ids -) -alpaca_prompt, test_data, test_ids = make_alpaca_json_pred( - dataset=dft_3d, prop=prop, id_tag=id_tag, ids=test_ids -) -dumpjson(data=train_data, filename="train_data.json") -dumpjson(data=test_data, filename="test_data.json") -model_path = "lora_model_train" - -max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally! -dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+ -load_in_4bit = ( - True # Use 4bit quantization to reduce memory usage. Can be False. -) - -# 4bit pre quantized models we support for 4x faster downloading + no OOMs. -fourbit_models = [ - "unsloth/mistral-7b-bnb-4bit", - "unsloth/mistral-7b-instruct-v0.2-bnb-4bit", - "unsloth/llama-2-7b-bnb-4bit", - "unsloth/llama-2-13b-bnb-4bit", - "unsloth/codellama-34b-bnb-4bit", - "unsloth/tinyllama-bnb-4bit", -] # More models at https://huggingface.co/unsloth - -nm = "unsloth/mistral-7b-bnb-4bit" -nm = fourbit_models[-2] -# nm = fourbit_models[0] -model, tokenizer = FastLanguageModel.from_pretrained( - model_name=nm, # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B - max_seq_length=max_seq_length, - dtype=dtype, - load_in_4bit=load_in_4bit, - # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf -) - - -model = FastLanguageModel.get_peft_model( - model, - r=16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 - target_modules=[ - "q_proj", - "k_proj", - "v_proj", - "o_proj", - "gate_proj", - "up_proj", - "down_proj", - ], - lora_alpha=16, - lora_dropout=0, # Supports any, but = 0 is optimized - bias="none", # Supports any, but = "none" is optimized - use_gradient_checkpointing=True, - random_state=3407, - use_rslora=False, # We support rank stabilized LoRA - loftq_config=None, # And LoftQ -) - - -EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN - - -def formatting_prompts_func(examples): - instructions = examples["instruction"] - inputs = examples["input"] - outputs = examples["output"] - texts = [] - for instruction, input, output in zip(instructions, inputs, outputs): - # Must add EOS_TOKEN, otherwise your generation will go on forever! - text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN - texts.append(text) - return { - "text": texts, - } - - -dataset = load_dataset("json", data_files="train_data.json", split="train") -dataset = dataset.map( - formatting_prompts_func, - batched=True, -) - -trainer = SFTTrainer( - model=model, - tokenizer=tokenizer, - train_dataset=dataset, - dataset_text_field="text", - max_seq_length=max_seq_length, - dataset_num_proc=2, - packing=False, # Can make training 5x faster for short sequences. - args=TrainingArguments( - per_device_train_batch_size=2, - gradient_accumulation_steps=4, - warmup_steps=5, - overwrite_output_dir=True, - # max_steps = 60, - learning_rate=2e-4, - fp16=not torch.cuda.is_bf16_supported(), - bf16=torch.cuda.is_bf16_supported(), - logging_steps=1, - optim="adamw_8bit", - weight_decay=0.01, - lr_scheduler_type="linear", - seed=3407, - output_dir="outputs", - num_train_epochs=3, - report_to="none", - ), -) -trainer_stats = trainer.train() -model.save_pretrained(model_path) - -model_x, tokenizer = FastLanguageModel.from_pretrained( - model_name=model_path, # YOUR MODEL YOU USED FOR TRAINING - max_seq_length=max_seq_length, - dtype=dtype, - load_in_4bit=load_in_4bit, -) -FastLanguageModel.for_inference(model_x) # Enable native 2x faster inference - -# alpaca_prompt = You MUST copy from above! - -f = open("sloth_prop.csv", "w") -f.write("id,target,prediction\n") -for ii, i in enumerate(test_data): - inputs = tokenizer( - [ - alpaca_prompt.format( - "Predict " - + prop - + " property of this material", # instruction - i["input"], # input - "", # output - leave this blank for generation! - ) - ], - return_tensors="pt", - ).to("cuda") - - outputs = tokenizer.batch_decode( - model_x.generate(**inputs, max_new_tokens=64, use_cache=True) - )[0].split("### Output:\n")[-1] - floats = [float(j) for j in re.findall(r"\b\d+\.\d+\b", outputs)] - print(test_ids[ii], ",", i["output"], ",", floats[0]) - line = ( - str(test_ids[ii]) - + "," - + str(i["output"]) - + "," - + str(floats[0]) - + "\n" - ) - f.write(line) - # print(test_ids[ii], ",",i["output"].split("## Output:\\n")[1].split("")[0], ",",tokenizer.batch_decode(outputs)) -f.close() -df = pd.read_csv("sloth_prop.csv") -print("mae", mean_absolute_error(df["target"], df["prediction"])) diff --git a/atomgpt/train_id_prop.py b/atomgpt/train_id_prop.py deleted file mode 100644 index 78bcd03..0000000 --- a/atomgpt/train_id_prop.py +++ /dev/null @@ -1,716 +0,0 @@ -"""Module for fin tuning LLM model for materials chemsitry.""" - -from jarvis.db.figshare import data -import transformers -import torch -import random -from jarvis.db.jsonutils import loadjson, dumpjson -from torch.utils.data import DataLoader, Dataset -import numpy as np -import os -from jarvis.core.atoms import Atoms -import pandas as pd -from sklearn.metrics import mean_absolute_error -import json -from jarvis.db.figshare import get_jid_data -from jarvis.core.atoms import Atoms -from jarvis.analysis.structure.spacegroup import Spacegroup3D -from jarvis.analysis.diffraction.xrd import XRD -from jarvis.core.specie import Specie -import pprint -from collections import defaultdict -from tqdm import tqdm -import time -import json -import zipfile -from typing import Optional -from pydantic_settings import BaseSettings - - -class TrainingPropConfig(BaseSettings): - """Training config defaults and validation.""" - - id_prop_path: Optional[str] = "robo_desc.json.zip" - prefix: str = "atomgpt_run" - model_name: str = "gpt2" - batch_size: int = 16 - max_length: int = 512 - num_epochs: int = 500 - latent_dim: int = 1024 - learning_rate: float = 1e-3 - test_each_run: bool = True - include_struct: bool = False - pretrained_path: str = "" - seed_val: int = 42 - n_train: Optional[int] = None - n_val: Optional[int] = None - n_test: Optional[int] = None - output_dir: str = "out_temp" - train_ratio: Optional[float] = None - val_ratio: float = 0.1 - test_ratio: float = 0.1 - keep_data_order: bool = True - - -def get_id_train_val_test( - total_size=1000, - split_seed=123, - train_ratio=None, - val_ratio=0.1, - test_ratio=0.1, - n_train=None, - n_test=None, - n_val=None, - keep_data_order=True, -): - """Get train, val, test IDs.""" - if ( - train_ratio is None - and val_ratio is not None - and test_ratio is not None - ): - if train_ratio is None: - assert val_ratio + test_ratio < 1 - train_ratio = 1 - val_ratio - test_ratio - print("Using rest of the dataset except the test and val sets.") - else: - assert train_ratio + val_ratio + test_ratio <= 1 - # indices = list(range(total_size)) - if n_train is None: - n_train = int(train_ratio * total_size) - if n_test is None: - n_test = int(test_ratio * total_size) - if n_val is None: - n_val = int(val_ratio * total_size) - ids = list(np.arange(total_size)) - if not keep_data_order: - random.seed(split_seed) - random.shuffle(ids) - # np.random.shuffle(ids) - if n_train + n_val + n_test > total_size: - raise ValueError( - "Check total number of samples.", - n_train + n_val + n_test, - ">", - total_size, - ) - - # shuffle consistently with https://github.com/txie-93/cgcnn/data.py - # i.e. shuffle the index in place with standard library random.shuffle - # first obtain only valid indices - - # test_size = round(N * 0.2) - - # full train/val test split - # ids = ids[::-1] - id_train = ids[:n_train] - id_val = ( - ids[-(n_val + n_test) : -n_test] - if n_test > 0 - else ids[-(n_val + n_test) :] - ) # noqa:E203 - id_test = ids[-n_test:] if n_test > 0 else [] - return id_train, id_val, id_test - - -def make_id_prop( - benchmark_file="AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae.csv.zip", - desc_file="robo_desc.json.zip", - leaderboard_dir="/wrk/knc6/AFFBench/jarvis_leaderboard/jarvis_leaderboard", - # leaderboard_dir="/work/03943/kamalch/ls6/Software/atomgpt/jarvis_leaderboard/jarvis_leaderboard/", - output_dir="test_id_prop", -): - print("benchmark_file", benchmark_file) - method = benchmark_file.split("-")[0] - task = benchmark_file.split("-")[1] - prop_name = benchmark_file.split("-")[2] - dataset = benchmark_file.split("-")[3] - temp = dataset + "_" + prop_name + ".json.zip" - temp2 = dataset + "_" + prop_name + ".json" - fname = os.path.join(leaderboard_dir, "benchmarks", method, task, temp) - zp = zipfile.ZipFile(fname) - bench = json.loads(zp.read(temp2)) - dft_3d = data(dataset) - id_tag = "jid" - output_dir = prop_name + "_" + dataset - if "jid" in dft_3d[0]: - id_tag = "jid" - else: - id_tag = "id" - if not os.path.exists(output_dir): - os.makedirs(output_dir) - train_ids = list(bench["train"].keys()) - test_ids = list(bench["test"].keys()) - if "val" in bench: - val_ids = list(bench["val"].keys()) - else: - val_ids = test_ids - print("Saving files in", output_dir) - if ".zip" in desc_file: - zp = zipfile.ZipFile(desc_file) - dat = json.loads(zp.read(desc_file.split(".zip")[0].split("/")[-1])) - - else: - dat = loadjson(desc_file) - - dat2 = {} - for i in dat: - dat2[i["id"]] = i["desc"] - dft_3d2 = {} - for i in dft_3d: - dft_3d2[i[id_tag]] = i - mem = [] - for i in train_ids: - desc = dat2[i] - prop = dft_3d2[i][prop_name] - info = {} - info["id"] = i - info["desc"] = desc - info["prop"] = prop - mem.append(info) - for i in val_ids: - desc = dat2[i] - - prop = dft_3d2[i][prop_name] - info = {} - info["id"] = i - info["desc"] = desc - info["prop"] = prop - mem.append(info) - for i in test_ids: - desc = dat2[i] - prop = dft_3d2[i][prop_name] - info = {} - info["id"] = i - info["desc"] = desc - info["prop"] = prop - mem.append(info) - print("total", len(dft_3d)) - print("test_ids", len(test_ids)) - print("val_ids", len(val_ids)) - print("train_ids", len(train_ids)) - filename = os.path.join(output_dir, "id_prop_llm.json") - filename_config = os.path.join(output_dir, "config.json") - minfo = {} - minfo["n_train"] = len(train_ids) - minfo["n_val"] = len(val_ids) - minfo["n_test"] = len(test_ids) - minfo["id_prop_path"] = os.path.abspath(filename) - minfo["output_dir"] = os.path.abspath(output_dir) - - dumpjson(data=minfo, filename=filename_config) - dumpjson(data=mem, filename=filename) - return output_dir - - -## -os.environ["WANDB_ANONYMOUS"] = "must" -random_seed = 42 -random.seed(random_seed) -torch.manual_seed(random_seed) -np.random.seed(random_seed) -torch.cuda.manual_seed_all(random_seed) -try: - import torch_xla.core.xla_model as xm - - xm.set_rng_state(random_seed) -except ImportError: - pass -torch.backends.cudnn.deterministic = True -torch.backends.cudnn.benchmark = False -os.environ["PYTHONHASHSEED"] = str(random_seed) -os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") -torch.use_deterministic_algorithms(True) -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") -# device = "cpu" - - -# Define a custom dataset class for regression -class AtomGPTDataset(Dataset): - def __init__( - self, texts=[], targets=[], ids=[], tokenizer="", max_length=128 - ): - self.texts = texts - self.targets = targets - self.tokenizer = tokenizer - self.max_length = max_length - if not ids: - ids = ["text-" + str(i) for i in range(len(texts))] - self.ids = ids - - def __len__(self): - return len(self.texts) - - def __getitem__(self, idx): - inputs = self.tokenizer( - self.texts[idx], - return_tensors="pt", - max_length=self.max_length, - padding="max_length", - truncation=True, - ) - # torch.tensor(inputs*10,dtype=inputs.dtype) - return ( - inputs, - self.ids[idx], - torch.tensor(self.targets[idx], dtype=torch.float32), - ) - - -# Example usage - - -def run_atomgpt(config_file="config.json"): - print("Running AtomGPT prop predictor.") - config = loadjson(config_file) - config = TrainingPropConfig(**config) - id_prop_path = config.id_prop_path - if ".zip" in id_prop_path: - zp = zipfile.ZipFile(id_prop_path) - dat = json.loads(zp.read(id_prop_path.split(".zip")[0])) - else: - dat = loadjson(id_prop_path) - print("len", len(dat)) - prefix = config.prefix - model_name = config.model_name - batch_size = config.batch_size - max_length = config.max_length - num_epochs = config.num_epochs - latent_dim = config.latent_dim - learning_rate = config.learning_rate - test_each_run = config.test_each_run - pretrained_path = config.pretrained_path - seed_val = config.seed_val - include_struct = config.include_struct - n_train = config.n_train - n_val = config.n_val - n_test = config.n_test - train_ratio = config.train_ratio - val_ratio = config.val_ratio - test_ratio = config.test_ratio - output_dir = config.output_dir - keep_data_order = config.keep_data_order - - f = open(os.path.join(config.output_dir, "config.json"), "w") - f.write(json.dumps(config.dict(), indent=4)) - f.close() - - id_train, id_val, id_test = get_id_train_val_test( - total_size=len(dat), - split_seed=seed_val, - train_ratio=train_ratio, - val_ratio=val_ratio, - test_ratio=test_ratio, - n_train=n_train, - n_test=n_test, - n_val=n_val, - keep_data_order=keep_data_order, - ) - - train_texts = [] - train_targets = [] - train_ids_temp = [] - val_texts = [] - val_targets = [] - val_ids_temp = [] - test_texts = [] - test_targets = [] - test_ids_temp = [] - train_info = [] - val_info = [] - test_info = [] - for ii, i in enumerate(dat): - if ii in id_train: - train_texts.append(i["desc"]) - train_targets.append(i["prop"]) - train_ids_temp.append(i["id"]) - train_info.append(i) - if ii in id_test: - test_texts.append(i["desc"]) - test_targets.append(i["prop"]) - test_ids_temp.append(i["id"]) - val_info.append(i) - if ii in id_val: - val_texts.append(i["desc"]) - val_targets.append(i["prop"]) - val_ids_temp.append(i["id"]) - test_info.append(i) - print("test_texts:", len(test_texts)) - print("val_texts example:", val_texts[0]) - print("test_texts example:", test_texts[0]) - - print("Train\n", pd.DataFrame(train_info)) - print("Val\n", pd.DataFrame(val_info)) - print("test\n", pd.DataFrame(test_info)) - - print("total", len(dat)) - print("test_ids", len(id_test)) - print("val_ids", len(id_val)) - print("train_ids", len(id_train)) - # model_name = "mistralai/Mistral-7B-Instruct-v0.1" - # model_name = "gpt2" - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - # batch_size = 16 - # max_length = 128 - # num_epochs = 100 - # learning_rate = 5e-5 - criterion = torch.nn.L1Loss() - # Define example regression data (texts and corresponding numeric targets) - """ - ############################## - ###Fast test### - train_texts = [ - "This is the first example text.", - "Second example is a bit longer than the first one, but still within the max length.", - "Third example is the longest among these three examples. It exceeds the max length and will be truncated.", - "Second example is a bit longer than the first one, but still within the max length.", - ] - train_targets = [10.2, 15.5, 20.1, 15.5] # Example regression targets - val_texts = test_texts = train_texts - val_targets = test_targets = train_targets - train_ids_temp=['a','b','c','d'] - val_ids_temp = test_ids_temp = train_ids_temp - batch_size = 2 - num_epochs = 3 - - ############################## - ############################## - """ - - # Fine-tune the last layer of GPT-2 for regression - # fine_tune_gpt2_regression(train_texts, train_targets, tokenizer) - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - tokenizer.add_special_tokens({"unk_token": "#"}) - tokenizer.add_special_tokens({"unk_token": "&"}) - tokenizer.add_special_tokens({"unk_token": "@"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Sequential( - # torch.nn.Linear(model.config.hidden_size, 1), - torch.nn.Linear(model.config.hidden_size, latent_dim), - # torch.nn.Linear( latent_dim,256), - # torch.nn.Transformer(d_model=latent_dim, nhead=1, num_encoder_layers=1, num_decoder_layers=1), - # torch.nn.Linear(latent_dim, latent_dim), - # torch.nn.Linear(latent_dim, latent_dim), - # torch.nn.ReLU(), - # torch.nn.LeakyReLU(), - # torch.nn.Dropout(p=0.2), - # torch.nn.TransformerEncoder(torch.nn.TransformerEncoderLayer(d_model=latent_dim, nhead=4), num_layers=2), - # torch.nn.Linear(256, 1), - torch.nn.Linear(latent_dim, 1), - ) - if pretrained_path != "": - model.load_state_dict(torch.load(pretrained_path, map_location=device)) - # model.lm_head = torch.nn.Sequential(torch.nn.Linear( model.config.hidden_size, 256),torch.nn.SiLU(),torch.nn.Linear( 256, 1) ) - # set_seed(seed) - # set_deterministic() - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - # Prepare datasets and dataloaders with data collator - # TODO: knc6 change later - train_dataset = AtomGPTDataset( - texts=train_texts, - targets=train_targets, - ids=train_ids_temp, - tokenizer=tokenizer, - max_length=max_length, - ) - test_dataset = AtomGPTDataset( - texts=val_texts, - targets=val_targets, - tokenizer=tokenizer, - ids=val_ids_temp, - max_length=max_length, - ) - val_dataset = AtomGPTDataset( - texts=test_texts, - targets=test_targets, - tokenizer=tokenizer, - ids=test_ids_temp, - max_length=max_length, - ) - train_dataloader = DataLoader(train_dataset, batch_size=batch_size) - val_dataloader = DataLoader(val_dataset, batch_size=batch_size) - test_dataloader = DataLoader(test_dataset, batch_size=batch_size) - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - # output_dir = prefix + "_out" # + model_name + "_" + dataset + "_" + prop - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - # optimizer.zero_grad() - train_loss += loss.item() - scheduler.step() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - model.eval() - val_loss = 0 - t1 = time.time() - fname = os.path.join(output_dir, "val_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - with torch.no_grad(): - for batch in val_dataloader: - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - ids = batch[1] - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - f.write(line) - f.close() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - model.eval() - with torch.no_grad(): - if test_each_run: - t1_test = time.time() - # model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - test_loss = 0 - for batch in test_dataloader: - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - test_loss += loss.item() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - f.write(line) - test_loss = test_loss / len(test_dataloader) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - # mae, - test_loss, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results_final.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - optimizer.zero_grad() - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - output_dir = make_id_prop() - run_atomgpt(config_file=output_dir + "/config.json") - # config_file="config.json" - # ) diff --git a/atomgpt/train_prop.py b/atomgpt/train_prop.py deleted file mode 100644 index 40272c3..0000000 --- a/atomgpt/train_prop.py +++ /dev/null @@ -1,416 +0,0 @@ -#!/usr/bin/env python -"""Module to train properties.""" -import transformers -from atomgpt.data.dataset import data_from_benchmark_file, data_from_id_prop -from atomgpt.config import TrainingPropConfig -import os -import json -import zipfile -import torch -from jarvis.db.figshare import data -import time -import pandas as pd -from sklearn.metrics import mean_absolute_error -import random -import numpy as np -import os -from jarvis.db.jsonutils import loadjson, dumpjson -import sys -import argparse -import pprint - -device = "cpu" -if torch.cuda.is_available(): - device = torch.device("cuda") - -parser = argparse.ArgumentParser(description="AtomGPT") -parser.add_argument( - "--config_file", - default="config.json", - help="Config file", -) - - -def set_seed(random_seed=42): - os.environ["WANDB_ANONYMOUS"] = "must" - # random_seed = 42 - random.seed(random_seed) - torch.manual_seed(random_seed) - np.random.seed(random_seed) - torch.cuda.manual_seed_all(random_seed) - try: - import torch_xla.core.xla_model as xm - - xm.set_rng_state(random_seed) - except ImportError: - pass - torch.backends.cudnn.deterministic = True - torch.backends.cudnn.benchmark = False - os.environ["PYTHONHASHSEED"] = str(random_seed) - os.environ["CUBLAS_WORKSPACE_CONFIG"] = str(":4096:8") - torch.use_deterministic_algorithms(True) - - -def run_atomgpt(config_file=""): - print("Running AtomGPT prop predictor.") - config = loadjson(config_file) - config = TrainingPropConfig(**config) - benchmark_file = config.benchmark_file - id_prop_path = config.id_prop_path - prefix = config.prefix - model_name = config.model_name - leaderboard_dir = config.leaderboard_dir - batch_size = config.batch_size - max_length = config.max_length - num_epochs = config.num_epochs - latent_dim = config.latent_dim - learning_rate = config.learning_rate - test_each_run = config.test_each_run - pretrained_path = config.pretrained_path - seed_val = config.seed_val - include_struct = config.include_struct - n_train = config.n_train - n_val = config.n_val - n_test = config.n_test - train_ratio = config.train_ratio - val_ratio = config.val_ratio - test_ratio = config.test_ratio - keep_data_order = config.keep_data_order - output_dir = config.output_dir - print("configs", pprint.pprint(config.dict())) - set_seed(random_seed=seed_val) - if "t5" in model_name: - model = transformers.T5ForConditionalGeneration.from_pretrained( - model_name - ) - else: - model = transformers.AutoModelForCausalLM.from_pretrained( - model_name, - low_cpu_mem_usage=True, - # load_in_8bit=False, - # torch_dtype=torch.float16, - # load_in_8bit=True, - # device_map="auto" - ) - # device = model.device - if "t5" in model_name: - tokenizer = transformers.T5Tokenizer.from_pretrained(model_name) - - else: - tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) - - if tokenizer.pad_token is None: - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - tokenizer.add_special_tokens({"unk_token": "#"}) - tokenizer.add_special_tokens({"unk_token": "&"}) - tokenizer.add_special_tokens({"unk_token": "@"}) - model.resize_token_embeddings(len(tokenizer)) - model.lm_head = torch.nn.Sequential( - torch.nn.Linear(model.config.hidden_size, latent_dim), - torch.nn.Linear(latent_dim, 1), - ) - if benchmark_file is not None: - ( - train_dataloader, - val_dataloader, - test_dataloader, - ) = data_from_benchmark_file( - benchmark_file=benchmark_file, - leaderboard_dir=leaderboard_dir, - tokenizer=tokenizer, - max_length=max_length, - batch_size=batch_size, - include_struct=include_struct, - ) - elif id_prop_path is not None: - train_dataloader, val_dataloader, test_dataloader = data_from_id_prop( - id_prop_path=id_prop_path, - tokenizer=tokenizer, - max_length=max_length, - split_seed=seed_val, - n_train=n_train, - n_val=n_val, - n_test=n_test, - train_ratio=train_ratio, - val_ratio=val_ratio, - test_ratio=test_ratio, - keep_data_order=keep_data_order, - batch_size=batch_size, - include_struct=include_struct, - calc_desc=False, - ) - else: - raise ValueError("Provide id_prop_path or benchmark_file") - - val_dataloader = test_dataloader # for now - if pretrained_path != "": - model.load_state_dict(torch.load(pretrained_path, map_location=device)) - model.to(device) - if torch.cuda.device_count() > 1: - device_ids = [d for d in range(torch.cuda.device_count())] - model = torch.nn.DataParallel(model, device_ids=device_ids).cuda() - criterion = torch.nn.L1Loss() - optimizer = transformers.AdamW(model.parameters(), lr=learning_rate) - steps_per_epoch = len(train_dataloader) - scheduler = torch.optim.lr_scheduler.OneCycleLR( - optimizer, - max_lr=learning_rate, - epochs=num_epochs, - steps_per_epoch=steps_per_epoch, - # pct_start=pct_start, - pct_start=0.3, - ) - # print("train_data", len(train_texts)) - # print("test_data", len(test_texts)) - # output_dir = prefix + "_out_" + model_name - if not os.path.exists(output_dir): - os.makedirs(output_dir) - best_loss = np.inf - tot_time_start = time.time() - train_history = [] - val_history = [] - for epoch in range(num_epochs): - model.train() - t1 = time.time() - for batch in train_dataloader: - optimizer.zero_grad() - train_loss = 0 - # train_result = [] - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - # print('train',predictions,targets) - loss.backward() - optimizer.step() - scheduler.step() - # optimizer.zero_grad() - train_loss += loss.item() - train_loss = train_loss / len(train_dataloader) - t2 = time.time() - train_time = round(t2 - t1, 3) - model.eval() - - # total_eval_mae_loss = 0 - # predictions_list = [] - # targets_list = [] - val_loss = 0 - t1 = time.time() - for batch in val_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - targets = batch[2].squeeze() - # print('val',predictions,targets) - loss = criterion( - predictions.squeeze(), targets.squeeze().to(device) - ) - val_loss += loss.item() - saving_tag = "" - if val_loss < best_loss: - best_loss = val_loss - best_model_name = "best_model.pt" - torch.save( - model.state_dict(), - os.path.join(output_dir, best_model_name), - ) - # print("Saving model for epoch", epoch) - saving_tag = " saving model:" + str(epoch) - val_loss = val_loss / len(val_dataloader) - t2 = time.time() - val_time = round(t2 - t1, 3) - train_history.append(train_loss) - val_history.append(val_loss) - history = os.path.join(output_dir, "history.json") - - dumpjson( - data={"train": train_history, "val": val_history}, filename=history - ) - mae = "" - if test_each_run: - t1_test = time.time() - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - - else: - predictions = ( - model( - input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - f.write(line) - t2_test = time.time() - test_time = round(t2_test - t1_test, 3) - f.close() - df = pd.read_csv(fname) - mae = mean_absolute_error(df["target"], df["predictions"]) - if mae == "": - print( - "Epoch, train loss, val loss, train_time, val_time", - epoch, - train_loss, - val_loss, - train_time, - val_time, - saving_tag, - ) - else: - print( - "Epoch, train loss, val loss, test loss, train_time, val_time, test_time", - epoch, - train_loss, - val_loss, - mae, - train_time, - val_time, - test_time, - saving_tag, - ) - - model.eval() - fname = os.path.join(output_dir, "test_results.csv") - f = open(fname, "w") - f.write("id,target,predictions\n") - for batch in test_dataloader: - with torch.no_grad(): - input_ids = batch[0]["input_ids"].squeeze() # .squeeze(0) - if "t5" in model_name: - predictions = ( - model( - input_ids.to(device), - decoder_input_ids=input_ids.to(device), - ) - .logits.squeeze() - .mean(dim=-1) - ) - else: - predictions = ( - model(input_ids.to(device)).logits.squeeze().mean(dim=-1) - ) - ids = batch[1] - targets = batch[2].squeeze() - if len(ids) == 1: - targets = [targets] - predictions = [predictions] - # ids=[ids] - for ii, jj, kk in zip(targets, predictions, ids): - # print(kk,ii.cpu().detach().numpy().tolist(),jj.cpu().detach().numpy().tolist()) - line = ( - str(kk) - + "," - + str(round(ii.cpu().detach().numpy().tolist(), 3)) - + "," - + str(round(jj.cpu().detach().numpy().tolist(), 3)) - + "\n" - ) - # f.write("%s, %6f, %6f\n" % (kk, ii.cpu().detach().numpy().tolist(), jj.cpu().detach().numpy().tolist())) - # print(line) - f.write(line) - f.close() - tot_time_end = time.time() - tot_time = tot_time_end - tot_time_start - print("tot_time", tot_time) - - -if __name__ == "__main__": - # box = [[2.715, 2.715, 0], [0, 2.715, 2.715], [2.715, 0, 2.715]] - # coords = [[0, 0, 0], [0.25, 0.2, 0.25]] - # elements = ["Si", "Si"] - # Si = Atoms(lattice_mat=box, coords=coords, elements=elements) - # tmp=atoms_describer(Si) - # print(tmp) - # import sys - # sys.exit() - args = parser.parse_args(sys.argv[1:]) - config_file = args.config_file - # "AI-SinglePropertyPrediction-PBE_gap-halide_peroskites-test-mae.csv.zip" - # "AI-SinglePropertyPrediction-Tc_supercon-dft_3d-test-mae.csv.zip" - # id_prop_path = ( - # "/wrk/knc6/Software/mini_alignn/alignn/alignn/examples/sample_data" - # ) - # "AI-SinglePropertyPrediction-ead-tinnet_N-test-mae.csv.zip" - # "AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test-mae" - # args.benchmark_file - model_name = "facebook/opt-350m" - model_name = "mistralai/Mixtral-8x7B-v0.1" - model_name = "google/flan-t5-small" - model_name = "google/flan-t5-base" - model_name = "mistralai/Mistral-7B-Instruct-v0.1" - model_name = "google-t5/t5-small" - model_name = "xlnet/xlnet-base-cased" - model_name = "afmck/testing-llama-tiny" - model_name = "EleutherAI/gpt-neo-125m" - model_name = "openai-community/gpt2-medium" - model_name = "meta-llama/Llama-2-7b-hf" - model_name = "stas/tiny-random-llama-2" - model_name = "ahxt/llama2_xs_460M_experimental" - model_name = "gpt2" - run_atomgpt( - config_file=config_file, - ) diff --git a/dev-requirements.txt b/dev-requirements.txt new file mode 100644 index 0000000..6989dc3 --- /dev/null +++ b/dev-requirements.txt @@ -0,0 +1,102 @@ +accelerate==0.31.0 +aiohttp==3.9.5 +aiosignal==1.3.1 +alignn==2024.4.20 +annotated-types==0.7.0 +ase==3.23.0 +async-timeout==4.0.3 +attrs==23.2.0 +autopep8==2.3.1 +bitsandbytes==0.43.1 +black==24.4.2 +certifi==2024.6.2 +cffi +chardet==3.0.4 +charset-normalizer==3.3.2 +click==8.1.7 +contourpy==1.2.1 +cycler==0.12.1 +datasets==2.20.0 +dgl==1.1.1 +dill==0.3.8 +docstring_parser==0.16 +eval_type_backport==0.2.0 +filelock +flake8==7.1.0 +fonttools==4.53.0 +frozenlist==1.4.1 +fsspec==2024.5.0 +gmpy2 +huggingface-hub==0.23.4 +idna==3.7 +importlib_resources==6.4.0 +jarvis-tools==2024.4.30 +Jinja2 +joblib==1.4.2 +kiwisolver==1.4.5 +lmdb==1.4.1 +markdown-it-py==3.0.0 +MarkupSafe +matplotlib==3.9.0 +mccabe==0.7.0 +mdurl==0.1.2 +mpmath +multidict==4.7.6 +multiprocess==0.70.16 +mypy-extensions==1.0.0 +networkx +numpy==1.26.4 +packaging==24.1 +pandas==2.2.2 +pathspec==0.12.1 +peft==0.11.1 +pillow==10.3.0 +platformdirs==4.2.2 +psutil==6.0.0 +pyarrow==16.1.0 +pyarrow-hotfix==0.6 +pycodestyle==2.12.0 +pycparser +pydantic==2.7.4 +pydantic-settings==2.3.3 +pydantic_core==2.18.4 +pydocstyle==6.3.0 +pyflakes==3.2.0 +Pygments==2.18.0 +pyparsing==2.4.7 +python-dateutil==2.9.0.post0 +python-dotenv==1.0.1 +pytz==2024.1 +PyYAML +regex==2024.5.15 +requests==2.32.3 +rich==13.7.1 +safetensors==0.4.3 +scikit-learn==1.5.0 +scipy==1.13.1 +sentencepiece==0.2.0 +shtab==1.7.1 +six==1.16.0 +snowballstemmer==2.2.0 +spglib==2.4.0 +sympy +threadpoolctl==3.5.0 +tokenizers==0.19.1 +tomli==2.0.1 +toolz==0.12.1 +torch==2.2.2 +torchdata==0.7.1 +tqdm==4.66.4 +transformers==4.41.2 +triton==2.2.0 +trl==0.8.6 +typing_extensions +tyro==0.8.4 +tzdata==2024.1 +urllib3==2.2.2 +xformers==0.0.25.post1 +xmltodict==0.13.0 +xxhash==3.4.1 +yarl==1.9.4 +zipp==3.19.2 + diff --git a/setup.py b/setup.py index aef9d0e..1d5a8f4 100644 --- a/setup.py +++ b/setup.py @@ -23,7 +23,7 @@ "sentencepiece" ], - scripts=["atomgpt/train_prop.py"], + # scripts=["atomgpt/train_prop.py"], long_description=long_description, long_description_content_type="text/markdown", url="https://github.com/usnistgov/atomgpt",