Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canot from dataloader.bigquery_pypi import LLMDataset in file pl_data.py in line 11 #2

Open
jswjc555 opened this issue May 31, 2024 · 1 comment

Comments

@jswjc555
Copy link

Hello, I am stuck at Step 2: running training.

Could you please let me know if the function dataloader.bigquery_pypi import LLMDataset is from a third-party library of dataloader or a specific implementation within this project? My local version of dataloader is 2.0, and it cannot be imported. Moreover, there is no dataloader file within the project, which makes it impossible to import and run pl_data. Am I missing the implementation of LLMDataset?

Expecting a response from the author. Thx

@xiaowu0162
Copy link
Contributor

Hi,

Thank you for raising the issue. We missed a file when preparing the code. I will submit a pr to fix that. For now, you can create a file dataloader/bigquery_pypi.py and put the following content in it:

from torch.utils.data import Dataset

class LLMDataset(Dataset):
    def __init__(self, data):
        super(LLMDataset, self).__init__()
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, ind):
        # indexing the chunked data directly
        source_tokens = torch.tensor(self.data[ind]['token_ids'])
        return {"input_ids": source_tokens}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants