Adding LRGB to the HuggingFace hub #10

clefourrier · 2023-02-08T07:32:50Z

Hi!
@migalkin suggested on Twitter adding your datasets to the HuggingFace hub, which I think is a super cool idea, so I'm opening this issue to see if you need any help with that!

Here is the step by step tutorial on how to do so.
Ping me if you need anything in the process 🤗

SauravMaheshkar · 2023-02-13T11:44:28Z

PascalVOC-SP

edge_wt_only_coord, slic = 10 | SauravMaheshkar/voc_superpixels_edge_wt_only_coord_10
edge_wt_only_coord, slic = 30 | SauravMaheshkar/voc_superpixels_edge_wt_only_coord_30

clefourrier · 2023-02-13T11:53:55Z

Amazing!
Do you want me to create an LRBG org on the hub so all datasets can be under the same namespace?

SauravMaheshkar · 2023-02-13T11:55:23Z

Amazing! Do you want me to create an LRBG org on the hub so all datasets can be under the same namespace?

Yes thank you, that would be great. I also realized I uploaded .pickle files without processing. Will preprocess and update the dataset.

My HF Username is : SauravMaheshkar

clefourrier · 2023-02-13T12:02:08Z

I did and I added you to it! Once the datasets are correctly processed, feel free to transfer them to the org namespace!

@rampasek @vijaydwivedi75 Would one of you want to be an admin of that?
(If yes I would need your HuggingFace hub username)

SauravMaheshkar · 2023-02-13T13:21:10Z

I pre-processed and added all the PascalVOC datasets to the organization.

vijaydwivedi75 · 2023-02-14T05:48:58Z

Thanks a lot @SauravMaheshkar @clefourrier!

@clefourrier, sure. My username is vijaypradwi.
[I will check the steps linked in above comments for the HF datasets, as I haven't used before :')]

clefourrier · 2023-02-14T07:31:57Z

@vijaydwivedi75 Added you as admin!

Feel free to ask any questions you need here :)

SauravMaheshkar · 2023-02-15T10:20:30Z

I pre-processed and added all the COCO-SP datasets to the organization.

SauravMaheshkar · 2023-02-16T23:33:17Z

I pre-processed and added the peptides-functional dataset to the organization.

SauravMaheshkar · 2023-02-16T23:40:31Z

I pre-processed and added the peptides-structural dataset to the organization.

SauravMaheshkar · 2023-02-17T01:57:00Z

I pre-processed and add the PCQM-Contact dataset to the organization.

That's all the datasets done ✅ .

@clefourrier @vijaydwivedi75 can you folks go through the datasets and make sure they look good ? Maybe then we can close this issue.

clefourrier · 2023-02-17T07:02:10Z

Thank you very much for your work! I think we're very close to being good, just 2 last points:

why do the datasets have a .pt extension?
relatedly, how do you usually load them? so that they integrate perfectly into datasets, they could benefit from having a loading script, see this doc.

SauravMaheshkar · 2023-02-17T11:18:26Z

I ran the pre-processing scripts on all the datasets and they outputted *.pt files. I assumed we wanted to upload the pre-processed datasets instead of raw files, right ?
How would you propose we work on the loading script ? Is there any pre-existing scripts that can be used as reference in the lrgb repository.

clefourrier · 2023-02-17T11:26:29Z

Regarding 1, I might be missing context, since I don't know LRGB that well: which preprocessing scripts did you use? (We do usually want the pre-processed datasets.)

Regarding 2, it depends on 1, I'd need to understand better what the preprocessing does to give you a hand :)

SauravMaheshkar · 2023-02-17T11:29:01Z

All the datasets have a process function in their respective classes within the dataset/ dir. I simply ran that script and uploaded the processed datasets obtained as the output.

For 2, I'll refer to @vijaydwivedi75 for more context.

clefourrier · 2023-02-17T11:38:50Z

Ok, that's exactly what I needed, thank you!

I'm a bit in a rush today but I'll take the time to look at this in more depth on Monday (CET).

clefourrier · 2023-02-20T11:27:53Z

Hi @SauravMaheshkar !
After talking a bit internally, simplest way to convert the files will be to apply the following to the pytorch files (it will upload them as Datasets objects of similar properties automatically)

from datasets import Dataset
import torch

torch_dataset_info, torch_dataset = torch.load(<local path to pt file>)
# A torch dataset is a tuple which describes the contents shape, then stores the contents - we want the actual contents 
hf_dataset = Dataset.from_dict(torch_dataset)

# This command will require you to be connected, but will send the datasets automatically
hf_dataset.push_to_hub("LRGB/<dataset name>", split=<dataset split>)

I very sorry I did not notice earlier that the files were saved as pytorch objects.
We could also develop loading scripts, but it's not the preferred solution in this case, as it would here 1) require people wanting to use the datasets to have Pytorch and 2) to unpickle files on their machines.

clefourrier · 2023-02-21T14:23:49Z

Do you want to split, you convert half of them, and I convert the other half?

SauravMaheshkar · 2023-02-21T14:25:01Z

Sure, Thanks a lot ! I can take up the VOC superpixels and maybe you can take up COCO superpixels

clefourrier · 2023-02-21T14:25:54Z

Perfect!

SauravMaheshkar · 2023-03-01T10:34:54Z

```python
it=<dataset split>)
I very sorry I did not notice earlier that the files were saved as pytorch objects.
We could also develop loading scripts, but it's not the preferred solution in this case, as it would here 1) require people wanting to use the datasets to have Pytorch and 2) to unpickle files on their machines.

Ran into the following Error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/saurav/github/data/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 5245, in push_to_hub
    repo_info = dataset_infos[next(iter(dataset_infos))]
StopIteration

clefourrier · 2023-03-01T10:57:37Z

Hi @SauravMaheshkar , could you provide the full stack trace of the error, tell me on which dataset this occur, and maybe print the hf_dataset object?

SauravMaheshkar · 2023-03-01T12:42:28Z

Hi @SauravMaheshkar , could you provide the full stack trace of the error, tell me on which dataset this occur, and maybe print the hf_dataset object?

Sadly that is the entire stack trace (apart from the progress bar)

Pushing dataset shards to the dataset hub: 100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4609.13it/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/saurav/github/data/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 5245, in push_to_hub
    repo_info = dataset_infos[next(iter(dataset_infos))]
StopIteration

clefourrier · 2023-03-01T15:10:27Z

Hi again! I pinged people working on datasets, and your error message allowed to identify a corner case when pushing to an already existing repo without dataset_info in the YAML tags, so thank you! 🤗

A fix is being merged, once it's in datasets, you'll just have to update the lib and try again and it should work seamlessly.

SauravMaheshkar · 2023-03-01T15:12:17Z

Hi again! I pinged people working on datasets, and your error message allowed to identify a corner case when pushing to an already existing repo without dataset_info in the YAML tags, so thank you! 🤗

A fix is being merged, once it's in datasets, you'll just have to update the lib and try again and it should work seamlessly.

Oh great, glad to help ig 😅

clefourrier · 2023-04-14T13:39:33Z

Coming back to this!
@SauravMaheshkar if you want to try again, the conversion script works now 😃

I've converted the coco datasets with datasets 2.11.0, using:

from datasets import Dataset
import torch

dataset_names = [your dataset names]

for dataset in dataset_names:
    for split in ["train", "val", "test"]:
        torch_dataset_info, torch_dataset = torch.load(
            f"/{path_to_your_folder}/{dataset}/{split}.pt"
        )
        hf_dataset = Dataset.from_dict(torch_dataset)
        hf_dataset.push_to_hub(f"LRGB/{dataset}", split=split)

clefourrier · 2023-06-15T09:58:07Z

Hi @SauravMaheshkar, did you have the time to look at this?

lhoestq mentioned this issue Mar 1, 2023

Fix push_to_hub with no dataset_infos huggingface/datasets#5598

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding LRGB to the HuggingFace hub #10

Adding LRGB to the HuggingFace hub #10

clefourrier commented Feb 8, 2023 •

edited

Loading

SauravMaheshkar commented Feb 13, 2023

clefourrier commented Feb 13, 2023

SauravMaheshkar commented Feb 13, 2023

clefourrier commented Feb 13, 2023 •

edited

Loading

SauravMaheshkar commented Feb 13, 2023

vijaydwivedi75 commented Feb 14, 2023

clefourrier commented Feb 14, 2023 •

edited

Loading

SauravMaheshkar commented Feb 15, 2023

SauravMaheshkar commented Feb 16, 2023

SauravMaheshkar commented Feb 16, 2023

SauravMaheshkar commented Feb 17, 2023

clefourrier commented Feb 17, 2023

SauravMaheshkar commented Feb 17, 2023

clefourrier commented Feb 17, 2023

SauravMaheshkar commented Feb 17, 2023

clefourrier commented Feb 17, 2023

clefourrier commented Feb 20, 2023 •

edited

Loading

clefourrier commented Feb 21, 2023

SauravMaheshkar commented Feb 21, 2023

clefourrier commented Feb 21, 2023

SauravMaheshkar commented Mar 1, 2023

clefourrier commented Mar 1, 2023

SauravMaheshkar commented Mar 1, 2023

clefourrier commented Mar 1, 2023

SauravMaheshkar commented Mar 1, 2023

clefourrier commented Apr 14, 2023 •

edited

Loading

clefourrier commented Jun 15, 2023

Adding LRGB to the HuggingFace hub #10

Adding LRGB to the HuggingFace hub #10

Comments

clefourrier commented Feb 8, 2023 • edited Loading

SauravMaheshkar commented Feb 13, 2023

PascalVOC-SP

clefourrier commented Feb 13, 2023

SauravMaheshkar commented Feb 13, 2023

clefourrier commented Feb 13, 2023 • edited Loading

SauravMaheshkar commented Feb 13, 2023

vijaydwivedi75 commented Feb 14, 2023

clefourrier commented Feb 14, 2023 • edited Loading

SauravMaheshkar commented Feb 15, 2023

SauravMaheshkar commented Feb 16, 2023

SauravMaheshkar commented Feb 16, 2023

SauravMaheshkar commented Feb 17, 2023

clefourrier commented Feb 17, 2023

SauravMaheshkar commented Feb 17, 2023

clefourrier commented Feb 17, 2023

SauravMaheshkar commented Feb 17, 2023

clefourrier commented Feb 17, 2023

clefourrier commented Feb 20, 2023 • edited Loading

clefourrier commented Feb 21, 2023

SauravMaheshkar commented Feb 21, 2023

clefourrier commented Feb 21, 2023

SauravMaheshkar commented Mar 1, 2023

clefourrier commented Mar 1, 2023

SauravMaheshkar commented Mar 1, 2023

clefourrier commented Mar 1, 2023

SauravMaheshkar commented Mar 1, 2023

clefourrier commented Apr 14, 2023 • edited Loading

clefourrier commented Jun 15, 2023

clefourrier commented Feb 8, 2023 •

edited

Loading

clefourrier commented Feb 13, 2023 •

edited

Loading

clefourrier commented Feb 14, 2023 •

edited

Loading

clefourrier commented Feb 20, 2023 •

edited

Loading

clefourrier commented Apr 14, 2023 •

edited

Loading