Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple call to transforms warm_up() #2317

Open
vince62s opened this issue Feb 16, 2023 · 2 comments
Open

Multiple call to transforms warm_up() #2317

vince62s opened this issue Feb 16, 2023 · 2 comments

Comments

@vince62s
Copy link
Member

@Zenglinxiao
When you implemented #1912 you added setstate / getstate logics for multiprocessing.

If I am not wrong and @anderleich / @panosk faced the same issue, here is what happening:

When using build_vocab.py there is a call to make_transforms() in the main process, and then we spawn n_threads. Because we pass the transforms created in main, the pickling/unpickling mechanism triggers another call to warm_up() in the __setstate__ hence we could avoid the first call to warm_up in the make_transforms.
Even when we use n_threads=1 we spawn another process so same behavior.

When we train the story is a little different.
If we use num_worker=0 the Dataloader is not used, everything is happening in the main process, hence calling warm_up is required somewhere (currently in the make_transforms of the build_dynamic_dataset_iter
If num_worker>0then we fall back in the same situation as in build_vocab.

What do you think should be the best approach to avoid double warm_up (which is quite annoying for some transforms that loads big stuff)

cc @francoishernandez

@Zenglinxiao
Copy link
Contributor

Hi @vince62s,
Sorry for the late reply due to limited bandwidth.

The transforms will warm_up twice in the build_vocab step, but not in the training step. That was the case before version 3.0. We did not bother to optimize for build_vocab as it was not an urgent issue at that time.

The changes introduced in the release 3.0 switching to Dataloader makes things different:

data_loader = DataLoader(data_iter, batch_size=None,
pin_memory=True,
multiprocessing_context="spawn",
num_workers=data_iter.num_workers,
worker_init_fn=data_iter._init_datasets,
prefetch_factor=opt.prefetch_factor)

The data_iter is already initialized with transform objects, since the num_workers will be passed to dataloader, it will spawn other processes for multiprocessing. Then transform objects will be picked and unpicked through __getstate__ and __setstate__ method during multiprocessing.

To mitigate this issue, we should only initialize/warm_up transform objects in the processes where they will be used to avoid pickling and unpicking.

One way we can try is to delay make_transform to _init_datasets to solve this.

def _init_datasets(self, worker_id):

def build_dynamic_dataset_iter
...
    # transforms = make_transforms(opt, transforms_cls, vocabs)
    # delay this to DynamicDatasetIter._init_datasets
    corpora = get_corpora(opt, task)
    if corpora is None:
        assert task != CorpusTask.TRAIN, "only valid corpus is ignorable."
        return None
    data_iter = DynamicDatasetIter.from_opt(
        corpora, transforms_cls, # pass class rather than object
        vocabs, opt, task, copy=copy,
        stride=stride, offset=offset)
...

@vince62s
Copy link
Member Author

vince62s commented Feb 27, 2023

I really thought your bandwidth was unlimited :)
thanks for your feedback, appreciated.

There is an issue with your solution. We need to access "opt" to build the transforms from the cls.
The way we build data_iter with its from_opt method makes it difficult (or not very natural) to redefine the full attribute opt to the DynalicDayasetIter

Instead what if we remove the warm_up from def make_transform() and we call it manually only in the case of num_workers=0
In all other cases, the set_state will do the job.
Would that work ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants