multiwoz v22 is very slow #4446

stephenroller · 2022-03-24T13:01:43Z

Bug description

It takes an extremely long time to load multiwoz v22. With the data already downloaded, the train set takes >200 seconds to get to display_data on my development machine.

There are two reasons for this.

We aren't lazy when loading TOD datasets

As far as I can tell, the TOD teachers load the full dataset into memory before enumerating. I believe this comes from this issue here:

ParlAI/parlai/core/tod/tod_agents.py

Line 98 in 942952d

episodes = list(self.setup_episodes(self.fold))

Note that we list all sets of episodes, so then in setup_data, we don't get DialogTeacher's benefits from the lazy generator:

ParlAI/parlai/core/tod/tod_agents.py

Line 693 in 942952d

for episode in self.generate_episodes():

Fixing this would make display_data fast, as the second issue would be unnecessary. However, it's complicated with the n_shot stuff.

We are very inefficient in looking up in the multiwoz database

multiwoz v22 has a ton of code to load the database so inform can be computed. After the database is loaded, we need to find entries corresponding to user requests.

We're spending some 92% of our time inside this method:

ParlAI/parlai/tasks/multiwoz_v22/agents.py

Lines 159 to 162 in 942952d

    
               def _get_find_api_response(self, intent, raw_slots, sys_dialog_act): 
        
                   """ 
        
                   Get an API response out of the lookup databases. 
        
                   """

In particular, when we select from the database here:

ParlAI/parlai/tasks/multiwoz_v22/agents.py

Lines 196 to 205 in 942952d

    
           find = self.dbs[domain] 
        
           for slot, values in slots.items(): 
        
               if slot == "arriveby": 
        
                   condition = find[slot] < values[0] 
        
               elif slot == "leaveat": 
        
                   condition = find[slot] > values[0] 
        
               else: 
        
                   condition = find[slot].isin(values) 
        
               find = find[condition]

The issue is we're doing a fully linear SELECT operation on line 203: we have to explicitly enumerate every row and see if it matches one of our options. We then do this for every slot and value to continuously select. 😱

To fix, we would need to build an index of (slot, value)->record_id and select from that (repeatedly reducing the set for the multiple conditions).

Alternatively: if we could just move all this into the build.py and cache it, then we would do it all once the first time the dataset is loaded, and have fast loading forever after.

The text was updated successfully, but these errors were encountered:

github-actions · 2022-04-24T00:09:52Z

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

wangxieric · 2023-04-18T17:49:29Z

@stephenroller Any solution to address the above issue? Also suffering from this slow effect.

wangxieric · 2023-04-19T10:14:13Z

@mojtaba-komeili So this issue is addressed with updated code?

github-actions bot added the stale label Apr 24, 2022

stephenroller added never-stale and removed stale labels Apr 27, 2022

mojtaba-komeili closed this as completed Apr 18, 2023

klshuster added the Help Wanted label Apr 19, 2023

mojtaba-komeili reopened this Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiwoz v22 is very slow #4446

multiwoz v22 is very slow #4446

stephenroller commented Mar 24, 2022

github-actions bot commented Apr 24, 2022

wangxieric commented Apr 18, 2023

wangxieric commented Apr 19, 2023

multiwoz v22 is very slow #4446

multiwoz v22 is very slow #4446

Comments

stephenroller commented Mar 24, 2022

Bug description

We aren't lazy when loading TOD datasets

We are very inefficient in looking up in the multiwoz database

github-actions bot commented Apr 24, 2022

wangxieric commented Apr 18, 2023

wangxieric commented Apr 19, 2023