Add audformat.Scheme.replace_labels() #62

frankenjoe · 2021-04-26T19:29:59Z

Relates to #61

Currently, Database.update() fails when two different schemes with same ID are detected. This can be problematic if we have to extend a scheme with new values, e.g. we want to add data from new speakers. This adds Scheme.replace_labels(), which replaces the labels of a scheme. It will look for all columns that reference the scheme and change the dtype accordingly. When labels are removed, values are set to NaN.

Example

Initial database has labels 'a' and 'b'.

db = audformat.testing.create_db(minimal=True)
scheme = audformat.Scheme(labels=['a', 'b'])
db.schemes['scheme'] = scheme
audformat.testing.add_table(db, 'table', 'filewise')
print(scheme)
print(db['table']['scheme'].get())

dtype: str
labels: [a, b]
file
audio/001.wav    a
audio/002.wav    a
audio/003.wav    b
audio/004.wav    b
audio/005.wav    b
Name: scheme, dtype: category
Categories (2, object): ['a', 'b']

New database has labels 'b' and 'c'.

db_new = audformat.testing.create_db(minimal=True)
scheme_new = audformat.Scheme(labels=['b', 'c'])
db_new.schemes['scheme'] = scheme_new
audformat.testing.add_table(db_new, 'table', 'filewise')
print(scheme_new)
print(db_new['table']['scheme'].get())

dtype: str
labels: [b, c]
file
audio/001.wav    c
audio/002.wav    b
audio/003.wav    c
audio/004.wav    c
audio/005.wav    b
Name: scheme, dtype: category
Categories (2, object): ['b', 'c']

Update fails because schemes do not match.

try:
    db.update(db_new, overwrite=True)
except Exception as ex:
    print(ex)

Cannot update database, found different value for 'db.schemes['scheme']':
dtype: str
labels: [a, b]
!=
dtype: str
labels: [b, c]

Now we replace the labels in original database to 'b' and 'c', this will set 'a' to Nan.

scheme.replace_labels(scheme_new.labels)
print(scheme)
print(db['table']['scheme'].get())

dtype: str
labels: [b, c]
file
audio/001.wav    NaN
audio/002.wav    NaN
audio/003.wav      b
audio/004.wav      b
audio/005.wav      b
Name: scheme, dtype: category
Categories (2, object): ['b', 'c']

Now we can update the database.

db.update(db_new, overwrite=True)
print(db['table']['scheme'].get())

file
audio/001.wav    c
audio/002.wav    b
audio/003.wav    c
audio/004.wav    c
audio/005.wav    b
Name: scheme, dtype: category
Categories (2, object): ['b', 'c']

codecov · 2021-04-26T19:32:55Z

Codecov Report

Merging #62 (ad4d071) into master (9222dbd) will not change coverage.
The diff coverage is 100.0%.

Impacted Files	Coverage Δ
audformat/core/database.py	`100.0% <100.0%> (ø)`
audformat/core/scheme.py	`100.0% <100.0%> (ø)`

hagenw · 2021-04-27T07:05:17Z

I think your solution should be the right way to proceed.

There is one thing missing, your .update_labels() method does already force an overwrite, but this might not always be desired as we might also want to merge the two schemes, e.g.

db = audformat.testing.create_db(minimal=True)
scheme = audformat.Scheme(labels=['a', 'b'])
db.schemes['scheme'] = scheme
audformat.testing.add_table(db, 'table1', 'filewise')
print(scheme)
print(db['table1']['scheme'].get())

dtype: str                                           
labels: [a, b]                                                    
file                                                                
audio/001.wav    a                                                 
audio/002.wav    b
audio/003.wav    b                                                                   
audio/004.wav    b
audio/005.wav    b
Name: scheme, dtype: category
Categories (2, object): ['a', 'b']

db_new = audformat.testing.create_db(minimal=True)
scheme_new = audformat.Scheme(labels=['b', 'c'])
db_new.schemes['scheme'] = scheme_new
audformat.testing.add_table(db_new, 'table2', 'filewise')
print(scheme_new)
print(db_new['table2']['scheme'].get())

dtype: str
labels: [b, c]
file
audio/001.wav    b
audio/002.wav    b
audio/003.wav    b
audio/004.wav    b
audio/005.wav    c
Name: scheme, dtype: category
Categories (2, object): ['b', 'c']

Merging those two databases work, but does not provide the desired result:

scheme.update_labels(scheme_new.labels)
print(scheme)

dtype: str               
labels: [b, c]

Here we would have needed the following result:

dtype: str               
labels: [a, b, c]

Because otherwise we get:

db.update(db_new, overwrite=True)
print(db['table1']['scheme'].get())

file
audio/001.wav    NaN
audio/002.wav      b
audio/003.wav      b
audio/004.wav      b
audio/005.wav      b
Name: scheme, dtype: category
Categories (2, object): ['b', 'c']

hagenw · 2021-04-27T07:06:05Z

The same holds of cause if you want to merge the scheme for the same table, I just used two tables as it is easier with audformat.testing.add_table()

frankenjoe · 2021-04-27T09:47:35Z

There is one thing missing, your .update_labels() method does already force an overwrite

Ok, so by default we should have a union of the labels, right? Do we actually have to support the use-case to overwrite the labels at all then?

hagenw · 2021-04-27T09:52:16Z

There is one thing missing, your .update_labels() method does already force an overwrite

Ok, so by default we should have a union of the labels, right? Do we actually have to support the use-case to overwrite the labels at all then?

I think it might not be necessary, but could be helpful. In your example above the goal would most likely be to remove the 'a' label from the scheme as it is no longer used in the updated table.

frankenjoe · 2021-04-27T09:53:41Z

I think it might not be necessary, but could be helpful.

Ok, should we add a overwrite argument and set it to False by default?

hagenw · 2021-04-27T09:57:32Z

I think it might not be necessary, but could be helpful.

Ok, should we add a overwrite argument and set it to False by default?

I'm not sure if we can cover everything with one argument.
For example imagine the following example:

scheme1 = audformat.Scheme(labels={0: 'a', 1: 'b'})
scheme2 = audformat.Scheme(labels={1: 'c', 2: 'd'})

If we now use overwrite=True it is not clear to me what the result should be:

audformat.Scheme(labels={0: 'a', 1: 'c', 2: 'd'})

or

audformat.Scheme(labels={1: 'c', 2: 'd'})

hagenw · 2021-04-27T09:58:51Z

I think it might not be necessary, but could be helpful.

Ok, should we add a overwrite argument and set it to False by default?

I'm not sure if we can cover everything with one argument.
For example imagine the following example:
scheme1 = audformat.Scheme(labels={0: 'a', 1: 'b'})
scheme2 = audformat.Scheme(labels={1: 'c', 2: 'd'})
If we now use overwrite=True it is not clear to me what the result should be:
audformat.Scheme(labels={0: 'a', 1: 'c', 2: 'd'})
or
audformat.Scheme(labels={1: 'c', 2: 'd'})

The safe solution would maybe be:

audformat.Scheme(labels={0: 'a', 1: 'c', 2: 'd'})

as it doesn't seem harmful to have additional entries in the scheme.

frankenjoe · 2021-04-27T10:08:52Z

as it doesn't seem harmful to have additional entries in the scheme.

Ok, so with overwrite=False, {0: 'a', 1: 'b'} + {1: 'c', 2: 'd'} will raise an error, with overwrite=True the result is {0: 'a', 1: 'c', 2: 'd'}. And at least for now we do for the support the use-case where labels are dropped. Do you agree?

hagenw · 2021-04-27T10:18:59Z

as it doesn't seem harmful to have additional entries in the scheme.

Ok, so with overwrite=False, {0: 'a', 1: 'b'} + {1: 'c', 2: 'd'} will raise an error, with overwrite=True the result is {0: 'a', 1: 'c', 2: 'd'}. And at least for now we do for the support the use-case where labels are dropped. Do you agree?

I guess you mean:

And at least for now we do not support the use-case where labels are dropped

I agree.

There might be a way to decide if we should drop a label:
we could check if a label was used before in the database; if yes and the same label is no longer used in the updated database we might remove it.

But I'm still not sure if this will result in the desired behavior all the time. I guess it is better to not drop any label for now, and require manual action for that by the user.
BTW, is there a way for the user to get rid of an entry?
E.g., does the following work:

db.schemes['my-scheme'].labels.pop(0)

I guess not as it does not update the tables?

So, we might need to add another method that allows a user to remove a label then.

frankenjoe · 2021-04-27T10:28:24Z

I guess not as it does not update the tables?

No. And there are a lot of bad things a user can do at the moment, e.g. setting scheme.dtype to a different value. Bad! If we really want to avoid that we should make dtype and labels properties (and in the later case returning a copy:).

frankenjoe · 2021-04-27T10:29:25Z

we could check if a label was used before in the database; if yes and the same label is no longer used in the updated database we might remove it.

I think that is getting too complicated. If we want to give the user the option to remove labels, let's introduce drop_labels in addition to update_labels.

hagenw · 2021-04-27T10:32:20Z

we could check if a label was used before in the database; if yes and the same label is no longer used in the updated database we might remove it.

I think that is getting too complicated. If we want to give the user the option to remove labels, let's introduce drop_labels in addition to update_labels.

Yes, I agree and would vote for this. We should do it in a follow up merge request then.

hagenw · 2021-04-27T10:35:51Z

I guess not as it does not update the tables?

No. And there are a lot of bad things a user can do at the moment, e.g. setting scheme.dtype to a different value. Bad! If we really want to avoid that we should make dtype and labels properties (and in the later case returning a copy:).

You are right, we should make this more safe in another pull request as well. I created #63 to track this issue.

frankenjoe · 2021-04-27T10:54:32Z

Actually, I wonder if we should simply stick to the current implementation but rename update_labels() to replace_labels(). And then add update_labels() and maybe drop_labels() in another PR. Both we can easily implement based on replace_labels().

frankenjoe · 2021-04-27T12:17:43Z

Actually, I wonder if we should simply stick to the current implementation but rename update_labels() to replace_labels(). And then add update_labels() and maybe drop_labels() in another PR. Both we can easily implement based on replace_labels().

Ok, renamed. Please continue to review.

hagenw · 2021-04-27T12:21:04Z

Actually, I wonder if we should simply stick to the current implementation but rename update_labels() to replace_labels(). And then add update_labels() and maybe drop_labels() in another PR. Both we can easily implement based on replace_labels().

Ok, renamed. Please continue to review.

Having #63 in mind I'm not completely sure about replace_labels(), or if we should just make it a setter function, so instead of:

db.schemes['my-scheme'].replace_labels(my_labels)

you do

db.schemes['my-scheme'] = my_labels

But I would propose we just continue with your current solution, but don't make a new release until we added the update method and figured out how to handle #63

frankenjoe · 2021-04-27T16:26:46Z

Having #63 in mind I'm not completely sure about replace_labels(), or if we should just make it a setter function

I would argue that replace_labels() better expresses that you can only use it on a Scheme that already has labels. With a setter function it feels like you can always use it.

frankenjoe · 2021-04-27T16:30:05Z

until we added the update method

I just realized that one disadvantage of the update function will be that it will not work with the example in the description. The problem is if db has labels a, b and db_new has b, c after calling update_labels() db has a, b, c, which is again not compatible to db_new, so db.update(db_new) will still fail. Which was the original use-case we had in mind :)

hagenw · 2021-04-28T06:54:37Z

Having #63 in mind I'm not completely sure about replace_labels(), or if we should just make it a setter function

I would argue that replace_labels() better expresses that you can only use it on a Scheme that already has labels. With a setter function it feels like you can always use it.

OK, then we just stay with replace_labels() for now.

hagenw · 2021-04-28T06:55:55Z

until we added the update method

I just realized that one disadvantage of the update function will be that it will not work with the example in the description. The problem is if db has labels a, b and db_new has b, c after calling update_labels() db has a, b, c, which is again not compatible to db_new, so db.update(db_new) will still fail. Which was the original use-case we had in mind :)

Is there some workaround, or does that mean that db.update() has to call scheme.update_labels() internally?

frankenjoe · 2021-04-28T07:20:10Z

Is there some workaround, or does that mean that db.update() has to call scheme.update_labels() internally?

I guess the following will work:

joined_labels = set(scheme.labels + new_scheme.labels)  # in reality it's a bit more complicated since labels can also be a dict
scheme.replace_labels(joined_labels)
new_scheme.replace_labels(joined_labels)
db.update(new_db)

I guess we should add an option to Database.update() that does this automatically and the user only has do to do:

db.update(new_db, join_scheme_labels=True)

hagenw · 2021-04-28T07:22:21Z

I guess we should add an option to Database.update() that does this automatically and the user only has do to do:
db.update(new_db, join_scheme_labels=True) 

Agree

hagenw · 2021-04-28T07:23:57Z

To summarize:
We need Scheme.replace_labels(), but also code on top of it which we will implement in other pull request.

Which means this one is finished and I can take a final look, correct?

frankenjoe · 2021-04-28T07:28:04Z

Which means this one is finished and I can take a final look, correct?

yep!

audformat/core/scheme.py

Co-authored-by: Hagen Wierstorf <[email protected]>

frankenjoe added 3 commits April 26, 2021 20:32

Scheme: store reference to database

45bb6b7

TST: test Scheme.db

7e59477

Scheme: add update_labels

b96d8bd

frankenjoe requested a review from hagenw April 26, 2021 19:33

frankenjoe added 3 commits April 26, 2021 21:41

DOC: add docstring

5775cdd

DOC: docstring example

28dce56

raise error if dtype of new labels does not match dtype of scheme

d627dfb

Scheme: rename to replace_labels()

aca9fb8

frankenjoe changed the title ~~Update scheme labels~~ Replace scheme labels Apr 27, 2021

hagenw reviewed Apr 28, 2021

View reviewed changes

audformat/core/scheme.py Outdated Show resolved Hide resolved

audformat/core/scheme.py Outdated Show resolved Hide resolved

audformat/core/scheme.py Outdated Show resolved Hide resolved

frankenjoe and others added 4 commits April 28, 2021 14:13

Update audformat/core/scheme.py

491eebe

Co-authored-by: Hagen Wierstorf <[email protected]>

Update audformat/core/scheme.py

7158484

Co-authored-by: Hagen Wierstorf <[email protected]>

DOC: fix long line

4822edb

Scheme: remove db attribute

ad4d071

hagenw changed the title ~~Replace scheme labels~~ Add audformat.Scheme.replace_labels() Apr 28, 2021

hagenw merged commit 091fa56 into master Apr 28, 2021

hagenw deleted the update-scheme-labels branch April 28, 2021 12:30

hagenw mentioned this pull request Apr 29, 2021

Add audformat.utils.join_labels() #66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add audformat.Scheme.replace_labels() #62

Add audformat.Scheme.replace_labels() #62

frankenjoe commented Apr 26, 2021 •

edited

Loading

codecov bot commented Apr 26, 2021 •

edited

Loading

hagenw commented Apr 27, 2021

hagenw commented Apr 27, 2021

frankenjoe commented Apr 27, 2021 •

edited

Loading

hagenw commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

hagenw commented Apr 27, 2021

hagenw commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

hagenw commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

frankenjoe commented Apr 27, 2021 •

edited

Loading

hagenw commented Apr 27, 2021

hagenw commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

hagenw commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

hagenw commented Apr 28, 2021 •

edited

Loading

hagenw commented Apr 28, 2021

frankenjoe commented Apr 28, 2021

hagenw commented Apr 28, 2021

hagenw commented Apr 28, 2021

frankenjoe commented Apr 28, 2021

Add audformat.Scheme.replace_labels() #62

Add audformat.Scheme.replace_labels() #62

Conversation

frankenjoe commented Apr 26, 2021 • edited Loading

Example

codecov bot commented Apr 26, 2021 • edited Loading

Codecov Report

hagenw commented Apr 27, 2021

hagenw commented Apr 27, 2021

frankenjoe commented Apr 27, 2021 • edited Loading

hagenw commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

hagenw commented Apr 27, 2021

hagenw commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

hagenw commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

frankenjoe commented Apr 27, 2021 • edited Loading

hagenw commented Apr 27, 2021

hagenw commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

hagenw commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

frankenjoe commented Apr 27, 2021

hagenw commented Apr 28, 2021 • edited Loading

hagenw commented Apr 28, 2021

frankenjoe commented Apr 28, 2021

hagenw commented Apr 28, 2021

hagenw commented Apr 28, 2021

frankenjoe commented Apr 28, 2021

frankenjoe commented Apr 26, 2021 •

edited

Loading

codecov bot commented Apr 26, 2021 •

edited

Loading

frankenjoe commented Apr 27, 2021 •

edited

Loading

frankenjoe commented Apr 27, 2021 •

edited

Loading

hagenw commented Apr 28, 2021 •

edited

Loading