Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add audformat.Scheme.replace_labels() #62

Merged
merged 11 commits into from
Apr 28, 2021
Merged

Add audformat.Scheme.replace_labels() #62

merged 11 commits into from
Apr 28, 2021

Conversation

frankenjoe
Copy link
Collaborator

@frankenjoe frankenjoe commented Apr 26, 2021

Relates to #61

Currently, Database.update() fails when two different schemes with same ID are detected. This can be problematic if we have to extend a scheme with new values, e.g. we want to add data from new speakers. This adds Scheme.replace_labels(), which replaces the labels of a scheme. It will look for all columns that reference the scheme and change the dtype accordingly. When labels are removed, values are set to NaN.

Example

Initial database has labels 'a' and 'b'.

db = audformat.testing.create_db(minimal=True)
scheme = audformat.Scheme(labels=['a', 'b'])
db.schemes['scheme'] = scheme
audformat.testing.add_table(db, 'table', 'filewise')
print(scheme)
print(db['table']['scheme'].get())
dtype: str
labels: [a, b]
file
audio/001.wav    a
audio/002.wav    a
audio/003.wav    b
audio/004.wav    b
audio/005.wav    b
Name: scheme, dtype: category
Categories (2, object): ['a', 'b']

New database has labels 'b' and 'c'.

db_new = audformat.testing.create_db(minimal=True)
scheme_new = audformat.Scheme(labels=['b', 'c'])
db_new.schemes['scheme'] = scheme_new
audformat.testing.add_table(db_new, 'table', 'filewise')
print(scheme_new)
print(db_new['table']['scheme'].get())
dtype: str
labels: [b, c]
file
audio/001.wav    c
audio/002.wav    b
audio/003.wav    c
audio/004.wav    c
audio/005.wav    b
Name: scheme, dtype: category
Categories (2, object): ['b', 'c']

Update fails because schemes do not match.

try:
    db.update(db_new, overwrite=True)
except Exception as ex:
    print(ex)
Cannot update database, found different value for 'db.schemes['scheme']':
dtype: str
labels: [a, b]
!=
dtype: str
labels: [b, c]

Now we replace the labels in original database to 'b' and 'c', this will set 'a' to Nan.

scheme.replace_labels(scheme_new.labels)
print(scheme)
print(db['table']['scheme'].get())
dtype: str
labels: [b, c]
file
audio/001.wav    NaN
audio/002.wav    NaN
audio/003.wav      b
audio/004.wav      b
audio/005.wav      b
Name: scheme, dtype: category
Categories (2, object): ['b', 'c']

Now we can update the database.

db.update(db_new, overwrite=True)
print(db['table']['scheme'].get())
file
audio/001.wav    c
audio/002.wav    b
audio/003.wav    c
audio/004.wav    c
audio/005.wav    b
Name: scheme, dtype: category
Categories (2, object): ['b', 'c']

@codecov
Copy link

codecov bot commented Apr 26, 2021

Codecov Report

Merging #62 (ad4d071) into master (9222dbd) will not change coverage.
The diff coverage is 100.0%.

Impacted Files Coverage Δ
audformat/core/database.py 100.0% <100.0%> (ø)
audformat/core/scheme.py 100.0% <100.0%> (ø)

@frankenjoe frankenjoe requested a review from hagenw April 26, 2021 19:33
@hagenw
Copy link
Member

hagenw commented Apr 27, 2021

I think your solution should be the right way to proceed.

There is one thing missing, your .update_labels() method does already force an overwrite, but this might not always be desired as we might also want to merge the two schemes, e.g.

db = audformat.testing.create_db(minimal=True)
scheme = audformat.Scheme(labels=['a', 'b'])
db.schemes['scheme'] = scheme
audformat.testing.add_table(db, 'table1', 'filewise')
print(scheme)
print(db['table1']['scheme'].get())
dtype: str                                           
labels: [a, b]                                                    
file                                                                
audio/001.wav    a                                                 
audio/002.wav    b
audio/003.wav    b                                                                   
audio/004.wav    b
audio/005.wav    b
Name: scheme, dtype: category
Categories (2, object): ['a', 'b']
db_new = audformat.testing.create_db(minimal=True)
scheme_new = audformat.Scheme(labels=['b', 'c'])
db_new.schemes['scheme'] = scheme_new
audformat.testing.add_table(db_new, 'table2', 'filewise')
print(scheme_new)
print(db_new['table2']['scheme'].get())
dtype: str
labels: [b, c]
file
audio/001.wav    b
audio/002.wav    b
audio/003.wav    b
audio/004.wav    b
audio/005.wav    c
Name: scheme, dtype: category
Categories (2, object): ['b', 'c']

Merging those two databases work, but does not provide the desired result:

scheme.update_labels(scheme_new.labels)
print(scheme)
dtype: str               
labels: [b, c]

Here we would have needed the following result:

dtype: str               
labels: [a, b, c]

Because otherwise we get:

db.update(db_new, overwrite=True)
print(db['table1']['scheme'].get())
file
audio/001.wav    NaN
audio/002.wav      b
audio/003.wav      b
audio/004.wav      b
audio/005.wav      b
Name: scheme, dtype: category
Categories (2, object): ['b', 'c']

@hagenw
Copy link
Member

hagenw commented Apr 27, 2021

The same holds of cause if you want to merge the scheme for the same table, I just used two tables as it is easier with audformat.testing.add_table()

@frankenjoe
Copy link
Collaborator Author

frankenjoe commented Apr 27, 2021

There is one thing missing, your .update_labels() method does already force an overwrite

Ok, so by default we should have a union of the labels, right? Do we actually have to support the use-case to overwrite the labels at all then?

@hagenw
Copy link
Member

hagenw commented Apr 27, 2021

There is one thing missing, your .update_labels() method does already force an overwrite

Ok, so by default we should have a union of the labels, right? Do we actually have to support the use-case to overwrite the labels at all then?

I think it might not be necessary, but could be helpful. In your example above the goal would most likely be to remove the 'a' label from the scheme as it is no longer used in the updated table.

@frankenjoe
Copy link
Collaborator Author

I think it might not be necessary, but could be helpful.

Ok, should we add a overwrite argument and set it to False by default?

@hagenw
Copy link
Member

hagenw commented Apr 27, 2021

I think it might not be necessary, but could be helpful.

Ok, should we add a overwrite argument and set it to False by default?

I'm not sure if we can cover everything with one argument.
For example imagine the following example:

scheme1 = audformat.Scheme(labels={0: 'a', 1: 'b'})
scheme2 = audformat.Scheme(labels={1: 'c', 2: 'd'})

If we now use overwrite=True it is not clear to me what the result should be:

audformat.Scheme(labels={0: 'a', 1: 'c', 2: 'd'})

or

audformat.Scheme(labels={1: 'c', 2: 'd'})

@hagenw
Copy link
Member

hagenw commented Apr 27, 2021

I think it might not be necessary, but could be helpful.

Ok, should we add a overwrite argument and set it to False by default?

I'm not sure if we can cover everything with one argument.
For example imagine the following example:

scheme1 = audformat.Scheme(labels={0: 'a', 1: 'b'})
scheme2 = audformat.Scheme(labels={1: 'c', 2: 'd'})

If we now use overwrite=True it is not clear to me what the result should be:

audformat.Scheme(labels={0: 'a', 1: 'c', 2: 'd'})

or

audformat.Scheme(labels={1: 'c', 2: 'd'})

The safe solution would maybe be:

audformat.Scheme(labels={0: 'a', 1: 'c', 2: 'd'})

as it doesn't seem harmful to have additional entries in the scheme.

@frankenjoe
Copy link
Collaborator Author

as it doesn't seem harmful to have additional entries in the scheme.

Ok, so with overwrite=False, {0: 'a', 1: 'b'} + {1: 'c', 2: 'd'} will raise an error, with overwrite=True the result is {0: 'a', 1: 'c', 2: 'd'}. And at least for now we do for the support the use-case where labels are dropped. Do you agree?

@hagenw
Copy link
Member

hagenw commented Apr 27, 2021

as it doesn't seem harmful to have additional entries in the scheme.

Ok, so with overwrite=False, {0: 'a', 1: 'b'} + {1: 'c', 2: 'd'} will raise an error, with overwrite=True the result is {0: 'a', 1: 'c', 2: 'd'}. And at least for now we do for the support the use-case where labels are dropped. Do you agree?

I guess you mean:

And at least for now we do not support the use-case where labels are dropped

I agree.

There might be a way to decide if we should drop a label:
we could check if a label was used before in the database; if yes and the same label is no longer used in the updated database we might remove it.

But I'm still not sure if this will result in the desired behavior all the time. I guess it is better to not drop any label for now, and require manual action for that by the user.
BTW, is there a way for the user to get rid of an entry?
E.g., does the following work:

db.schemes['my-scheme'].labels.pop(0)

I guess not as it does not update the tables?

So, we might need to add another method that allows a user to remove a label then.

@frankenjoe
Copy link
Collaborator Author

I guess not as it does not update the tables?

No. And there are a lot of bad things a user can do at the moment, e.g. setting scheme.dtype to a different value. Bad! If we really want to avoid that we should make dtype and labels properties (and in the later case returning a copy:).

@frankenjoe
Copy link
Collaborator Author

frankenjoe commented Apr 27, 2021

we could check if a label was used before in the database; if yes and the same label is no longer used in the updated database we might remove it.

I think that is getting too complicated. If we want to give the user the option to remove labels, let's introduce drop_labels in addition to update_labels.

@hagenw
Copy link
Member

hagenw commented Apr 27, 2021

we could check if a label was used before in the database; if yes and the same label is no longer used in the updated database we might remove it.

I think that is getting too complicated. If we want to give the user the option to remove labels, let's introduce drop_labels in addition to update_labels.

Yes, I agree and would vote for this. We should do it in a follow up merge request then.

@hagenw
Copy link
Member

hagenw commented Apr 27, 2021

I guess not as it does not update the tables?

No. And there are a lot of bad things a user can do at the moment, e.g. setting scheme.dtype to a different value. Bad! If we really want to avoid that we should make dtype and labels properties (and in the later case returning a copy:).

You are right, we should make this more safe in another pull request as well. I created #63 to track this issue.

@frankenjoe
Copy link
Collaborator Author

Actually, I wonder if we should simply stick to the current implementation but rename update_labels() to replace_labels(). And then add update_labels() and maybe drop_labels() in another PR. Both we can easily implement based on replace_labels().

@frankenjoe frankenjoe changed the title Update scheme labels Replace scheme labels Apr 27, 2021
@frankenjoe
Copy link
Collaborator Author

Actually, I wonder if we should simply stick to the current implementation but rename update_labels() to replace_labels(). And then add update_labels() and maybe drop_labels() in another PR. Both we can easily implement based on replace_labels().

Ok, renamed. Please continue to review.

@hagenw
Copy link
Member

hagenw commented Apr 27, 2021

Actually, I wonder if we should simply stick to the current implementation but rename update_labels() to replace_labels(). And then add update_labels() and maybe drop_labels() in another PR. Both we can easily implement based on replace_labels().

Ok, renamed. Please continue to review.

Having #63 in mind I'm not completely sure about replace_labels(), or if we should just make it a setter function, so instead of:

db.schemes['my-scheme'].replace_labels(my_labels)

you do

db.schemes['my-scheme'] = my_labels

But I would propose we just continue with your current solution, but don't make a new release until we added the update method and figured out how to handle #63

@frankenjoe
Copy link
Collaborator Author

Having #63 in mind I'm not completely sure about replace_labels(), or if we should just make it a setter function

I would argue that replace_labels() better expresses that you can only use it on a Scheme that already has labels. With a setter function it feels like you can always use it.

@frankenjoe
Copy link
Collaborator Author

until we added the update method

I just realized that one disadvantage of the update function will be that it will not work with the example in the description. The problem is if db has labels a, b and db_new has b, c after calling update_labels() db has a, b, c, which is again not compatible to db_new, so db.update(db_new) will still fail. Which was the original use-case we had in mind :)

@hagenw
Copy link
Member

hagenw commented Apr 28, 2021

Having #63 in mind I'm not completely sure about replace_labels(), or if we should just make it a setter function

I would argue that replace_labels() better expresses that you can only use it on a Scheme that already has labels. With a setter function it feels like you can always use it.

OK, then we just stay with replace_labels() for now.

@hagenw
Copy link
Member

hagenw commented Apr 28, 2021

until we added the update method

I just realized that one disadvantage of the update function will be that it will not work with the example in the description. The problem is if db has labels a, b and db_new has b, c after calling update_labels() db has a, b, c, which is again not compatible to db_new, so db.update(db_new) will still fail. Which was the original use-case we had in mind :)

Is there some workaround, or does that mean that db.update() has to call scheme.update_labels() internally?

@frankenjoe
Copy link
Collaborator Author

Is there some workaround, or does that mean that db.update() has to call scheme.update_labels() internally?

I guess the following will work:

joined_labels = set(scheme.labels + new_scheme.labels)  # in reality it's a bit more complicated since labels can also be a dict
scheme.replace_labels(joined_labels)
new_scheme.replace_labels(joined_labels)
db.update(new_db)

I guess we should add an option to Database.update() that does this automatically and the user only has do to do:

db.update(new_db, join_scheme_labels=True) 

@hagenw
Copy link
Member

hagenw commented Apr 28, 2021

I guess we should add an option to Database.update() that does this automatically and the user only has do to do:

db.update(new_db, join_scheme_labels=True) 

Agree

@hagenw
Copy link
Member

hagenw commented Apr 28, 2021

To summarize:
We need Scheme.replace_labels(), but also code on top of it which we will implement in other pull request.

Which means this one is finished and I can take a final look, correct?

@frankenjoe
Copy link
Collaborator Author

Which means this one is finished and I can take a final look, correct?

yep!

audformat/core/scheme.py Outdated Show resolved Hide resolved
audformat/core/scheme.py Outdated Show resolved Hide resolved
audformat/core/scheme.py Outdated Show resolved Hide resolved
@hagenw hagenw changed the title Replace scheme labels Add audformat.Scheme.replace_labels() Apr 28, 2021
@hagenw hagenw merged commit 091fa56 into master Apr 28, 2021
@hagenw hagenw deleted the update-scheme-labels branch April 28, 2021 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants