Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emotion recognition webdata #102

Open
jiachengc opened this issue Sep 12, 2024 · 5 comments
Open

emotion recognition webdata #102

jiachengc opened this issue Sep 12, 2024 · 5 comments

Comments

@jiachengc
Copy link

Thanks for this solid work. Have you released any preprocessed emotion recognition web dataset like Ravdess, Cream-D, or any data processing files so we can process the data ourselves? @knoriy @YuchenHui22314

@YuchenHui22314
Copy link
Collaborator

Thanks for you comment, but as far as I know, we did not use emotion recognition datasets in the end. Good luck with you research!

@jiachengc
Copy link
Author

Thanks for you comment, but as far as I know, we did not use emotion recognition datasets in the end. Good luck with you research!

Thank you for your quick reply. I would like to ask a quick question: if my dataset is an audio emotion recognition dataset, such as TESS, when I process the corresponding webdata, should I rewrite the 'text' column to represent the emotion label corresponding to the audio instead of caption of audio? For example,
{ "text": [ "happy" ], "tag": [ "happy" ], "original_data": { "title": "TESS - Toronto Emotional Speech Set", "desciption": "Dataset for emotion recognition from audio", "license": "TESS dataset license", "fname": "OAF_back_happy.flac", "category": "happy" } }
doing so in order to allow the model to output the corresponding emotion predictions. Looking forward to see your reply. Thanks in advance!. @YuchenHui22314

@YuchenHui22314
Copy link
Collaborator

Then the "text" should be a complete sentence, e.g. ["this is an happy sound"]. So you may want to come up with a way to make up a sentence using emotion labels.

@jiachengc
Copy link
Author

Then the "text" should be a complete sentence, e.g. ["this is an happy sound"]. So you may want to come up with a way to make up a sentence using emotion labels.

Thank you for your quick reply, I really appreciate it. I followed your suggestion and modified the text to captions like ['this is a happy sound'], and then used eval_linear_probe.py to fine-tune the last linear layer of audio encoder. However, the results are quite poor on iemocap dataset, with an accuracy of around 55%. So far, I've tried a range of learning rates: [1e-2, 1e-3, 1e-4, 1e-5], weight decay values: [0.1, 0.01, 0.001, 0.001], and linear probe losses: [ce, mse]. However, none of these combinations have achieved accuracy beyond 55%. I'm feeling a bit lost about the next debugging direction and would greatly appreciate any suggestions you might have. Thanks in advance again! @YuchenHui22314

@YuchenHui22314
Copy link
Collaborator

Oh ok. So you are doing supervised classification instead of contrastive pretraining. I thought that you would like to add emotion dataset as part of pretraining data. "text" should be a sentence only in the pretraining process, but when it comes to supervised classification, I am not familir with the eval_linear_probe.py code. Maybe you could reach out to Ke Chen on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants