-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streamline/Optimize Orca Active Learning tool (OrcaAL) #30
Comments
Link to OrcaAL repository / home page looks incorrect. Is it https://github.com/orcasound/orcaal/? |
Hi @valentina-s @scottveirs @yosoyjay, I am in the midst of writing my proposal and am deciding on which steps to focus on. I wanted to clarify a few matters:
I read that the ML API and endpoints are currently containerized and hosted on AWS LightSail. Can I clarify whether the model currently trains and predicts using a CPU, since LightSail does not seem to support GPU instances from my research? Also, is there currently any form of container orchestration used in AWS, e.g. ECS/EKS?
I understand that a PostgreSQL database is currently used to store annotations. However, I'm not sure where it is currently hosted; is it on AWS LightSail as well? Also, can I clarify the reason for wanting to switch over to a cloud-hosted database? Would that mean using a solution such as AWS RDS instead?
How is obtaining of new unlabeled data currently handled? I don't seem to be able to find much info about it, is it currently done manually, with a new set of data generated by the
Can I clarify if this refers to examples which human annotators themselves are unsure of? If that is the case, I suppose that a separate database model (perhaps keeping track of the number of times skipped, so that that can be a key used for querying) for storing these uncertain examples are required?
Would a plausible distance based querying strategy be to use some distance measure (e.g. Manhattan or Euclidean) to find the furthest unlabeled examples from the means of both 'orca' and 'non-orca' embeddings, and treat those as the most 'uncertain' examples?
Do these new datasets refer to the labelled, unlabeled or both? Should this change be controllable from the OrcaAL site itself (i.e. can be seen and changed by annotators), or should it be more for internal usage?
Would this entail some form of A/B testing or canary deployment? The downsides to this are that I foresee requiring 2 isolated databases (or multiple depending on how many strategies we are testing) to contain the different annotations, and probably isolated S3 buckets as well to store the resulting different sets of labelled data thereafter. This method might also take some time to test, especially if the traffic for OrcaAL site isn't very high.
Does this refer to handling concurrent loads and high traffic (load testing)? Some additional queries I have:
|
Thanks for all the questions @Benjamintdk!
Yes, the training is on CPU and app, database and training are on one instance. The containers are being started separately. If the annotations are moved to a hosted database, and a container is spinned only when needed, then one would need just one container for the app. It is important to consider the two use cases: 1) orcasound community: setting up the tool so that it works well for us. 2) users willing to set up orcaal on their own, and not willing to depend too much on individual AWS services.
Now the database is on a docker container on AWS LightSail, which is fine, but sometimes the containers stop working so a hosted database (RDS) might be more reliable?
Currently, there are two folders in S3
Yes, the ones which are left when clicking the
It could be both labelled or unlabelled. Maybe not from OrcaAL site at this stage. Even from the command line I think the set up steps are pretty tight up to our AWS set up. So if a colleague comes with their own data and wants to set up OrcaAL, it might not be trivial to get started. This is a bit open for discussion @yosoyjay @scottveirs
We can compare the different querying strategies as a starter. Even without full deployment we can do some experiments. Kunal did some evaluation of the uncertainty strategy in a jupyter notebook with a very small sample, but I think now we have slightly more labeled data. We also need to look at more proper metrics: now we have accuracy, but the number of training and testing samples changes after each round so one should be more careful about how to interpret results. We can also advertise some experiments to the slack channel for a starter, or make OrcaAl more visible to the citizen scientists.
Maybe simply sending same observation to more than one annotators to check for consistency. We have given a demo at some events with bunch of people: the app has not crashed but we do not know how slow it is for the users. Maybe that should be on the list to investigate.
This is an interesting topic to investigate. We started with a small training set so initially the new labels had a big effect, but with time a new batch may not help much, especially with the uncertainty metric. One should be also careful, not to bias the model toward a small batch of observations. You may find some inspiration in this book.
We actually have two strategies for selecting the uncertain samples:
Possibly, we should collect some user feedback. |
Just a note for everybody, that all of the above topics do not need to be in one project 😅 : , these are just potential directions ⛵ ! |
I agree with the assessment the the current implementation of the pipeline is pretty tightly coupled to AWS, but it wouldn't be too difficult to separate AWS specific deployment bits and general, reusable code. This would facilitate someone else setting up their own instance of OrcaAL. I think making changes to the current implementation to point to other datasources might be a bigger, but not too heavy, of a lift, but I think that would be a secondary concern. |
Thanks @yosoyjay and @valentina-s for the replies, they'll be really helpful for my proposal! Regarding the datasets, I think that I misinterpreted it the first time I read it. It seems like developer experience is something that can be improved upon a lot, as I also remember seeing a possible feature to easily add other models for experimentation as a suggestion for the previous GSoC run.
I want to further clarify what sort of data sources @yosoyjay might be referring to, would these be toy datasets that perhaps someone has in a zip file locally, or perhaps in a cloud storage on a separate S3 bucket, or perhaps even on another cloud platform (e.g. Google Cloud Storage in GCP). I was thinking that if it were for experimentation locally, then perhaps creating dev docker containers might be a solution? There can be volume binds as access points for loading data and saving models on the local machine, and it will simplify installation and the non-trivial setup of the repo as Valentina has mentioned. |
Yeah, absolutely. If this is something folks are interested in pursuing, we might also need to think a bit more about tracking different models.
I don't know the scope of the potential inputs. My response was prompted by @valentina-s mentioning "So if a colleague comes with their own data and wants to set up OrcaAL, it might not be trivial to get started." The issues you bring up is exactly why it was mentioned that it would require additional discussion and why I thought that it should be a secondary concern. |
@valentina-s @yosoyjay @scottveirs another thing that I was considering is the possibility of incorporating data version control into the current pipeline. I realize when looking through the code, that while we keep track of the different model checkpoints, the data in the S3 bucket doesn't seem to be. This would make it difficult to pinpoint exactly which batch(es) of data started to cause performance to degrade should it happen. |
Hi @yosoyjay @valentina-s @scottveirs bumping this as I'm not sure if it got missed out. Just to provide a little more context, I came across data versioning while doing a course (Full Stack Deep Learning) by UC Berkeley sometime back. |
@Benjamintdk versioning did come up in the discussions while building OrcaAL but of course we did not have time for it. I remember looking at DVC which was rather new at that time, and now seems it is a whole suite of useful tools! I do not exactly know how they look at diff's of data and whether there is potential to explode size-wise for our data, but in our context versioning the labels of the audio segments which went into training should be enough. |
During summers 2020 and 2021, GSOC students worked on creating the OrcaAL Orca Active Learning tool (repo, demo) which integrates the efforts of human annotators and machine learning experts to create better ML training sets and algorithms. The initial set up was designed to handle one day of Orcasound data with one querying strategy. To make the app more flexible and scale the annotation process, there are several steps one can take to streamline the OrcaAL tool and its performance:
Expected outcomes: Improve a tool that accelerates the annotation of Orcasound audio data and the training of machine learning models to classify acoustic signals from orcas.
Required Skills: Python, Machine Learning, Docker
Bonus Skills: Cloud Computing, Flask
Mentors: Valentina, Jesse, Scott
Difficulty level: Hard
Project Size: 175 or 350 h
Resources:
Getting Started:
Follow the instructions to start the API
The text was updated successfully, but these errors were encountered: