This is a starter kit for the Snakes Species Identification Challenge on AIcrowd.
In this challenge you will be provided with a dataset of RGB images of snakes, and their corresponding species (class). The goal is to train a classification model.
The difficulty of the challenge relies on the dataset characteristics, as there might be a high intraclass variance for certain classes and a low interclass variance among others, as shown in the examples from the Datasets section. Also, the distribution of images between class is not equal for all classes: the class with the most images has 12201, while the class with the fewest images has 17. (top & bottom 15 classes shared below - based on training data)
scientific_name | Count | scientific_name | Count |
---|---|---|---|
thamnophis-sirtalis | 12201 | agkistrodon-laticinctus | 17 |
storeria-dekayi | 9708 | bitis-armata | 17 |
pantherophis-obsoletus | 8767 | bothrocophias-microphthalmus | 17 |
crotalus-atrox | 8679 | chironius-bicarinatus | 17 |
pituophis-catenifer | 7113 | geophis-hoffmanni | 17 |
nerodia-sipedon | 6436 | hebius-miyajimae | 17 |
agkistrodon-contortrix | 4946 | lycodon-effraenis | 17 |
diadophis-punctatus | 4406 | macrovipera-schweizeri | 17 |
lampropeltis-triangulum | 4326 | naja-pallida | 17 |
pantherophis-alleghaniensis | 4282 | philothamnus-punctatus | 17 |
nerodia-erythrogaster | 4276 | sibynophis-collaris | 17 |
lampropeltis-californiae | 4158 | spalerosophis-dolichospilus | 17 |
agkistrodon-piscivorus | 4011 | thamnophis-chrysocephalus | 17 |
opheodrys-aestivus | 3931 | agkistrodon-taylori | 18 |
crotalus-horridus | 3571 | boiga-beddomei | 18 |
The distribution of Snake Species and their image count looks as follows:
To keep the barrier to entry much lower and demonstrate that an approach works well, we started with 85 species and 130150 images in previous round, and these numbers are increasing with every round. The idea would be then to renew the challenge every 4 months in order to get closer to our final goal, which is to build an algorithm which best predicts which antivenin should be given (if any) when given a specific image.
The datasets are available in the Resources section of the challenge page, and on following the links, you have following files :
train_images.tar.gz
train_labels.tar.gz
validate_images.tar.gz
validate_labels.tar.gz
validate_images_small.tar.gz
validate_labels_small.tar.gz
Where :
train_images.tar.gz
untars into a folder containing245185
images of snakes spread across783
different snake species.train_labels.tar.gz
untars into a CSV with the following structure :
scientific_name | country | hashed_id | continent |
---|---|---|---|
crotalus-pyrrhus | United States of America | f670636e2f | North America |
phyllorhynchus-decurtatus | United States of America | 5bfe5fa2ef | North America |
thamnophis-marcianus | United States of America | 94d2da23c9 | North America |
boa-constrictor | UNKNOWN | 871c3b709a | UNKNOWN |
crotalus-atrox | United States of America | e983981e77 | North America |
boa-imperator | Mexico | ba9f7def25 | North America |
masticophis-flagellum | United States of America | 2402a939c3 | North America |
coluber-constrictor | United States of America | af5eacaac1 | North America |
storeria-dekayi | United States of America | ba74f6f6c1 | North America |
With the following columns :
- `hashed_id` : Unique ID of a single image (files are present in `train_images.tar.gz` with name `{hashed_id}.jpg`
- `scientific_name` : Unique class name for the image in question
- `country` : Country where the image was taken
- `continent` : Continent where the image was taken
validate_images.tar.gz
expands into a folder.jpg
files representing a sample with similar distribution as test data. This has been provided to help you locally run validation phase over your model.validate_labels.tar.gz
expands into a CSV file with the following structure :
scientific_name | country | hashed_id | continent |
---|---|---|---|
lycodon-ruhstrati | UNKNOWN | bac3ff1139 | UNKNOWN |
nerodia-sipedon | Canada | 516a58ab5c | North America |
naja-nigricollis | UNKNOWN | b8319014ed | UNKNOWN |
crotalus-atrox | United States of America | cfed281bac | North America |
symphimus-mayae | Mexico | d237616cb2 | North America |
pantherophis-vulpinus | United States of America | b581db474b | North America |
storeria-dekayi | United States of America | fad71aeca3 | North America |
opheodrys-aestivus | United States of America | 3fd4dea662 | North America |
storeria-dekayi | United States of America | 83402064a6 | North America |
The files validate_images_small.tar.gz
and validate_labels_small.tar.gz
are small subset of validation data, just to try out your submission locally without doing heavy downloads.
Before moving into the next phase, it would be good to download the datasets from the above mentioned links, and organize them in the ./data
folder with the following folder structure :
├── data
│ ├── validate_images_small
│ │ ├── 01978e1d8d.jpg
│ │ ├── 019d1e8cae.jpg
│ │ ├── 04a3809dda.jpg
│ │ ├── ..............
│ │ ├── ..............
│ │ ├── ..............
│ │ ├── fbb98a8213.jpg
│ │ ├── fc9fd55077.jpg
│ │ └── fce0ab02dd.jpg
│ └── validate_labels_small.csv
NOTE : The training related files and directories are excluded in the illustration above for simplicity.
The predictions should be a valid CSV file with the same number of rows as the number of images in the test set (listed also in the test_metadata
file), and the header should be the hashed_id
of each test case, and the probability distribution across all the valid snake species in this round. The run.py
script has a list of the valid snake species for this round, which can also be created from the scientific_name
column in the train_labels.csv
file. Overall, the file is expected to have 86
columns (1
for hashed_id and 85
for each of the included snake species). The sum of the probabilities across all the snake-species columns should be < 1.0
.
A sample script which generates a random prediction for the whole test set is included in the run.py. The included inline comments better illustrate the structure expected. Please ensure to use the following environment variables :
AICROWD_TEST_IMAGES_PATH
AICROWD_TEST_METADATA_PATH
AICROWD_PREDICTIONS_OUTPUT_PATH
to get the path to the test images, the test metadata, and the final path where the prediction outputs are to be saved.
To submit to the challenge you'll need to ensure you've set up an appropriate repository structure, create a private git repository at https://gitlab.aicrowd.com with the contents of your submission, and push a git tag corresponding to the version of your repository you'd like to submit.
We have created this sample submission repository which you can use as reference.
Each repository should have a aicrowd.json file with the following fields:
{
"challenge_id" : "snake-species-identification-challenge",
"grader_id": "snake-species-identification-challenge",
"authors" : ["aicrowd-user"],
"description" : "Snakes Random Classification Agent",
"debug" : "false"
}
This file is used to identify your submission as a part of the Snake Species Identification Challenge. You must use the challenge_id
and grader_id
specified above in the submission. You can enable "debug" mode for having quicker submission with 100 test image, for integration testing, those submissions score would not be counted toward leaderboard.
You can specify your software environment by using all the available configuration options of repo2docker.
For example, to use Anaconda configuration files you can include an environment.yml file:
conda env export --no-build > environment.yml
It is important to include --no-build
flag, which is important for allowing your Anaconda config to be replicable cross-platform.
The evaluator will use /home/aicrowd/run.sh
as the entrypoint. Please remember to have a run.sh
at the root which can instantiate any necessary environment variables and execute your code. This repository includes a sample run.sh
file.
To make a submission, you will have to create a private repository on https://gitlab.aicrowd.com.
You will have to add your SSH Keys to your GitLab account by following the instructions here. If you do not have SSH Keys, you will first need to generate one.
Then you can create a submission by making a tag push to your repository, adding the correct git remote and pushing to the remote:
git clone https://github.com/AIcrowd/snake-species-identification-challenge-starter-kit snake-species-identification-challenge
cd snake-species-identification-challenge
# Add AICrowd git remote endpoint
git remote add aicrowd [email protected]:<YOUR_AICROWD_USER_NAME>/snake-species-identification-challenge.git
git push aicrowd master
# Create a tag for your submission and push
git tag -am "submission-v0.1" submission-v0.1
git push aicrowd master
git push aicrowd submission-v0.1
# Note : If the contents of your repository (latest commit hash) does not change,
# then pushing a new tag will not trigger a new evaluation.
You now should be able to see the details of your submission at : gitlab.aicrowd.com/<YOUR_AICROWD_USER_NAME>/snake-species-identification-challenge/issues
Best of Luck
- Sharada Mohanty ([email protected])
- Shivam Khandelwal ([email protected])