Would it be possible to use a pretrained model (such as Topview Mouse from DLC animal zoo)? #158

hummuscience · 2024-05-24T10:08:37Z

I have been playing around with lightning pose the past few days and quite impressed with the training speed and performance!

Coming from DeepLabCut, I am testing LP on videos of mice captured from the top view. As you probably know, the DLC animal zoo had a pretrained model for this scenario .

Would it be possible to use that as a backbone for LP instead of the typical resnets?

I am still new to your codebase, so I might have not understood it in depth yet and missing something here...

themattinthehatt · 2024-05-24T13:18:43Z

@hummuscience thanks for the kind words!

We currently do not offer any pretrained models, though we hope to start providing some within the year. Once DLC releases the dataset used to train their TopViewMouse network we can use that to train an LP version. I'll leave this issue open for now and update it once we start working on this.

We are also considering providing a pretrained model from the Facemap dataset (mouse face from different angles) and the CRIM13 dataset (top-down view of two mice, one black and one white).

If anybody has additional pretrained models they would like to see (and importantly, a pointer to a labeled dataset), please let us know!

hummuscience · 2024-05-27T10:12:22Z

I just checked the AnimalZoo preprint and the mentioned references. It doesn't seem that any of the actual labeled datasets are available, but in some cases, the videos are (see here for example: https://zenodo.org/records/3608658).

I wonder if it would make sense to run the TopViewMouse model on these videos, extract some output frames (refine them in case of errors), and then convert the project to a DLC project. It might take some time, but could be worth the effort (for me, at least). How many frames would you aim for in this case?

The facemap model would be quite useful, since many people working with head-fixed mice could benefit from it.

The CRIM13 model is also interesting, but wouldn't have that many applications since it's specific for assays with two animals (one white the other black). Unless, this is a more common assay, and I am not aware of that.

On the subject of multianimal tracking. Are there currently any possibilities with LP?

themattinthehatt · 2024-05-28T13:48:01Z

We've been in touch with the DLC folks and they plan to release the labeled datasets once the paper is out. However, it is not clear how long that will take so your suggestion is probably the quickest way to getting something workable. This isn't something we have the bandwidth to work on right now, but if you're up for trying it that would be great! I'd suggest labeling all the video frames with the TopViewMouse model, then doing the following:

compute motion energy (absolute difference between keypoints on consecutive frames, averaged over all keypoints)
remove frames with low motion energy (when the mouse is sitting still)
for the remaining frames, run a clustering algorithm (kmeans is fine) on the poses and selecting 1000 clusters. Then you can take 1 frame from each cluster to get 1000 labeled frames with a decent amount of pose variability (see a related implementation here, though note this runs kmeans on a pca embedding of the frames rather than the predicted keypoints).
refine errors in selected frames if necessary

I would also note that we find, even with the LP bells and whistles, that more labeled frames is always better. so if you're not too daunted by the refinement step, selecting 2k or 3k frames will almost certainly result in a more robust model than 1k frames. but 1k will certainly do a good job.

If you go down this route please let me know! Happy to keep discussing it with you.

Re: multi-animal tracking - this is something that we are working towards, but it will be a while before we have these features built into LP (unless the animals are visibly distinct, like in the CRIM13 dataset)

hummuscience · 2024-07-10T12:25:50Z

So, I started to implement this for some of my own videos and currently refining the predictions on 300 images to test if it improves things.

Meanwhile, it seems like the datasets from the SuperAnimal paper are public now (or at least, I found them on zenodo: MausHaus and the whole TopViewMouse dataset.

I tried the pretrained model on some images from the MausHaus and the BlackMice datasets, but the results were actually not as good as I expected. The expectation would be that the pretrained model would reproduce the labels from the training set. But it didn't (or rather, performed poorly).

Now there is also the entire TopViewMouse dataset. There, not all keypoints are labelled in all images (since they come from different datasets). I wondered if I can just go ahead and train LP with it like that.

The other issue is that the annotations are in a JSON file. I think I could manage to convert it to LP format though.

themattinthehatt · 2024-07-10T13:26:49Z

I looked at the TopViewMouse demo a year or so back and also found that it did not perform as well as expected on a top-view mouse video that looked very similar to one in the training dataset. Good to know that you could replicate this finding.

I still haven't had a chance to look into the TopVIewMouse dataset - I do remember that not all keypoints are labeled in all images, but are the keypoints at least named the same when they are in the same location across datasets? If so then you can definitely train and LP model on these, you would just leave the ground truth label empty where it doesn't exist and then LP ignores this keypoint during training.

It shouldn't be too hard to convert the annotations from JSON to LP format. If you end up doing this please let me know and we can discuss best ways to train the model!

hummuscience · 2024-07-10T13:55:42Z

The results are a bit better when one uses spatio-temporal adaptation. But not as I would expect.

Yes, the positions are the same. Then I will go ahead and try out the try. I will report :)

themattinthehatt · 2024-07-10T15:47:57Z

Awesome excited to see the results! I take it the labeled dataset doesn't have the associated videos as well? Maybe I can ask the Mathis's for that data, then we could extract the context frames and test out a context model as well.

hummuscience · 2024-07-12T14:48:06Z

Training is running 👍 will post once its done.

Yeah, the dataset doesn't contain any videos. I think some of the origin datasets could have videos (pranav 2018 maybe?). But yeah, it could be easier to ask Mathis's for the videos. Even though it is possible that they won't have them...

Would it be possible to inform the context model with a non-context model somehow? Maybe one could use the PCA?

Btw, I wrote a script that automatically extracts the context frames for each image. Could that be useful to add as a utility in the scripts folder?

hummuscience · 2024-07-15T08:59:38Z

First look at the training. It seems like I should be stopping the training earlier (150k?) or at least saving more checkpoints.

Test videos look very good :) I am thinking of training a DLC model witht he same dataset (maybe some shuffles?) to compare output. There are 300 additional frames not contained in the original TopViewMouse5k that come from my own datasets.

Will check the evaluation

I am quite new to LP so I am not so sure about the choices in the config.yaml file. I added it below.

data:
  image_orig_dims:
    height: 480
    width: 640
  image_resize_dims:
    height: 512
    width: 512
  data_dir: /mnt/hpc_slurm/home/abdelhaym/freely-moving/lp-models/all_mighty_mouse/
  video_dir: /mnt/hpc_slurm/home/abdelhaym/freely-moving/lp-models/all_mighty_mouse/videos/
  csv_file: CollectedData_all.csv
  downsample_factor: 2
  num_keypoints: 27
  keypoint_names:
  - nose
  - left_ear
  - right_ear
  - left_ear_tip
  - right_ear_tip
  - left_eye
  - right_eye
  - neck
  - mid_back
  - mouse_center
  - mid_backend
  - mid_backend2
  - mid_backend3
  - tail_base
  - tail1
  - tail2
  - tail3
  - tail4
  - tail5
  - left_shoulder
  - left_midside
  - left_hip
  - right_shoulder
  - right_midside
  - right_hip
  - tail_end
  - head_midpoint
  mirrored_column_matches: null
  columns_for_singleview_pca:
  - 1
  - 2
  - 7
  - 9
  - 13
  - 19
  - 20
  - 21
  - 22
  - 23
  - 24
  - 26
training:
  imgaug: dlc-top-down
  train_batch_size: 8
  val_batch_size: 48
  test_batch_size: 48
  train_prob: 0.8
  val_prob: 0.1
  train_frames: 1
  num_gpus: 1
  num_workers: 4
  early_stop_patience: 3
  unfreezing_epoch: 20
  min_epochs: 300
  max_epochs: 750
  log_every_n_steps: 10
  check_val_every_n_epoch: 5
  gpu_id: 0
  rng_seed_data_pt: 0
  rng_seed_model_pt: 0
  lr_scheduler: multisteplr
  lr_scheduler_params:
    multisteplr:
      milestones:
      - 150
      - 200
      - 250
      gamma: 0.5
model:
  losses_to_use: []
  backbone: resnet50_animal_ap10k
  model_type: heatmap
  heatmap_loss_type: mse
  model_name: test
  checkpoint: null
  lightning_pose_version: 1.4.0
dali:
  general:
    seed: 123456
  base:
    train:
      sequence_length: 64
    predict:
      sequence_length: 128
  context:
    train:
      batch_size: 16
    predict:
      sequence_length: 96
losses:
  pca_multiview:
    log_weight: 5.0
    components_to_keep: 3
    epsilon: null
  pca_singleview:
    log_weight: 5.0
    components_to_keep: 0.99
    epsilon: null
  temporal:
    log_weight: 5.0
    epsilon: 20.0
    prob_threshold: 0.05
eval:
  hydra_paths:
  - ' '
  predict_vids_after_training: true
  save_vids_after_training: true
  fiftyone:
    dataset_name: freemovetest
    model_display_names:
    - test_freemoving
    launch_app_from_script: false
    remote: true
    address: 127.0.0.1
    port: 5151
  test_videos_directory: /mnt/hpc_slurm/home/abdelhaym/freely-moving/lp-models/all_mighty_mouse/test_videos/
  saved_vid_preds_dir: null
  confidence_thresh_for_vid: 0.9
  video_file_to_plot: null
  pred_csv_files_to_plot:
  - ' '
callbacks:
  anneal_weight:
    attr_name: total_unsupervised_importance
    init_val: 0.0
    increase_factor: 0.01
    final_val: 1.0
    freeze_until_epoch: 0

hummuscience · 2024-07-15T15:49:31Z

I tried to run a semi-supervised model (PCA or temporal), but it fails due to the input images being of different sizes.

I could rescale the images and the key points, do you think that would make sense?

themattinthehatt · 2024-07-15T17:13:31Z

ah very cool, glad you were able to get this working!

I tried to run a semi-supervised model (PCA or temporal), but it fails due to the input images being of different sizes.

yes pca will require the frames to be the same sizes; but actually this is an interesting use case because even if you resized the frames to be the same size the size of the animal would vary a lot from dataset to dataset, so maybe PCA wouldn't work so well anyways. I'll have to give this some more thought.

config options

image_resize_dims: 512x512 is quite large, you might also try 256x256 or 384x384 and see if they work as well/better. model training/inference will certainly be faster with smaller resize dims. training can actually be more accurate in some cases too because the number of trainable parameters increases as you increase the resize dims, so I've seen (through other users) that even in 1000x1000 pixel images with a freely moving mouse resizing to something smaller than 512x512 is more accurate.
train_prob/val_prob: I would up the train_prob so you use more of the data for training. maybe use something like 0.95/0.05? that doesn't leave any leftover for test data, but it depends on how you want to test the model (could be on held-data that you labeled that isn't the CollectedData.csv file). also, there are a ton of frames in this dataset (>5k right?) so you could even go further to 0.98/0.02 or something (this would still give you >100 validation frames)
train_batch_size: if you decrease the resize_dims you could double this to 16 without memory issues I'm guessing; then the model will train faster

everything else looks good! I see you set imgaug to dlc-top-down already, which is great 👌

Would it be possible to inform the context model with a non-context model somehow? Maybe one could use the PCA?

the lack of context frames and the frames being different sizes means it is difficult to play with the context/semi-supervised features of the model. there's not a real easy workaround to either of these with this dataset that I can think of off the top of my head. So I guess I'd say use the fully supervised model first and see how that works for you?

First look at the training. It seems like I should be stopping the training earlier (150k?) or at least saving more checkpoints.

I'm kinda surprised the validation loss starts going up so early - but my intuition is that this is related to the big resize dim (512x512), I'm curious if this goes away with smaller resize dims.

hummuscience · 2024-07-16T11:23:15Z

Re-training now with your suggestions.

I have a RTX A6000, so I increased the batch size to 32 without any memory error (after 40 epochs).
Went with 256 x 256 size
And 0.97 train, 0.02 val which ends up with 100+ validation images and 50+ test

Interestingly, the model is not much faster, an epoch takes 1 minute

I was checking the loss development with Tensorboard, and it seems like the loss is reducing quicker than the first model I trained.

It could be due to the larger training data size and larger batch size that the loss is reducing faster and its generalizing better.

hummuscience · 2024-07-16T12:17:04Z

Going through the evaluation results of the first model I trained, it seems like the model doesn't generalize across datasets where less keypoints are labelled.

For example, in this dataset, only 4 keypoints were labelled. When we predict all 26 keypoints, the model does quite a bad job at it, with a high confidence.

Am I doing something wrong?

In the SuperAnimal paper, they mention using gradient masking of the heatmaps to deal with this issue. I have no clear idea what that means, though.

themattinthehatt · 2024-07-16T13:20:53Z

Interestingly, the model is not much faster, an epoch takes 1 minute

I guess this makes sense, you have fewer batches but each batch takes longer to process. Is your GPU utilization at or near 100% while training? If not you could also increase training.num_workers to something like 6 or 8 (depending on the number of cores you have) and that would speed training up a bit.

I was checking the loss development with Tensorboard, and it seems like the loss is reducing quicker than the first model I trained.

You are correct that this is due to the increased batch size - you can see in the first model that there is a big dip around 10k iterations. This is actually due to the unfreezing of the backbone weights. When you increase the batch size you take fewer steps per epoch, and so the backbone is unfrozen earlier (in terms of number of gradient steps). Actually in your case because there are so many frames you could probably reduce the parameter training.unfreeze_epoch to something like 5 instead of 20, but that doesn't invalidate any of your current results.

For example, in this dataset, only 4 keypoints were labelled. When we predict all 26 keypoints, the model does quite a bad job at it, with a high confidence.

Huh this is a bit unexpected, I would have guessed the model could generalize better than this. I'm curious if the 256x256 model generalizes better. This is a great test though! Do you see this kind of issue with other datatsets that have few labeled keypoints? I forget how exactly the SuperAnimal paper does the gradient masking but I'll look into that and get back to you.

hummuscience · 2024-07-16T14:10:41Z

I found the code that DLC uses for gradient masking in their dlc_pytorch branch

        for b in range(batch_size):
            for heatmap_idx, group_keypoints in enumerate(coords[b]):
                for keypoint in group_keypoints:
                    # FIXME: Gradient masking weights should be parameters
                    if keypoint[-1] == 0:
                        # full gradient masking
                        weights[b, heatmap_idx] = 0.0
                    elif keypoint[-1] == -1:
                        # full gradient masking
                        weights[b, heatmap_idx] = 0.0

                    elif keypoint[-1] > 0:
                        # keypoint visible
                        self.update(
                            heatmap=heatmap[b, :, :, heatmap_idx],
                            grid=grid,
                            keypoint=keypoint[..., :2],
                            locref_map=self.get_locref(locref_map, b, heatmap_idx),
                            locref_mask=self.get_locref(locref_mask, b, heatmap_idx),
                        )

If I understand this correctly, the gradient masking is to the weights before the backwards pass? But I am not sure, since in the SuperAnimal paper they mention that it is applied before the loss calculation.

In LP the masking is done during [the loss calculation]https://github.com/danbider/lightning-pose/blob/4967266feb59a8c08ff3c31d08f520d480cd10d1/lightning_pose/losses/losses.py#L149), as in, all NaNs are removed before loss calculation. Is that the same approach though?

Checking the SuperAnimal paper, there are some images from with vs. without masking

The methods part mentions the following:

Training naively on these projected annotations would harm the training stability, as the loss function penalizes undefined keypoints, as if they were not visible (i.e., occluded).

For stable training of our panoptic pose estimation model, we mask components of the loss function across keypoints. The keypoint mask $n_k$ is set to 1 if the keypoint $k$ is present in the annotation of the image and set to 0 if the keypoint is absent. We denote the predicted probability for keypoint $k$ at pixel $(i, j)$ as $p_k(i, j) \in [0, 1]$ and the respective label as $t_k(i, j) \in {0, 1}$, and formulate the masked $L_k$ error loss function as

$$ L_k = \sum_{k=1}^{m} n_k \sum_{i,j} |p_k(i, j) - t_k(i, j)|_z, $$

with $z=2$ for mean square error and $z=1$ for L1 loss (e.g., used for locref maps in DLCRNet) and the masked cross-entropy loss function as

$$ L_{CE} = - \sum_{k=1}^{m} \sum_{i,j} n_k t_k(i, j) \log p_k(i, j). $$

Note that we make distinct the difference between not annotated and not defined in the original dataset and we only mask undefined keypoints. This is important as, in the case of sideview animals, "not annotated" could also mean occluded/invisible. Adding masking to not annotated keypoints will encourage the model to assign high likelihood to occluded keypoints.

hummuscience · 2024-07-16T15:25:06Z

Does LP distinguish between "unlabelled" keypoints (because of occlusions) and keypoints that were not labelled at all?

themattinthehatt · 2024-07-16T15:34:29Z

No, currently LP does not distinguish between the two - if a ground truth label is missing it is dropped from the loss function with the remove_nans function you pointed out (though see "Why does the network produce high confidence values..." here: https://lightning-pose.readthedocs.io/en/latest/source/faqs.html).

So yes, on a first pass it appears that LP by default does the same gradient masking that the SuperAnimal paper implements. Looking at the DLC function you linked, weights does not refer to the weights of the model, but rather the mask applied to the loss ($n_k$ in their notation above).

themattinthehatt · 2024-07-16T15:42:36Z

Oh something else I just realized: the TopViewMouse dataset contains some datasets with multiple animals - how do you deal with this right now? LP cannot currently handle multi-animal pose estimation.

hummuscience · 2024-07-16T15:55:59Z

I removed the two datasets that have multiple animals (TriMouse and Golden Lab).

I also just realized that one of the Datasets (Kiehn Lab Openfield) actually doesn't have labels. So I will rerun the training without it

themattinthehatt · 2024-07-16T16:03:17Z

Great. Let me know how the generalization looks with the 256x256 model when done, I'm still scratching my head a bit about the bad performance in the frames you showed above.

hummuscience · 2024-07-16T18:21:49Z

This is the current state of the training (it still has the Kiehn Lab Openfield data though).

The train_heatmap_mse_loss is plateauing as well as the RMSE loss. Not sure what to think about the val loss.

Should I stop the training in this case? If I stop the training, will it save the checkpoint? (I set the config to save multiple checkpoints but realized that it is not implemented on the dynamic_crop branch)

themattinthehatt · 2024-07-17T00:32:28Z

the noisiness in the validation plot is weird, especially compared to the black line. is black 512x512 and magenta 256x256?

one thing you can do is hit the three dots in the upper right hand corner of these plots and change the y-axis to a log scale, that's typically more helpful the further you get in training - my guess is that the train_heatmap_mse_loss isn't plateauing yet, it's just decreasing on smaller scales.

the model should be saving out weights along the way, you'll find them in the tb_logs directory (you'll have to go down a couple more subdirectories).

themattinthehatt · 2024-07-17T00:39:03Z

btw if you're training on the dynamic_crop branch I would recommend switching over to main and pulling the latest updates, I recently made an update to the validation dataloader that might be relevant here (a5e3831). Previously the validation data was passing through the image augmentation pipeline, so the "validation" data was actually different on every single pass. In the most up-to-date main branch that has been fixed and now the validation data is not augmented. Obviously you were doing the same thing with the 512x512 network and didn't see the weird spikes in the validation loss but wouldn't hurt to remove that factor.

hummuscience · 2024-07-17T12:49:04Z

Alright, re-training now with the on the unsupervised multiview branch with 8 workers, and unfreeze epoch set to 5.

I will also have a look at how well the magenta model performed.

hummuscience · 2024-07-17T15:31:34Z

So far so good. Orange is the latest model.

hummuscience · 2024-07-18T10:07:26Z

(I wonder if we should move this to discussions?)

So, the model is trained and it seems to perform better than the previous ones. It might be due to the removal of the weird datasets though (2-3 mice and the one without labels).

I would have expected the early stopping to kick in at some point after 5.5 hours. But maybe I am misunderstanding how it works. how do I know which checkpoint was used for the inference in the end? I understand that lightning uses the checkpoint with the "best score" but I can't find out how this is determined...

The predictions on some datasets look quite good:

While on others, less so:

Specially this effect of all the predictions being "squished" into the central line of the animal of to the front of the animal.

One obvious difference betewen the two different types of datasets is the size of the animal. I thought this should only affect semi-supervised PCA methods, but I have a blank losses_to_use list in the config. For some weird reason though, my test_video predictions have a *_pca_singleview_error.csv file. Is the PCA still being computed and used? (I am on the semisupervised multiview branch).

This issue reminds me of the situation that DLC had with the SuperAnimal models and therefore adapted spatial-pyramid search during testing and video adaptation during inference of previously unseen data.

One mistake that I did was that I didn't adjust the train/va/test proportions after removing one of the datasets. I ended um with a small val/test dataset (80 images each) which did not contains all datasets. I wonder if it would make sense to perform the train/test/val split manually to make sure all datasets are represented in each split.

themattinthehatt · 2024-07-18T18:16:25Z

awesome, thanks so much for looking into this. yeah maybe we can switch over to the discord? https://discord.gg/tDUPdRj4BM

to answer your most recent questions here though:

if you set training.early_stopping to true in the config file the model doesn't actually perform early stopping (apologies for this, old nomenclature that we should update). rather, the trainer monitors the validation loss and saves out a model checkpoint every time the validation loss is lower than any previously recorded validation loss. the model is trained for the full number of epochs still. so when you load the weights you'll get the ones corresponding to that lowest validation loss epoch.
those first three images you shared look pretty good. do the turquoise markers correspond to the set of keypoints that were labeled for that particular dataset, and the blue markers the unlabeled keypoints? (if so that's very helpful). I'm not sure why the generalization to missing keypoints would work in some instances but not others, especially that last one where...well...it's just a dark mouse on a bright background.
regarding the PCA error file, if you set losses_to_use to be an empty list the model will not use the pca loss during training, but it will still automatically compute that loss on videos after running inference if the columns_for_pca_singleview isn't null
i still haven't read deeply into their video adaptation method, but I would think that's not necessary on frames from datasets that were used in the training
if you want to manually create a test dataset you can select a set of frames, remove those frames from the CollectedData.csv file, and place them in a file called CollectedData_new.csv. Then you can use all frames in CollectedData.csv for train/val. Our training function will automatically look for a file named CollectedData_new.csv (or, more generally, the name of the csv file used for training with "_new" appended to the end) and run inference on that. This is what we did for all of our OOD experiments in the LP paper, to have strict control over which frames were train/val vs OOD test.

hummuscience · 2024-08-15T12:23:51Z

I am currently thinking that the issues with the labels on the sides of the animal getting bad predictions has to do with the relative proportion of these labels vs. others (such as ears or tail base) in the whole dataset.

I will try to do some stratification with the train/test data or some oversampling. I haven't found hints of this being implemented in LP until now. Do you think it might make sense to add?

themattinthehatt · 2024-08-20T20:57:45Z

That's a good observation, it is indeed possible that oversampling would force the model to focus more on these less-frequent keypoints. One way to do this would be to implement a custom sampler for the pytorch data loader, which would allow upweighting of the labeled examples with more keypoints.

I don't have the bandwidth to work on this for the time being, but happy to discuss the details more if you're interested in trying it out.

hummuscience · 2024-08-21T13:47:54Z

Nice! Good to know there is a way to implement this directly, I will give it a look :)

In the meantime, I ran a few experiments. I took the whole datasets I am working with, created a OOD test-set (40 images from each modality/dataset), and the rest I used for training using a 90% train and 10% eval split.

I tried 4 different approaches.

Trained untouched ("Original")
"Oversampled": the images with the under-represented labels (created a "weight" per image based on how much under-represented labels it has) were oversampled while keeping the total number of images for training fixed. The final training set had a total of around 4400 images (1900 if duplicates are removed).
"Oversampled-Inflated": since I am "losing" precious training data if I oversample while keeping the number of images the same, I inflated the total number of images by 1.5, kept the original dataset, and then added more images, but there I oversampled the under-represented images
"Oversampled-2.5Inflated": The same as 3, but with 2.5 inflation, to makes sure even more under-represented images are present (in retrospect, I should have thought of a better way the bias the data more towards the under-represented labels).

Doing this lead to the following distributions of labels in the training data:

Here is the validation loss during training: (blue original, dark blue oversampled, pink inflated, purple 2.5 inflated)

I also saved a snapshot every 50 epochs from each model to run the eval on the OOD dataset and see the development:

Here is the median log pixel error of some keypoints (mainly the problematic ones and mouse_center/tail_base where are present in each dataset. The vertical lines show the "best" model as chosen by LP

I wanted to see if the models perform similarly on different origin datasets. Here is the log pixel error for "left_hip"

And tail_base:

It seems like the oversampling and inflation improves the model performance for these rare labels.

The varying performance between datasets is still a bit puzzling... Will have a closer look at the weights for each image, it might be that OFT/3CSI are over-represented for some reason.

hummuscience · 2024-08-21T14:08:16Z

The more intersting question (at least for me) is what happens with this "collapsing" of these left_ and right_ keypoints to the center. Here is an example of the same frame from 4 different models. The "no_topview" model is one that was trained only on in-house data (without the TopViewMouse dataset). You can see that the lateral keypoints are "collapsing" into the spine of the animal in the other models.

If I plot the distances between the left and right keypoints, moments where this "collapse" is happening are situations where the distance is close to 0. compared to the "no_topview" model (which this collapsing doesnt happen), all trained models show this. And they are not even consistent as to when it happens (even though it seems like frames where the animal stands up, narrowing its body).

Here are the distributions of the distance for all trained models:

In this case, it seems like inflation of the training dataset leads to a worse result.

This is however not the case for all datasets. The OFT dataset which benefitted from the oversampling/inflation above has this problem solved :)

So I might be getting closer to the solution. I think it has to do with how I am oversampling and weighing the data :)

themattinthehatt · 2024-08-21T14:55:24Z

@hummuscience wow this is awesome, nice work! some misc comments/questions:

the oversampled + inflated approach seems to be a simple (and potentially effective?) way forward. But I am also confused as to why the collapsing is still happening. In your final 3 plots above, are these results on labeled OOD frames or on a video?
does the collapsing occur on all datasets, or just the datasets that didn't include the additional left_ and right_ keypoints?
since you have your own in-house dataset, I'm curious if you see improved results by first fitting a model on the TopViewMouse dataset, and then fine-tuning on your own data
is there an easy way to oversample + inflate the data such that the distribution over keypoints in your first plot is uniform (or closer to uniform)? the oversample_inflated and oversample_2.5inflated distributions are much more similar to each other than the original and oversampled distributions

hummuscience · 2024-08-21T15:13:01Z

Results on two videos. One is from the ZN_free_moving dataset, the other from the OFT dataset. The collapsing only affects the ZN_free_moving (and the GK_sickness, but that is due to too little labelled frames), but not the OFT dataset.
Haven't tested on all datasets yet. I only have videos for OFT, EPM, ZN_free_moving, GK_sickness (last two are own projects). I could give them a try. All of these have labels for left_ and right_. But they lack some intermediate labels on the spine... I wonder if that is what is impacting the performance.
That was a thought I had. It is basically what I did with the DLC model in the beginning. There is no possibility of using the TopViewMouse model as backbone to train a semi-supervised/context model, right?
Yeah, I was thinking of how to do it. The issue is that some keypoints are present in every dataset (tail_base for example). So oversampling an image with left_ear_tip by giving it a higher weight would automatically lead to more tail_base keypoints in the dataset. Currently, I can't think of a better way to deal with this (unless I am missing some method). Do you have an idea?

themattinthehatt · 2024-08-21T16:07:45Z

yes! you won't be able to train a semi-supervised/context model directly with the TopViewMouse data, but you can definitely use that supervised/no-context backbone to to train a semi-supervised/context model on your own data (in the same way we use pretrained ImageNet/AP10k backbones for those models). All you need to do is train on the TopViewMouse and then point to those weights in the config file. The context model weights will be randomly initialized, but the ResNet-50 weights from TopViewMouse will be loaded in.
ah yes of course. Are you planning on labeling all possible points (including left_ and right_ keypoints) in your own data, or just sticking with the most common ones in the TopViewMouse dataset?

hummuscience · 2024-08-21T16:24:30Z

Sweeet! That's what I will try out then! First re-running the model with less val/test proportions and adding another dataset I was missing (with 1000ish images).
In my own datasets, I have all keypoints labelled. But not many labelled images (in total I have around 300).

hummuscience · 2024-08-21T16:26:31Z

The collapsing happens in all datasets, except for the OFT dataset... Funnily, the OFT dataset has quite a small number of labels to begin with. Very puzzling. I will do some digging to see if I did a mistake somewhere.

hummuscience · 2024-08-23T08:48:43Z

How do I actually do this?

"All you need to do is train on the TopViewMouse and then point to those weights in the config file. The context model weights will be randomly initialized, but the ResNet-50 weights from TopViewMouse will be loaded in."

I tried to pass the path to the ckpt file as backbone in the config, but that didn't work.

Now I went with resnet50_animal_ap10k as backbone and the checkpoint passed to the "checkpoint" argument. Is that the correct way?

themattinthehatt · 2024-08-23T14:46:18Z

Yes your second approach is the way to go! A little bit inefficient because it builds the backbone, then loads the ap10k weights, and then loads the weights from the checkpoint, but it's the most straightforward way to lay it out in the config file.

hummuscience · 2024-08-30T10:34:29Z

Here is a small update. I am not completely done with the tests but it seems like there is a solution to the “collapsing” issue.

The upper part is the pixel error on test images using different data as input (combined is GK + ZN, two different experiments).

The middle part is violin plots of the distance between left and right bodyparts in the same test video. The closer to 0, the worse the model.

I sorted them a bit by the average pixel error, to make it a bit easier to see patterns in the combinations.

I had a suspicion that something about the size of the input images was defining whether I see "collapsing" or not. So I ran some models with 512x512 instead of 256x256 as image_size. And this does the trick. I am not well enough versed in dealing with these models but it could be that reducing the size to 256x256 gets the keypoins "too close" to each other and the model has issues in learning to differentiate them?

All these models were run after pre-training on the TopViewMouse5k data with inflated+upsampled frames and 256x256 as image size. I also ran tests there after seeing these results, so I will post an update another time. But 512x512 seems to be a better choice for this data as it improves the result from datasets where the mice are small due to zoomed out images.

themattinthehatt · 2024-09-06T17:43:47Z

That's interesting, we've fit Lightning Pose on some fairly large images with freely moving mice and 256x256 has always worked well for us. But that has always been the single dataset/single set of keypoints setting, so maybe something about this heterogeneous dataset is different. Have you tried doing some of comparisons you did previously on the TopViewMouse5k data (with inflated+upsampled frames) looking at 256x256 versus 512x512? Might also be worth checking out 384x384 as an intermediate solution.

hummuscience · 2024-10-30T16:52:31Z

Just wanted to give a quick update about the current state of this.

I ended up training quite a few models to find out what works best for my data. Here is a summary using 3 different backbones of the DLC topviewmouse dataset and different combinations of losses. I am only using the OOD (so test) dataset for these plots. Side difference is how the ground truth and predicted difference between the sides compare. So lower values are better. The models are ordered according to how well they perform on the combination of pixel error, side difference, percent correct keypoints (PCK), and calibration error (to account for the over-confidence). I did not alter the parameters for the different losses.

One thing that is clear is that training with larger image size is generally performing better than with smaller images. The same goes for context models vs. normal models.

Surprisingly, combining both unsupervised losses is nit necessarily better than single losses.

What is still problematic is this "collapsing" or side switching. When checking predictions more closely, it seems to not only be single frames but actually happen "gradually" over multiple frames. I will change the colormap to make it more obvious and post an example of it.

As to the question of how the size of the images in the backbone and the size of the images in the "refinement" step affect the "collapsing". It seems like both have an affect. With the size in the refinement having a larger effect (in this case, its the raw distance between left and right in the predicted videos):

themattinthehatt · 2024-10-31T18:21:24Z

@hummuscience thanks for checking in, I continue to be very curious about these results. Some questions for clarification:

what am I looking at in the last row of the first plot, with all the colored circles? (and more generally, what are the columns across all the rows?)
what is ap10k vs DLC256 vs DLC512? I thought I understood DLC256 to be using a backbone pretrained on the DLC top-down data with resizing to 256x256, but then I don't understand what the entry in column=512, row=DLC256 is equivalent to in the final plot

And as a more general question, how far away are the models from being usable for answering your scientific question? How many frames have you ended up labeling so far from your own dataset?

hummuscience · 2024-11-01T10:53:13Z

So, I first trained 2 different 'backbones'. DLC256 and DLC512 which either resize to 256x256 or 512x512. I then used these backbones to train different models with my data of interest (freely moving mice). The backbones did not see any of my data and only used data from the DLC model.

These are the parameters/options chosen for the training of the model. I used hydra to try out different combinations (shown on the right). Different "backbones", different image sizes, different losses, context vs. normal model etc.
the first 3 rows are the backbones (as written on the right, its a bit crowded, sorry). The second mention of 512 and 256 is the image size used for "refining" step. The training step using my own data. So in the second image, a column of 512 and a row DLC256 means that the backbone was trained with 256x256 and the "refinement" was done with 512x512.

And to the general question. I am by now using the best performing model for my actual data (but removing the "side" keypoints as the switching/collapsing is still a problem for downstream use such as Keypoint Moseq etc.).

In total I have only labelled 250 images. I am thinking of adding more images to try and maybe improve this "switching" performance.

themattinthehatt · 2024-11-01T14:31:41Z

ah I see now! ok very cool. so the general trends are:

backbone: DLC256/DLC512 > ap10k
image size: 512 > 256
unsupervised losses: a bit mixed?
model type: mixed, maybe context is a bit better?

I still think it would be worth trying out a 384x384 model just to see how it fits into the mix.

hummuscience · 2024-11-04T07:25:04Z

Yeah. I think when it comes to unsupervised losses it's a bit mixed, while with model type, I would say context is a bit better. I might be choosing the wrong ways to compare the models, though. Please let me know if you have better ideas.

Also, the models were all trained with training.uniform_heatmaps_for_nan_keypoints: false
I am thinking of switching this option on as the models seem to be rather over-confident if not done (and that messes up downstream methods such as Keypoint-Moseqs error calibration).

The DLC team updated their Zenodo repo after I found an error in one of the datasets (which meant that I had to drop 800 frames from training the backbone before). https://zenodo.org/records/10618947

So I will retrain the backbone with the larger dataset and the different options of sampling to correct for the imbalance in labels and then run another round and update you :)

themattinthehatt · 2024-11-04T14:37:30Z

Oof sorry to hear about the issue with the super-animal dataset - yet another good reason to make data publicly available!

Regarding training.uniform_heatmaps_for_nan_keypoints, you should definitely make sure to set this to false when pretraining the backbone on the super-animal dataset, since different datasets have different keypoints annotated. But when finetuning on your own dataset, yes you can definitely set that param to true.

Regarding the model evaluation, one recommendation I have would be to look at snippets of held-out video data as well, and not just the labeled data. Since the collapsing sides is still an issue, this is fortunately a metric you can calculate on unlabeled data for all the models - just compute the pixel difference between two selected side keypoints for each frame. Then you could take 5-10 short clips of held-out mice, running inference with some of the models, and then comparing the side differences.

I would also recommend checking out our EKS post-processor - it requires training and inference with multiple models, but we've really found it to be helpful in cleaning up noise in the predictions (as long as the different models (trained on different subsets of the data) aren't all making the same mistake).

hummuscience · 2024-11-04T16:36:56Z

True. For the backbone, its bad because of the different labels. Will only activate it for the finetuning.

Regarding your suggestion, this is exactly what I did in the second plot. The distributions are from a single held-out video (15 minutes). Will add a few more videos to have more variability.

I tried the EKS post-processor. But as you said, since many of the models are doing similar mistakes, the outcomes were not worth the trouble. Will definitely check it out once things are solved.

themattinthehatt · 2024-11-04T22:11:26Z

Ah I see that makes a lot of sense - I assumed those were distances computed on labeled frames.

I'm surprised that the pca loss actually has a shorter distance. Is one of these distance distributions more "correct" than the others? Also, do you have a sense if the pca loss is good at signaling the side collapse issue?

It's interesting that multiple models would have the side collapse issue on the same frames. I guess that means there's something specific about the frames in which the collapse occurs - do you have any sense of this? (sorry if you mentioned this previously) If so, it would be helpful to identify such examples across different animals and add those frames to your labeled data.

themattinthehatt self-assigned this May 24, 2024

themattinthehatt added enhancement New feature or request app Question about the app rather than lightning-pose labels May 24, 2024

hummuscience mentioned this issue Jul 17, 2024

Error about varying image sizes in fully supervised model training #188

Closed

Would it be possible to use a pretrained model (such as Topview Mouse from DLC animal zoo)? #158

Would it be possible to use a pretrained model (such as Topview Mouse from DLC animal zoo)? #158

Comments

hummuscience commented May 24, 2024

themattinthehatt commented May 24, 2024

hummuscience commented May 27, 2024

themattinthehatt commented May 28, 2024 • edited Loading

hummuscience commented Jul 10, 2024

themattinthehatt commented Jul 10, 2024

hummuscience commented Jul 10, 2024

themattinthehatt commented Jul 10, 2024

hummuscience commented Jul 12, 2024

hummuscience commented Jul 15, 2024

hummuscience commented Jul 15, 2024

themattinthehatt commented Jul 15, 2024 • edited Loading

hummuscience commented Jul 16, 2024

hummuscience commented Jul 16, 2024

themattinthehatt commented Jul 16, 2024

hummuscience commented Jul 16, 2024 • edited Loading

hummuscience commented Jul 16, 2024

themattinthehatt commented Jul 16, 2024

themattinthehatt commented Jul 16, 2024

hummuscience commented Jul 16, 2024

themattinthehatt commented Jul 16, 2024

hummuscience commented Jul 16, 2024

themattinthehatt commented Jul 17, 2024

themattinthehatt commented Jul 17, 2024

hummuscience commented Jul 17, 2024

hummuscience commented Jul 17, 2024

hummuscience commented Jul 18, 2024

themattinthehatt commented Jul 18, 2024

hummuscience commented Aug 15, 2024

themattinthehatt commented Aug 20, 2024

hummuscience commented Aug 21, 2024

hummuscience commented Aug 21, 2024

themattinthehatt commented Aug 21, 2024

hummuscience commented Aug 21, 2024

themattinthehatt commented Aug 21, 2024

hummuscience commented Aug 21, 2024

hummuscience commented Aug 21, 2024

hummuscience commented Aug 23, 2024

themattinthehatt commented Aug 23, 2024

hummuscience commented Aug 30, 2024 • edited Loading

themattinthehatt commented Sep 6, 2024

hummuscience commented Oct 30, 2024 • edited Loading

themattinthehatt commented Oct 31, 2024

hummuscience commented Nov 1, 2024

themattinthehatt commented Nov 1, 2024

hummuscience commented Nov 4, 2024 • edited Loading

themattinthehatt commented Nov 4, 2024

hummuscience commented Nov 4, 2024

themattinthehatt commented Nov 4, 2024

themattinthehatt commented May 28, 2024 •

edited

Loading

themattinthehatt commented Jul 15, 2024 •

edited

Loading

hummuscience commented Jul 16, 2024 •

edited

Loading

hummuscience commented Aug 30, 2024 •

edited

Loading

hummuscience commented Oct 30, 2024 •

edited

Loading

hummuscience commented Nov 4, 2024 •

edited

Loading