use SAM model to generate the masks of photos and then The process of selectively merging the regions corresponding to these masks is supervised by CLIP, which measures the score of the categories recognized by CLIP in the image area. The merging process refers to the selective search algorithm. Due to the effective recognition ability of CLIP for small targets, this project can ultimately generate a segmented image of the given image, including masks and corresponding categories, which can be automatically annotated for the target segmented dataset, To replace manual annotation.
On the far left is the original image, in the middle is the automatic annotation (category not marked) made for this project, and on the right is the result of manual annotation. The annotation method is different, so it looks different. The recognition results are very similar.