feat: add open-clip #130

long8v · 2023-06-21T04:50:53Z

review open clip

long8v

0621 readme.md 읽음

long8v · 2023-06-21T04:53:10Z

open_clip/README.md

+
+Welcome to an open source implementation of OpenAI's [CLIP](https://arxiv.org/abs/2103.00020) (Contrastive Language-Image Pre-training).
+
+The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset.


CLIP에서 말하고 있는 distribution shift에도 강건한지 확인하기 위함.
original CLIP 모델이 재현이 되는지 확인

long8v · 2023-06-21T04:54:33Z

open_clip/README.md

+Welcome to an open source implementation of OpenAI's [CLIP](https://arxiv.org/abs/2103.00020) (Contrastive Language-Image Pre-training).
+
+The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset.
+Specifically, a ResNet-50 model trained with our codebase on OpenAI's [15 million image subset of YFCC](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md) achieves **32.7%** top-1 accuracy on ImageNet. OpenAI's CLIP model reaches **31.3%** when trained on the same subset of YFCC. For ease of experimentation, we also provide code for training on the 3 million images in the [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/download) dataset, where a ResNet-50x4 trained with our codebase reaches 22.2% top-1 ImageNet accuracy.


재현을 확인하기 위해 YFCC에서 Refine된 15M에 대해 여기서 학습해봤는데 ImageNet 32.7의 정확도가 나왔고 원래 open ai clip model 31.3이 나와서 재현 됐다고 볼 수 있을듯

long8v · 2023-06-21T04:55:16Z

open_clip/README.md

+The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset.
+Specifically, a ResNet-50 model trained with our codebase on OpenAI's [15 million image subset of YFCC](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md) achieves **32.7%** top-1 accuracy on ImageNet. OpenAI's CLIP model reaches **31.3%** when trained on the same subset of YFCC. For ease of experimentation, we also provide code for training on the 3 million images in the [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/download) dataset, where a ResNet-50x4 trained with our codebase reaches 22.2% top-1 ImageNet accuracy.
+
+We further this with a replication study on a dataset of comparable size to OpenAI's, [LAION-400M](https://arxiv.org/abs/2111.02114), and with larger datasets such as [LAION-2B](https://laion.ai/blog/laion-5b/) and [DataComp-1B](https://arxiv.org/abs/2304.14108) datasets. In addition, we study scaling behavior in a paper on [reproducible scaling laws for contrastive language-image learning](https://arxiv.org/abs/2212.07143).


이걸 확장해서 LAION-400M / LAION-2B / DataComp-1B dataset 해봤고 clip에서의 scaling law를 보임

long8v · 2023-06-21T05:22:19Z

open_clip/README.md

+We further this with a replication study on a dataset of comparable size to OpenAI's, [LAION-400M](https://arxiv.org/abs/2111.02114), and with larger datasets such as [LAION-2B](https://laion.ai/blog/laion-5b/) and [DataComp-1B](https://arxiv.org/abs/2304.14108) datasets. In addition, we study scaling behavior in a paper on [reproducible scaling laws for contrastive language-image learning](https://arxiv.org/abs/2212.07143).
+
+We have trained the following ViT CLIP models:
+  * ViT-B/32 on LAION-400M with a accuracy of **62.9%**, comparable to OpenAI's **63.2%**, zero-shot top-1 on ImageNet-1k


LAION-400M으로 학습한게 CLIP의 WIT-400M랑 거의 유사한 성능을 보임

long8v · 2023-06-21T05:22:38Z

open_clip/README.md

+
+We have trained the following ViT CLIP models:
+  * ViT-B/32 on LAION-400M with a accuracy of **62.9%**, comparable to OpenAI's **63.2%**, zero-shot top-1 on ImageNet-1k
+  * ViT-B/32 on LAION-2B with a accuracy of **66.6%**.


400M -> 2B로 학습하니까 63.2 -> 66.6M

long8v · 2023-06-21T05:50:28Z

open_clip/README.md

+## Pretrained model details
+
+### LAION-400M - https://laion.ai/laion-400-open-dataset


pretrained model에 대한 자세한 설명. JUWELS supercomputer에서 학습했다고 반복적으로 나옴
https://en.wikipedia.org/wiki/JUWELS

long8v · 2023-06-21T05:52:41Z

open_clip/README.md

+#### ViT-B/16 224x224
+
+The B/16 LAION400M training reached a top-1 ImageNet-1k zero-shot validation score of 67.07.
+
+<img src="https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/laion_clip_zeroshot_b16.png" width="700">
+
+This was the first major train session using the updated webdataset 0.2.x code. A bug was found that prevented shards from being shuffled properly between nodes/workers each epoch. This was fixed part way through training (epoch 26) but likely had an impact.
+
+ViT-B/16 was trained with 176 A100 (40 GB) GPUS for ~61 hours, 10700 GPU-hours. Batch size per GPU was 192 for a global batch size of 33792.


ViT-B/16

176개의 A100으로 61시간 동안 학습했다. bs per gpu = 192여서 총 bs는 33792였다

webdataset에서 node간 shuffle이 안되는 버그가 있었다. 학습 중간에 고치긴 했는데 성능에 이상이 있을 수도 있다

long8v · 2023-06-21T05:56:17Z

open_clip/README.md

+#### ViT-B/16+ 240x240
+
+The B/16+ 240x240 LAION400M training reached a top-1 ImageNet-1k zero-shot validation score of 69.21.
+
+This model is the same depth as the B/16, but increases the
+  * vision width from 768 -> 896
+  * text width from 512 -> 640
+  * the resolution 224x224 -> 240x240 (196 -> 225 tokens)
+
+<img src="https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/laion_clip_zeroshot_b16_plus_240.png" width="700">
+
+Unlike the B/16 run above, this model was a clean run with no dataset shuffling issues.
+
+ViT-B/16+ was trained with 224 A100 (40 GB) GPUS for ~61 hours, 13620 GPU-hours. Batch size per GPU was 160 for a global batch size of 35840.


ViT-B/16+ 240x240

resolution을 240 / 240으로 늘렸고 (196 vision token -> 225 token)

이에 따라 vision encoder width 768 -> 896 text hid dim도 512 -> 640으로 늘렸다.

224개의 A100으로 61시간 동안 학습했다.

bs 160 for a global batch size of 35840.

long8v · 2023-06-21T05:57:09Z

open_clip/README.md

+#### ViT-L/14 224x224
+
+A ViT-L/14 with a 75.3% top-1 ImageNet-1k zero-shot was trained on JUWELS Booster. See model details here https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K
+
+These weights use a different dataset mean and std than others. Instead of using the OpenAI mean & std, inception style normalization `[-1, 1]` is used via a mean and std of `[0.5, 0.5, 0.5]`. This is handled automatically if using `open_clip.create_model_and_transforms` from pretrained weights.
+
+#### ViT-H/14 224x224
+
+A ViT-H/14 with a 78.0% top-1 ImageNet-1k zero-shot was trained on JUWELS Booster. See model details here https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K
+
+#### ViT-g/14 224x224
+
+A ViT-g/14 with a 76.6% top-1 ImageNet-1k zero-shot was trained on JUWELS Booster. See model details here https://huggingface.co/laion/CLIP-ViT-g-14-laion2B-s12B-b42K
+
+This model was trained with a shorted schedule than other LAION-2B models with 12B samples seen instead of 32+B. It matches LAION-400M training in samples seen. Many zero-shot results are lower as a result, but despite this it performs very well in some OOD zero-shot and retrieval tasks.


ViT large 부터는 슈퍼컴퓨터로 해서 A100으로 얼마나 걸릴지 모르겠땅

long8v · 2023-06-21T05:59:03Z

open_clip/README.md

+#### ViT-B/32 roberta base
+
+A ViT-B/32 with roberta base encoder with a 61.7% top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-B-32-roberta-base-laion2B-s12B-b32k
+This is the first openclip model using a HF text tower. It has better performance on a range of tasks compared to the standard text encoder, see [metrics](https://huggingface.co/laion/CLIP-ViT-B-32-roberta-base-laion2B-s12B-b32k/blob/main/unknown.png)
+
+#### ViT-B/32 xlm roberta base
+
+A ViT-B/32 with xlm roberta base encoder with a 62.33% top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k
+This is the first openclip model trained on the full laion5B dataset; hence the first multilingual clip trained with openclip. It has better performance on a range of tasks compared to the standard text encoder, see [metrics](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k/blob/main/metrics.png)
+A preliminary multilingual evaluation was run: 43% on imagenet1k italian (vs 21% for english B/32), 37% for imagenet1k japanese (vs 1% for english B/32 and 50% for B/16 clip japanese). It shows the multilingual property is indeed there as expected. Larger models will get even better performance.
+
+#### ViT-H/14 xlm roberta large
+
+A ViT-H/14 with xlm roberta large encoder with a 77.0% (vs 78% for the english equivalent) top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k
+
+This model was trained following the [LiT](https://arxiv.org/abs/2111.07991) methodology: the image tower was frozen (initialized from english openclip ViT-H/14), the text tower was initialized from [xlm roberta large](https://huggingface.co/xlm-roberta-large) and unfrozen. This reduced training cost by a 3x factor.


ViT-B/32 + RoBERTa text encoder의 경우 61.7%의 정확도 <=>
위의 We replicate OpenAI's results on ViT-B/32, reaching a top-1 ImageNet-1k zero-shot accuracy of 62.96%.
다른건 Text encoder도 scratch로 학습. pretrained RoBERTa 에서 가져오니까 62.96 -> 61.7로 ImageNet-1K에 대해서는 성능이 안좋아짐. 다른건 좋은 것도 있나보다

xlm roberta의 경우 LiT의 접근법을 썼다고
english openclip ViT-H/14는 frozen 시키고 text는 xlm roberta 가져와서 unfrozen. training cost가 3배 개선

LiT : https://arxiv.org/pdf/2111.07991.pdf
image와 Text를 align 시킬 때 image는 froze하고 text를 맞추는게 성능이 좋았다는 논문인 듯 하다

long8v

23.06.27.
src/training/data 쪽 구경

long8v · 2023-06-27T05:10:29Z

open_clip/src/training/data.py

@@ -0,0 +1,563 @@
+import ast


src/training/data.py를 읽어봅시다

long8v · 2023-06-27T05:10:50Z

open_clip/src/training/data.py

+import torchvision.datasets as datasets
+import webdataset as wds
+from PIL import Image
+from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler, IterableDataset, get_worker_info
+from torch.utils.data.distributed import DistributedSampler
+from webdataset.filters import _shuffle
+from webdataset.tariterators import base_plus_ext, url_opener, tar_file_expander, valid_sample


주로 package들 import

webdataset documentation : https://webdataset.github.io/webdataset/gettingstarted/
컨셉자체는 tar 파일에서 같은 이름을 가진 jpg / json을 잘 가져올 수 있도록 짜져있는 패키지
local disk에 저장되어 있든 cloud에 저장되어 있든 쉽게 짤 수 있음

url = "http://storage.googleapis.com/nvdata-openimages/openimages-train-000000.tar" url = f"pipe:curl -L -s {url} || true" dataset = wds.WebDataset(url) for sample in islice(dataset, 0, 3): for key, value in sample.items(): print(key, repr(value)[:50]) print()

간단한 transform도 지원

dataset = ( wds.WebDataset(url) .shuffle(100) .decode("rgb") .to_tuple("jpg;png", "json") ) for image, data in islice(dataset, 0, 3): print(image.shape, image.dtype, type(data))

long8v · 2023-06-27T05:33:16Z

open_clip/src/training/data.py

+try:
+    import horovod.torch as hvd
+except ImportError:
+    hvd = None


horovod -> distributed learning 도와주는 패키지
https://github.com/horovod/horovod

long8v · 2023-06-27T06:02:58Z

open_clip/src/training/data.py

+    hvd = None
+
+
+class CsvDataset(Dataset):


csv에서 가져오는 Dataset 객체. 별거 없음

long8v · 2023-06-27T06:34:29Z

open_clip/src/training/data.py

+def expand_urls(urls, weights=None):
+    if weights is None:
+        expanded_urls = wds.shardlists.expand_urls(urls)
+        return expanded_urls, None
+    if isinstance(urls, str):
+        urllist = urls.split("::")
+        weights = weights.split('::')
+        assert len(weights) == len(urllist),\
+            f"Expected the number of data components ({len(urllist)}) and weights({len(weights)}) to match."
+        weights = [float(weight) for weight in weights]
+        all_urls, all_weights = [], []
+        for url, weight in zip(urllist, weights):
+            expanded_url = list(braceexpand.braceexpand(url))
+            expanded_weights = [weight for _ in expanded_url]
+            all_urls.extend(expanded_url)
+            all_weights.extend(expanded_weights)
+        return all_urls, all_weights
+    else:
+        all_urls = list(urls)
+        return all_urls, weights


url expand하는 함수
readme 읽으면 multiple dataset 기능(--train-data '/data/cc12m/cc12m-train-{0000..2175}.tar'::/data/LAION-400M/{00000..41455}.tar)이 있는데 요걸 구현했다
braceexpand 기능을 썼고 이를 통해 multipledataset shuffle이 가능하당

long8v · 2023-06-27T06:54:08Z

open_clip/src/training/data.py

+    if is_train:
+        if not resampled:
+            pipeline.extend([
+                detshuffle2(
+                    bufsize=_SHARD_SHUFFLE_SIZE,
+                    initial=_SHARD_SHUFFLE_INITIAL,
+                    seed=args.seed,
+                    epoch=shared_epoch,
+                ),
+                wds.split_by_node,
+                wds.split_by_worker,
+            ])
+        pipeline.extend([
+            # at this point, we have an iterator over the shards assigned to each worker at each node
+            tarfile_to_samples_nothrow,  # wds.tarfile_to_samples(handler=log_and_continue),
+            wds.shuffle(
+                bufsize=_SAMPLE_SHUFFLE_SIZE,
+                initial=_SAMPLE_SHUFFLE_INITIAL,
+            ),
+        ])


shard끼리의 detshuffle2 / node / worker별로 나누는거 진행 -> shard 내에서 shuffle 하는거

long8v · 2023-06-27T06:54:59Z

open_clip/src/training/data.py

+    dataloader = wds.WebLoader(
+        dataset,
+        batch_size=None,
+        shuffle=False,
+        num_workers=args.workers,
+        persistent_workers=args.workers > 0,
+    )


wds.WebLoader로 감싸주고 dataloader return

long8v · 2023-06-27T06:55:26Z

open_clip/src/training/data.py

+    dataloader.num_batches = num_batches
+    dataloader.num_samples = num_samples
+
+    return DataInfo(dataloader=dataloader, shared_epoch=shared_epoch)


dataloader와 shared_epoch(?) 포함하는 DataInfo return 하고 종료

long8v · 2023-06-27T06:56:04Z

open_clip/src/training/data.py

+def get_csv_dataset(args, preprocess_fn, is_train, epoch=0, tokenizer=None):
+    input_filename = args.train_data if is_train else args.val_data
+    assert input_filename
+    dataset = CsvDataset(
+        input_filename,
+        preprocess_fn,
+        img_key=args.csv_img_key,
+        caption_key=args.csv_caption_key,
+        sep=args.csv_separator,
+        tokenizer=tokenizer
+    )
+    num_samples = len(dataset)
+    sampler = DistributedSampler(dataset) if args.distributed and is_train else None
+    shuffle = is_train and sampler is None
+
+    dataloader = DataLoader(
+        dataset,
+        batch_size=args.batch_size,
+        shuffle=shuffle,
+        num_workers=args.workers,
+        pin_memory=True,
+        sampler=sampler,
+        drop_last=is_train,
+    )
+    dataloader.num_samples = num_samples
+    dataloader.num_batches = len(dataloader)
+
+    return DataInfo(dataloader, sampler)


csv는 그냥 파일 하나만 되는듯?

long8v · 2023-06-27T06:56:25Z

open_clip/src/training/data.py

+def get_dataset_fn(data_path, dataset_type):
+    if dataset_type == "webdataset":
+        return get_wds_dataset
+    elif dataset_type == "csv":
+        return get_csv_dataset
+    elif dataset_type == "synthetic":
+        return get_synthetic_dataset
+    elif dataset_type == "auto":
+        ext = data_path.split('.')[-1]
+        if ext in ['csv', 'tsv']:
+            return get_csv_dataset
+        elif ext in ['tar']:
+            return get_wds_dataset
+        else:
+            raise ValueError(
+                f"Tried to figure out dataset type, but failed for extension {ext}.")
+    else:
+        raise ValueError(f"Unsupported dataset type: {dataset_type}")


ㅇㅇ 그런듯하다

long8v

23.07.05
main / train / loss / model 코드 구경

long8v · 2023-07-05T05:19:08Z

open_clip/src/open_clip/model.py

+
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch import nn
+from torch.utils.checkpoint import checkpoint
+
+from .hf_model import HFTextEncoder
+from .modified_resnet import ModifiedResNet
+from .timm_model import TimmModel
+from .transformer import LayerNormFp32, LayerNorm, QuickGELU, Attention, VisionTransformer, TextTransformer
+from .utils import to_2tuple


model 부분 text encoder는 hf에서 가져오고 vision encoder는 timm에서 가져오나 보다

long8v · 2023-07-05T05:19:23Z

open_clip/src/open_clip/model.py

+@dataclass
+class CLIPVisionCfg:
+    layers: Union[Tuple[int, int, int, int], int] = 12
+    width: int = 768
+    head_width: int = 64
+    mlp_ratio: float = 4.0
+    patch_size: int = 16
+    image_size: Union[Tuple[int, int], int] = 224


config는 dataclass

long8v · 2023-07-05T05:20:50Z

open_clip/src/open_clip/model.py

+    ls_init_value: Optional[float] = None  # layer scale initial value
+    patch_dropout: float = 0.  # what fraction of patches to dropout during training (0 would mean disabled and no patches dropped) - 0.5 to 0.75 recommended in the paper for optimal results
+    input_patchnorm: bool = False  # whether to use dual patchnorm - would only apply the input layernorm on each patch, as post-layernorm already exist in original clip vit design
+    global_average_pool: bool = False  # whether to global average pool the last embedding layer, instead of using CLS token (https://arxiv.org/abs/2205.01580)


[cls]를 쓰는것보다 GAP 쓰는게 좋다고 알려진듯 하다
-> Better plain ViT baselines for ImageNet-1k

long8v · 2023-07-05T05:21:51Z

open_clip/src/open_clip/model.py

+
+    ls_init_value: Optional[float] = None  # layer scale initial value
+    patch_dropout: float = 0.  # what fraction of patches to dropout during training (0 would mean disabled and no patches dropped) - 0.5 to 0.75 recommended in the paper for optimal results
+    input_patchnorm: bool = False  # whether to use dual patchnorm - would only apply the input layernorm on each patch, as post-layernorm already exist in original clip vit design


각 패치에 대해 input layernorm을 적용할지 말지.

post layernorm은 이미 있다고?

원래 ViT에는 input_patchnorm = True인 것 같은뎅

아 fcn으로 projection 하기 전에 패치단위에서 $x_p^1$에 layernorm 적용하는건가보다

long8v · 2023-07-05T05:32:49Z

open_clip/src/open_clip/model.py

+    image_size: Union[Tuple[int, int], int] = 224
+
+    ls_init_value: Optional[float] = None  # layer scale initial value
+    patch_dropout: float = 0.  # what fraction of patches to dropout during training (0 would mean disabled and no patches dropped) - 0.5 to 0.75 recommended in the paper for optimal results


flip. image patch 그냥 drop 하는거

long8v · 2023-07-05T09:46:26Z

open_clip/src/open_clip/loss.py

@@ -0,0 +1,216 @@
+import torch


loss 읽자!

long8v · 2023-07-05T09:46:35Z

open_clip/src/open_clip/loss.py

+    hvd = None
+
+
+def gather_features(


feature 모으자!

long8v · 2023-07-05T09:47:15Z

open_clip/src/open_clip/loss.py

+        if gather_with_grad:
+            all_image_features = torch.cat(torch.distributed.nn.all_gather(image_features), dim=0)
+            all_text_features = torch.cat(torch.distributed.nn.all_gather(text_features), dim=0)


gather_with_grad인 경우 nn.all_gather로
https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather

long8v · 2023-07-05T09:51:11Z

open_clip/src/open_clip/loss.py

+        else:
+            gathered_image_features = [torch.zeros_like(image_features) for _ in range(world_size)]
+            gathered_text_features = [torch.zeros_like(text_features) for _ in range(world_size)]
+            dist.all_gather(gathered_image_features, image_features)
+            dist.all_gather(gathered_text_features, text_features)


else 이면 그냥 zero 벡터를 일단 맞춘다음에 dist.all_gather로 ?

import torch.distributed.nn from torch import distributed as dist

잉 뭐가 다르지 둘이?

long8v · 2023-07-05T09:51:39Z

open_clip/src/open_clip/loss.py

+            if not local_loss:
+                # ensure grads for local rank when all_* features don't have a gradient
+                gathered_image_features[rank] = image_features
+                gathered_text_features[rank] = text_features


local_loss가 아니면 rank별로 해서 다 채워 줌

이때 feature들은 gradient가 안흐른다고!

long8v

23.07.06
기타 config랑 util들 봄. 내일 open_clip/transformer의 ViT부터 보면 될듯

long8v · 2023-07-06T08:15:39Z

open_clip/src/open_clip/hf_configs.py

@@ -0,0 +1,56 @@
+# HF architecture dict:
+arch_dict = {


hf_configs
보통 text encoder를 통해 쓰는건 roberta / xlm-roberta / mt5 / bert 정도가 있는듯 하다

long8v · 2023-07-06T08:16:07Z

open_clip/src/open_clip/hf_model.py

+        return x.last_hidden_state[:, self.cls_token_position, :]
+
+
+class HFTextEncoder(nn.Module):


huggingface text encoder 가져와서 몇개의 class를 추가

long8v · 2023-07-06T08:18:31Z

open_clip/src/open_clip/hf_model.py

+        if config is None:
+            self.config = AutoConfig.from_pretrained(model_name_or_path)
+            create_func, model_args = (AutoModel.from_pretrained, model_name_or_path) if pretrained else (
+                AutoModel.from_config, self.config)


일단 모델 가져옴

long8v · 2023-07-06T08:19:23Z

open_clip/src/open_clip/hf_model.py

+# TODO: ?last - for gpt-like models
+_POOLERS = {}
+
+
+def register_pooler(cls):
+    """Decorator registering pooler class"""
+    _POOLERS[_camel2snake(cls.__name__)] = cls
+    return cls
+
+
+@register_pooler
+class MeanPooler(nn.Module):
+    """Mean pooling"""
+
+    def forward(self, x: BaseModelOutput, attention_mask: TensorType):
+        masked_output = x.last_hidden_state * attention_mask.unsqueeze(-1)
+        return masked_output.sum(dim=1) / attention_mask.sum(-1, keepdim=True)
+
+
+@register_pooler
+class MaxPooler(nn.Module):
+    """Max pooling"""
+
+    def forward(self, x: BaseModelOutput, attention_mask: TensorType):
+        masked_output = x.last_hidden_state.masked_fill(attention_mask.unsqueeze(-1), -torch.inf)
+        return masked_output.max(1).values
+
+
+@register_pooler
+class ClsPooler(nn.Module):
+    """CLS token pooling"""
+
+    def __init__(self, use_pooler_output=True):
+        super().__init__()
+        self.cls_token_position = 0
+        self.use_pooler_output = use_pooler_output
+
+    def forward(self, x: BaseModelOutput, attention_mask: TensorType):
+        if (self.use_pooler_output and
+            isinstance(x, (BaseModelOutputWithPooling, BaseModelOutputWithPoolingAndCrossAttentions)) and
+            (x.pooler_output is not None)
+        ):
+            return x.pooler_output
+
+        return x.last_hidden_state[:, self.cls_token_position, :]
+
+
+@register_pooler
+class ClsLastHiddenStatePooler(nn.Module):
+    """CLS token pooling
+    NOTE: this is equivalent to ClsPooler above with use_pooler_output=False
+    """
+
+    def __init__(self):
+        super().__init__()
+        self.cls_token_position = 0
+
+    def forward(self, x: BaseModelOutput, attention_mask: TensorType):
+        return x.last_hidden_state[:, self.cls_token_position, :]


POOLER들은 이런게 있다

long8v · 2023-07-06T08:19:43Z

open_clip/src/open_clip/hf_model.py

+        if (d_model == output_dim) and (proj is None):  # do we always need a proj?
+            self.proj = nn.Identity()
+        elif proj == 'linear':
+            self.proj = nn.Linear(d_model, output_dim, bias=False)
+        elif proj == 'mlp':
+            hidden_size = (d_model + output_dim) // 2
+            self.proj = nn.Sequential(
+                nn.Linear(d_model, hidden_size, bias=False),
+                nn.GELU(),
+                nn.Linear(hidden_size, output_dim, bias=False),
+            )


pooling하고 projection 하는 부분

long8v · 2023-07-06T08:49:15Z

open_clip/src/open_clip/transformer.py

+        q, k, v = F.linear(x, self.in_proj_weight, self.in_proj_bias).chunk(3, dim=-1)
+        q = q.contiguous().view(L, N * self.num_heads, -1).transpose(0, 1)
+        k = k.contiguous().view(L, N * self.num_heads, -1).transpose(0, 1)
+        v = v.contiguous().view(L, N * self.num_heads, -1).transpose(0, 1)


한번에 projection 한다음에 chunk로 QKV 나눈다

long8v · 2023-07-06T08:51:06Z

open_clip/src/open_clip/transformer.py

+        if self.logit_scale is not None:
+            attn = torch.bmm(F.normalize(q, dim=-1), F.normalize(k, dim=-1).transpose(-1, -2))
+            logit_scale = torch.clamp(self.logit_scale, max=self.logit_scale_max).exp()
+            attn = attn.view(N, self.num_heads, L, L) * logit_scale


attention score에 곱해주는 값인듯 하다

아 그거구나 $\frac{QK^T}{d_k}$

아 아닌데 그거는 self.scale로 따로 있는데.. 뭔지 잘 모르겠 $d_k$ 대신 학습가능한걸로 하는건가보군

long8v · 2023-07-06T08:55:55Z

open_clip/src/open_clip/transformer.py

+        return x
+
+
+class AttentionalPooler(nn.Module):


coca에 나오는 attentional pooler

long8v · 2023-07-06T08:56:03Z

open_clip/src/open_clip/transformer.py

+            norm_layer: Callable = LayerNorm
+    ):
+        super().__init__()
+        self.query = nn.Parameter(torch.randn(n_queries, d_model))


query는 그냥 random embedding

long8v · 2023-07-06T08:57:31Z

open_clip/src/open_clip/transformer.py

+    def forward(self, x: torch.Tensor):
+        x = self.ln_k(x).permute(1, 0, 2)  # NLD -> LND
+        N = x.shape[1]
+        q = self.ln_q(self.query)
+        out = self.attn(self._repeat(q, N), x, x, need_weights=False)[0]
+        return out.permute(1, 0, 2)  # LND -> NLD


attn

Q : q를 K bs 차원만큼 repeat 해준거

K=V : key project된 x
아 그냥 random embedding query로 하는 mhsa 한거넹

long8v

23.07.19.
data resample 부분 다시 봄

long8v · 2023-07-19T04:48:12Z

open_clip/src/training/data.py

+                yield dict(url=self.rng.choices(self.urls, weights=self.weights, k=1)[0])
+
+
+def get_wds_dataset(args, preprocess_img, is_train, epoch=0, floor=False, tokenizer=None):


shuffle 관련해서 다시 읽으면

long8v · 2023-07-19T04:49:05Z

open_clip/src/training/data.py

+    if resampled:
+        pipeline = [ResampledShards2(
+            input_shards,
+            weights=args.train_data_upsampling_factors,
+            deterministic=True,
+            epoch=shared_epoch,
+        )]


resample을 할 경우에 ResampledShards2라는 클래스를 pipeline에 넣어주는데 이 구현체는

long8v · 2023-07-19T04:49:59Z

open_clip/src/training/data.py

+        return _shuffle(src, self.bufsize, self.initial, rng)
+
+
+class ResampledShards2(IterableDataset):


IterableDataset을 상속하는객체고 url(=즉 샤드)를 shuffling 해준다.

long8v · 2023-07-19T04:51:22Z

open_clip/src/training/data.py

+    def __iter__(self):
+        """Return an iterator over the shards."""
+        if isinstance(self.epoch, SharedEpoch):
+            epoch = self.epoch.get_value()
+        else:
+            # NOTE: this is epoch tracking is problematic in a multiprocess (dataloader workers or train)
+            # situation as different workers may wrap at different times (or not at all).
+            self.epoch += 1
+            epoch = self.epoch
+        if self.deterministic:
+            # reset seed w/ epoch if deterministic
+            if self.worker_seed is None:
+                # pytorch worker seed should be deterministic due to being init by arg.seed + rank + worker id
+                seed = pytorch_worker_seed(epoch)
+            else:
+                seed = self.worker_seed() + epoch
+            self.rng.seed(seed)
+        for _ in range(self.nshards):
+            if self.weights is None:
+                yield dict(url=self.rng.choice(self.urls))
+            else:
+                yield dict(url=self.rng.choices(self.urls, weights=self.weights, k=1)[0])


기본적으로 해주는건 seed 만들고 rng로 고르는거.
deterministic은 seed를 설정해주는 부분. Deterministic이 없으면 seed x(Totally random)

long8v · 2023-07-19T04:54:42Z

open_clip/src/training/data.py

+    else:
+        assert args.train_data_upsampling_factors is None,\
+            "--train_data_upsampling_factors is only supported when sampling with replacement (with --dataset-resampled)."
+        pipeline = [wds.SimpleShardList(input_shards)]


아니면 그냥 SimpleShardList로. 이거는 그냥 wds 구현 그대로 써도 되는듯.

long8v · 2023-07-19T04:58:07Z

open_clip/src/training/data.py

+            pipeline.extend([
+                detshuffle2(
+                    bufsize=_SHARD_SHUFFLE_SIZE,
+                    initial=_SHARD_SHUFFLE_INITIAL,
+                    seed=args.seed,
+                    epoch=shared_epoch,
+                ),
+                wds.split_by_node,
+                wds.split_by_worker,
+            ])


detshuffle2을 call. 얘는 뭐냐면

근데 이게 여전히 shards 단위로 되고 있는듯 하다. 즉 shard 단위로 shuffling.

long8v · 2023-07-19T04:58:42Z

open_clip/src/training/data.py

+_SAMPLE_SHUFFLE_INITIAL = 1000
+
+
+class detshuffle2(wds.PipelineStage):


이것인데

long8v · 2023-07-19T05:01:59Z

open_clip/src/training/data.py

+        rng = random.Random()
+        if self.seed < 0:
+            # If seed is negative, we use the worker's seed, this will be different across all nodes/workers
+            seed = pytorch_worker_seed(epoch)
+        else:
+            # This seed to be deterministic AND the same across all nodes/workers in each epoch
+            seed = self.seed + epoch
+        rng.seed(seed)
+        return _shuffle(src, self.bufsize, self.initial, rng)


하는 거 별거 없고 그냥 seed를 Epoch으로 줘서 _shuffle call
_shuffle은 코드 보니까 그냥 shuffling이긴 한데 buffer라는걸 줘서 그거 단위로 셔플링을 한다.
즉 1만개 있으면, 1천개 단위로 나눠서 거기끼리 Shuffling을 하는듯

def _shuffle(data, bufsize=1000, initial=100, rng=None, handler=None): """Shuffle the data in the stream. This uses a buffer of size `bufsize`. Shuffling at startup is less random; this is traded off against yielding samples quickly. data: iterator bufsize: buffer size for shuffling returns: iterator rng: either random module or random.Random instance """ if rng is None: rng = random.Random(int((os.getpid() + time.time()) * 1e9)) initial = min(initial, bufsize) buf = [] for sample in data: buf.append(sample) if len(buf) < bufsize: try: buf.append(next(data)) # skipcq: PYL-R1708 except StopIteration: pass if len(buf) >= initial: yield pick(buf, rng) while len(buf) > 0: yield pick(buf, rng)

long8v · 2023-07-19T05:10:25Z

open_clip/src/training/data.py

+                wds.split_by_node,
+                wds.split_by_worker,


아 여기서 Node / worker 별로 나눠줌.

def split_by_node(src, group=None): rank, world_size, worker, num_workers = utils.pytorch_worker_info(group=group) if world_size > 1: for s in islice(src, rank, None, world_size): yield s else: for s in src: yield s

셔플링 된걸 그냥 나눠주는 역할

long8v · 2023-07-19T05:51:44Z

open_clip/src/training/data.py

+            wds.shuffle(
+                bufsize=_SAMPLE_SHUFFLE_SIZE,
+                initial=_SAMPLE_SHUFFLE_INITIAL,
+            ),


wds.shuffle 이거는 shard 내 sample shuffle.

long8v · 2023-07-19T06:10:35Z

open_clip/src/training/data.py

+    else:
+        pipeline.extend([
+            wds.split_by_worker,
+            # at this point, we have an iterator over the shards assigned to each worker
+            wds.tarfile_to_samples(handler=log_and_continue),
+        ])


else면 splitby_worker만 해준다

feat: add open-clip

82244f2

long8v commented Jun 21, 2023

View reviewed changes

long8v added CLIP 2021Q1 labels Jun 27, 2023

long8v commented Jun 27, 2023

View reviewed changes

long8v commented Jul 6, 2023

View reviewed changes

long8v commented Jul 19, 2023

View reviewed changes


		Welcome to an open source implementation of OpenAI's [CLIP](https://arxiv.org/abs/2103.00020) (Contrastive Language-Image Pre-training).

		The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset.

		## Pretrained model details

		### LAION-400M - https://laion.ai/laion-400-open-dataset

		return x.last_hidden_state[:, self.cls_token_position, :]


		class HFTextEncoder(nn.Module):

		yield dict(url=self.rng.choices(self.urls, weights=self.weights, k=1)[0])


		def get_wds_dataset(args, preprocess_img, is_train, epoch=0, floor=False, tokenizer=None):

		return _shuffle(src, self.bufsize, self.initial, rng)


		class ResampledShards2(IterableDataset):

		_SAMPLE_SHUFFLE_INITIAL = 1000


		class detshuffle2(wds.PipelineStage):

feat: add open-clip #130

Are you sure you want to change the base?

feat: add open-clip #130

Conversation

long8v commented Jun 21, 2023

long8v left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

long8v left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

long8v left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

long8v left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

long8v left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment