-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add open-clip #130
base: main
Are you sure you want to change the base?
feat: add open-clip #130
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0621 readme.md 읽음
|
||
Welcome to an open source implementation of OpenAI's [CLIP](https://arxiv.org/abs/2103.00020) (Contrastive Language-Image Pre-training). | ||
|
||
The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLIP에서 말하고 있는 distribution shift에도 강건한지 확인하기 위함.
original CLIP 모델이 재현이 되는지 확인
Welcome to an open source implementation of OpenAI's [CLIP](https://arxiv.org/abs/2103.00020) (Contrastive Language-Image Pre-training). | ||
|
||
The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset. | ||
Specifically, a ResNet-50 model trained with our codebase on OpenAI's [15 million image subset of YFCC](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md) achieves **32.7%** top-1 accuracy on ImageNet. OpenAI's CLIP model reaches **31.3%** when trained on the same subset of YFCC. For ease of experimentation, we also provide code for training on the 3 million images in the [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/download) dataset, where a ResNet-50x4 trained with our codebase reaches 22.2% top-1 ImageNet accuracy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
재현을 확인하기 위해 YFCC에서 Refine된 15M에 대해 여기서 학습해봤는데 ImageNet 32.7의 정확도가 나왔고 원래 open ai clip model 31.3이 나와서 재현 됐다고 볼 수 있을듯
The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset. | ||
Specifically, a ResNet-50 model trained with our codebase on OpenAI's [15 million image subset of YFCC](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md) achieves **32.7%** top-1 accuracy on ImageNet. OpenAI's CLIP model reaches **31.3%** when trained on the same subset of YFCC. For ease of experimentation, we also provide code for training on the 3 million images in the [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/download) dataset, where a ResNet-50x4 trained with our codebase reaches 22.2% top-1 ImageNet accuracy. | ||
|
||
We further this with a replication study on a dataset of comparable size to OpenAI's, [LAION-400M](https://arxiv.org/abs/2111.02114), and with larger datasets such as [LAION-2B](https://laion.ai/blog/laion-5b/) and [DataComp-1B](https://arxiv.org/abs/2304.14108) datasets. In addition, we study scaling behavior in a paper on [reproducible scaling laws for contrastive language-image learning](https://arxiv.org/abs/2212.07143). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
이걸 확장해서 LAION-400M / LAION-2B / DataComp-1B dataset 해봤고 clip에서의 scaling law를 보임
We further this with a replication study on a dataset of comparable size to OpenAI's, [LAION-400M](https://arxiv.org/abs/2111.02114), and with larger datasets such as [LAION-2B](https://laion.ai/blog/laion-5b/) and [DataComp-1B](https://arxiv.org/abs/2304.14108) datasets. In addition, we study scaling behavior in a paper on [reproducible scaling laws for contrastive language-image learning](https://arxiv.org/abs/2212.07143). | ||
|
||
We have trained the following ViT CLIP models: | ||
* ViT-B/32 on LAION-400M with a accuracy of **62.9%**, comparable to OpenAI's **63.2%**, zero-shot top-1 on ImageNet-1k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LAION-400M으로 학습한게 CLIP의 WIT-400M랑 거의 유사한 성능을 보임
|
||
We have trained the following ViT CLIP models: | ||
* ViT-B/32 on LAION-400M with a accuracy of **62.9%**, comparable to OpenAI's **63.2%**, zero-shot top-1 on ImageNet-1k | ||
* ViT-B/32 on LAION-2B with a accuracy of **66.6%**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
400M -> 2B로 학습하니까 63.2 -> 66.6M
## Pretrained model details | ||
|
||
### LAION-400M - https://laion.ai/laion-400-open-dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretrained model에 대한 자세한 설명. JUWELS supercomputer에서 학습했다고 반복적으로 나옴
https://en.wikipedia.org/wiki/JUWELS
#### ViT-B/16 224x224 | ||
|
||
The B/16 LAION400M training reached a top-1 ImageNet-1k zero-shot validation score of 67.07. | ||
|
||
<img src="https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/laion_clip_zeroshot_b16.png" width="700"> | ||
|
||
This was the first major train session using the updated webdataset 0.2.x code. A bug was found that prevented shards from being shuffled properly between nodes/workers each epoch. This was fixed part way through training (epoch 26) but likely had an impact. | ||
|
||
ViT-B/16 was trained with 176 A100 (40 GB) GPUS for ~61 hours, 10700 GPU-hours. Batch size per GPU was 192 for a global batch size of 33792. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ViT-B/16
- 176개의 A100으로 61시간 동안 학습했다. bs per gpu = 192여서 총 bs는 33792였다
- webdataset에서 node간 shuffle이 안되는 버그가 있었다. 학습 중간에 고치긴 했는데 성능에 이상이 있을 수도 있다
#### ViT-B/16+ 240x240 | ||
|
||
The B/16+ 240x240 LAION400M training reached a top-1 ImageNet-1k zero-shot validation score of 69.21. | ||
|
||
This model is the same depth as the B/16, but increases the | ||
* vision width from 768 -> 896 | ||
* text width from 512 -> 640 | ||
* the resolution 224x224 -> 240x240 (196 -> 225 tokens) | ||
|
||
<img src="https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/laion_clip_zeroshot_b16_plus_240.png" width="700"> | ||
|
||
Unlike the B/16 run above, this model was a clean run with no dataset shuffling issues. | ||
|
||
ViT-B/16+ was trained with 224 A100 (40 GB) GPUS for ~61 hours, 13620 GPU-hours. Batch size per GPU was 160 for a global batch size of 35840. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ViT-B/16+ 240x240
- resolution을 240 / 240으로 늘렸고 (196 vision token -> 225 token)
- 이에 따라 vision encoder width 768 -> 896 text hid dim도 512 -> 640으로 늘렸다.
- 224개의 A100으로 61시간 동안 학습했다.
- bs 160 for a global batch size of 35840.
#### ViT-L/14 224x224 | ||
|
||
A ViT-L/14 with a 75.3% top-1 ImageNet-1k zero-shot was trained on JUWELS Booster. See model details here https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K | ||
|
||
These weights use a different dataset mean and std than others. Instead of using the OpenAI mean & std, inception style normalization `[-1, 1]` is used via a mean and std of `[0.5, 0.5, 0.5]`. This is handled automatically if using `open_clip.create_model_and_transforms` from pretrained weights. | ||
|
||
#### ViT-H/14 224x224 | ||
|
||
A ViT-H/14 with a 78.0% top-1 ImageNet-1k zero-shot was trained on JUWELS Booster. See model details here https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K | ||
|
||
#### ViT-g/14 224x224 | ||
|
||
A ViT-g/14 with a 76.6% top-1 ImageNet-1k zero-shot was trained on JUWELS Booster. See model details here https://huggingface.co/laion/CLIP-ViT-g-14-laion2B-s12B-b42K | ||
|
||
This model was trained with a shorted schedule than other LAION-2B models with 12B samples seen instead of 32+B. It matches LAION-400M training in samples seen. Many zero-shot results are lower as a result, but despite this it performs very well in some OOD zero-shot and retrieval tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ViT large 부터는 슈퍼컴퓨터로 해서 A100으로 얼마나 걸릴지 모르겠땅
#### ViT-B/32 roberta base | ||
|
||
A ViT-B/32 with roberta base encoder with a 61.7% top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-B-32-roberta-base-laion2B-s12B-b32k | ||
This is the first openclip model using a HF text tower. It has better performance on a range of tasks compared to the standard text encoder, see [metrics](https://huggingface.co/laion/CLIP-ViT-B-32-roberta-base-laion2B-s12B-b32k/blob/main/unknown.png) | ||
|
||
#### ViT-B/32 xlm roberta base | ||
|
||
A ViT-B/32 with xlm roberta base encoder with a 62.33% top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k | ||
This is the first openclip model trained on the full laion5B dataset; hence the first multilingual clip trained with openclip. It has better performance on a range of tasks compared to the standard text encoder, see [metrics](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k/blob/main/metrics.png) | ||
A preliminary multilingual evaluation was run: 43% on imagenet1k italian (vs 21% for english B/32), 37% for imagenet1k japanese (vs 1% for english B/32 and 50% for B/16 clip japanese). It shows the multilingual property is indeed there as expected. Larger models will get even better performance. | ||
|
||
#### ViT-H/14 xlm roberta large | ||
|
||
A ViT-H/14 with xlm roberta large encoder with a 77.0% (vs 78% for the english equivalent) top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k | ||
|
||
This model was trained following the [LiT](https://arxiv.org/abs/2111.07991) methodology: the image tower was frozen (initialized from english openclip ViT-H/14), the text tower was initialized from [xlm roberta large](https://huggingface.co/xlm-roberta-large) and unfrozen. This reduced training cost by a 3x factor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xlm roberta의 경우 LiT의 접근법을 썼다고
english openclip ViT-H/14는 frozen 시키고 text는 xlm roberta 가져와서 unfrozen. training cost가 3배 개선
- LiT : https://arxiv.org/pdf/2111.07991.pdf
image와 Text를 align 시킬 때 image는 froze하고 text를 맞추는게 성능이 좋았다는 논문인 듯 하다
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
23.06.27.
src/training/data 쪽 구경
@@ -0,0 +1,563 @@ | |||
import ast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/training/data.py를 읽어봅시다
import torchvision.datasets as datasets | ||
import webdataset as wds | ||
from PIL import Image | ||
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler, IterableDataset, get_worker_info | ||
from torch.utils.data.distributed import DistributedSampler | ||
from webdataset.filters import _shuffle | ||
from webdataset.tariterators import base_plus_ext, url_opener, tar_file_expander, valid_sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
주로 package들 import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
webdataset documentation : https://webdataset.github.io/webdataset/gettingstarted/
컨셉자체는 tar 파일에서 같은 이름을 가진 jpg / json을 잘 가져올 수 있도록 짜져있는 패키지
local disk에 저장되어 있든 cloud에 저장되어 있든 쉽게 짤 수 있음
url = "http://storage.googleapis.com/nvdata-openimages/openimages-train-000000.tar"
url = f"pipe:curl -L -s {url} || true"
dataset = wds.WebDataset(url)
for sample in islice(dataset, 0, 3):
for key, value in sample.items():
print(key, repr(value)[:50])
print()
간단한 transform도 지원
dataset = (
wds.WebDataset(url)
.shuffle(100)
.decode("rgb")
.to_tuple("jpg;png", "json")
)
for image, data in islice(dataset, 0, 3):
print(image.shape, image.dtype, type(data))
try: | ||
import horovod.torch as hvd | ||
except ImportError: | ||
hvd = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
horovod -> distributed learning 도와주는 패키지
https://github.com/horovod/horovod
hvd = None | ||
|
||
|
||
class CsvDataset(Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
csv에서 가져오는 Dataset 객체. 별거 없음
def expand_urls(urls, weights=None): | ||
if weights is None: | ||
expanded_urls = wds.shardlists.expand_urls(urls) | ||
return expanded_urls, None | ||
if isinstance(urls, str): | ||
urllist = urls.split("::") | ||
weights = weights.split('::') | ||
assert len(weights) == len(urllist),\ | ||
f"Expected the number of data components ({len(urllist)}) and weights({len(weights)}) to match." | ||
weights = [float(weight) for weight in weights] | ||
all_urls, all_weights = [], [] | ||
for url, weight in zip(urllist, weights): | ||
expanded_url = list(braceexpand.braceexpand(url)) | ||
expanded_weights = [weight for _ in expanded_url] | ||
all_urls.extend(expanded_url) | ||
all_weights.extend(expanded_weights) | ||
return all_urls, all_weights | ||
else: | ||
all_urls = list(urls) | ||
return all_urls, weights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
url expand하는 함수
readme 읽으면 multiple dataset 기능(--train-data '/data/cc12m/cc12m-train-{0000..2175}.tar'::/data/LAION-400M/{00000..41455}.tar
)이 있는데 요걸 구현했다
braceexpand 기능을 썼고 이를 통해 multipledataset shuffle이 가능하당
if is_train: | ||
if not resampled: | ||
pipeline.extend([ | ||
detshuffle2( | ||
bufsize=_SHARD_SHUFFLE_SIZE, | ||
initial=_SHARD_SHUFFLE_INITIAL, | ||
seed=args.seed, | ||
epoch=shared_epoch, | ||
), | ||
wds.split_by_node, | ||
wds.split_by_worker, | ||
]) | ||
pipeline.extend([ | ||
# at this point, we have an iterator over the shards assigned to each worker at each node | ||
tarfile_to_samples_nothrow, # wds.tarfile_to_samples(handler=log_and_continue), | ||
wds.shuffle( | ||
bufsize=_SAMPLE_SHUFFLE_SIZE, | ||
initial=_SAMPLE_SHUFFLE_INITIAL, | ||
), | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shard끼리의 detshuffle2 / node / worker별로 나누는거 진행 -> shard 내에서 shuffle 하는거
dataloader = wds.WebLoader( | ||
dataset, | ||
batch_size=None, | ||
shuffle=False, | ||
num_workers=args.workers, | ||
persistent_workers=args.workers > 0, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wds.WebLoader로 감싸주고 dataloader return
dataloader.num_batches = num_batches | ||
dataloader.num_samples = num_samples | ||
|
||
return DataInfo(dataloader=dataloader, shared_epoch=shared_epoch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataloader와 shared_epoch(?) 포함하는 DataInfo return 하고 종료
def get_csv_dataset(args, preprocess_fn, is_train, epoch=0, tokenizer=None): | ||
input_filename = args.train_data if is_train else args.val_data | ||
assert input_filename | ||
dataset = CsvDataset( | ||
input_filename, | ||
preprocess_fn, | ||
img_key=args.csv_img_key, | ||
caption_key=args.csv_caption_key, | ||
sep=args.csv_separator, | ||
tokenizer=tokenizer | ||
) | ||
num_samples = len(dataset) | ||
sampler = DistributedSampler(dataset) if args.distributed and is_train else None | ||
shuffle = is_train and sampler is None | ||
|
||
dataloader = DataLoader( | ||
dataset, | ||
batch_size=args.batch_size, | ||
shuffle=shuffle, | ||
num_workers=args.workers, | ||
pin_memory=True, | ||
sampler=sampler, | ||
drop_last=is_train, | ||
) | ||
dataloader.num_samples = num_samples | ||
dataloader.num_batches = len(dataloader) | ||
|
||
return DataInfo(dataloader, sampler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
csv는 그냥 파일 하나만 되는듯?
def get_dataset_fn(data_path, dataset_type): | ||
if dataset_type == "webdataset": | ||
return get_wds_dataset | ||
elif dataset_type == "csv": | ||
return get_csv_dataset | ||
elif dataset_type == "synthetic": | ||
return get_synthetic_dataset | ||
elif dataset_type == "auto": | ||
ext = data_path.split('.')[-1] | ||
if ext in ['csv', 'tsv']: | ||
return get_csv_dataset | ||
elif ext in ['tar']: | ||
return get_wds_dataset | ||
else: | ||
raise ValueError( | ||
f"Tried to figure out dataset type, but failed for extension {ext}.") | ||
else: | ||
raise ValueError(f"Unsupported dataset type: {dataset_type}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ㅇㅇ 그런듯하다
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
23.07.05
main / train / loss / model 코드 구경
|
||
import numpy as np | ||
import torch | ||
import torch.nn.functional as F | ||
from torch import nn | ||
from torch.utils.checkpoint import checkpoint | ||
|
||
from .hf_model import HFTextEncoder | ||
from .modified_resnet import ModifiedResNet | ||
from .timm_model import TimmModel | ||
from .transformer import LayerNormFp32, LayerNorm, QuickGELU, Attention, VisionTransformer, TextTransformer | ||
from .utils import to_2tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model 부분 text encoder는 hf에서 가져오고 vision encoder는 timm에서 가져오나 보다
@dataclass | ||
class CLIPVisionCfg: | ||
layers: Union[Tuple[int, int, int, int], int] = 12 | ||
width: int = 768 | ||
head_width: int = 64 | ||
mlp_ratio: float = 4.0 | ||
patch_size: int = 16 | ||
image_size: Union[Tuple[int, int], int] = 224 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config는 dataclass
ls_init_value: Optional[float] = None # layer scale initial value | ||
patch_dropout: float = 0. # what fraction of patches to dropout during training (0 would mean disabled and no patches dropped) - 0.5 to 0.75 recommended in the paper for optimal results | ||
input_patchnorm: bool = False # whether to use dual patchnorm - would only apply the input layernorm on each patch, as post-layernorm already exist in original clip vit design | ||
global_average_pool: bool = False # whether to global average pool the last embedding layer, instead of using CLS token (https://arxiv.org/abs/2205.01580) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[cls]
를 쓰는것보다 GAP 쓰는게 좋다고 알려진듯 하다
-> Better plain ViT baselines for ImageNet-1k
|
||
ls_init_value: Optional[float] = None # layer scale initial value | ||
patch_dropout: float = 0. # what fraction of patches to dropout during training (0 would mean disabled and no patches dropped) - 0.5 to 0.75 recommended in the paper for optimal results | ||
input_patchnorm: bool = False # whether to use dual patchnorm - would only apply the input layernorm on each patch, as post-layernorm already exist in original clip vit design |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
각 패치에 대해 input layernorm을 적용할지 말지.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
post layernorm은 이미 있다고?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image_size: Union[Tuple[int, int], int] = 224 | ||
|
||
ls_init_value: Optional[float] = None # layer scale initial value | ||
patch_dropout: float = 0. # what fraction of patches to dropout during training (0 would mean disabled and no patches dropped) - 0.5 to 0.75 recommended in the paper for optimal results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flip. image patch 그냥 drop 하는거
@@ -0,0 +1,216 @@ | |||
import torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loss 읽자!
hvd = None | ||
|
||
|
||
def gather_features( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feature 모으자!
if gather_with_grad: | ||
all_image_features = torch.cat(torch.distributed.nn.all_gather(image_features), dim=0) | ||
all_text_features = torch.cat(torch.distributed.nn.all_gather(text_features), dim=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gather_with_grad인 경우 nn.all_gather로
https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather
else: | ||
gathered_image_features = [torch.zeros_like(image_features) for _ in range(world_size)] | ||
gathered_text_features = [torch.zeros_like(text_features) for _ in range(world_size)] | ||
dist.all_gather(gathered_image_features, image_features) | ||
dist.all_gather(gathered_text_features, text_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else 이면 그냥 zero 벡터를 일단 맞춘다음에 dist.all_gather
로 ?
import torch.distributed.nn
from torch import distributed as dist
잉 뭐가 다르지 둘이?
if not local_loss: | ||
# ensure grads for local rank when all_* features don't have a gradient | ||
gathered_image_features[rank] = image_features | ||
gathered_text_features[rank] = text_features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local_loss가 아니면 rank별로 해서 다 채워 줌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
이때 feature들은 gradient가 안흐른다고!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
23.07.06
기타 config랑 util들 봄. 내일 open_clip/transformer의 ViT부터 보면 될듯
@@ -0,0 +1,56 @@ | |||
# HF architecture dict: | |||
arch_dict = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hf_configs
보통 text encoder를 통해 쓰는건 roberta / xlm-roberta / mt5 / bert 정도가 있는듯 하다
return x.last_hidden_state[:, self.cls_token_position, :] | ||
|
||
|
||
class HFTextEncoder(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huggingface text encoder 가져와서 몇개의 class를 추가
if config is None: | ||
self.config = AutoConfig.from_pretrained(model_name_or_path) | ||
create_func, model_args = (AutoModel.from_pretrained, model_name_or_path) if pretrained else ( | ||
AutoModel.from_config, self.config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
일단 모델 가져옴
# TODO: ?last - for gpt-like models | ||
_POOLERS = {} | ||
|
||
|
||
def register_pooler(cls): | ||
"""Decorator registering pooler class""" | ||
_POOLERS[_camel2snake(cls.__name__)] = cls | ||
return cls | ||
|
||
|
||
@register_pooler | ||
class MeanPooler(nn.Module): | ||
"""Mean pooling""" | ||
|
||
def forward(self, x: BaseModelOutput, attention_mask: TensorType): | ||
masked_output = x.last_hidden_state * attention_mask.unsqueeze(-1) | ||
return masked_output.sum(dim=1) / attention_mask.sum(-1, keepdim=True) | ||
|
||
|
||
@register_pooler | ||
class MaxPooler(nn.Module): | ||
"""Max pooling""" | ||
|
||
def forward(self, x: BaseModelOutput, attention_mask: TensorType): | ||
masked_output = x.last_hidden_state.masked_fill(attention_mask.unsqueeze(-1), -torch.inf) | ||
return masked_output.max(1).values | ||
|
||
|
||
@register_pooler | ||
class ClsPooler(nn.Module): | ||
"""CLS token pooling""" | ||
|
||
def __init__(self, use_pooler_output=True): | ||
super().__init__() | ||
self.cls_token_position = 0 | ||
self.use_pooler_output = use_pooler_output | ||
|
||
def forward(self, x: BaseModelOutput, attention_mask: TensorType): | ||
if (self.use_pooler_output and | ||
isinstance(x, (BaseModelOutputWithPooling, BaseModelOutputWithPoolingAndCrossAttentions)) and | ||
(x.pooler_output is not None) | ||
): | ||
return x.pooler_output | ||
|
||
return x.last_hidden_state[:, self.cls_token_position, :] | ||
|
||
|
||
@register_pooler | ||
class ClsLastHiddenStatePooler(nn.Module): | ||
"""CLS token pooling | ||
NOTE: this is equivalent to ClsPooler above with use_pooler_output=False | ||
""" | ||
|
||
def __init__(self): | ||
super().__init__() | ||
self.cls_token_position = 0 | ||
|
||
def forward(self, x: BaseModelOutput, attention_mask: TensorType): | ||
return x.last_hidden_state[:, self.cls_token_position, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
POOLER들은 이런게 있다
if (d_model == output_dim) and (proj is None): # do we always need a proj? | ||
self.proj = nn.Identity() | ||
elif proj == 'linear': | ||
self.proj = nn.Linear(d_model, output_dim, bias=False) | ||
elif proj == 'mlp': | ||
hidden_size = (d_model + output_dim) // 2 | ||
self.proj = nn.Sequential( | ||
nn.Linear(d_model, hidden_size, bias=False), | ||
nn.GELU(), | ||
nn.Linear(hidden_size, output_dim, bias=False), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pooling하고 projection 하는 부분
q, k, v = F.linear(x, self.in_proj_weight, self.in_proj_bias).chunk(3, dim=-1) | ||
q = q.contiguous().view(L, N * self.num_heads, -1).transpose(0, 1) | ||
k = k.contiguous().view(L, N * self.num_heads, -1).transpose(0, 1) | ||
v = v.contiguous().view(L, N * self.num_heads, -1).transpose(0, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
한번에 projection 한다음에 chunk로 QKV 나눈다
if self.logit_scale is not None: | ||
attn = torch.bmm(F.normalize(q, dim=-1), F.normalize(k, dim=-1).transpose(-1, -2)) | ||
logit_scale = torch.clamp(self.logit_scale, max=self.logit_scale_max).exp() | ||
attn = attn.view(N, self.num_heads, L, L) * logit_scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attention score에 곱해주는 값인듯 하다
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
아 그거구나
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
아 아닌데 그거는 self.scale로 따로 있는데.. 뭔지 잘 모르겠
return x | ||
|
||
|
||
class AttentionalPooler(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coca에 나오는 attentional pooler
norm_layer: Callable = LayerNorm | ||
): | ||
super().__init__() | ||
self.query = nn.Parameter(torch.randn(n_queries, d_model)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query는 그냥 random embedding
def forward(self, x: torch.Tensor): | ||
x = self.ln_k(x).permute(1, 0, 2) # NLD -> LND | ||
N = x.shape[1] | ||
q = self.ln_q(self.query) | ||
out = self.attn(self._repeat(q, N), x, x, need_weights=False)[0] | ||
return out.permute(1, 0, 2) # LND -> NLD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attn
- Q : q를 K bs 차원만큼 repeat 해준거
- K=V : key project된 x
아 그냥 random embedding query로 하는 mhsa 한거넹
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
23.07.19.
data resample 부분 다시 봄
yield dict(url=self.rng.choices(self.urls, weights=self.weights, k=1)[0]) | ||
|
||
|
||
def get_wds_dataset(args, preprocess_img, is_train, epoch=0, floor=False, tokenizer=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shuffle 관련해서 다시 읽으면
if resampled: | ||
pipeline = [ResampledShards2( | ||
input_shards, | ||
weights=args.train_data_upsampling_factors, | ||
deterministic=True, | ||
epoch=shared_epoch, | ||
)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resample을 할 경우에 ResampledShards2라는 클래스를 pipeline에 넣어주는데 이 구현체는
return _shuffle(src, self.bufsize, self.initial, rng) | ||
|
||
|
||
class ResampledShards2(IterableDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IterableDataset을 상속하는객체고 url(=즉 샤드)를 shuffling 해준다.
def __iter__(self): | ||
"""Return an iterator over the shards.""" | ||
if isinstance(self.epoch, SharedEpoch): | ||
epoch = self.epoch.get_value() | ||
else: | ||
# NOTE: this is epoch tracking is problematic in a multiprocess (dataloader workers or train) | ||
# situation as different workers may wrap at different times (or not at all). | ||
self.epoch += 1 | ||
epoch = self.epoch | ||
if self.deterministic: | ||
# reset seed w/ epoch if deterministic | ||
if self.worker_seed is None: | ||
# pytorch worker seed should be deterministic due to being init by arg.seed + rank + worker id | ||
seed = pytorch_worker_seed(epoch) | ||
else: | ||
seed = self.worker_seed() + epoch | ||
self.rng.seed(seed) | ||
for _ in range(self.nshards): | ||
if self.weights is None: | ||
yield dict(url=self.rng.choice(self.urls)) | ||
else: | ||
yield dict(url=self.rng.choices(self.urls, weights=self.weights, k=1)[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
기본적으로 해주는건 seed 만들고 rng로 고르는거.
deterministic은 seed를 설정해주는 부분. Deterministic이 없으면 seed x(Totally random)
else: | ||
assert args.train_data_upsampling_factors is None,\ | ||
"--train_data_upsampling_factors is only supported when sampling with replacement (with --dataset-resampled)." | ||
pipeline = [wds.SimpleShardList(input_shards)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
아니면 그냥 SimpleShardList로. 이거는 그냥 wds 구현 그대로 써도 되는듯.
pipeline.extend([ | ||
detshuffle2( | ||
bufsize=_SHARD_SHUFFLE_SIZE, | ||
initial=_SHARD_SHUFFLE_INITIAL, | ||
seed=args.seed, | ||
epoch=shared_epoch, | ||
), | ||
wds.split_by_node, | ||
wds.split_by_worker, | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detshuffle2을 call. 얘는 뭐냐면
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
근데 이게 여전히 shards 단위로 되고 있는듯 하다. 즉 shard 단위로 shuffling.
_SAMPLE_SHUFFLE_INITIAL = 1000 | ||
|
||
|
||
class detshuffle2(wds.PipelineStage): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
이것인데
rng = random.Random() | ||
if self.seed < 0: | ||
# If seed is negative, we use the worker's seed, this will be different across all nodes/workers | ||
seed = pytorch_worker_seed(epoch) | ||
else: | ||
# This seed to be deterministic AND the same across all nodes/workers in each epoch | ||
seed = self.seed + epoch | ||
rng.seed(seed) | ||
return _shuffle(src, self.bufsize, self.initial, rng) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
하는 거 별거 없고 그냥 seed를 Epoch으로 줘서 _shuffle call
_shuffle은 코드 보니까 그냥 shuffling이긴 한데 buffer라는걸 줘서 그거 단위로 셔플링을 한다.
즉 1만개 있으면, 1천개 단위로 나눠서 거기끼리 Shuffling을 하는듯
def _shuffle(data, bufsize=1000, initial=100, rng=None, handler=None):
"""Shuffle the data in the stream.
This uses a buffer of size `bufsize`. Shuffling at
startup is less random; this is traded off against
yielding samples quickly.
data: iterator
bufsize: buffer size for shuffling
returns: iterator
rng: either random module or random.Random instance
"""
if rng is None:
rng = random.Random(int((os.getpid() + time.time()) * 1e9))
initial = min(initial, bufsize)
buf = []
for sample in data:
buf.append(sample)
if len(buf) < bufsize:
try:
buf.append(next(data)) # skipcq: PYL-R1708
except StopIteration:
pass
if len(buf) >= initial:
yield pick(buf, rng)
while len(buf) > 0:
yield pick(buf, rng)
wds.split_by_node, | ||
wds.split_by_worker, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
아 여기서 Node / worker 별로 나눠줌.
def split_by_node(src, group=None):
rank, world_size, worker, num_workers = utils.pytorch_worker_info(group=group)
if world_size > 1:
for s in islice(src, rank, None, world_size):
yield s
else:
for s in src:
yield s
셔플링 된걸 그냥 나눠주는 역할
wds.shuffle( | ||
bufsize=_SAMPLE_SHUFFLE_SIZE, | ||
initial=_SAMPLE_SHUFFLE_INITIAL, | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wds.shuffle
이거는 shard 내 sample shuffle.
else: | ||
pipeline.extend([ | ||
wds.split_by_worker, | ||
# at this point, we have an iterator over the shards assigned to each worker | ||
wds.tarfile_to_samples(handler=log_and_continue), | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else면 splitby_worker만 해준다
review open clip