Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[20230312] Weekly AI ArXiv 만담 시즌2 - 9회차 #75

Open
scene-the-ella opened this issue Mar 7, 2023 · 5 comments
Open

[20230312] Weekly AI ArXiv 만담 시즌2 - 9회차 #75

scene-the-ella opened this issue Mar 7, 2023 · 5 comments

Comments

@scene-the-ella
Copy link

No description provided.

@jungwoo-ha
Copy link
Owner

jungwoo-ha commented Mar 10, 2023

News

ArXiv

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

  • Visual Foundation Model 을 ChatGPT API에 연동해서 붙인 시스템 리포트 (from MSRA)
  • BLIP, Stable diffusion, ControlNet 등의 Visual Foundation Model 들 사용해서 다양한 기능 제공
  • ChatGPT와 VFM 그리고 사용자 conversation interaction 매니징을 위해 Prompt Manager 모듈 추가
  • 소스코드는
    image
    image
    image

Scaling up GANs for Text-to-Image Synthesis

  • Postech 강민국님이 Adobe Research 인턴 가서 만들어낸 GigaGAN!
    image
  • Text-to-image 를 Diffusion 계열보다 더 고퀄로 할 수 있는 GAN Scaling up을 성공 (1B까지)
  • 그걸 되게 하기 위해서
    • G는 기본적으로 멀티스케일
      • Text는 Frozen CLIP encoder랑 추가 Learnable attention layer쓰고 글로벌 feature는 Z에 Mapping으로 넣어 W로
      • Local text 는 Generator 의 멀티스케일 각각에 conditioning 으로 활용
      • 이미지 업스케일 컴포넌트는 conv, self-atten, cross-atten으로 구성
      • Text condition에 따라 유동적으로 convolution kernel을 사용하는 sample-adaptive kernel selection 이 주요한 방법으로 보임
    • D 도 멀티스케일, Text, Image 각각 브랜치
      • Text는 멀티스케일별로 conditioning
      • 이미지 스케일별로 conv, self-attn으로 구성되며 독립적으로 real / fake prediction
  • GAN 특성상 아주 빠름. 512 X 512를 0.13초에!!!
  • Project page: https://mingukkang.github.io/GigaGAN/
    image
    image

PaLM-E: An Embodied Multimodal Language Model

  • PaLM (540B) + ViT (22B: 이미지 인식) + 센서임베딩 --> 로봇컨트롤!
  • 구글의 Flex!! 라는...
  • Continual learning관점에서 중요한 의미
    image
    image

@gyunggyung
Copy link

gyunggyung commented Mar 11, 2023

가벼운 소식 찍먹

  1. 뉴스: GPT-4비슷한 기능을 보여주는 최신 논문들
  2. NEXT AI (Yann André LeCun은 LLM처럼 크기만 키워서는 AGI를 못 만든다고 합니다. 사람의 뇌와 비슷한 모델을 만들어야지 AGI를 만들 수 있다고 합니다. 세상에 대한 정보를 알고, 본능적인 반응, 깊은 생각을 하고 반응, 이를 조절하는 부분을 구현해야 한다고 합니다. 다만 저는 그의 주장은 100% 동의하지는 않습니다!)

LLaMA

최신 소식 공유

맥북에서 LLM

https://github.com/gyunggyung/KoChatLLaMA.cpp
https://www.facebook.com/groups/1272877526915876/permalink/1277329939803968/

llama.cpp
Inference of Facebook's LLaMA model in pure C/C++

Hot topics

Description

The main goal is to run the model using 4-bit quantization on a MacBook.

  • Plain C/C++ implementation without dependencies
  • Apple silicon first-class citizen - optimized via Arm Neon and Accelerate framework
  • AVX2 support for x86 architectures
  • Mixed F16 / F32 precision
  • 4-bit quantization support
  • Runs on the CPU

This was hacked in an evening - I have no idea if it works correctly. Please do not make conclusions about the models based on the results from this implementation. For all I know, it can be completely wrong. This project is for educational purposes and is not going to be maintained properly. New features will probably be added mostly through community contributions, if any.


Here is a typical run using LLaMA-7B:

make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

make: Nothing to be done for `default'.
main: seed = 1678486056
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

main: prompt: 'Building a website can be done in 10 simple steps:'
main: number of tokens in prompt = 15
     1 -> ''
  8893 -> 'Build'
   292 -> 'ing'
   263 -> ' a'
  4700 -> ' website'
   508 -> ' can'
   367 -> ' be'
  2309 -> ' done'
   297 -> ' in'
 29871 -> ' '
 29896 -> '1'
 29900 -> '0'
  2560 -> ' simple'
  6576 -> ' steps'
 29901 -> ':'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000


Building a website can be done in 10 simple steps:
1) Select a domain name and web hosting plan
2) Complete a sitemap
3) List your products
4) Write product descriptions
5) Create a user account
6) Build the template
7) Start building the website
8) Advertise the website
9) Provide email support
10) Submit the website to search engines
A website is a collection of web pages that are formatted with HTML. HTML is the code that defines what the website looks like and how it behaves.
The HTML code is formatted into a template or a format. Once this is done, it is displayed on the user's browser.
The web pages are stored in a web server. The web server is also called a host. When the website is accessed, it is retrieved from the server and displayed on the user's computer.
A website is known as a website when it is hosted. This means that it is displayed on a host. The host is usually a web server.
A website can be displayed on different browsers. The browsers are basically the software that renders the website on the user's screen.
A website can also be viewed on different devices such as desktops, tablets and smartphones.
Hence, to have a website displayed on a browser, the website must be hosted.
A domain name is an address of a website. It is the name of the website.
The website is known as a website when it is hosted. This means that it is displayed on a host. The host is usually a web server.
A website can be displayed on different browsers. The browsers are basically the software that renders the website on the user’s screen.
A website can also be viewed on different devices such as desktops, tablets and smartphones. Hence, to have a website displayed on a browser, the website must be hosted.
A domain name is an address of a website. It is the name of the website.
A website is an address of a website. It is a collection of web pages that are formatted with HTML. HTML is the code that defines what the website looks like and how it behaves.
The HTML code is formatted into a template or a format. Once this is done, it is displayed on the user’s browser.
A website is known as a website when it is hosted

main: mem per token = 14434244 bytes
main:     load time =  1332.48 ms
main:   sample time =  1081.40 ms
main:  predict time = 31378.77 ms / 61.41 ms per token
main:    total time = 34036.74 ms

And here is another demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook:

Usage

Here are the step for the LLaMA-7B model:

# build this repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

# install Python dependencies
python3 -m pip install torch numpy sentencepiece

# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py models/7B/ 1

# quantize the model to 4-bits
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128

LLaMA의 한계와 발전 방향

Coming soon.

Google USM

우리의 인코더는 사전 훈련을 통해 300개 이상의 언어를 통합한다. 우리는 YouTube Caption의 다국어 음성 데이터에 대한 미세 조정을 통해 사전 훈련된 인코더의 효과를 입증한다. 감독된 유튜브 데이터는 73개 언어를 포함하고 있으며 언어당 평균 3,000시간 미만의 데이터를 가지고 있다. 제한된 감독 데이터에도 불구하고, 이 모델은 73개 언어에서 평균 30% 미만의 단어 오류율(WER; 낮은 것이 더 좋다)을 달성하며, 이는 우리가 이전에 달성한 적이 없는 이정표이다. en-US의 경우 USM은 현재 내부 최첨단 모델에 비해 상대적으로 WER이 6% 낮다. 마지막으로, 우리는 최근 출시된 대형 모델인 Whisper(large-v2)와 비교하는데, 이 모델은 40만 시간 이상의 레이블링된 데이터로 훈련되었다. 비교를 위해, 우리는 위스퍼가 40% 미만의 WER로 성공적으로 디코딩할 수 있는 18개 언어만 사용한다. 우리 모델은 이러한 18개 언어의 Whisper에 비해 평균적으로 32.7% 낮은 WER을 가지고 있다.

--
USM supports all 73 languages in the YouTube Captions' Test Set and outperforms Whisper on the languages it can support with lower than 40% WER. Lower WER is better.

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.
arXivGPT
"default" prompt is used
The paper "Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages" describes a single large model, the Universal Speech Model (USM), that performs automatic speech recognition (ASR) across 100+ languages by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million hours spanning over 300 languages and fine-tuning on a smaller labeled dataset.

Key insights and lessons learned from the paper include:

Multilingual pre-training with random-projection quantization and speech-text modality matching can achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
USM exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages compared to the Whisper model, despite using a labeled training set 1/7th the size of Whisper's training set.
USM significantly reduces model complexity and inference latency compared to traditional approaches that require multiple language-specific models.
The paper highlights the importance of a large, diverse multilingual dataset for pre-training and fine-tuning the model, as well as the effectiveness of random-projection quantization and speech-text modality matching.
Three questions to ask the authors:

How does USM compare to other large-scale multilingual speech recognition models, such as Facebook's wav2vec and wav2vec 2.0 models?
Have you explored using USM for other speech-related tasks, such as speaker identification or emotion recognition?
Can USM be extended to handle low-resource languages with limited labeled training data, and if so, what techniques might be effective?
Three suggestions for related topics or future research directions:

Investigate the transfer learning capabilities of USM for other natural language processing tasks, such as text classification or named entity recognition.
Explore the impact of additional pre-training tasks on USM's performance, such as masked language modeling or sequence-to-sequence translation.
Investigate the effectiveness of USM for speech recognition in noisy or adverse acoustic environments.
Relevant references:

Baevski, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv preprint arXiv:2006.11477.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Joulin, A. (2021). Unsupervised Cross-lingual Representation Learning at Scale. arXiv preprint arXiv:2012.15761.
Ghoshal, A., & Swietojanski, P. (2017). Multi-lingual training of convolutional neural networks for low-resource speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5220-5224). IEEE.
Hu, B., Chen, Y., Zhang, W., Han, W., & Wu, Y. (2020). Exploring large-scale pretraining for speech recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 2386-2392).
Khurana, U., Mahajan, M., Dhingra, B., Carlini, N., & Liu, Y. (2021). Multilingual speech recognition: A survey of recent advances. arXiv preprint arXiv:2103.03247.✏

image
image
image
image
image

MuAViC

뭐가 더 좋은지 확인 요망. API 신청 완료. 대부분 + 다국어는 구글, 이상 값은 메타?

@nick-jhlee
Copy link

nick-jhlee commented Mar 11, 2023

(오랜만에 돌아왔습니다..)

Upcoming Conferences/Deadlines

Papers (emphasis on diffusion models)

  • Communication-Efficient Collaborative Heterogeneous Bandits in Networks

    • KAIST AI (제가 참여한 논문임다)
    • bandit이 여러개 있는 상황에서, 서로 협력하면서 learning을 할 때, 기존의 network flooding에 대비해, 훨씬 더 효율적이면서 성능 차이가 minimal한 알고리즘 개발!
    • 처음으로 network 분야에서 논문을 써보았는데, 분야가 되게 재밌더라구요 ㅎㅎ 앞으로 기회가 되면 이쪽도 한번 논문을 계속 써보는걸로.. ㅎㅎ
  • Dropout Reduces Underfitting

    • FAIR (Meta AI), UC Berkeley, MBZUAI
    • 지금까지 dropout을 overfitting 방지용으로 썼는데, 사실 요즘같이 exploding data 시대에는 underfitting을 걱정해야한다..?!
    • Q. can dropout be used to mitigate underfitting?
    • A. Yes, if used early on! For later, it reduces overfitting!
    • How? true gradient 방향과 더 맞게 align을 해줌!
    • 실험적으로 뒷바침함 (ImageNet + ViT, Mixer-S, ConvNeXt...)

Screenshot 2023-03-11 at 9 23 19 PM Screenshot 2023-03-11 at 8 33 28 PM

  • Diffusion Models are Minimax Optimal Distribution Estimators

    • University of Tokyo & RIKEN AIP (통계학의 장인나라인 일본에서 나온 나름 핫한 논문입니당)
    • TL;DR: statistical learning theory for diffusion modeling!
      • diffusion이 과연 얼마나 좋은 distribution estimator인가??
      • score을 neural net으로 근사하는 에러가 과연 얼마나 bound가 될 수 있고, 이 오차가 과연 diffusion process에 어떤 영향을 주는건가??
    • minimax optimal (임의의 estimator를 고정하였을때, 그 estimator 가 incur하는 최대의 error)한 sample complexity
    • DR: diffusion이 통계적으로 매우 우수한 estimator이다.
  • Understanding the Diffusion Objective as a Weighted Integral of ELBOs

    • Google Research, Google Brain (킹-마옹 참여)
    • 원조격인 (Sohl-Dickenstein et al., ICML'15)은 likelihood-based loss, ELBO,를 썼지맘, 그 후로 나온 SOTA 논문들은 다른 loss를 썼고, 지금도 그렇다!
      • score matching (Song & Emron, NeurIPS'19)
      • noise prediction (Ho et al., NeurIPS'20)
    • Q. 그러면, ELBO는 diffusion과 이제 무관한 존재인가?
    • A. No! 알고보니, 우리가 쓰는 (SOTA인) weighted loss가 *noise-perturbed data"에서의 ELBO와 같다!
    • More interesting (and more important):
      • weight function이 (non)monotone in time이면, ELBO를 maximize(minimize)함!
        • ex. v-prediction (Salimans & Ho, ICLR 2022)는 nonmonotone!
      • 그래서 저자들이 다소 간단한 monotone weighting을 이용하여 SOTA와 비슷한 성능을 찍음
    • so what's the deal??
      • The newfound equivalence between monotonic weighting and the ELBO with data augmentation, allows for a direct apples-to-apples comparison of diffusion models with other likelihood-based models.
      • For example, it allows one to optimize other likelihood-based models, such as autoregressive transformers, towards the same objective as monotonically weighted diffusion models.
      • This would shine light on whether diffusion models are better or worse than other model types, as measured in terms of their objective as opposed to FID scores.
  • Consistency Models

    • OpenAI (Yang Song 참여!)
    • TL;DR: a new family of generative models that achieve high sample quality without adversarial training
    • How? 임의의 시간에서 임의의 점을 그 trajectory의 initial point로 mapping하는 network를 학습시킴!
    • Self-consistency: points on the same trajectory map to the same initial point.
    • Consistency models allow us to generate data samples (initial points of ODE trajectories, e.g., x0 in Fig. 1) by converting random noise vectors (endpoints of ODE trajectories, e.g., xT in Fig. 1) with only one network evaluation.
      • Importantly, ... we can improve sample quality and perform zero-shot data editing at the cost of more compute, similar to what iterative refinement enables for diffusion models.

Screenshot 2023-03-11 at 8 30 45 PM Screenshot 2023-03-11 at 8 30 57 PM

@jwlee-neubla
Copy link

jwlee-neubla commented Mar 12, 2023

소식

논문

  • Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking
    • MAE의 문제점
      • Language는 informative representation을 담고 있지만 image의 pixel값은 그렇지 않음(+ redundant)
      • Decoder가 작은 asymmetric architecture를 가짐으로 인한 긴 학습시간 문제 (1600 epochs)
      • 위문제들로 인하여 충분한 representation learning이 되지 못함(아래 그림 참조)
    • High-level feature 와 low level feature(RGB pixel)을 둘 다 학습하는 방법으로 MAE를 개선

image

image

image

(ref) ((masked autoencoder))
image

@veritas9872
Copy link

veritas9872 commented Mar 12, 2023

Hyena Hierarchy: Towards Larger Convolutional Language Models

Blog: https://hazyresearch.stanford.edu/blog/2023-03-07-hyena
ArXiv: https://arxiv.org/abs/2302.10866
GitHub: https://github.com/HazyResearch/safari

image

image

image

image

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants