Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Appearance settings

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Google Research Datasets

Datasets released by Google Research

1.1k followers
Mountain View, CA
http://research.google

Overview
Repositories
Projects
Packages
People

Overview
Repositories
Projects
Packages
People

Pinned Loading

natural-questions natural-questions Public

Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

Python 1k 156
conceptual-captions conceptual-captions Public

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

Shell 542 27
Objectron Objectron Public

Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the came…

Jupyter Notebook 2.3k 261
wit wit Public

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

1.1k 44
paws paws Public

This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase ident…

Python 560 54
dstc8-schema-guided-dialogue dstc8-schema-guided-dialogue Public

The Schema-Guided Dialogue Dataset

Python 580 129

Repositories

Type

Select type

All Public Sources Forks Archived Mirrors Templates

Language

Select language

All C++ HTML Jupyter Notebook Macaulay2 Makefile Python Shell Starlark

Sort

Select order

Last updated Name Stars

Showing 10 of 170 repositories

cultural_familiarity_annotations Public
The dataset consists of AI generated stories and accompanied human ratings on their cultural fluency and relevance.

google-research-datasets/cultural_familiarity_annotations’s past year of commit activity

1 Apache-2.0 1 0 0 Updated Aug 6, 2025
tydiqa-wana Public

google-research-datasets/tydiqa-wana’s past year of commit activity

Jupyter Notebook 0 Apache-2.0 0 0 0 Updated Jul 30, 2025
conceptual-12m Public
Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

google-research-datasets/conceptual-12m’s past year of commit activity

397 20 5 0 Updated Jul 14, 2025
sanpo_dataset Public

google-research-datasets/sanpo_dataset’s past year of commit activity

Python 43 Apache-2.0 2 5 3 Updated Jun 27, 2025
common-crawl-domain-names Public
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").

google-research-datasets/common-crawl-domain-names’s past year of commit activity

18 MIT 2 0 0 Updated Jun 16, 2025
rag_conflicts Public
CONFLICTS is a QA dataset annotated with knowledge conflict types. Each instance comprises a query, a set of retrieved relevant passages, a corresponding conflict type label, and, for specific types, the ground truth correct answer

google-research-datasets/rag_conflicts’s past year of commit activity

7 Apache-2.0 1 1 0 Updated Jun 11, 2025
wit-retrieval Public

google-research-datasets/wit-retrieval’s past year of commit activity

5 0 1 0 Updated Jun 5, 2025
Amplify_SSA Public
An annotated dataset of 8,091 adversarial queries in seven Sub-Saharan African languages.

google-research-datasets/Amplify_SSA’s past year of commit activity

Jupyter Notebook 0 1 0 0 Updated May 1, 2025
egotempo Public

google-research-datasets/egotempo’s past year of commit activity

Jupyter Notebook 26 CC-BY-4.0 0 3 0 Updated Apr 26, 2025
artydiqa Public
ArTyDi-QA is a dataset for Question Answering (QA) and Question Generation (QG) in Modern Standard Arabic (MSA), adapted from TyDiQA. It features extractive QA where models find answer spans or identify unanswerable questions, and a QG task involving formulating questions from context and answer pairs.

google-research-datasets/artydiqa’s past year of commit activity

0 0 0 0 Updated Apr 23, 2025

View all repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Uh oh!

There was an error while loading. Please reload this page.

Most used topics

nlp deep-learning nlp-machine-learning deep-neural-networks wikipedia

Footer

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.