Skip to content

Commit

Permalink
[Feat] Download dataset by using MIM&OpenDataLab (open-mmlab#1630)
Browse files Browse the repository at this point in the history
* add dataset.index

* update preprocess shell

* update shell

* update docs

* update docs
  • Loading branch information
Ezra-Yu authored Jun 30, 2023
1 parent 8afad77 commit 59c0777
Show file tree
Hide file tree
Showing 7 changed files with 100 additions and 1 deletion.
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
include requirements/*.txt
include mmpretrain/.mim/model-index.yml
include mmpretrain/.mim/dataset-index.yml
recursive-include mmpretrain/.mim/configs *.py *.yml
recursive-include mmpretrain/.mim/tools *.py *.sh
11 changes: 11 additions & 0 deletions dataset-index.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
imagenet1k:
dataset: ImageNet-1K
download_root: data
data_root: data/imagenet
script: tools/dataset_converters/odl_imagenet1k_preprocess.sh

cub:
dataset: CUB-200-2011
download_root: data
data_root: data/CUB_200_2011
script: tools/dataset_converters/odl_cub_preprocess.sh
25 changes: 25 additions & 0 deletions docs/en/user_guides/dataset_prepare.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,12 +140,37 @@ For a complete example about how to use the `CustomDataset`, please see [How to

ImageNet has multiple versions, but the most commonly used one is [ILSVRC 2012](http://www.image-net.org/challenges/LSVRC/2012/). It can be accessed with the following steps.

`````{tabs}
````{group-tab} Download by MIM
MIM supports downloading from [OpenDataLab](https://opendatalab.com/) and preprocessing ImageNet dataset with one command line.
_You need to register an account at [OpenDataLab official website](https://opendatalab.com/) and login by CLI._
```Bash
# install OpenDataLab CLI tools
pip install -U opendatalab
# log in OpenDataLab, register if you don't have an account.
odl login
# download and preprocess by MIM, better to execute in $MMPreTrain directory.
mim download mmpretrain --dataset imagenet1k
```
````
````{group-tab} Download form Official Source
1. Register an account and login to the [download page](http://www.image-net.org/download-images).
2. Find download links for ILSVRC2012 and download the following two files
- ILSVRC2012_img_train.tar (~138GB)
- ILSVRC2012_img_val.tar (~6.3GB)
3. Untar the downloaded files
````
`````

### The Directory Structrue of the ImageNet dataset

We support two ways of organizing the ImageNet dataset: Subfolder Format and Text Annotation File Format.
Expand Down
25 changes: 25 additions & 0 deletions docs/zh_CN/user_guides/dataset_prepare.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,12 +138,37 @@ train_dataloader = dict(

ImageNet 有多个版本,但最常用的一个是 [ILSVRC 2012](http://www.image-net.org/challenges/LSVRC/2012/)。 可以通过以下步骤使用它。

`````{tabs}
````{group-tab} MIM 下载
MIM支持使用一条命令行从 [OpenDataLab](https://opendatalab.com/) 下载并预处理 ImageNet 数据集。
_需要在 [OpenDataLab 官网](https://opendatalab.com/) 注册账号并命令行登录_。
```Bash
# 安装opendatalab库
pip install -U opendatalab
# 登录到 OpenDataLab, 如果还没有注册,请到官网注册一个
odl login
# 使用 MIM 下载数据集, 最好在 $MMPreTrain 目录执行
mim download mmpretrain --dataset imagenet1k
```
````
````{group-tab} 从官网下载
1. 注册一个帐户并登录到[下载页面](http://www.image-net.org/download-images)。
2. 找到 ILSVRC2012 的下载链接,下载以下两个文件:
- ILSVRC2012_img_train.tar (~138GB)
- ILSVRC2012_img_val.tar (~6.3GB)
3. 解压已下载的图片。
````
`````

### ImageNet数据集目录结构

我们支持两种方式组织ImageNet数据集,子目录格式和文本注释文件格式。
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ def add_mim_extension():
else:
return

filenames = ['tools', 'configs', 'model-index.yml']
filenames = ['tools', 'configs', 'model-index.yml', 'dataset-index.yml']
repo_path = osp.dirname(__file__)
mim_path = osp.join(repo_path, 'mmpretrain', '.mim')
os.makedirs(mim_path, exist_ok=True)
Expand Down
15 changes: 15 additions & 0 deletions tools/dataset_converters/odl_cub_preprocess.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/usr/bin/env bash

set -x

DOWNLOAD_DIR=$1
DATA_ROOT=$2

# unzip all of data
cat $DOWNLOAD_DIR/CUB-200-2011/raw/*.tar.gz | tar -xvz -C $DOWNLOAD_DIR

# move data into DATA_ROOT
mv -f $DOWNLOAD_DIR/CUB-200-2011/CUB-200-2011/* $DATA_ROOT/

# remove useless data file
rm -R $DOWNLOAD_DIR/CUB-200-2011/
22 changes: 22 additions & 0 deletions tools/dataset_converters/odl_imagenet1k_preprocess.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/usr/bin/env bash

set -x

DOWNLOAD_DIR=$1
DATA_ROOT=$2

# unzip all of data
cat $DOWNLOAD_DIR/ImageNet-1K/raw/*.tar.gz.* | tar -xvz -C $DOWNLOAD_DIR

# move images into data/imagenet
mv $DOWNLOAD_DIR/ImageNet-1K/{train,val,test} $DATA_ROOT

# download the mate ann_files file
wget -P $DATA_ROOT https://download.openmmlab.com/mmclassification/datasets/imagenet/meta/caffe_ilsvrc12.tar.gz

# unzip mate ann_files file and put it into 'meta' folder
mkdir $DATA_ROOT/meta
tar -xzvf $DATA_ROOT/caffe_ilsvrc12.tar.gz -C $DATA_ROOT/meta

# remove useless data files
rm -R $DOWNLOAD_DIR/ImageNet-1K

0 comments on commit 59c0777

Please sign in to comment.