- Create an S3 bucket in your AWS account.
- Create an EC2 instance and attach a role that has permission to access the S3 bucket (
advanced details while creating EC2 instance
). - Install the kaggle CLI on the EC2 instance.
pip install kaggle
- Store authentication key in EC2: kaggle api docs
- Sample api key stored:
ubuntu@ip-172-31-88-59:~$ cat .kaggle/kaggle.json
{"username":"deependu__","key":"YOUR_API_KEY"}
- Download dataset from kaggle:
kaggle datasets download -d adityajn105/flickr8k
- Once data is downloaded on EC2, extract the zip.
sudo apt install unzip
unzip flickr-image-dataset.zip -d ~/image-caption-dataset
- Upload the extracted dataset to AWS S3.
import boto3
import os
from os import walk
s3 = boto3.client("s3")
image_folder = "flickr30k_images"
file_name = "results.csv"
destination = "datasets"
for idx, (dirpath, dirnames, filenames) in enumerate(walk(image_folder)):
for idy, file_name in enumerate(filenames):
print("\n" + "-" * 70)
print(f"{idy=}")
file_name = os.path.join(image_folder, file_name)
s3_path = os.path.join(destination, file_name)
print(f"uploading to s3 path: {s3_path}")
print(f"uploading file: {file_name}")
s3.upload_file(file_name, "deependu-my-personal-projects", s3_path)
print("done")
- Verify the dataset is uploaded to S3 bucket.
-
Data was mostly clean, but one of the caption was missing in the csv file, which caused a lot of issues while tracking.
-
So, that row was removed from the csv file. and the updated csv file was uploaded to S3.
-
Here's the jupyter notebook that was used to remove the row and make some minor changes to the csv file: notebook link