Name		Name	Last commit message	Last commit date
parent directory ..
clean-results-csv.ipynb		clean-results-csv.ipynb
readme.md		readme.md
upload-to-s3.py		upload-to-s3.py

readme.md

Upload kaggle dataset to S3 bucket using EC2 instance 🤴🏻

Create an S3 bucket in your AWS account.
Create an EC2 instance and attach a role that has permission to access the S3 bucket (advanced details while creating EC2 instance).
Install the kaggle CLI on the EC2 instance.

pip install kaggle

Store authentication key in EC2: kaggle api docs
Sample api key stored:

ubuntu@ip-172-31-88-59:~$ cat .kaggle/kaggle.json
{"username":"deependu__","key":"YOUR_API_KEY"}

Download dataset from kaggle:

kaggle datasets download -d adityajn105/flickr8k

Once data is downloaded on EC2, extract the zip.

sudo apt install unzip
unzip flickr-image-dataset.zip -d ~/image-caption-dataset

Upload the extracted dataset to AWS S3.

import boto3
import os
from os import walk

s3 = boto3.client("s3")

image_folder = "flickr30k_images"

file_name = "results.csv"
destination = "datasets"

for idx, (dirpath, dirnames, filenames) in enumerate(walk(image_folder)):
    for idy, file_name in enumerate(filenames):
        print("\n" + "-" * 70)
        print(f"{idy=}")
        file_name = os.path.join(image_folder, file_name)
        s3_path = os.path.join(destination, file_name)
        print(f"uploading to s3 path: {s3_path}")
        print(f"uploading file: {file_name}")
        s3.upload_file(file_name, "deependu-my-personal-projects", s3_path)
print("done")

Verify the dataset is uploaded to S3 bucket.

One issue that caused a lot of headache 🤯

Data was mostly clean, but one of the caption was missing in the csv file, which caused a lot of issues while tracking.
So, that row was removed from the csv file. and the updated csv file was uploaded to S3.
Here's the jupyter notebook that was used to remove the row and make some minor changes to the csv file: notebook link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ec2-to-s3-data-preparation

ec2-to-s3-data-preparation

readme.md

Upload kaggle dataset to S3 bucket using EC2 instance 🤴🏻

One issue that caused a lot of headache 🤯

Files

ec2-to-s3-data-preparation

Directory actions

More options

Directory actions

More options

Latest commit

History

ec2-to-s3-data-preparation

Folders and files

parent directory

readme.md

Upload kaggle dataset to S3 bucket using EC2 instance 🤴🏻

One issue that caused a lot of headache 🤯