This is an attempt at building a generic CSV to dataset solution for offline exploration and management of various CSV files.
- Many CSV files that are available on open data sites such as Data.gov and Kaggle.com.
- Managing these in a file/folder structure in your local disk as CSV files is complex and consumes a lot of disk space.
- Finding these random CSV files across the internet and bookmarking them is also time-consuming.
- What if you had a locally deployed web app to upload the files, add metadata about them?
- What if you could achieve a 60%-90%+ compression of the data from its original file size using the Apache Avro format with binary compression instead of CSV?
- What if you could expose the data as a generic REST API or GraphQL interface using the Avro files that is JSON friendly?
Still under development but some initial technical design thoughts are:
- Use SpringBoot and host a server web application using Spring's embedded Tomcat web server and Thymeleaf templates for server-side data rendering.
- Make CSV upload, preview and storing into an Avro file very seamless experience.
- Generic data model that can handle up to 100 MB files and 50 columns in the CSV file. File size can be changed via a property.
- Web application allows you to "Manage Datasets" with the data read from Avro files on disk, preview them really fast and add metadata/tags to organize them.
- Based on each dataset created from a CSV file, automatically make the data available via a REST API or GraphQL interface.
This project uses
- JDK - Azul Zulu JDK 17
- IDE - IntelliJ IDEA Community Edition
- Build - Gradle 7.5
- SpringBoot 3.0.0
- Thymeleaf for html templates.
- OpenCSV 5.7.0
- Apache Spark 2.13
- Apache Avro 1.11
- JUnit 4.13.2
- Git clone repo via https or SSH
- Import project into IntelliJ IDEA Community version.
- Refresh gradle tasks or run
./gradlew build
from command line. - From IDEA - "Run" the main class
Csv2DatasetApplication
to start embedded Tomcat server. - Access application in http://localhost:8080
Created project from Spring Initializr with the following settings:
This is where you can upload any CSV file to process into Avro files.
This page shows you a preview of the rows before you process the import. It also allows you to specify a dataset name and other metadata to store for the Avro schema.
This page has a dropdown of all the Avro files imported. Clicking "View" shows the file metadata and full data using Avro DatafileReader that makes page load very snappy.