This project leverages Component's dataset containing Bandcamp transactions from September 9, 2020 to October 2, 2020, to gain insights into sales and evaluate the effectiveness of Bandcamp's business model. The insights obtained from this project are similar to those presented in Component's report, The Chaos Bazaar. While the Component's report might provides better insights than this project, the project owner undertakes it to hone their data engineering skills, including pipeline orchestration, DBT ELT, terraform, and other related techniques.
Now that are explained, I would also like to express my appreciation to DataTalks.Club for providing me with the opportunity to learn data engineering in a structured and effective way, as well as for offering various chances to practice and improve my skills. Thanks to their resources and support, I have been able to learn and test my abilities in docker, pipeline orchestration, apache spark for batch processing, DBT for ELT, kafka for stream processing, and related techniques.
- Terraform for IaC
- Prefect for pipeline orchestration
- Polars for batch processing
- BigQuery for data warehouse
- DBT for ELT
- Looker Studio for reporting and visualization
_id
: Unique identifier combining the sale's URL and UTC timestamptransaction_date
: Transaction datetimeurl
: The path to the item on Bandcampartist_name
: Name of the artistalbum_title
: Title of the album, if applicableitem_type
: Denotes the type of object transacted whether if its an physical album, digital album, or digital trackslug_type
: Denotes the type of object transacted whether if its n album, track, or merchcountry
: Country of the buyeritem_price
: Item price set by the seller in seller's currencyamount_paid
: Amount of money paid in seller's currencycurrency
: Currency used by the selleritem_price_usd
: Item price converted to dollaramount_paid_usd
: Amount paid converted to dollaramount_overpaid_usd
: Amount voluntarily paid by the buyer in dollarpaid_to_price_ratio
: Ratio of amount paid to item price
To ensure that the project runs smoothly, you should have the following prerequisites in place:
- A GCP Account
- A GCP Service Account file with owner rights.
To set up the project, perform these steps:
- Clone the repository and place the service account file in the
config/
folder as.secret.json
. - Prepare your GCP infrastructure:
- You can manually create a GCS bucket and a BQ dataset named
bandcamp
. - Alternatively, you can use Terraform:
- Navigate to the
terraform/
directory usingcd
. - Modify the configuration in
terraform/variables.tf
. - Execute
terraform init
, thenterraform plan
, and review the plan. Finally, executeterraform apply
.
- Navigate to the
- You can manually create a GCS bucket and a BQ dataset named
- Set up a virtual environment using either virtualenv or Anaconda. Install all dependencies listed in
requirements.txt
, and activate the environment. - To use Prefect for orchestration, do the following:
- Navigate to the
orchest/
directory usingcd
, and install the flows by runningpip install -e .
in the virtual environment. - Create two Prefect blocks in Orion: GCP Credentials that point to the service account file, and Local File System that point to the local data folder where data files are located.
- Modify the constant variables in
flows/web_2_local_2_gcs.py
andflows/gcs_2_bq.py
as needed. - Deploy all flows by running the following commands:
prefect deployment build -a flows/web_2_local_2_gcs.py:main --name web_2_gcs
prefect deployment build -a flows/gcs_2_bq.py:main --name gcs_2_bq
- Run a Prefect agent, then execute the
web_2_gcs
andgcs_2_bq
flows in order, either from Prefect Orion or from the command line.
- Navigate to the
- To use DBT for ELT, do the following:
- Modify
~/.dbt/profiles
using the GCP service account file, based on the instructions in the DBT BigQuery Setup section of the documentation. - Execute
dbt run
in thedbt/
directory. - To generate documentation, execute
dbt docs generate
, thendbt docs serve
.
- Modify
- To visualize the data using Looker Studio Visualization, follow these steps:
- Open Looker Studio using the same account where your BigQuery is located.
- Connect the data and create visualizations.
- You're done!
- The process of which columns are partitioned and clustered can be seen on
orchest/flows/gcs_2_bq.py
with it's reasoning and explanation
- Prefect
- One script to deploy all flows instead of using CLI.
- Find a way to to put hard coded constants somewhere else, maybe in prefect blocks.
- Find a way to utilize DaskTaskRunner with polars for blazingly fast ETL.
- DBT
- Proper docs.
- Add tests.
- Looker Studio
- Utilize tables derived from the fact table.