-
Notifications
You must be signed in to change notification settings - Fork 4
II. Data Storage
Data used in this project come from a MySQL database managed by our partner BarefootLaw. They consist in messages exchanged between the organization and their beneficiaries. The MySQL database is queried to produce the Raw layer (1). An Intermediate layer (2) and a Primary layer (3) are then produced and stored in the Virtual Machine hosted by BarefootLaw. Files in those layers are stored as .parquet files in the /datadrive folder.
The raw data are never edited. The intermediate layer contains data from multiple tables that are cleaned and saved in the format that the DSSG team needs. The primary data layer is created from the intermediate layer and produces a single table for all messages. Data in the primary layer will be used in training/testing the Machine Learning (ML) algorithms.
Storage details can be found and edited in the conf/catalog.yml
file of this repository.
The raw data is queried from the BarefootLaw MySQL database. The following tables are queried:
- raw_fb_messages: Table that contains Facebook messages exchanged between Barefootlaw and their beneficiaries
- raw_received_sms: Table that contains SMS messages received by BarefootLaw
- raw_sent_sms: Table that contains SMS messages sent by BarefootLaw
The raw data is processed to ensure Facebook messages and SMS are stored in the same format, with matching column names. Columns that are not being used in the model are dropped at this stage. The parquet files are saved in the following folders:
- intermediate_fb_messages: /datadrive/int/fb_messages.pq
- intermediate_received_sms: /datadrive/int/received_sms.pq
- intermediate_sent_sms: /datadrive/int/sent_sms.pq
In the Primary layer, Facebook messages and SMS are processed to build question-answer pairs. Those two tables are then merged to produce the full messages table. The parquet files are saved in the following folders:
- primary_fb_conversations: /datadrive/prm/fb_conversations.pq
- primary_sms: /datadrive/prm/sms.pq
- primary_messages (all messages): /datadrive/prm/messages.pq