Skip to content

II. Data Storage

Maren Eckhoff edited this page Sep 2, 2019 · 1 revision

Overview

Data used in this project come from a MySQL database managed by our partner BarefootLaw. They consist in messages exchanged between the organization and their beneficiaries. The MySQL database is queried to produce the Raw layer (1). An Intermediate layer (2) and a Primary layer (3) are then produced and stored in the Virtual Machine hosted by BarefootLaw. Files in those layers are stored as .parquet files in the /datadrive folder.

The raw data are never edited. The intermediate layer contains data from multiple tables that are cleaned and saved in the format that the DSSG team needs. The primary data layer is created from the intermediate layer and produces a single table for all messages. Data in the primary layer will be used in training/testing the Machine Learning (ML) algorithms.

Storage details can be found and edited in the conf/catalog.yml file of this repository.


1. Raw layer

The raw data is queried from the BarefootLaw MySQL database. The following tables are queried:

  • raw_fb_messages: Table that contains Facebook messages exchanged between Barefootlaw and their beneficiaries
  • raw_received_sms: Table that contains SMS messages received by BarefootLaw
  • raw_sent_sms: Table that contains SMS messages sent by BarefootLaw

2. Intermediate Layer

The raw data is processed to ensure Facebook messages and SMS are stored in the same format, with matching column names. Columns that are not being used in the model are dropped at this stage. The parquet files are saved in the following folders:

  • intermediate_fb_messages: /datadrive/int/fb_messages.pq
  • intermediate_received_sms: /datadrive/int/received_sms.pq
  • intermediate_sent_sms: /datadrive/int/sent_sms.pq

3. Primary Layer

In the Primary layer, Facebook messages and SMS are processed to build question-answer pairs. Those two tables are then merged to produce the full messages table. The parquet files are saved in the following folders:

  • primary_fb_conversations: /datadrive/prm/fb_conversations.pq
  • primary_sms: /datadrive/prm/sms.pq
  • primary_messages (all messages): /datadrive/prm/messages.pq
Clone this wiki locally