-
Notifications
You must be signed in to change notification settings - Fork 4
III. Data Preparation
Maren Eckhoff edited this page Sep 2, 2019
·
1 revision
This page discusses the processing of raw data, data cleaning steps for creating the intermediate layer, and the data pre-processing steps for creating the primary data layer used in subsequent modelling steps.
The data used in this project live on BarefootLaw's MySQL database. The following tables are imported:
-
raw_fb_messages:
messages
table, which contains Facebook messages exchanged between Barefootlaw and their beneficiaries. As of August 2019, this table contained over 37,000 records, and was structured as follows:
id |
thread_id |
user_id |
body |
created_at |
updated_at |
deleted_at |
from |
message_type |
facebook_message_id |
message_sent_date |
case_id |
---|---|---|---|---|---|---|---|---|---|---|---|
INTEGER | INTEGER | INTEGER | STRING | TIMESTAMP | TIMESTAMP | TIMESTAMP | STRING | STRING | INTEGER | TIMESTAMP | INTEGER |
-
raw_received_sms:
received_sms
table, which contains SMS messages received by BarefootLaw. As of August 2019, this table contained a bit over 1,000 messages.
id |
sender |
message |
date_sent |
created_at |
updated_at |
recipient |
message_type |
is_read |
case_id |
---|---|---|---|---|---|---|---|---|---|
INTEGER | STRING | STRING | TIMESTAMP | TIMESTAMP | TIMESTAMP | STRING | STRING | STRING | INTEGER |
-
raw_sent_sms:
sent_sms
table, which contains SMS messages sent by BarefootLaw. As of August 2019, this table contained close to 16,000 messages.
id |
recipient |
message |
created_at |
updated_at |
date_sent |
template_id |
staff_id |
sender |
message_type |
---|---|---|---|---|---|---|---|---|---|
INTEGER | STRING | STRING | TIMESTAMP | TIMESTAMP | TIMESTAMP | STRING | STRING | STRING | STRING |
-
raw_cases:
cases
table, which contains text and metadata for the incoming requests. As of August 2019, this table contained close to 4,000 rows.
id |
customer_id |
beneficiary_interest |
case_category |
consultation_highlights |
case_status |
follow_up_required |
date_of_next_followup |
mode_of_next_followup |
created_at |
updated_at |
created_by |
laywer_summary |
service_delivery |
user_facebook_id |
facebook_message_id |
type |
created_date |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
INTEGER | INTEGER | STRING | INTEGER | STRING | STRING | STRING | TIMESTAMP | STRING | TIMESTAMP | TIMESTAMP | INTEGER | STRING | STRING | INTEGER | INTEGER | STRING | STRING |
In this step, columns from the raw data tables are dropped and others renamed to create consistency across tables. Null rows are also removed. Tables are reformatted as follows:
- intermediate_fb_messages:
thread_id |
body |
from |
message_sent_date |
case_id |
---|---|---|---|---|
INTEGER | STRING | STRING | TIMESTAMP | INTEGER |
- intermediate_received_sms:
sender |
body |
message_date_sent |
case_id |
---|---|---|---|
STRING | STRING | TIMESTAMP | INTEGER |
- intermediate_sent_sms: Table that contains SMS messages sent by BarefootLaw
recipient |
body |
message_date_sent |
---|---|---|
STRING | STRING | TIMESTAMP |
The goal of this step is to extract a single table of question-answer pairs from the intermediate layers. Two main challenges are addressed in the restructure_intermediate_input
function:
- A single Facebook thread can cover multiple conversations spanning over a long period of time. A message is flagged as starting a new conversation if more than 10 days have passed since the last message.
- A single question on Facebook can be asked over a succession of messages. Multiple messages from a beneficiary or BarefootLaw are systematically concatenated within a conversation.
In this step:
- The
restructure_intermediate_input
function is applied tointermediate_fb_conversations
to create theprimary_fb_conversations
table. - The
intermediate_received_sms
andintermediate_sent_sms
are also concatenated before applying therestructure_intermediate_input
function to create theprimary_sms
table - The
primary_messages
table is created by concatenatingprimary_fb_conversations
andprimary_sms
. It is structured as follows:
thread_id |
num_of_conversations |
num_of_messages |
question |
beneficiary_name |
question_asked_time |
days_since_last_message |
answer |
answer_given_time |
days_taken_to_respond |
---|---|---|---|---|---|---|---|---|---|
INTEGER | INTEGER | INTEGER | STRING | STRING | TIMESTAMP | INTEGER | STRING | TIMESTAMP | INTEGER |