Skip to content

III. Data Preparation

Maren Eckhoff edited this page Sep 2, 2019 · 1 revision

Overview


This page discusses the processing of raw data, data cleaning steps for creating the intermediate layer, and the data pre-processing steps for creating the primary data layer used in subsequent modelling steps.



1. Importing Raw Data


The data used in this project live on BarefootLaw's MySQL database. The following tables are imported:


  • raw_fb_messages: messagestable, which contains Facebook messages exchanged between Barefootlaw and their beneficiaries. As of August 2019, this table contained over 37,000 records, and was structured as follows:
id thread_id user_id body created_at updated_at deleted_at from message_type facebook_message_id message_sent_date case_id
INTEGER INTEGER INTEGER STRING TIMESTAMP TIMESTAMP TIMESTAMP STRING STRING INTEGER TIMESTAMP INTEGER

  • raw_received_sms: received_sms table, which contains SMS messages received by BarefootLaw. As of August 2019, this table contained a bit over 1,000 messages.
id sender message date_sent created_at updated_at recipient message_type is_read case_id
INTEGER STRING STRING TIMESTAMP TIMESTAMP TIMESTAMP STRING STRING STRING INTEGER

  • raw_sent_sms: sent_sms table, which contains SMS messages sent by BarefootLaw. As of August 2019, this table contained close to 16,000 messages.
id recipient message created_at updated_at date_sent template_id staff_id sender message_type
INTEGER STRING STRING TIMESTAMP TIMESTAMP TIMESTAMP STRING STRING STRING STRING

  • raw_cases: cases table, which contains text and metadata for the incoming requests. As of August 2019, this table contained close to 4,000 rows.
id customer_id beneficiary_interest case_category consultation_highlights case_status follow_up_required date_of_next_followup mode_of_next_followup created_at updated_at created_by laywer_summary service_delivery user_facebook_id facebook_message_id type created_date
INTEGER INTEGER STRING INTEGER STRING STRING STRING TIMESTAMP STRING TIMESTAMP TIMESTAMP INTEGER STRING STRING INTEGER INTEGER STRING STRING

2. Creating the Intermediate Layer


In this step, columns from the raw data tables are dropped and others renamed to create consistency across tables. Null rows are also removed. Tables are reformatted as follows:

  • intermediate_fb_messages:
thread_id body from message_sent_date case_id
INTEGER STRING STRING TIMESTAMP INTEGER

  • intermediate_received_sms:
sender body message_date_sent case_id
STRING STRING TIMESTAMP INTEGER

  • intermediate_sent_sms: Table that contains SMS messages sent by BarefootLaw
recipient body message_date_sent
STRING STRING TIMESTAMP


3. Creating the Primary Layer


The goal of this step is to extract a single table of question-answer pairs from the intermediate layers. Two main challenges are addressed in the restructure_intermediate_input function:

  • A single Facebook thread can cover multiple conversations spanning over a long period of time. A message is flagged as starting a new conversation if more than 10 days have passed since the last message.
  • A single question on Facebook can be asked over a succession of messages. Multiple messages from a beneficiary or BarefootLaw are systematically concatenated within a conversation.

In this step:

  • The restructure_intermediate_input function is applied to intermediate_fb_conversations to create the primary_fb_conversations table.
  • The intermediate_received_sms and intermediate_sent_sms are also concatenated before applying the restructure_intermediate_inputfunction to create the primary_sms table
  • The primary_messages table is created by concatenating primary_fb_conversations and primary_sms. It is structured as follows:
thread_id num_of_conversations num_of_messages question beneficiary_name question_asked_time days_since_last_message answer answer_given_time days_taken_to_respond
INTEGER INTEGER INTEGER STRING STRING TIMESTAMP INTEGER STRING TIMESTAMP INTEGER