Skip to content

Database

Bencabe edited this page Jun 22, 2020 · 3 revisions

2018 Bus Data

Overview of data provided

pending

Setting up the environment

The first step of the process was setting up the environment which we would be using. This involved installing MySQL and anaconda on the VM in order to be able to store and analyse the data.

The next step of the process involved figuring out how to run a Jupiter notebook from within the VM on UCD's server and access it on my local computer. After several failed attempts I found this page which solved the problem.

I felt it would be helpful to have some sort of GUI when working with the SQL data so the final step in setting up our environment involved connecting MySQLWorkbench to the database which was hosted on our VM. We accomplished this using the standard TCP/IP over SSH connection method in workbench.

Inserting data into the database

The data we were given consists of three files, one of which is over 10GB in size and there are a number of interesting challenges when dealing with this much data. The VM's RAM was not sufficient to load the entire file into memory at once so analysing all the data in a dataframe wasn't really an option.

Initially, I attempted to load the data into the database by splitting it into chucks and inserting these chunks into the DB iteratively using the following code:

 df_chunk = pd.read_csv('rt_leavetimes_DB_2018.txt',sep=';',chunksize=1000000)

 chunk_list = []    

 i = 0    

 for chunk in df_chunk: 
     print(i)    
     i +=1     
     chunk.to_sql('rt_leavetimes', con=engine, if_exists='append')

This was taking far too long however and the jupyter kernel kept breaking so I started looking for another method.

Next, I attempted to load the entire file into a table using the LOAD DATA INFILE SQL command. At first I ran into the following error:
"ERROR 1290 (HY000): The MySQL server is running with the --secure-file-priv option so it cannot execute this statement "
After a bit of googling, I realised the file had to be in the /var/lib/mysql-files/ directory. Once I moved the file this appeared to be working so I let the terminal run overnight on my computer with the hope that the table would be updated by morning.

Unfortunately in the morning, I woke up to an error letting me know that there was "insufficient memory" and no rows had been inserted into the table. At this point, I decided to split the file up into smaller files. The file had 116 million rows so I created 116 files with 1 million rows and inserted them individually into the table using the following command for each LOAD DATA INFILE '/var/lib/mysql-files/{file_name}' INTO TABLE rt_leavetimes FIELDS TERMINATED BY ';';. This worked well but it was laborious to run the command 116 times and if I were to do it again I would create a bash script to do this programmatically.

Creating derived and supplementary data

Once the three data files we had been provided were in our database we realised there was a number of pieces of information that were missing which we would need to create our application.

The first of these which we tackled was the 2018 weather data. We needed this as we believed it would be an important feature in the ML model which we were using the 2018 bus data to build. We created a csv file with the weather for each hour in 2018 using the openweathermap API. This csv file was then loaded into the database.

The next important piece of data which was not present in the three files provided was the information for each bus stop. We needed the coordinates and names of each bus stop in order to provide useful information to the user. In order to do this we created a csv file using this API from smart dublin. Details of how the csv file was created can be found in the scripts/Stop_id_csv_maker.py file in this repository.

bus stops on each route... pending