This is my final project for the Data Input and Manipulation course at Georgia Tech. The main purpose of this course is to web scrape data sets online (I used Selnium and BeautifulSoup) and use Python libraries such as numpy and pandas to clean and analysis the data. I also used the plotly library to create visualizations for my cleaned data.
How did the COVID pandemic impact consumer spending in the United States. Are these changes in consumer behavior, if any, consistent among different countries? Were the changes in spending different based on spending categories such as essential versus non-essential goods.
I webscraped three websites and collected a csv file. The cvs file, which I collected from the US government travel site, contained flight information such as departuture and arrival cities, number of passengers, and cost per flight for different years dating back to 1993.
The first website I webscaped was the JSON API from the OECD. This provided final expenditure of households each year for different countries.
My second website was this table from the National Center for Biotechnology Information. This table gave break down of food spending categories based on different demographics.
My final website was country codes from IBAN. It mainly served to assist in my Geographical visualization of my data.
Within my datasets, there were a couple of inconsistencies.
-
Within my downloaded CSV file, there were two columns: "Geocoded_City1" and "Geocoded_City2" which had NaN values in certain rows while other rows of the same cities had valid values. To fix this inconsistency, I created a dictionary (geo_dict) with each "citymarketid" as a key and its values as its Geolocation. Then for each row within my DataFrame if the "citymarketid" was in my dictionary, I replaced "Geocoded_City1" and "Geocoded_City2" with the value corresponding to the citymarketid to replace NaN values using the apply method and a lambda expresson.
-
Within my downloaded CSV file, my "Geocoded_City1" and "Geocoded_City2" had the city name seperated by "\n" with its coordinates. These coorinates would be useful for data visualization so I decided to seperate these columns into two three seperate columns: one for the city name, one for latitude, and one for longitude. I used the .str.split("", expand = True) to seperate the columns and casted the coordinates as floats while keeping the city name as a string.
-
My first web collection site contained a JSON dictionary where the values was a list with mixed in with with floats, 0s and nulls. I checked for non-zero floats within the lists and only used those for my pandas DataFrame.