- Analyze the IMD data and focus onto frequent flood hit areas
- Predicting the number of human casualties
- Estimating the distribution of displaced humans and thereby proposing warehouse locations
Dataset : The India Flood Inventory, a geospatial dataset developed in collaboration with the Indian Meteorological Department (IMD). This dataset provides valuable information on floods in India, including fatalities, damage, and other relevant parameters. However there is a lot of missing data that needs to be addressed before further analysis.
Glimpse of the columns present along with their Non - Null count, Data types and %age of missing values.
These columns : Location, Latitude , Longitude, Severity,Area affected, Human injured, Human displaced, Animal fatality and event source have most of the data missing (more than 80%)
-
Created Start month, Start year from start date, End month and end year from end date. Then dropped start date and end date columns. There were also instances when the start date was later than the end date in the dataset (~3%) so dropped the corresponding rows.
-
Main cause column was treated with string conversion to lowercase, stripping off the whitespaces. On seeing the entries there are a couple of problems that need to be addressed in this column
a) Entries such as 'flood' and 'floods', 'heavy rain' and 'heavy rains' are nothing but the same thing.There are a lot of entries with this problem.
b) Apart from the most occuring data like heavy rain and flood, there are many other entries which occur only once and are in the form of a long string.
Thus further preprocessing steps were taken which included punctuation removal + word lemmatization. I took the first 14 unique values into consideration and replaced others with 'other'. -
We needed to extract the latitude and longitude of the places in order to do the geospatial analysis. Upon going through the data present in these columns, these observations can be made:
a) The essential data (i.e. data in latitude and longitude columns) is very less. We need to convert the district and state data into coordinates.
b) First we will consider the dataframe where these latitudes and longitudes are null and the columns will be district, state and location
c) On seeing the data, there are 410 rows where district data is null and 356 rows where state data is null. Actually state data is always present when district data is present and apart from that 54 rows contain only state information
d) For the rest rows which contain neither coordinates nor district-state information we have data present in Location column
We will be geocoding the data present in df_geo dataframe. It is nothing but converting an address to a location on map.
After we fetched the coordinates here are some plots which we created just to visualize the flood locations on a sample of 500 points.
So now from the sample maps it could be seen that the north eastern and the southern parts of India are most affected from floods, we can also do a region specific geospatial analysis. Taking a window
-
Severity, Area Affected, Human fatality, Human injured, Human Displaced, Animal Fatality, Description of Casualties/injured, Extent of damage :
The EM_DAT event source has no information regarding the extent of the floods, so we will drop the corresponding rows. We will also take a range of coordinates to further narrow down our research to areas where most of the floods occur. As evident from the marker cluster map, we will take the north eastern part into consideration.
DFO source contains complete information about Severity, Area affected , Human fatality and Human Displaced whereas IMD source contains some information about Human fatality, Human injured, animal fatality, description and extent.
Predicted Human Fatality with R² 0.55 after feature selection, using Artificial Neural Networks.
Now estimating humans displaced. Used KNNimputer to fill in the missing values. Achieved R² score of 0.58 using Linear Regression.
We plotted various heat maps to visually show how the impact of floods were spread across the state of Assam and its nearby region.
Now in order to determine the distribution of displaced people we assumed the distribution to be in accordance with the census population report. Since the last detailed census occurred in 2011 (Couldn't happen in 2021 due to covid), we had the following heatmap of assam population:
Each warehouse will correspond to one cluster. Location of warehouse will be equivalent to cluster center so as to reduce the within cluster sum of squares or the L2 distance between cluster center and other points.
We further boiled down to Kamrup district of Assam. This was done due to 2 main reasons : (i) More number of town population information was present in Kamrup District as of census 2011 compared to other districts. (ii) Tailoring to one district will give a more detailed solution of this problem as it will be meaningless for a state to have 5-6 warehouses.
In the context of disaster response planning in Assam, the Voronoi diagram delineates the boundaries of influence for each cluster centroid, effectively partitioning the geographical area into regions that are closest to each respective centroid.