Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. The library provides easy access to different functions for cleaning and processing data for any purpose. Pandas uses the Numpy library for all its mathematical purposes and this makes it compatible with every other library built on top of numpy, including but not limited to matplotlib, Xarray, sparse etc. Pandas is especially useful in different Machine Learning fields and it's applications are not limited to manipulating numerical data. It is also extremely useful in manipulating textual data, something that comes in handy when working in NLP
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Data alignment and integrated handling of missing data.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Easy handling and manipulation of textual data.
Open the terminal and type the following command in:
pip3 install pandas
If pip is not already installed, you might meet an error message, if that happens use the following command in the terminal
sudo apt-get install python3-pip
In this tutorial we will be starting from the basic operations that pandas allows and working our way upto some manipulation and inference techniques that are an absolute necessity for working in NLP
import pandas as pd
We willbe using a csv file named Tweets.csv to show the different applications of pandas in NLP. To load the csv file:
data = pd.read_csv("Tweets.csv")
This loads the csv file in the variable 'data' as a dataframe.
To effectively use the functions at our disposal, we need to know everything about the data that we are going to be manipulating, pandas provides an array of different functions for this purpose.
To simply view the dataframe:
The output looks something like this:
This provides a birds-eye view of the whole dataset and can be used at any stage of data cleaning/manipulation to get a sense of how the changes you are making are affecting the whole dataframe.
To get the number of rows and columns in DataFrame, we use the shape method:
The output shows that the DataFrame has 14640 rows and 15 columns:
This method is used to get a summary about the whole DataFrame in an easy to read manner:
The output shows the summary of the whole DataFrame:
This method is used to specifically find out the data type of the values in each columns:
The output displays all the datatypes in the DataFrame:
For the effective use of the available data, more often than not, we are required to extract certain parts of the data that conforms to certain constraints. Pandas provides easy access to function that enable us to do this.
To extract a specific column, we write:
negr = data['negativereason']
The output shows the values in the 'negative_reason' column:
To extract the values from multiple columns that conform to certain constraints, we write:
posrec = data[data['airline_sentiment'] == 'positive']
The output contains all the rows in which the column 'airline_sentiment' has the 'positive' entry:
Note : Pandas allows the constraint to be anything, ranging from numerical values to string, this makes the NLP tasks extrmely easy to do with pandas.
After the data of interest has been extracted, some cleaning is required before it can be inserted into Language Models or Other Algorithms. To achieve our objective, we will be using pandas again.
It is used to gather information about a pandas series or DataFrame.:
new = data['airline_sentiment']
The output is a description of the 'new' series:
More often than not, we are faces with data that is incompatible for our purposes, therefore, we have to change the datatype so that it fits the requirements, for this we use the astype() method:
The output is the contentes of 'new' converted to string datatype:
NaN is a numeric datatype which stands for 'Not A Number'. A lot of Datasets out there have NaN values in them and therefore it is esssential to get rid of them before we proceed any further.
To find out if there are any NaN values in the series we use the following method:
The output is false which means that this particular row does not have any NaN values:
When we employ the same method on anohter column:
The output is :
To find out the number of NaN values in the series we use the following method:
The output is false which means that this particular row does not have any NaN values:
To replace the NaN values in a series with the input of our choice we use the fillna method:
The output is the modified series where the NaN values have been replaced with the mean of the whole column:
When we see the .isnull output after the operation:
The ouput is :
One of the central themes of NLP is to extract information from the given set of words which is useful. To achieve this to some extent, we will be using NLP to break down the textual data and find the patterns hidden in it.
Histograms are one of the most powerful tools when it comes to NLP, it can be used to get a picture of what the data is telling. To get started we will try to see what the general public opinion is of all the airlines:
pos = data[data['airline_sentiment'] == 'positive'].shape[0]
neg = data[data['airline_sentiment'] == 'negative'].shape[0]
neu = data[data['airline_sentiment'] == 'neutral'].shape[0]
This will store the number of each type of values present in the column 'airline_sentiment'. Next we will import matplotlib to plot the histogram:
import matplotlib.pyplot as plt
Now we will plot the values using matplotlib:,pos,3, label="Positive"),neg,3, label="Negative"),neu,3, label="Neutral")
plt.ylabel('Number of examples')
plt.title('Proportion of examples')
The output is the following histogram, which shows that an overwhelming people find the airlines unsatisfactory:
For plotting the Histograms, we do not need matplotlib as pandas provides inbuilt methods for plotting the values. To find the biggest reason why people are finding airlines services unsatisfactory, we will plot a histogram of the different reasons and their frequency, this time, using pandas:
fig = plt.figure(figsize = (20,20))
ax = fig.gca()
negr.hist(ax = ax)
The code above produces a histogram of size 20X20:
As we can see in the histogram, 'customer service issue' is the biggest reason behind the dissatisfaction.
To further break the data down, we need to find out the level of dissatisfaction with each airline. This allows us to infer where the changes need to be made.
This method is used to determine the uniques values present in a column, we will use it to find out how many different airlines have been polled:
The output is the list of airlines present in the dataset:
Now we will plot the proportion of the result for each airline, starting from Virgin America:
va = data[data['airline'] == 'Virgin America']['airline_sentiment']
The output is the following histogram:
US Airways
The breaking down of data has shown that except for Virgin America, all the other airlines in this dataset are doing very poorly in the eyes of the people.
This is the end of this tutorial. The methods and function used in this tutorial are extrmemly useful in real life situation where dealing with large textual datasets can be a hassle otherwise. Pandas is an extremely powerful tool for handling any dataset, hopefully after reading this, you can use the skills learned to tackle some of the problems in NLP.
- DevIncept Mentor
- GeeksforGeeks
- Pandas Documentation