eda_with_data_visualization_week2.py

# -*- coding: utf-8 -*-
"""EDA with Data Visualization_week2.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1SQj9gc1V2iPvgJbxFgTqbKg7aQV15XgY

# SpaceX Falcon 9 First Stage Landing Prediction
Assignment: Exploring and Preparing Data

Objectives

Perform Exploratory Data Analysis (EDA) and Feature Engineering using the Pandas and Matplotlib libraries

Exploratory Data Analysis (EDA)

Gain a general understanding of the dataset and its contents
Identify key patterns and trends
Uncover relationships between features
Identify missing values and outliers
Assess the distribution of data for each feature
Data Preparation & Feature Engineering

Clean the data to address missing values and outliers
Convert categorical features into numerical representations usable by machine learning models
Create new features based on existing ones that may be useful for prediction
Standardize the scale of features to ensure equal importance during model training
"""

import pandas as pd
import numpy as np
#We will need to install the required libraries for this task, as we will be creating graphs. Two additional libraries we will need are matplotlib and seaborn.
import matplotlib.pyplot as plt
import seaborn as sns

URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_2.csv"
df = pd.read_csv(URL)

df.head()

df.describe()

df.info()  #general overview

"""First, let's try to see how the `FlightNumber` (indicating the continuous launch attempts.) and `Payload` variables would affect the launch outcome.

We can plot out the <code>FlightNumber</code> vs. <code>PayloadMass</code>and overlay the outcome of the launch. We see that as the flight number increases, the first stage is more likely to land successfully. The payload mass is also important; it seems the more massive the payload, the less likely the first stage will return.
"""

sns.catplot(y="PayloadMass", x="FlightNumber", hue="Class", data=df, aspect = 5)
plt.xlabel("Flight Number",fontsize=20)
plt.ylabel("Pay load Mass (kg)",fontsize=20)
plt.show()
#The code you provided utilizes the Seaborn library (sns) to create a visualization that can help you analyze the relationship between FlightNumber, PayloadMass, and the launch outcome (Class). Here's a breakdown of what the code doe
#Import (assumed): This part likely involves importing libraries like seaborn and pandas (if your data is in a pandas dataframe).
#sns.catplot: This function from Seaborn is used to create a categorical scatterplot.
#y="PayloadMass": This specifies the variable to be plotted on the y-axis, which is PayloadMass in this case.
#x="FlightNumber": This specifies the variable to be plotted on the x-axis, which is FlightNumber.
#hue="Class": This defines a separate line or marker for each category within the Class variable. The Class variable likely represents the launch outcome (successful/unsuccessful).
#data=df: This tells the function to use the data from the dataframe named df.
#aspect = 5: This sets the aspect ratio of the plot, making it wider than usual (optional).
#plt.xlabel and plt.ylabel: These lines set labels for the x and y-axis, respectively.
#plt.show: This displays the generated plot.

"""We see that different launch sites have different success rates. CCAFS LC-40, has a success rate of 60 %, while KSC LC-39A and VAFB SLC 4E has a success rate of 77%.

Next, let's drill down to each site visualize its detailed launch records.

#TASK 1: Visualize the relationship between Flight Number and Launch Site
Use the function catplot to plot FlightNumber vs LaunchSite, set the parameter x parameter to FlightNumber,set the y to Launch Site and set the parameter hue to 'class'
"""

sns.catplot(y="LaunchSite", x="FlightNumber", hue="Class", data=df, aspect = 5)
plt.xlabel("Flight Number",fontsize=20)
plt.ylabel("LaunchSite",fontsize=20)
plt.show()
# Plot a scatter point chart with x axis to be Flight Number and y axis to be the launch site, and hue to be the class value

#The provided image shows a chart that illustrates the relationship between flight number and launch location. The horizontal axis of the chart corresponds to the flight number and the vertical axis corresponds to the launch location. Each point on the chart represents a specific flight, and the color of the point represents the flight class.

#As evident from the chart, there is no clear pattern between flight number and launch location. Flights are launched from various locations, and there is no consistent pattern. However, it appears that a few launch locations are more popular than others. For instance, the CCAFS and SLC launch locations seem to be used for a large number of flights.

#Furthermore, there seems to be a weak relationship between flight class and launch location. Class A flights appear to be launched more from the CCAFS and SLC launch locations, while class C flights appear to be launched more from the VAPS and SUC 4E launch locations. However, this is just a general observation, and there are many exceptions.

#Overall, the chart suggests that there is a complex relationship between flight number and launch location. There is no consistent pattern, and it seems that multiple factors influence the determination of a flight's launch location.

"""Now try to explain the patterns you found in the Flight Number vs. Launch Site scatter point plots.

#TASK 2: Visualize the relationship between Payload and Launch Site

We also want to observe if there is any relationship between launch sites and their payload mass.
"""

# Plot a scatter point chart with x axis to be Pay Load Mass (kg) and y axis to be the launch site, and hue to be the class value
sns.catplot(y="LaunchSite", x="PayloadMass", hue="Class", data=df, aspect = 5)
plt.xlabel("PayloadMass(kg)",fontsize=20)
plt.ylabel("LaunchSite",fontsize=20)
plt.show()
#The output provided shows a  catplot that shows the relationship between payload mass and launch site. The horizontal axis shows payload mass (in kilograms) and the vertical axis shows launch site. Data points are colored by class.
#As can be seen from the graph, there is a strong relationship between payload mass and launch site. Heavier payloads are generally launched from larger launch sites. This is because larger launch sites are capable of carrying heavier payloads and also have more space to accelerate them.

#Specifically, we can observe that:

#Heavier payloads (over 10,000 kg) are exclusively launched from the CCAFS SLC 40 and MAFB-SUCRE launch sites.
#Medium payloads (between 2,000 and 10,000 kg) are launched from both launch sites, but more are launched from CCAFS SLC 40.
#Light payloads (less than 2,000 kg) are exclusively launched from the CCAFS SLC 40 launch site.
#These findings are consistent with intuition, as heavier payloads require more force to launch and larger launch sites can provide this force.

#In addition, the graph shows a tendency for payload mass to increase with increasing launch distance. This is because heavier payloads generally require higher orbits and require more energy to reach these orbits. Larger launch sites are capable of providing this additional energy.

#Conclusion
#The image provided shows that there is a strong relationship between payload mass and launch site. Heavier payloads are generally launched from larger launch sites. This is because larger launch sites are capable of carrying heavier payloads and also have more space to accelerate them.

#In addition, the graph shows a tendency for payload mass to increase with increasing launch distance. This is because heavier payloads generally require higher orbits and require more energy to reach these orbits. Larger launch sites are capable of providing this additional energy.

"""Now if you observe Payload Vs. Launch Site scatter point chart you will find for the VAFB-SLC launchsite there are no rockets launched for heavypayload mass(greater than 10000).

#TASK  3: Visualize the relationship between success rate of each orbit type
Next, we want to visually check if there are any relationship between success rate and orbit type.

Let's create a bar chart for the sucess rate of each orbit
"""

df.head()

# group df by Orbits and find the mean of Class column
df_groupby_orbits = df.groupby('Orbit').Class.mean()
df_groupby_orbits

sns.countplot(data=df, x="Orbit", hue="Class")
plt.show()
#method_countplot
#The provided chart illustrates the relationship between orbit type and success rate. The horizontal axis of the chart corresponds to the orbit type, and the vertical axis corresponds to the success rate. Each bar in the chart represents a specific orbit type, and the height of the bar indicates the success rate for that orbit type.

#As evident from the chart, there is a strong correlation between orbit type and success rate. LEO (Low Earth Orbit) and GEO (Geostationary Earth Orbit) have the highest success rates, while HEO (Highly Elliptical Orbit) and GTO (Geostationary Transfer Orbit) have the lowest success rates.

#There are several reasons for this pattern. LEO and GEO orbits are relatively stable and do not require many maneuvers to maintain. This makes them less prone to errors. On the other hand, HEO and GTO orbits are unstable and require numerous maneuvers to maintain them. This makes them more error-prone.

#Furthermore, HEO and GTO orbits often use more powerful launchers, which are also more prone to errors. Powerful launchers have more components that can fail, and therefore have a higher probability of error.

#Overall, the chart demonstrates that there is a strong relationship between orbit type and success rate. LEO and GEO have the highest success rates, while HEO and GTO have the lowest success rates. This pattern is attributed to orbit stability, maneuver requirements, and launcher power
#Conclusion
#The chart reveals that there is a strong correlation between orbit type and success rate. LEO and GEO have the highest success rates, while HEO and GTO have the lowest success rates. This pattern is attributed to orbit stability, maneuver requirements, and launcher power.

"""Orbit types: The orbit types represented in the chart are:
LEO: Low Earth Orbit
GEO: Geostationary Earth Orbit
MEO: Medium Earth Orbit
HEO: Highly Elliptical Orbit
GTO: Geostationary Transfer Orbit
Number of successful launches: The number of successful launches for each orbit type is as follows:
LEO: 14 successful launches
GEO: 12 successful launches
MEO: 4 successful launches
HEO: 2 successful launches
GTO: 0 successful launches
Key observations
LEO and GEO have the highest number of successful launches. This is likely due to several factors, including:
Orbit stability: LEO and GEO orbits are relatively stable and require fewer maneuvers to maintain, making them less prone to errors.
Satellite complexity: LEO and GEO orbits are commonly used for smaller satellites, which are generally less complex and more reliable than larger satellites.
HEO and GTO have the lowest number of successful launches. This is likely due to several factors, including:
Orbit instability: HEO and GTO orbits are more challenging due to their instability and the need for frequent maneuvers.
Satellite complexity: These orbits are also often used for larger and more complex satellites, which are more susceptible to failures.
Mission complexity: GTO orbits involve a critical transition phase, where the spacecraft is raised from a lower orbit to its final destination, further increasing the risk of failure.
Overall, the chart demonstrates that the number of successful launches varies significantly across different orbit types. LEO and GEO have the highest success rates, while HEO and GTO have the lowest success rates. This pattern is attributed to orbit stability, maneuver requirements, satellite complexity, and mission complexity
"""

df_success= df[df['Class']==1]
df_fail= df[df['Class']==0]
#The code snippet uses Python programming language.
#It defines two sets named df_success and df_fail. These sets likely contain dataframes that hold information about successful and failed launches respectively.
#The code then uses the .value_counts() method to count the number of occurrences of each element in the Orbit column of the df_success dataframe.
#The result is stored in the variable named per. The variable per is likely a pandas Series object that contains the counts for each orbit type.

y=set(df_success['Orbit'])
y

X=set(df_fail['Orbit'])
X

per=(df_success['Orbit'].value_counts(normalize=True))
per

df_success
sns.countplot(data=df_success, x="Orbit", hue="Class" )
plt.show()

df_fail
sns.countplot(data=df_fail, x="Orbit", hue="Class" )
plt.show()

"""# TASK  4: Visualize the relationship between FlightNumber and Orbit type

For each orbit, we want to see if there is any relationship between FlightNumber and Orbit type.
"""

# Plot a scatter point chart with x axis to be FlightNumber and y axis to be the Orbit, and hue to be the class value
sns.catplot(y="Orbit", x="FlightNumber", hue="Class",data=df, aspect = 5)
plt.xlabel("FlightNumber",fontsize=20)
plt.ylabel("Orbit",fontsize=20)
plt.show()
#There is no clear pattern between flight number and launch location. Flights are launched from various locations, and there is no consistent trend. However, it appears that a few launch locations are more popular than others. For instance, the CCAFS and SLC launch locations seem to be used for a large number of flights.

"""
You should see that in the LEO orbit the Success appears related to the number of flights; on the other hand, there seems to be no relationship between flight number when in GTO orbit."""

sns.catplot(y="Orbit", x="FlightNumber", hue="LaunchSite",data=df, aspect = 5)
plt.xlabel("FlightNumber",fontsize=20)
plt.ylabel("Orbit",fontsize=20)
plt.show()

"""#TASK 5: Visualize the relationship between Payload and Orbit type
Similarly, we can plot the Payload vs. Orbit scatter point charts to reveal the relationship between Payload and Orbit type

Plot a scatter point chart with x axis to be Payload and y axis to be the Orbit, and hue to be the class value
With heavy payloads the successful landing or positive landing rate are more for Polar,LEO and ISS.

However for GTO we cannot distinguish this well as both positive landing rate and negative landing(unsuccessful mission) are both there here.
"""

# Plot a scatter point chart with x axis to be Payload and y axis to be the Orbit, and hue to be the class value
sns.catplot(y="Orbit", x="PayloadMass", hue="Class",data=df, aspect = 5)
plt.xlabel("PayloadMass",fontsize=20)
plt.ylabel("Orbit",fontsize=20)
plt.show()

"""With heavy payloads the successful landing or positive landing rate are more for Polar,LEO and ISS.

However for GTO we cannot distinguish this well as both positive landing rate and negative landing(unsuccessful mission) are both there here.

#TASK  6: Visualize the launch success yearly trend
You can plot a line chart with x axis to be Year and y axis to be average success rate, to get the average launch success trend.

The function will help you get the year from the date:
"""

year=[]  #This line defines a function named Extract_yrar. This function takes no parameters.
def Extract_year():  #This line starts a for loop that iterates over each element in the del list. The variable i is used to store each element of the list in each iteration of the loop.
    for i in df["Date"]:  #This line appends the element i to the list year. Before appending, the element i is converted to a list using split("-"). This method splits the list into parts based on the hyphen (-) and extracts the first part, which represents the year.
        year.append(i.split("-")[0])
    return year
Extract_year()  #These lines call the function Extract_yrar and assign its result to the Date column in the DataFrame df.
df['Date'] = year
df.head()  #This line prints the first 5 rows of the DataFrame df. This allows you to verify that the Date column is correctly populated with the corresponding years.
#"I see that the 'data' column only contains the year."
# plot line chart

# Plot a line chart with x axis to be the extracted year and y axis to be the success rate
df_copy = df.copy()
df_copy['Extract_year'] = pd.DatetimeIndex(df['Date']).year

# plot line chart
fig, ax=plt.subplots(figsize=(12,6))
sns.lineplot(data=df_copy, x='Extract_year', y='Class')
plt.title('Plot of launch success yearly trend');
plt.show()

"""Plot a line chart with x axis to be the extracted year and y axis to be the success rate.
you can observe that the sucess rate since 2013 kept increasing till 2020
"""

df_succ= df[df['Class']==1]  #We are interested in identifying successful launch sites and A class rating of one signifies a successful launch site

df_line=df_succ[['Date','Class' ]]  #We are extracting the class and data fields
df_line

df_succ['Class'].count()  #I see the number.

sns.countplot(x='Date', data=df_succ)
sns.color_palette("pastel")
plt.show()

df['PayloadMass'].hist()
plt.show()

sns.countplot(x='LaunchSite',data=df)
plt.show()
#Based on the information provided in the image, it is possible to draw some inferences about the popularity of different launch sites. For example, it is possible that CCAFS is the most popular launch site because it is the largest and most well-equipped launch site in the United States.

"""#Features Engineering
By now, you should obtain some preliminary insights about how each important variable would affect the success rate, we will select the features that will be used in success prediction in the future module.
"""

features = df[['FlightNumber', 'PayloadMass', 'Orbit', 'LaunchSite', 'Flights', 'GridFins', 'Reused', 'Legs', 'LandingPad', 'Block', 'ReusedCount', 'Serial']]
features.head()
#The importance of columns 'FlightNumber', 'PayloadMass', 'Orbit', 'LaunchSite', 'Flights', 'GridFins', 'Reused', 'Legs', 'LandingPad', 'Block', 'ReusedCount', and 'Serial' for launch success.

"""The current output is not suitable for direct machine learning applications as machine learning algorithms primarily operate on numerical data.
The get_dummies function is a useful tool in the pandas library for handling categorical data in machine learning. It efficiently transforms categorical variables into numerical features, making them suitable for processing by machine learning algorithms.

#TASK  7: Create dummy variables to categorical columns

Use the function get_dummies and features dataframe to apply OneHotEncoder to the column Orbits, LaunchSite, LandingPad, and Serial. Assign the value to the variable features_one_hot, display the results using the method head. Your result dataframe must include all features including the encoded ones.
"""

features_one_hot= pd.get_dummies(df[['Orbit', 'LaunchSite', 'LandingPad', 'Serial']])

features_one_hot.head().astype(int)

"""# TASK  8: Cast all numeric columns to `float64`
Now that our features_one_hot dataframe only contains numbers cast the entire dataframe to variable type float64

HINT: use astype function
"""

df_Dummy= features_one_hot.astype(float)

df_Dummy

df= df.drop(['Orbit', 'LaunchSite', 'LandingPad', 'Serial'] , axis=1)
#The first step is to remove the four columns that we converted from non-numeric to numeric from our main DataFrame. Since these columns are of type object, we will use the drop method to eliminate them.

df

df=pd.concat([df, df_Dummy] , axis=1) #To combine two dataframes and create a new dataframe with only numeric values.

df

#We can now export it to a CSV for the next section,but to make the answers consistent, in the next lab we will provide data in a pre-selected date range.
df.to_csv('Week02_02.csv', index=False)