Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated Data Processing and Analysis #365

Open
sarthakforwet opened this issue Oct 30, 2024 · 1 comment
Open

Automated Data Processing and Analysis #365

sarthakforwet opened this issue Oct 30, 2024 · 1 comment
Assignees

Comments

@sarthakforwet
Copy link

sarthakforwet commented Oct 30, 2024

@DhanushNehru, I would like to add the following script -

A Python Script to perform data analysis and processing for a Regression task given a dataset and the target variable as input.

@sarthakforwet sarthakforwet changed the title @dhaunshNehru Automated Data Processing and Analysis Oct 30, 2024
@TheProgrammer-creator
Copy link

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

1. Load the dataset

def load_data(file_path):
data = pd.read_csv(file_path)
print("Data successfully loaded.")
return data

2. Exploratory Data Analysis

def exploratory_data_analysis(data, target_variable):
print("First 5 rows:\n", data.head())
print("Data information:\n", data.info())
print("Descriptive statistics:\n", data.describe())

# Check correlation
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

# Target variable distribution
sns.histplot(data[target_variable], kde=True)
plt.title(f"Distribution of Target Variable: {target_variable}")
plt.show()

3. Data Preprocessing

def preprocess_data(data, target_variable):
# Separate features and target variable
X = data.drop(columns=[target_variable])
y = data[target_variable]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print("Data preprocessing complete.")
return X_train, X_test, y_train, y_test

4. Train the Model

def train_model(X_train, y_train):
model = LinearRegression()
model.fit(X_train, y_train)
print("Model trained successfully.")
return model

5. Evaluate the Model

def evaluate_model(model, X_test, y_test):
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R^2 Score: {r2}")

# Visualize Predictions
plt.scatter(y_test, y_pred, alpha=0.7)
plt.xlabel("Actual Values")
plt.ylabel("Predictions")
plt.title("Actual Values vs. Predictions")
plt.show()

Main function to run the entire script

def main(file_path, target_variable):
data = load_data(file_path)
exploratory_data_analysis(data, target_variable)
X_train, X_test, y_train, y_test = preprocess_data(data, target_variable)
model = train_model(X_train, y_train)
evaluate_model(model, X_test, y_test)

Execute script by specifying the file path and target variable

file_path = "path_to_your_dataset.csv"
target_variable = "your_target_variable_name"
main(file_path, target_variable)
Explications:
Load the dataset: Reads data from a CSV file.
Exploratory Data Analysis (EDA): Displays initial data, statistics, and correlation matrix for understanding relationships.
Data Preprocessing: Splits the dataset into training and testing sets and applies standardization.
Train the Model: Trains a linear regression model on the training data.
Evaluate the Model: Calculates Mean Squared Error and R^2 Score and plots predictions vs. actual values for model performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants