This project aims to build a machine learning pipeline for predicting stock prices using various features extracted from historical data. The pipeline consists of several components, including data ingestion, data visualization, feature engineering, and model building.
In this project, we have integrated MLOps practices to enhance the efficiency, reproducibility, and collaboration of our machine learning workflow. The primary tools utilized for MLOps in this project are DVC (Data Version Control) and MLflow.
DVC is an open-source version control system designed specifically for managing machine learning projects. It allows data scientists and engineers to track changes in data, models, and experiments, similar to how Git manages code. Here are some key features and benefits of using DVC:
- Data Versioning: DVC enables us to version our datasets, ensuring that we can easily revert to previous versions if needed.
- Pipeline Management: DVC allows us to define and manage complex data pipelines.
- Integration with Git: DVC integrates seamlessly with Git, allowing us to keep our data and code in sync.
dvc init # Initialize DVC in the project
dvc add data/raw_data.csv # Add the raw data file to DVC
git add data/raw_data.csv.dvc data/.gitignore # Track the DVC file in Git
MLflow is a powerful platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment. It provides tools for tracking experiments, packaging code into reproducible runs, and sharing and deploying models. Key features of MLflow include:
- Experiment Tracking: MLflow allows us to log parameters, metrics, and artifacts for each experiment run.
- Model Registry: MLflow provides a centralized model registry, enabling us to manage and version our machine learning models.
- Integration with Various Frameworks: MLflow supports various machine learning libraries and frameworks, making it flexible for different use cases.
import mlflow
with mlflow.start_run():
mlflow.log_param("model_type", "RandomForest")
mlflow.log_metric("mse", mean_squared_error(y_true, y_pred))
mlflow.sklearn.log_model(model, "model")
The project is organized into several key scripts, each responsible for a specific aspect of the pipeline:
- Data Ingestion:
data_ingestion.py
- Data Visualization:
data_visualization.py
- Feature Engineering:
feature_engineering.py
- Model Building:
model_building.py
The data_ingestion.py
script is responsible for loading and transforming the
dataset. It includes robust logging to track the data loading process.
import pandas as pd
import logging
logger = logging.getLogger('data_ingestion')
def load_data(data_url: str) -> pd.DataFrame:
df = pd.read_csv(data_url)
logger.debug('Data loaded successfully.')
return df
In data_visualization.py
, we create insightful visualizations to understand
the data better.
import seaborn as sns
import matplotlib.pyplot as plt
def visualize_distribution(df: pd.DataFrame, stock_name: str):
sns.histplot(df[df['Stock_Name'] == stock_name]['Price'], kde=True)
plt.title(f'Distribution of {stock_name} Prices')
plt.show()
The feature_engineering.py
script handles the creation of new features that
enhance model performance.
from sklearn.preprocessing import LabelEncoder
def convert_stock_names_to_ids(df: pd.DataFrame, column: str) -> pd.DataFrame:
le = LabelEncoder()
df[column] = le.fit_transform(df[column])
return df
Finally, in model_building.py
, we build and evaluate our models.
from sklearn.ensemble import RandomForestRegressor
def train_model(X_train, y_train):
model = RandomForestRegressor()
model.fit(X_train, y_train)
return model
By integrating DVC and MLflow into our stock price prediction pipeline, we enhance our ability to manage data and track experiments efficiently. This project not only demonstrates the power of MLOps but also serves as a robust framework for predicting stock prices. Join us in exploring the exciting world of financial predictions!