diff --git a/01_materials/labs/.ipynb_checkpoints/01_setup-checkpoint.ipynb b/01_materials/labs/.ipynb_checkpoints/01_setup-checkpoint.ipynb new file mode 100644 index 000000000..6b0966a70 --- /dev/null +++ b/01_materials/labs/.ipynb_checkpoints/01_setup-checkpoint.ipynb @@ -0,0 +1,447 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Production 1: Setting Up A Repo" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "+ Working with code in production is hard. Rarely we will have a chance to work on a greenfield development and will get a chance to define all of its specifications.\n", + "+ Sometimes, we may be offered the option of scraping a system and starting from scratch. This option should be considered carefully and, most of the time, rejected.\n", + "+ Working with legacy code will be the norm:\n", + "\n", + " - Legacy code includes our own code.\n", + " - Legacy code may have been written by colleagues with different approaches, philosophies, and skills.\n", + " - Legacy code may have been written for old technology.\n", + "\n", + "+ Most of the time, legacy code works and *this* is the reason we are working with it.\n", + "\n", + "## Software Entropy\n", + "\n", + "+ Software entroy is the natural evolution of code towards chaos.\n", + "+ Messy code is a natural consequence of change:\n", + "\n", + " - Requirements change.\n", + " - Technology change.\n", + " - Business processes change.\n", + " - People change.\n", + "\n", + "+ Software entropy can be managed. Some techniques include:\n", + "\n", + " - Apply a code style.\n", + " - Reduce inconsistency.\n", + " - Continuous refactoring.\n", + " - Apply reasonable architectures.\n", + " - Apply design patterns.\n", + " - Testing and CI/CD.\n", + " - Documentation.\n", + "\n", + "+ *Technical debt* is future work that is owed to fix issues with the current codebase.\n", + "+ Technical debt has principal and interest: complexity spreads and what was a simple *duct tape* solution becomes the source of complexity in downstream consumers.\n", + "+ ML systems are complex: they involve many components and the interaction among those components determines the behaviour of the system. Adding additional complexity by using poor software development practices can be avoided.\n", + "+ Building ML Systems is most of the time a team sport. Our tools should be designed for collaboration." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# A Reference Architecture" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What are we building?\n", + "\n", + "+ [Agrawal and others](https://arxiv.org/abs/1909.00084) propose the reference architecture below.\n", + "\n", + "![Flock reference architecture (Agrawal etl al, 2019)](./images/01_flock_ref_arhitecture.png)\n", + "\n", + "+ Through the course, we will write the code in Python for the different components of this architecture. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Repo File Structure\n", + "\n", + "+ A simple standard file structure can go a long way in reducing entropy. Some frameworks will impose file structures, but generally the following pattern works well:\n", + "\n", + "```\n", + "./\n", + "./data/\n", + "./env/\n", + "./logs/\n", + "./src/\n", + "./tests/\n", + "...\n", + "./docs/\n", + "./.gitignore\n", + "./readme.md\n", + "./requirements.txt\n", + "...\n", + "```\n", + "\n", + "+ A few notes on the file structure:\n", + "\n", + " - `./data/` is a general data folder which is generally subdivided in a namespace. \n", + " \n", + " * This is an optional component and meant to hold development data or a small feature store.\n", + " * Include in `.gitignore`. \n", + "\n", + " - `./env/` is the Python virtual environment. It is included in `.gitignore`.\n", + " - `./logs/` is the location of log files. It is included in `.gitignore`.\n", + " - `./src/` is the source folder. \n", + "\n", + " * Contains most of the modules and module folders.\n", + " * It is the reference directory to all relative paths.\n", + "\n", + " - `./src/.env` contains environment variable definitions. \n", + "\n", + " * We will add settings to this file and read them throughout our setup.\n", + " * A single convenient location to maintain connection strings to DB, directory locations, and settings.\n", + " \n", + " - `./readme.md` a description of the project and general guidance on where to find information in the repo.\n", + " - `./requirements.txt` Python libraries that are required for this repo.\n", + " - `./.gitignore` includes all files that should be ignored in change control. \n", + "\n", + "+ The pattern lends itself to standard .gitignore files that include `./env/` " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Source Control\n", + "\n", + "\n", + "\n", + "## Git and Github\n", + "\n", + "+ Git is a version control system that lets you manage and keep track of your source code history.\n", + "+ If you have not done so, please get an account on [Github](https://github.com/) and setup SSH authentication:\n", + "\n", + " - Check for [existing SSH keys](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/checking-for-existing-ssh-keys).\n", + " - If needed, create an [SSH Key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent).\n", + " - [Add SSH key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account#adding-a-new-ssh-key-to-your-account) to your Github account.\n", + "\n", + "+ If you need a refresher of Git commands, a good reference is [Pro Git](https://git-scm.com/book/en/v2) (Chacon and Straub, 2014).\n", + "\n", + "## What do we include in a commit?\n", + "\n", + "* Generally, we will use Git to maintain data transformation and movement *code*.\n", + "* It is good practice to not use Git to maintain data inputs or outputs of any kind. \n", + "* Some exceptions include: settings, experimental notebooks used to document design choices. \n", + "* Things to avoid putting in a repo: Personal Identity Information (PII), passwords and keys.\n", + "\n", + "## Version Control System Best Practices\n", + "\n", + "+ Commit early and commit often.\n", + "+ Use meaningful commits:\n", + "\n", + " - The drawback of commiting very frequently is that there will be incomplete commits, errors and stepbacks in the commit messages. Commit messages include: \"Committing before switching to another task\", \"Oops\", \"Undoing previous idea\", \"Fire alarm\", etc.\n", + " - In Pull Requests, squash commits and write meaningful messages. \n", + "\n", + "+ Apply a branch strategy.\n", + "+ Submit clean pull requests: verify that latest branch is merged and review conflicts.\n", + "\n", + "## Commit Messages\n", + "\n", + "+ Clear commit messages help document your code and allow you to trace the reaoning behind design decisions. \n", + "+ A few guidelines for writing commit messages:\n", + "\n", + " - Use markdown: Github interprets commit messages as markdown.\n", + " - First line is a subject:\n", + "\n", + " * No period at the end.\n", + " * Use uppercase as appropriate.\n", + " \n", + " - Write in imperative form in the subject line and whenever possible:\n", + "\n", + " * Do: \"Add connection to db\", \"Connect to db\"\n", + " * Do not: \"This commit adds a connection to db\", \"Connection to db added\"\n", + "\n", + " - The body of the message should explain why the change was made and not what was changed.\n", + "\n", + " * Diff will show changes in the code, but not the reasoning behind it.\n", + "\n", + " - Same rules apply for Pull Requests.\n", + "\n", + "+ Many of these points are taken from [How to Write a Git Commit Message](https://cbea.ms/git-commit/) by Chris Beams." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Branching Strategies\n", + "\n", + "+ When working standalone or in a team, you should consider your [branching strategy](https://www.atlassian.com/agile/software-development/branching).\n", + "+ A branching strategy is a way to organize the progression of code in your repo. \n", + "+ In [trunk-based branching strategy](https://www.atlassian.com/continuous-delivery/continuous-integration/trunk-based-development), each developer works based on the *trunk* or *main* branch.\n", + "\n", + "![(Ryaboy, 2021)](./images/01_trunk_based_development.png)\n", + "\n", + "+ After each bug fix, enhancement, or upgrade is complete, the change is integrated to *main*.\n", + "+ Generally, part of a larger Continuous Integration/Continuous Deployment (CI/CD) process." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## VS Code and Git\n", + "\n", + "+ An Interactive Development Environment (IDE) is software to help you code. \n", + "+ IDEs are, ultimately, a matter of personal taste, but there are advantages to using the popular solutions:\n", + "\n", + " - Active development and bug fixes.\n", + " - Plugin and extension ecosystems.\n", + " - Active community for help, support, tutorials, etc.\n", + "\n", + "+ Avoid the *l33t coder* trap: *vim* and *emacs* may work for some, but *nano* and VS Code are great solutions too. \n", + "+ Reference: [Using Git source control in VS Code](https://code.visualstudio.com/docs/sourcecontrol/overview).\n", + "+ A few tips:\n", + "\n", + " - From the source control menu, one can easily stage files, commit, and push/pull to origin.\n", + "\n", + " - Other commands can be accessed via the command pallete (`Ctrl + Shift + P`). For instance, one can select or create a new branch using the option *Git: Checkout to*." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Python Virtual Environments\n", + "\n", + "+ There are many reasons to control our development environment, including version numbers for Python and all the libraries that we are using:\n", + "\n", + " - Reproducibility: we want to be able to reproduce our process in a production environment with as little change as possible. \n", + " - Backup and archiving: saving our work in a way that can be used in the future, despite Python and libraries evolving.\n", + " - Collaboration: work with colleagues on different portions of the code involves everyone having a standard platform to run the codebase.\n", + "\n", + "+ We can achieve the objectives above in many ways, including vritualizing our environments, packaging our code in containers, and using virtual machines, among others.\n", + "+ Most of the time, creating a virtual environment will be part of the initial development setup. This vritual environment will help us *freeze* the python version and some version libraries. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting up the environment\n", + "\n", + "### Using venv\n", + "\n", + "+ The simplest way to add a new virtual environment is to use the command: `python -m venv env`.\n", + "+ This command will start a new virtual environment in the subfolder `./env`.\n", + "+ To *activate* this environment use `./env/Scripts/Activate.ps` (windows).\n", + "+ Optionally, consider the Python add-on for VS Code that activates the environment automatically for you.\n", + "\n", + "### Conda\n", + "\n", + "+ [Conda](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html) is a command line tool for package and environment management.\n", + "+ From the terminal, create a virtual environment with: `conda create -n `. For example, `conda create -n scale2prod` creates a new environment called `scale2prod`.\n", + "+ Activate the environment with `conda activate `. For example, `conda activate scale2prod`.\n", + "+ Other useful commands are:\n", + "\n", + " - Verify conda installation: `conda info` or `conda -V`\n", + " - List current environments: `conda info --envs`\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setup a Logger\n", + "\n", + "+ We will use Python's logging module and will provision our standard loggers through our first module.\n", + "+ The module is located in `./05_src/logger.py`.\n", + "+ Our notebooks will need to add `../05_src/` to their path and load environment variables from `../05_src/.env`. Notice that these paths are based on the notebook's location. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Logger highlights\n", + "\n", + "A few highlights about `./05_src/logger.py`:\n", + "\n", + "+ This logger has two handlers: \n", + "\n", + " - A `FileHandler` that will save logs to files that are datetime index.\n", + " - A `StreamHandler` handler that outputs messages to the stdout.\n", + "\n", + "+ Each logger can set its own format. \n", + "+ The log directory and log level are obtained from the environment.\n", + "+ According to the [Advanced Logging Tutorial](https://docs.python.org/2/howto/logging.html#logging-advanced-tutorial): \n", + "\n", + " >\"A good convention to use when naming loggers is to use a module-level logger, in each module which uses logging, named as follows: \n", + " >\n", + " >`logger = logging.get_logger(__name__)`.\n", + " >\n", + " >This means that logger names track the package/module hierarchy, and it’s intuitively obvious where events are logged just from the logger name.\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run the code below to verify that your setup is working." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%load_ext dotenv\n", + "%dotenv " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "sys.path.append(\"../../05_src\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from logger import get_logger\n", + "_logs = get_logger(__name__)\n", + "_logs.info(\"Hello world!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using Docker to Set Up a Dev DB\n", + "\n", + "+ For our work, we need an environment that resembles the production environment as closely as possible. \n", + "+ One way to achieve this is to use containers and containerized application. \n", + "+ Without going into the details, you can think of a container as software that encapsulates the key features of an operating system, a programming language, and the application code.\n", + "+ Containers are meant to be portable across operating systems: a container will work the same regardless if the underlying Docker application is installed in a Windows, Linux or Mac machine.\n", + "+ Containers are not Virtual Machines.\n", + "+ Docker is a popular application that implement containers.\n", + "\n", + "## What is Docker?\n", + "\n", + "+ From product documentation:\n", + "\n", + "> Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure so you can deliver software quickly. With Docker, you can manage your infrastructure in the same ways you manage your applications. By taking advantage of Docker's methodologies for shipping, testing, and deploying code, you can significantly reduce the delay between writing code and running it in production.\n", + "\n", + "## General Procedure\n", + "\n", + "+ To setup a DB Server using containers, we will do the following:\n", + "\n", + "1. Download an image from [Docker Hub](https://hub.docker.com/).\n", + "2. If required, set up a volume to [persist data](https://docs.docker.com/guides/walkthroughs/persist-data/).\n", + "3. Redirect ports as needed.\n", + "4. Start the container.\n", + "\n", + "+ We will follow these steps two times: once to setup a PostgreSQL server and again to start the PgAdmin4 service.\n", + "\n", + "## Starting the Containers\n", + "\n", + "+ To run the process above, first navigate to the `./05_src/db/` folder, then run `docker compose up -d`. \n", + "+ The flag `-d` indicates that we will do a headless run. \n", + "+ Notice that the containers are set to always restart. You can remove the option or turn the containers off manually. Be aware that if you leave this option on, the containers will run any time your Docker desktop restarts.\n", + "\n", + "## Stopping the Containers\n", + "\n", + "+ To stop the containers use (from `./05_src/db/`): `docker compose stop`.\n", + "+ Alternatively, you can bring all images down including their volumes with: `docker compose down -v`. \n", + "\n", + " - The `-v` flag removes volumes. \n", + " - It is the best option when you are do not need the data any more because **it will delete the data in your DB **. \n", + "\n", + "\n", + "## Connecting to PgAdmin\n", + "\n", + "+ PgAdmin4 is management software for PostgreSQL Server.\n", + "+ You can open the local implementation by navigating to [http://localhost:5051](http://localhost:5051/). You will find a screen like the one below.\n", + "\n", + "![](./images/01_pgadmin_login.png)\n", + "\n", + "+ Login using the credentials specified in the file `./05_src/db/.env`. Notice there are two sets of credentials, use the ones for PgAdmin4. After authentication, you will see a screen like the one below.\n", + "\n", + "![](./images/01_pgadmin_initial.png)\n", + "\n", + "+ Click on \"Add New Server\":\n", + "\n", + " - In the *General* Tab, under Name enter: localhost. \n", + " - Under the *Connection* Tab, use Host name *db* (this is the name of the service in the docker compose file). \n", + " - Username and password are the ones found in the `./05_src/db/.env` file.\n", + "\n", + "\n", + "\n", + "## Learn More\n", + "\n", + "+ Containers and containerization are topics well beyond the scope of this course. However, we will use containerized applications to help us implement certain patterns. \n", + "+ If you are interested in Docker, a good place to start is the [Official Docker Guides](https://docs.docker.com/get-started/overview/). " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# On Jupyter Notebooks\n", + "\n", + "+ Jupyter Notebooks are great for drafting code, fast experimentation, demos, documentation, and some prototypes.\n", + "+ They are not great for production code and not great for experiment tracking." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# A Note about Copilot\n", + "\n", + "+ AI-assisted coding is a reality. I would like your opinions about the use of this technology.\n", + "+ I will start the course with Copilot on, but if it becomes too distracting, I will be happy to turn it off. \n", + "+ Copilot is a nice tool, but it is not for everyone. If you are starting to code or are trying to level up, I recommend that you leave AI assistants (Copilot, ChatGPT, etc.) for later." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/01_materials/labs/live_production/.ipynb_checkpoints/02_data_engineering_lp-checkpoint.ipynb b/01_materials/labs/live_production/.ipynb_checkpoints/02_data_engineering_lp-checkpoint.ipynb new file mode 100644 index 000000000..a047e4ce1 --- /dev/null +++ b/01_materials/labs/live_production/.ipynb_checkpoints/02_data_engineering_lp-checkpoint.ipynb @@ -0,0 +1,481 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# What are we doing?\n", + "\n", + "## Objectives \n", + "\n", + "\n", + "* Build a data pipeline that downloads price data from the internet, stores it locally, transforms it into return data, and stores the feature set.\n", + " - Getting the data.\n", + " - Schemas and index in dask.\n", + "\n", + "* Explore the parquet format.\n", + " - Reading and writing parquet files.\n", + " - Read datasets that are stored in distributed files.\n", + " - Discuss dask vs pandas as a small example of big vs small data.\n", + " \n", + "* Discuss the use of environment variables for settings.\n", + "* Discuss how to use Jupyter notebooks and source code concurrently. \n", + "* Logging and using a standard logger.\n", + "\n", + "## About the Data\n", + "\n", + "+ We will download the prices for a list of stocks.\n", + "+ The source is Yahoo Finance and we will use the API provided by the library yfinance.\n", + "\n", + "\n", + "## Medallion Architecture\n", + "\n", + "+ The architecture that we are thinking about is called Medallion by [DataBricks](https://www.databricks.com/glossary/medallion-architecture). It is an ELT type of thinking, although our data is well-structured.\n", + "\n", + "![Medallion Architecture (DataBicks)](../images/02_medallion_architecture.png)\n", + "\n", + "+ In our case, we would like to optimize the number of times that we download data from the internet. \n", + "+ Ultimately, we will build a pipeline manager class that will help us control the process of obtaining and transforming our data.\n", + "\n", + "![](../images/02_target_pipeline_manager.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Download Data from Yahoo Finance\n", + "\n", + "Yahoo Finance provides information about public stocks in different markets. The library yfinance gives us access to a fair bit of the data in Yahoo Finance. \n", + "\n", + "These steps are based on the instructions in:\n", + "\n", + "+ [yfinance documentation](https://pypi.org/project/yfinance/)\n", + "+ [Tutorial in geeksforgeeks.org](https://www.geeksforgeeks.org/get-financial-data-from-yahoo-finance-with-python/)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "+ If required, install: `python -m pip install yfinance`.\n", + "+ To download the price history of a stock, first use the following setup:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A few things to notice in the code chunk above:\n", + "\n", + "+ Libraries are ordered from high-level to low-level libraries from the package manager (pip in this case, but could be conda, poetry, etc.)\n", + "+ The command `sys.path.append(\"../05_src/)` will add the `../05_src/` directory to the path in the Notebook's kernel. This way, we can use our modules as part of the notebook.\n", + "+ Local modules are imported at the end. \n", + "+ The function `get_logger()` is called with `__name__` as recommended by the documentation." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, to download the historical price data for a stock, we could use:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Parametrize the download\n", + "\n", + "+ Generally, we will look to separate every parameter and setting from functions.\n", + "+ If we had a few stocks, we could cycle through them. We need a place to store the list of tickers (a db or file, for example).\n", + "+ Store a csv file with a few stock tickers. The location of the file is a setting, the contents of this file are parameters.\n", + "+ Use **environment variables** to pass parameters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Collecting padas data frames\n", + "\n", + "+ From the [documentation](https://pandas.pydata.org/docs/user_guide/merging.html):\n", + "\n", + "> [`concat()`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html#pandas.concat) makes a full copy of the data, and iteratively reusing `concat()` can create unnecessary copies. Collect all DataFrame or Series objects in a list before using `concat()`.\n", + "\n", + "+ We can string operation togethers using dot operations. Enclose the line in parenthesis and add linebreaks for readability." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Reliability\n", + "\n", + "+ Keppelman (2017) defines *reliability* as:\n", + "\n", + " - A system should continue to work correctly. \n", + " - To work correctly means performing the correct function at the desired level of performance, even in the face of adversity such as hardware or software faults, and even human error. \n", + "\n", + "+ *Faults* are things that can go wrong.\n", + "+ Sytems that can cope with (certain types of) faults are called *fault-tolerant* or *resilient*.\n", + "+ A fault is different than a failure. \n", + " \n", + " - A *fault* occurs when a component of the system deviates from spec.\n", + " - A *failure* is when the system stops providing the required service to the user.\n", + "\n", + "+ In our simple example, we handle the fault that occurs when one ticker is not found and log it using *warning*.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Storing Data in CSV\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "+ We have some data. How do we store it?\n", + "+ We can compare two options, CSV and Parqruet, by measuring their performance:\n", + "\n", + " - Time to save.\n", + " - Space required." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Save Data to Parquet\n", + "\n", + "### Dask \n", + "\n", + "We can work with with large data sets and parquet files. In fact, recent versions of pandas support pyarrow data types and future versions will require a pyarrow backend. The pyarrow library is an interface between Python and the Appache Arrow project. The [parquet data format](https://parquet.apache.org/) and [Arrow](https://arrow.apache.org/docs/python/parquet.html) are projects of the Apache Foundation.\n", + "\n", + "However, Dask is much more than an interface to Arrow: Dask provides parallel and distributed computing on pandas-like dataframes. It is also relatively easy to use, bridging a gap between pandas and Spark. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Parquet files and Dask Dataframes\n", + "\n", + "+ Parquet files are immutable: once written, they cannot be modified.\n", + "+ Dask DataFrames are a useful implementation to manipulate data stored in parquets.\n", + "+ Parquet and Dask are not the same: parquet is a file format that can be accessed by many applications and programming languages (Python, R, PowerBI, etc.), while Dask is a package in Python to work with large datasets using distributed computation.\n", + "+ **Dask is not for everything** (see [Dask DataFrames Best Practices](https://docs.dask.org/en/stable/dataframe-best-practices.html)). \n", + "\n", + " - Consider cases suchas small to larrge joins, where the small dataframe fits in memory, but the large one does not. \n", + " - If possible, use pandas: reduce, then use pandas.\n", + " - Pandas performance tips apply to Dask.\n", + " - Use the index: it is beneficial to have a well-defined index in Dask DataFrames, as it may speed up searching (filtering) the data. A one-dimensional index is allowed.\n", + " - Avoid (or minimize) full-data shuffling: indexing is an expensive operations. \n", + " - Some joins are more expensive than others. \n", + "\n", + " * Not expensive:\n", + "\n", + " - Join a Dask DataFrame with a pandas DataFrame.\n", + " - Join a Dask DataFrame with another Dask DataFrame of a single partition.\n", + " - Join Dask DataFrames along their indexes.\n", + "\n", + " * Expensive:\n", + "\n", + " - Join Dask DataFrames along columns that are not their index.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How do we store prices?\n", + "\n", + "+ We can store our data as a single blob. This can be difficult to maintain, especially because parquet files are immutable.\n", + "+ Strategy: organize data files by ticker and date. Update only latest month.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Why would we want to store data this way?\n", + "\n", + "+ Easier to maintain. We do not update old data, only recent data.\n", + "+ We can also access all files as follows." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load, Transform and Save " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load\n", + "\n", + "+ Parquet files can be read individually or as a collection.\n", + "+ `dd.read_parquet()` can take a list (collection) of files as input.\n", + "+ Use `glob` to get the collection of files." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Transform\n", + "\n", + "+ This transformation step will create a *Features* data set. In our case, features will be stock returns (we obtained prices).\n", + "+ Dask dataframes work like pandas dataframes: in particular, we can perform groupby and apply operations.\n", + "+ Notice the use of [an anonymous (lambda) function](https://realpython.com/python-lambda/) in the apply statement." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Lazy Exection\n", + "\n", + "What does `dd_rets` contain?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "+ Dask is a lazy execution framework: commands will not execute until they are required. \n", + "+ To trigger an execution in dask use `.compute()`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Save\n", + "\n", + "+ Apply transformations to calculate daily returns\n", + "+ Store the enriched data, the silver dataset, in a new directory.\n", + "+ Should we keep the same namespace? All columns?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# A few notes\n", + "\n", + "# Jupyter? \n", + "\n", + "+ We have drafted our code in a Jupyter Notebook. \n", + "+ Finalized code should be written in Python modules.\n", + "\n", + "## Object oriented programming?\n", + "\n", + "+ We can use classes to keep parameters and functions together.\n", + "+ We *could* use Object Oriented Programming, but parallelization of data manipulation and modelling tasks benefits from *Functional Programming*.\n", + "+ An Idea: \n", + "\n", + " - [Data Oriented Programming](https://blog.klipse.tech/dop/2022/06/22/principles-of-dop.html).\n", + " - Use the class to bundle together parameters and functions.\n", + " - Use stateless operations and treat all data objects as immutable (we do not modify them, we overwrite them).\n", + " - Take advantage of [`@staticmethod`](https://realpython.com/instance-class-and-static-methods-demystified/).\n", + "\n", + "The code is in `./05_src/`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our original design was:\n", + "\n", + "![](../images/02_target_pipeline_manager.png)\n", + "\n", + "Our resulting interface is:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/02_activities/assignments/.env b/02_activities/assignments/.env new file mode 100644 index 000000000..292e1ee99 --- /dev/null +++ b/02_activities/assignments/.env @@ -0,0 +1,18 @@ +# Logs +LOG_DIR=../../07_logs/ +LOG_LEVEL=INFO + +TICKERS=../../05_src/data/tickers/sp500_wiki.csv +TEMP_DATA=../../05_src/data/temp/ +SRC_DIR =../../05_src/ +PRICE_DATA=../../05_src/data/prices/ +FEATURES_DATA=../../05_src/data/features/stock_features.parquet +FEATURES_DATA_2=../../05_src/data/features/features_assignment.parquet + +DB_URL=postgresql://postgres:HumanAfterAll@localhost:5432/model_db +ARTIFACTS_DIR=../../05_src/data/artifacts/ + +CREDIT_DATA=../../05_src/data/credit/cs-training.csv + +LOGISTIC_REGRESSION_PG=../../05_src/config/logistic_regression_pg.json +SVM_PG=../../05_src/config/svm_pg.json diff --git a/02_activities/assignments/.ipynb_checkpoints/assignment_1-checkpoint.ipynb b/02_activities/assignments/.ipynb_checkpoints/assignment_1-checkpoint.ipynb new file mode 100644 index 000000000..29d0b8687 --- /dev/null +++ b/02_activities/assignments/.ipynb_checkpoints/assignment_1-checkpoint.ipynb @@ -0,0 +1,188 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Working with parquet files\n", + "\n", + "## Objective\n", + "\n", + "+ In this assignment, we will use the data downloaded with the module `data_manager` to create features.\n", + "\n", + "(11 pts total)\n", + "\n", + "## Prerequisites\n", + "\n", + "+ This notebook assumes that price data is available to you in the environment variable `PRICE_DATA`. If you have not done so, then execute the notebook `production_2_data_engineering.ipynb` to create this data set.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "+ Load the environment variables using dotenv. (1 pt)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code below.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import dask.dataframe as dd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "+ Load the environment variable `PRICE_DATA`.\n", + "+ Use [glob](https://docs.python.org/3/library/glob.html) to find the path of all parquet files in the directory `PRICE_DATA`.\n", + "\n", + "(1pt)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from glob import glob\n", + "\n", + "# Write your code below.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For each ticker and using Dask, do the following:\n", + "\n", + "+ Add lags for variables Close and Adj_Close.\n", + "+ Add returns based on Adjusted Close:\n", + " \n", + " - `returns`: (Adj Close / Adj Close_lag) - 1\n", + "\n", + "+ Add the following range: \n", + "\n", + " - `hi_lo_range`: this is the day's High minus Low.\n", + "\n", + "+ Assign the result to `dd_feat`.\n", + "\n", + "(4 pt)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code below.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "+ Convert the Dask data frame to a pandas data frame. \n", + "+ Add a rolling average return calculation with a window of 10 days.\n", + "+ *Tip*: Consider using `.rolling(10).mean()`.\n", + "\n", + "(3 pt)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code below.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Please comment:\n", + "\n", + "+ Was it necessary to convert to pandas to calculate the moving average return?\n", + "+ Would it have been better to do it in Dask? Why?\n", + "\n", + "(1 pt)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Criteria\n", + "\n", + "The [rubric](./assignment_1_rubric_clean.xlsx) contains the criteria for grading." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Submission Information\n", + "\n", + "🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.\n", + "\n", + "### Submission Parameters:\n", + "* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`\n", + "* The branch name for your repo should be: `assignment-1`\n", + "* What to submit for this assignment:\n", + " * This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.\n", + "* What the pull request link should look like for this assignment: `https://github.com//production/pull/`\n", + " * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.\n", + "\n", + "Checklist:\n", + "- [ ] Created a branch with the correct naming convention.\n", + "- [ ] Ensured that the repository is public.\n", + "- [ ] Reviewed the PR description guidelines and adhered to them.\n", + "- [ ] Verify that the link is accessible in a private browser window.\n", + "\n", + "If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/02_activities/assignments/assignment_1.ipynb b/02_activities/assignments/assignment_1.ipynb index 29d0b8687..0b80c5581 100644 --- a/02_activities/assignments/assignment_1.ipynb +++ b/02_activities/assignments/assignment_1.ipynb @@ -26,17 +26,18 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your code below.\n", - "\n" + "%load_ext dotenv\n", + "%dotenv " ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -55,7 +56,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -63,7 +64,10 @@ "from glob import glob\n", "\n", "# Write your code below.\n", - "\n" + "\n", + "px_dir = os.getenv(\"PRICE_DATA\")\n", + "px_glob = glob(px_dir+\"*/*.parquet/*.parquet\")\n", + "#px_glob" ] }, { @@ -88,12 +92,23 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your code below.\n", - "\n" + "dd_px = dd.read_parquet(px_glob)\n", + "import numpy as np\n", + "dd_feat = dd_px.groupby('ticker', group_keys=False\n", + " ).apply(\n", + " lambda x: x.assign(\n", + " Close_lag_1 = x['Close'].shift(1),\n", + " Adj_Close_lag_1 = x['Adj Close'].shift(1))\n", + " ).assign(\n", + " returns = lambda x: x['Adj Close']/x['Adj_Close_lag_1'] - 1,\n", + " hi_lo_range = lambda x: x['High'] - x['Low']\n", + " )\n", + "#dd_feat" ] }, { @@ -109,12 +124,23 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your code below.\n", - "\n" + "df = dd_feat.compute()\n", + "#df\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df['10_day_avg']=df['returns'].rolling(10).mean()\n", + "#df" ] }, { @@ -124,7 +150,16 @@ "Please comment:\n", "\n", "+ Was it necessary to convert to pandas to calculate the moving average return?\n", + "> In our case the time taken using pandas was virtually instant which is a good thing. In general DASK can be used to perform this aggregation task as well.\n", "+ Would it have been better to do it in Dask? Why?\n", + "> Since in pandas this aggregation took my machine less than half a tenth of a second to compute the rolling average, it would have been irrelevant to use DASK. \n", + "\n", + ">Per ['DASK documentation: \"DataFrame Best Practices\"'](https://docs.dask.org/en/stable/dataframe-best-practices.html) \n", + "\n", + ">\"**Use Pandas** For data that fits into RAM, pandas can often be faster and easier to use than Dask DataFrame. While “Big Data” tools can be exciting, they are almost always worse than normal data tools while those remain appropriate\" \n", + "\n", + ">The machine used to compute this has 16GB of RAM, hence a user with less RAM, below the size of the df computed might benefit in using Dask to perform the 10day window aggregation\n", + "\n", "\n", "(1 pt)" ] @@ -180,7 +215,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.0" + "version": "3.9.19" } }, "nbformat": 4, diff --git a/05_src/.ipynb_checkpoints/logger-checkpoint.py b/05_src/.ipynb_checkpoints/logger-checkpoint.py new file mode 100644 index 000000000..fd3ec157c --- /dev/null +++ b/05_src/.ipynb_checkpoints/logger-checkpoint.py @@ -0,0 +1,32 @@ +import logging +from datetime import datetime + +from dotenv import load_dotenv +import os + +load_dotenv() + +LOG_DIR = os.getenv('LOG_DIR', './logs/') +LOG_LEVEL = os.getenv('LOG_LEVEL', 'INFO') + +def get_logger(name, log_dir = LOG_DIR, log_level = LOG_LEVEL): + + ''' + Set up a logger with the given name and log level. + ''' + _logs = logging.getLogger(name) + if not os.path.exists(log_dir): + os.makedirs(log_dir) + + f_handler = logging.FileHandler(os.path.join(log_dir, f'{ datetime.now().strftime("%Y%m%d_%H%M%S") }.log')) + f_format = logging.Formatter('%(asctime)s, %(name)s, %(filename)s, %(lineno)d, %(funcName)s, %(levelname)s, %(message)s') + f_handler.setFormatter(f_format) + _logs.addHandler(f_handler) + + s_handler = logging.StreamHandler() + s_format = logging.Formatter('%(asctime)s, %(filename)s, %(lineno)d, %(levelname)s, %(message)s') + s_handler.setFormatter(s_format) + _logs.addHandler(s_handler) + + _logs.setLevel(log_level) + return _logs \ No newline at end of file