Due date: October 25, 2024 (11:59pm)
This homework will be released in three parts.
-
The first part (released Monday, Oct 14) reviews some of the basics of data processing in the context of Pandas (input, cleaning, validation, and manipulation of DataFrames).
-
The second part (released Wednesday, Oct 16) is about performance measuring & comparison, and will explore several design choices which can affect performance.
-
Finally, the third part (released Friday, Oct 18) is your "project proposal". Like the rest of the homeworks in this class, homework 1 comes with a project component. This task will ask you to choose a basic domain and dataset for your project, including manually collecting a small sample dataset of 10 data points.
Clone this repository to your own machine (or open up a Codespace),
then open up and complete part1.py
.
Parts 2 and 3 will be made available throughout the week.
For now, please work on your repo on your own machine or in Codespaces -- There will be further instructions on how to submit released with Part 3. Submission will be either via Gradescope or via GitHub Classroom.
If you get stuck, please ask a question on Piazza!
In order to receive credit for your work, please follow the following guidelines.
-
Make sure that you
git commit
andgit push
your latest code to your personal repository. This is how you will "submit" your code. Go togithub.com/<your repository link>
online to see if the changes are there; if you see the latest, most up-to-date version, then you are good to go. -
Make sure that
python3 part1.py
,python3 part2.py
, andpython3 part3.py
run successfully with no errors, and the same forpytest part1.py
,pytest part2.py
, andpytest part3.py
. We cannot give credit to code that doesn't run. -
Each part should produce, when run, a corresponding answers file
part<n>-answers.txt
, i.e.part1-answers.txt
,part2-answers.txt
, andpart3-answers.txt
, and corresponding plots inplots
. Please commit these output files along with your project and ensure that they are regenerated when the code is run. -
Don't rename any functions or methods or change the function signatures unless asked to do so.
-
As discussed in the syllabus, a small number of points on each homework (at most 10% of the grade) are reserved for style points. Here are some thoughts to consider: are your variable names chosen appropriately? Have you added comments with
#
or docstrings with"""
where appropriate? Have you removed any obsolete, unused code blocks, functions, or variables?
Many thanks to Hassnain (the TA) and the data science course at LUMS (CS 334 taught by Dr. Mobin Javed) for the data and some of the exercises that were used in Part 1 of this homework assignment.