Skip to content

Commit

Permalink
finalize post
Browse files Browse the repository at this point in the history
  • Loading branch information
naishasinha committed Aug 22, 2024
1 parent 28ed8da commit c2ba7a4
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 50 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
90 changes: 40 additions & 50 deletions content/blog/entries/2024-08-22-automating-quantifying/contents.lr
Original file line number Diff line number Diff line change
Expand Up @@ -22,50 +22,40 @@ body:

This post serves as a technical journal for the development process of the
concluding stretch of Automating Quantifying the Commons, a project initiative
for the 2024 Google Summer of Code program. Please visit [Part 1][] for more context
for the 2024 Google Summer of Code program. Please visit **[Part 1][part1]** for more context
if you haven't already done so.

At the point of the midterm evaluation, I successfully completed Phases 1, 2, and 3
(`fetch`, `process`, and `report`) of the Google Custom Search (GCS) data source, with a working report `README` generation
for each quarter. My goal for the next half of the quarter is to complete a baseline automation
for each quarter. My documented goal for the second half of the period was to complete a baseline automation
software for these processes across all data sources.


## Development Process
***

### I. Midpoint Reassessment
If you read my previous post, you may have noticed that my next steps included the completion of these phases
for the rest of the data sources. However, I quickly realized that my GCS phases and the base
analysis and visualization code from the Data Discovery Program already provide a standard reference
for the development of the rest of these data sources. Since the actual aim of this venture is to develop automation software
for these phases, my mentor suggested that the trajectory of the concluding time period should shift to programming
the Git functions for automation first (which would require more time and effort) rather than focusing on the phases of the remaining data sources.
This way, regardless of who works on the rest of the data sources, anyone can easily add the rest of the data sources, using the
current code as a reference.

If you read my previous post, you might have seen that my next steps involved completing the phases for the remaining data sources.
However, I soon realized that the GCS phases, along with the base analysis and visualization code from the Data Discovery Program,
already serve as a standard reference for these tasks. Given that the primary goal of this project is to develop automation software
for these phases, my mentor suggested shifting the focus of the final time period towards programming the Git functions for automation.
This approach, which will require more time and effort, will ensure that anyone working on the remaining data sources can easily integrate
them using the existing code as a reference.

### II. GitHub Actions Development
We defined GitHub Actions to host our CI/CD workflows. GitHub Actions uses YAML for their workflows, and as I had never
used YAML before, I had to learn and familiarize myself within this new domain. As with learning any new technology, there
were challenges with initially developing the git automation (which is why my mentor emphasized my focus on the git programming).
There were a lot of new nuances I had to learn that I hadn't dealt with before -- for example, I would get an error during the workflow
runs, but would not have any known way to debug it.

In my previous post, I shared three points that helped me familiarize myself with new technology during the first-half of the summer
period. Here, I am sharing two additional points that helped me during the git programming:
1. **GitHub Actions Extension on Visual Studio Code:** As I was using VSCode for my development, I didn't have any way to debug the issues
during the workflow runs. However, I came across the GitHub Actions Extension on VSCode, which was a game-changer for me. It made it much
easier for me to figure out why the code would not be working because the extension specifically highlights them.
2. **Assigning myself mini-tasks:** I created my own GitHub repository with basic, minimal-functioning code to help me understand and experiment
with GitHub actions in a less risky environment. This made it much easier for me to debug, compare, and figure out why things possibly weren't
working the way they were supposed to. Although I did receive more repository priviliges after being accepted for GSoC, I still do not have the
same level of access as my mentor has, so creating my own seperate repository helped me understand the higher level of GitHub Actions better, and
this actually helped me read into the error logs better. For example, I was able to realize that initially, the fetched automation wasn't working
because the repository secrets weren't updated without having access to the secrets.

Once I got the intial steps to compile, I worked on refining the scripts to maximum optimization. This included moving the commit functions into
the shared module instead, where the functions would be called in the individual scripts instead of the YAML workflow, allowing for less scope
for crashing. After I was able to successfully run the workflows, I implemented Cron functions to schedule the workflows in a quarterly time period.
We defined GitHub Actions to host our CI/CD workflows, and since I had never used YAML before,
I needed to learn and familiarize myself with this new technology. Learning YAML presented challenges,
particularly in developing the Git automation. My mentor emphasized focusing on the Git programming due to these challenges.
For example, I encountered errors during workflow runs without clear ways to debug them.

In my previous post, I shared three strategies that helped me familiarize myself with new technology during the first half of the summer. Here, I’m sharing two additional strategies that were particularly useful for GitHub Actions programming:

1. **GitHub Actions Extension for Visual Studio Code:** As I was using VSCode for development, I initially struggled to debug issues during workflow runs. Discovering the GitHub Actions Extension for VSCode was a game-changer. This extension highlights issues in the workflow, making it much easier to diagnose and fix problems. I highly recommend searching extensions for any development task, as having relevant tools can make programming much easier.

2. **Creating Mini-Tasks for Experimentation:** I set up my own GitHub repository with minimal, functional code to experiment with GitHub Actions in a low-risk environment. This approach facilitated easier debugging and comparison, helping me understand why certain things weren’t working. Although I gained more repository privileges after being accepted for GSoC, I still didn’t have the same access level as my mentor. By using a separate repository, I gained a better understanding of GitHub Actions and was able to interpret error logs more effectively. For instance, I realized that the automation wasn’t working initially due to outdated repository secrets, which I discovered without access to the secrets.

After successfully compiling the initial steps, I focused on refining the scripts for optimal performance. I moved the commit functions into a shared module, which reduced the risk of crashes by allowing functions to be called within individual scripts rather than directly in the YAML workflow. Once the workflows ran successfully, I implemented Cron functions to schedule them quarterly.

### III. Engineering a Custom Error Handling and Exception System
A key innovation in this project was the creation of a custom `QuantifyingException` class tailored specifically for the unique needs of the data pipeline.
Expand All @@ -75,30 +65,27 @@ While testing this system across all phases, I made sure to purposely include "e

Upon completion of a robust error and exception handling system, I completed all phase outlines of the remainder of the data sources. For fetching data from these sources, I have developed
codebases combining the GCS fetch system and the original Data Discovery API fetching for a complete fetching system. However, it should be noted that I have not actually fetched data from these
APIs using the new codebase, as Timid Robot will undertake an initiative to add GitHub bots for the API keys after the GSoC period -- this is due to best practice purposes, as it is fundamental
APIs using the new codebase, as Timid Robot will undertake an initiative to add GitHub bots for the API keys after the GSoC period this is due to best practice purposes, as it is fundamental
to create dedicated accounts for clear API usage and automated git commits. Therefore, these fetch files may need to be slightly tweaked after that, which will be discussed in **Next Steps**.
However, I have made sure to utilize fake data to ensure that the third phase successfully generates reports within the respective README file for ALL data sources.

### IV. Final Data Flow + System Design
In Part 1, I had shared the initial data flow diagram (DFD) for arranging the codebase. By the end of the program, however, the DFD and the
system design had solidified into something completely different. Below are the finalized diagrams for data flow and system design, which establish
an official framework for future endeavors.

insert data flow diagram
### IV. Finalized Flow of System + Data
In Part 1, I had shared the initial data flow diagram (DFD) for arranging the codebase. By the end of the program, however, the DFD and the overall system had solidified into something different.
Below is the final diagram for data flow, which establish an official framework for future endeavors.

insert system design diagram
![DFD](Final DFD.png)

## Final Conclusions
***

### I. All Deliverables Completed Over the Course of the Program

Although this 12-week period provided a vast amount of time to grow the Quantifying codebase, there were still time and resource constraints that we had to consider; primarily, the lack
Although this 12-week period allowed significant expansion of the Quantifying codebase, there were still time and resource constraints that we had to consider; primarily, the lack
of data we could collect using the given APIs over this time period. However, as mentioned earlier, given strategic implementations, I was able to still complete the summer goal of developing a baseline
automation software for data gathering, flow, and report generation, ensuring script runs on a quarterly basis. The **Next Steps** section will elaborate on how this software will be solidified over
the upcoming quarters and years.

[number]+ commits, [number]+ lines of code, and 360+ hours of work later, I present to you ten pivotal deliverables that I have completed over the summer period:
130+ commits, 7,615+ net code additions, and 360+ hours of work later, I present ten pivotal deliverables that I have completed over the summer period:

| Deliverable | Description|
| ------------- | ------------- |
Expand All @@ -115,7 +102,7 @@ the upcoming quarters and years.


### II. Acknowledgements, Impact, Next Steps
This project would not have been possible without the constant guidance and insights of my mentors: Timid Robot (lead), Shafiya Heena (supporting), and Sara Lovell (supporting).
This project would not have been possible without the constant guidance and insights of my mentors: **[Timid Robot Zehta][timid-robot]** (lead), **[Shafiya Heena][shafiya]** (supporting), and **[Sara Lovell][sara]** (supporting).
I appreciate how they created a safe space for working since the very beginning. I've never felt hesitant to ask questions
and have never felt out-of-place working in the organization, despite my introductory-level skillset at the start. In fact, this allowed
me to feel open to ask questions and be able to undertake side-projects that facilitated my growth. I truly believe that being able to work in an environment like this
Expand All @@ -124,24 +111,27 @@ has played a large role in my ability to perform well, and this was the sole rea
As for overall impact, it is very evident that Creative Commons is integral to facilitating the sharing and utilization of creative works worldwide. With over 2.5 billion
licenses globally, Creative Commons and its open-source inititives hold heavy impact, promising to empower researchers, policymakers, and stakeholders with up-to-date insights into the global
usage patterns of open doman and CC-licensed content. Therefore, I'm looking forward to witnessing the direct influence this project holds in paving the way for future advancements
in leveraging open content licenses globally. I am extremely grateful and honored to be able to have such a major role in contributing to this organization, and am excited to see
in leveraging open content licenses globally. I am extremely grateful and honored to be able to play such a major role in contributing to this organization, and am excited to see
future contributions I facilitate alongside other CC open-source developers.

Limitations + Next Steps:
add after deploying issues
As for next steps, I am opening several post-GSoC issues in the Quantifying repository that can be worked on by any open-source contributor.
These issues cover some of the necessary adjustments that need to be made once we cross certain time periods and codebase additions.
If you're interested in getting involved, please visit the **[Issues][issues]** page linked for your convenience.
Your contributions will be invaluable as we continue to enhance and expand this project,
and I’m eager to see the innovative solutions and improvements that will unfold these upcoming years!

## Additional Readings
***

- Automating Quantifying the Commons: Part 1 | Author: Naisha Sinha | Aug. 2024
- [Automating Quantifying the Commons: Part 1][part1] | Author: Naisha Sinha | Jul. 2024
- [Data Science Discovery: Quantifying the Commons][quantifying] | Author: Dun-Ming Huang (Brandon Huang) | Dec. 2022

[quantifying]: https://opensource.creativecommons.org/blog/entries/2022-12-07-berkeley-quantifying/
[logging]: https://github.com/creativecommons/quantifying/pull/97
[AWS-whitepaper]: https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/distributed-data-management.html
[meta-whitepaper]: https://engineering.fb.com/2024/05/22/data-infrastructure/composable-data-management-at-meta/
[timid-robot]: https://opensource.creativecommons.org/blog/authors/TimidRobot/
[documentation]: https://unmarred-gym-686.notion.site/Automating-Quantifying-the-Commons-Documentation-441056ae02364d8a9a51d5e820401db5?pvs=4
[shafiya]: https://opensource.creativecommons.org/blog/authors/shafiya/
[sara]: https://opensource.creativecommons.org/blog/authors/sara/
[part1]: https://opensource.creativecommons.org/blog/entries/2024-07-10-automating-quantifying/
[issues]: https://github.com/creativecommons/quantifying/issues



Expand Down

0 comments on commit c2ba7a4

Please sign in to comment.