-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline standards #17
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together, Reggie! I like what you've noted already and have made some requests to add some sections.
- Establish clear metrics for a successful pipeline. | ||
|
||
## Choose the right tools and technologies | ||
Depending on the data type, volume, and velocity, choose appropriate tools and technologies. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful to distinguish tools that are possible when building GFE-based pipelines vs. cloud-based pipelines.
- Data orchestration: Apache Airflow | ||
|
||
## Scalability and flexibility | ||
Where possible, design the pipeline to be easily scaled up or down and to adapt to changes in data types and data formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and there needs to be a balance between adaptability/scalability and how long it takes to deliver what's needed. I suggest adding a few example questions to guide people in this consideration. For instance:
- How will implementing a certain scaling capability or data handling flexibility affect the project timeline and code complexity / maintainability?
- What cost constraints are there?
In addition, are there some minimum guidelines on flexibility and scalability? For instance -- anything about use of regex, minimizing hard coding, etc?
## Monitoring and optimizing | ||
Continuously monitor the performance of the pipeline and seek opportunities to optimize data processing times, reduce costs, and improve data quality. Implement monitoring and logging to track the performance and health of the pipeline. Alerts should be set up for failures or significant performance degradations. Logs can include assessments of data quality and any major errors or inconsistencies caught during data quality checks. | ||
|
||
## Ensure security and compliance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's include a link to the HHS approved software list (noting that the link is only accessible within HHS).
## Scalability and flexibility | ||
Where possible, design the pipeline to be easily scaled up or down and to adapt to changes in data types and data formats. | ||
|
||
## Implement data quality checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this (and all of the subsequent sections), can you add a sub-section for Examples and start by linking to relevant parts of the PIR code base? For future projects, we can similarly add links -- though some will be to private repos, which is okay.
@@ -0,0 +1,42 @@ | |||
# Pipeline Best Practices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add to the README
- How will success be measured? | ||
- Establish clear metrics for a successful pipeline. | ||
|
||
## Choose the right tools and technologies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a section for building iteratively, ensuring there's constant demos and syncs with the client -- you can adapt from the lessons learned doc.
Thank you @janejuenyang! @skalaga-arch is spearheading the work here, so I'll let him take the lead on these changes, but I'll review and contribute--especially the items from the lessons learned. |
No description provided.