Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ability to switch off/on creation of parquet dwh #1074

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

mozzy11
Copy link
Collaborator

@mozzy11 mozzy11 commented May 29, 2024

Fixes #1073

Add ability to switch of creation of a parquet DWH in case of syncying betwen FHIR servers

Added a flag to the createParquetDwh to the controller to switch off/on creation of parquet DWH

E2E test

Adedd e2e tests for synching from a hapi fhir sever to another using the pipeline controller for both FULL and INCREMENTAL modes while swtching on/off creation of parquet DWH

TESTED:
Testes Locally syncying between FHIR server while the parquet DWH is switched off/on

Checklist: I completed these to help reviewers :)

  • I have read and will follow the review process.

  • I am familiar with Google Style Guides for the language I have coded in.

    No? Please take some time and review Java and Python style guides.

  • My IDE is configured to follow the Google code styles.

    No? Unsure? -> configure your IDE.

  • I have added tests to cover my changes. (If you refactored existing code that was well tested you do not have to add tests)

  • I ran mvn clean package right before creating this pull request and added all formatting changes to my commit.

  • All new and existing tests passed.

  • My pull request is based on the latest changes of the master branch.

    No? Unsure? -> execute command git pull --rebase upstream master

@codecov-commenter
Copy link

codecov-commenter commented May 29, 2024

Codecov Report

Attention: Patch coverage is 37.50000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 51.99%. Comparing base (42c5e30) to head (e296d6c).

Files Patch % Lines
...java/com/google/fhir/analytics/DataProperties.java 40.00% 1 Missing and 2 partials ⚠️
...a/com/google/fhir/analytics/ConvertResourceFn.java 0.00% 0 Missing and 1 partial ⚠️
...a/com/google/fhir/analytics/FetchSearchPageFn.java 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #1074      +/-   ##
============================================
- Coverage     52.03%   51.99%   -0.04%     
  Complexity      653      653              
============================================
  Files            89       89              
  Lines          5396     5402       +6     
  Branches        708      710       +2     
============================================
+ Hits           2808     2809       +1     
- Misses         2325     2328       +3     
- Partials        263      265       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mozzy11
Copy link
Collaborator Author

mozzy11 commented May 30, 2024

cc @bashir2

@mozzy11
Copy link
Collaborator Author

mozzy11 commented Jul 8, 2024

cc @bashir2

@bashir2
Copy link
Collaborator

bashir2 commented Jul 9, 2024

cc @bashir2

Sorry @mozzy11 this fell off my radar (perhaps due to DevDays); thanks for the reminder. I'll provide some feedback by tomorrow.

Copy link
Collaborator

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mozzy11 for this change. I had a look and made some comments but in general I feel we need to think deeper about the implications of skipping Parquet file generation; I feel there are still scenarios not covered in your change (beyond what I have commented below) but need to think more about this.

@@ -73,6 +73,9 @@ fhirdata:
# that directory too, such that files created by the pipelines are readable by
# the Thrift Server, e.g., `setfacl -d -m o::rx dwh/`.
dwhRootPrefix: "dwh/controller_DEV_DWH"
#Whether to create a Parquet DWH or not.In case of syncying between a FHIR server to FHIR server ,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please also break long lines at 80 chars for YAML files (I know we have not followed it everywhere in this file but we should).

Suggested change
#Whether to create a Parquet DWH or not.In case of syncying between a FHIR server to FHIR server ,
# Whether to create a Parquet DWH or not. In case of syncing from a FHIR server to another, if Parquet files are not needed, their generation can be switched off by this flag.

@@ -260,4 +260,10 @@ public interface FhirEtlOptions extends BasePipelineOptions {
String getSourceNDJsonFilePattern();

void setSourceNDJsonFilePattern(String value);

@Description("Flag to switch off/on creation of a parquet DWH")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@Description("Flag to switch off/on creation of a parquet DWH")
@Description("Flag to switch off/on creation of parquet files; can be turned off when syncing from a FHIR server to another.")

@@ -25,6 +25,8 @@ fhirdata:
# fhirServerUrl: "http://hapi-server:8080/fhir"
dbConfig: "config/hapi-postgres-config_local.json"
dwhRootPrefix: "/dwh/controller_DWH"
#Whether to create a Parquet DWH or not
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably drop this comment as we have a reference to pipelines/controller/config/application.yaml at the top for all comments.

@@ -0,0 +1,59 @@
#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that most of the content of this directory is a copy of config dir. Can you reuse those config files and only override the values that you need to change, e.g., with command-line arguments?

id: 'Bring down controller and Spark containers for FHIR server to FHIR server sync'
args: [ '-f', './docker/compose-controller-spark-sql-single.yaml', 'down' ,'-v']

# Resetting Sink FHIR server
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that these new tests are adding 15+ minutes to the e2e test run-time; I think changes in PR #947 had a similar effect too and we should try to reduce this. How about doing the sync test in one of the scenarios only and see how much it reduces the run-time? Maybe we can have only one scenario where sync is on and Parquet generation is off. Please also make sure that the incremental mode is tested in that scenario.

@@ -200,7 +202,7 @@ public void setup() throws SQLException, ProfileException {
oAuthClientSecret,
fhirContext);
fhirSearchUtil = new FhirSearchUtil(fetchUtil);
if (!Strings.isNullOrEmpty(parquetFile)) {
if (createParquetDwh) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a sanity check if createParquetDwh is true but parquetFile is null or empty?

@@ -138,6 +143,7 @@ void validateProperties() {
logger.info("Using FHIR-search mode since dbConfig is not set.");
}
Preconditions.checkState(!createHiveResourceTables || !thriftserverHiveConfig.isEmpty());
Preconditions.checkState(!createHiveResourceTables || createParquetDwh);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are more config sanity check that needs to be done, e.g., when we are not generating Parquet files, generation of views should be disabled as well.

@@ -213,6 +219,8 @@ PipelineConfig createBatchOptions() {
Instant.now().toString().replace(":", "-").replace("-", "_").replace(".", "_");
options.setOutputParquetPath(dwhRootPrefix + TIMESTAMP_PREFIX + timestampSuffix);

options.setCreateParquetDwh(createParquetDwh);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested the incremental pipeline when this flag is turned off. In particular does the mergerPipelines here work fine? I think we need extra logic in PipelineManager to handle these edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to turn on /off parquet file generation in case of syncying fhir to fhir server
3 participants