-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ability to switch off/on creation of parquet dwh #1074
base: master
Are you sure you want to change the base?
Changes from all commits
9ffab3a
ef22c53
f37bbeb
fbcf38c
564d68a
f2da0ea
66ad736
e296d6c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,6 +25,8 @@ fhirdata: | |
# fhirServerUrl: "http://hapi-server:8080/fhir" | ||
dbConfig: "config/hapi-postgres-config_local.json" | ||
dwhRootPrefix: "/dwh/controller_DWH" | ||
#Whether to create a Parquet DWH or not | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can probably drop this comment as we have a reference to |
||
createParquetDwh: true | ||
incrementalSchedule: "0 0 * * * *" | ||
purgeSchedule: "0 30 * * * *" | ||
numOfDwhSnapshotsToRetain: 2 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems that most of the content of this directory is a copy of |
||
# Copyright 2020-2022 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
# See `pipelines/controller/config/application.yaml` for full documentation | ||
# of these options. | ||
# This config is meant to be used by `compose-controller-spark-sql.yaml`. | ||
fhirdata: | ||
# 172.17.0.1 is an example docker network interface ip address; | ||
# `hapi-server` is another docker example where a container with that name is | ||
# running on the same docker network. | ||
# fhirServerUrl: "http://172.17.0.1:8091/fhir" | ||
# fhirServerUrl: "http://hapi-server:8080/fhir" | ||
dbConfig: "config/hapi-postgres-config_local.json" | ||
dwhRootPrefix: "/dwh/controller_DWH" | ||
#Whether to create a Parquet DWH or not | ||
createParquetDwh: false | ||
incrementalSchedule: "0 0 * * * *" | ||
purgeSchedule: "0 30 * * * *" | ||
numOfDwhSnapshotsToRetain: 2 | ||
# There is no Questionnaire in our test FHIR server, but it is added to | ||
# prevent regression of https://github.com/google/fhir-data-pipes/issues/785. | ||
# TODO: add resource table creation to e2e tests. | ||
resourceList: "Patient,Encounter,Observation,Questionnaire,Condition,Practitioner,Location,Organization,DiagnosticReport,Immunization,MedicationRequest,PractitionerRole,Procedure" | ||
numThreads: 1 | ||
autoGenerateFlinkConfiguration: true | ||
createHiveResourceTables: false | ||
#thriftserverHiveConfig: "config/thriftserver-hive-config_local.json" | ||
#hiveResourceViewsDir: "config/views" | ||
# structureDefinitionsPath: "config/profile-definitions" | ||
structureDefinitionsPath: "classpath:/r4-us-core-definitions" | ||
fhirVersion: "R4" | ||
rowGroupSizeForParquetFiles: 33554432 # 32mb | ||
#viewDefinitionsDir: "config/views" | ||
#sinkDbConfigPath: "config/hapi-postgres-config_local_views.json" | ||
sinkFhirServerUrl: "http://sink-server:8080/fhir" | ||
#sinkUserName: "hapi" | ||
#sinkPassword: "hapi123" | ||
recursiveDepth: 1 | ||
|
||
# Enable spring boot actuator end points, use "*" to expose all endpoints, or a comma-separated | ||
# list to expose selected ones | ||
management: | ||
endpoints: | ||
web: | ||
exposure: | ||
include: health,info,metrics,prometheus,pipeline-metrics |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Copyright 2023 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# To use this config, FLINK_CONF_DIR env. var should be set to the parent dir. | ||
|
||
# This is needed to prevent an "Insufficient number of network buffers" | ||
# exceptions when running the merger on large input with many workers. | ||
taskmanager.memory.network.max: 256mb | ||
|
||
# This is needed to be able to process large resources, otherwise in JDBC | ||
# mode we may get the following exception: | ||
# "The record exceeds the maximum size of a sort buffer ..." | ||
taskmanager.memory.managed.size: 256mb | ||
|
||
# This is to make pipeline.run() non-blocking with FlinkRunner; unfortunately | ||
# this is overwritten in `local` mode: https://stackoverflow.com/a/74416240 | ||
execution.attached: false | ||
|
||
# This is required to track the pipeline metrics when FlinkRunner is used. | ||
execution.job-listeners: com.google.fhir.analytics.metrics.FlinkJobListener |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"jdbcDriverClass": "org.postgresql.Driver", | ||
"databaseService" : "postgresql", | ||
"databaseHostName" : "hapi-fhir-db", | ||
"databasePort" : "5432", | ||
"databaseUser" : "admin", | ||
"databasePassword" : "admin", | ||
"databaseName" : "hapi" | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,7 +21,6 @@ | |
import ca.uhn.fhir.parser.IParser; | ||
import com.cerner.bunsen.exception.ProfileException; | ||
import com.google.common.annotations.VisibleForTesting; | ||
import com.google.common.base.Strings; | ||
import com.google.fhir.analytics.JdbcConnectionPools.DataSourceConfig; | ||
import com.google.fhir.analytics.model.DatabaseConfiguration; | ||
import com.google.fhir.analytics.view.ViewApplicationException; | ||
|
@@ -88,6 +87,8 @@ abstract class FetchSearchPageFn<T> extends DoFn<T, KV<String, Integer>> { | |
|
||
protected final String parquetFile; | ||
|
||
protected final Boolean createParquetDwh; | ||
|
||
private final int secondsToFlush; | ||
|
||
private final int rowGroupSize; | ||
|
@@ -130,6 +131,7 @@ abstract class FetchSearchPageFn<T> extends DoFn<T, KV<String, Integer>> { | |
this.oAuthClientSecret = options.getFhirServerOAuthClientSecret(); | ||
this.stageIdentifier = stageIdentifier; | ||
this.parquetFile = options.getOutputParquetPath(); | ||
this.createParquetDwh = options.isCreateParquetDwh(); | ||
this.secondsToFlush = options.getSecondsToFlushParquetFiles(); | ||
this.rowGroupSize = options.getRowGroupSizeForParquetFiles(); | ||
this.viewDefinitionsDir = options.getViewDefinitionsDir(); | ||
|
@@ -200,7 +202,7 @@ public void setup() throws SQLException, ProfileException { | |
oAuthClientSecret, | ||
fhirContext); | ||
fhirSearchUtil = new FhirSearchUtil(fetchUtil); | ||
if (!Strings.isNullOrEmpty(parquetFile)) { | ||
if (createParquetDwh) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have a sanity check if |
||
parquetUtil = | ||
new ParquetUtil( | ||
fhirContext.getVersion().getVersion(), | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -260,4 +260,10 @@ public interface FhirEtlOptions extends BasePipelineOptions { | |||||
String getSourceNDJsonFilePattern(); | ||||||
|
||||||
void setSourceNDJsonFilePattern(String value); | ||||||
|
||||||
@Description("Flag to switch off/on creation of a parquet DWH") | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
@Default.Boolean(true) | ||||||
Boolean isCreateParquetDwh(); | ||||||
|
||||||
void setCreateParquetDwh(Boolean value); | ||||||
} |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -73,6 +73,9 @@ fhirdata: | |||||
# that directory too, such that files created by the pipelines are readable by | ||||||
# the Thrift Server, e.g., `setfacl -d -m o::rx dwh/`. | ||||||
dwhRootPrefix: "dwh/controller_DEV_DWH" | ||||||
#Whether to create a Parquet DWH or not.In case of syncying between a FHIR server to FHIR server , | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Please also break long lines at 80 chars for YAML files (I know we have not followed it everywhere in this file but we should).
Suggested change
|
||||||
# generation of parquet DWH could be switched off/on | ||||||
createParquetDwh: true | ||||||
|
||||||
# The schedule for automatic incremental pipeline runs. | ||||||
# Uses the Spring CronExpression format, i.e., | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -118,6 +118,8 @@ public class DataProperties { | |
|
||
private int recursiveDepth; | ||
|
||
private boolean createParquetDwh; | ||
|
||
@PostConstruct | ||
void validateProperties() { | ||
CronExpression.parse(incrementalSchedule); | ||
|
@@ -127,6 +129,9 @@ void validateProperties() { | |
"At least one of fhirServerUrl or dbConfig should be set!"); | ||
Preconditions.checkState(fhirVersion != null, "FhirVersion cannot be empty"); | ||
|
||
Preconditions.checkArgument( | ||
!Strings.isNullOrEmpty(dwhRootPrefix), "dwhRootPrefix is required!"); | ||
|
||
if (!Strings.isNullOrEmpty(dbConfig)) { | ||
if (!Strings.isNullOrEmpty(fhirServerUrl)) { | ||
logger.warn("Both fhirServerUrl and dbConfig are set; ignoring fhirServerUrl!"); | ||
|
@@ -138,6 +143,7 @@ void validateProperties() { | |
logger.info("Using FHIR-search mode since dbConfig is not set."); | ||
} | ||
Preconditions.checkState(!createHiveResourceTables || !thriftserverHiveConfig.isEmpty()); | ||
Preconditions.checkState(!createHiveResourceTables || createParquetDwh); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there are more config sanity check that needs to be done, e.g., when we are not generating Parquet files, generation of views should be disabled as well. |
||
} | ||
|
||
private PipelineConfig.PipelineConfigBuilder addFlinkOptions(FhirEtlOptions options) { | ||
|
@@ -213,6 +219,8 @@ PipelineConfig createBatchOptions() { | |
Instant.now().toString().replace(":", "-").replace("-", "_").replace(".", "_"); | ||
options.setOutputParquetPath(dwhRootPrefix + TIMESTAMP_PREFIX + timestampSuffix); | ||
|
||
options.setCreateParquetDwh(createParquetDwh); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have you tested the incremental pipeline when this flag is turned off. In particular does the |
||
|
||
PipelineConfig.PipelineConfigBuilder pipelineConfigBuilder = addFlinkOptions(options); | ||
|
||
// Get hold of thrift server parquet directory from dwhRootPrefix config. | ||
|
@@ -230,6 +238,7 @@ List<ConfigFields> getConfigParams() { | |
return List.of( | ||
new ConfigFields("fhirdata.fhirServerUrl", fhirServerUrl, "", ""), | ||
new ConfigFields("fhirdata.dwhRootPrefix", dwhRootPrefix, "", ""), | ||
new ConfigFields("fhirdata.createParquetDwh", String.valueOf(createParquetDwh), "", ""), | ||
new ConfigFields("fhirdata.incrementalSchedule", incrementalSchedule, "", ""), | ||
new ConfigFields("fhirdata.purgeSchedule", purgeSchedule, "", ""), | ||
new ConfigFields( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that these new tests are adding 15+ minutes to the e2e test run-time; I think changes in PR #947 had a similar effect too and we should try to reduce this. How about doing the sync test in one of the scenarios only and see how much it reduces the run-time? Maybe we can have only one scenario where sync is on and Parquet generation is off. Please also make sure that the incremental mode is tested in that scenario.