-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support DocumentReference URL attachments #172
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,3 +16,4 @@ | |
TASK_SET_EMPTY = 21 | ||
ARGS_CONFLICT = 22 | ||
ARGS_INVALID = 23 | ||
FHIR_URL_MISSING = 24 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,12 +11,12 @@ | |
import sys | ||
import tempfile | ||
import time | ||
from typing import List, Type | ||
from typing import Iterable, List, Type | ||
from urllib.parse import urlparse | ||
|
||
import ctakesclient | ||
|
||
from cumulus import common, context, deid, errors, loaders, store, tasks | ||
from cumulus import common, context, deid, errors, fhir_client, loaders, store, tasks | ||
from cumulus.config import JobConfig, JobSummary | ||
|
||
|
||
|
@@ -27,27 +27,22 @@ | |
############################################################################### | ||
|
||
|
||
async def load_and_deidentify( | ||
loader: loaders.Loader, selected_tasks: List[Type[tasks.EtlTask]] | ||
) -> tempfile.TemporaryDirectory: | ||
async def load_and_deidentify(loader: loaders.Loader, resources: Iterable[str]) -> tempfile.TemporaryDirectory: | ||
""" | ||
Loads the input directory and does a first-pass de-identification | ||
|
||
Code outside this method should never see the original input files. | ||
|
||
:returns: a temporary directory holding the de-identified files in FHIR ndjson format | ||
""" | ||
# Grab a list of all required resource types for the tasks we are running | ||
required_resources = set(t.resource for t in selected_tasks) | ||
|
||
# First step is loading all the data into a local ndjson format | ||
loaded_dir = await loader.load_all(list(required_resources)) | ||
loaded_dir = await loader.load_all(list(resources)) | ||
|
||
# Second step is de-identifying that data (at a bulk level) | ||
return await deid.Scrubber.scrub_bulk_data(loaded_dir.name) | ||
|
||
|
||
def etl_job(config: JobConfig, selected_tasks: List[Type[tasks.EtlTask]]) -> List[JobSummary]: | ||
async def etl_job(config: JobConfig, selected_tasks: List[Type[tasks.EtlTask]]) -> List[JobSummary]: | ||
""" | ||
:param config: job config | ||
:param selected_tasks: the tasks to run | ||
|
@@ -58,7 +53,7 @@ def etl_job(config: JobConfig, selected_tasks: List[Type[tasks.EtlTask]]) -> Lis | |
scrubber = deid.Scrubber(config.dir_phi) | ||
for task_class in selected_tasks: | ||
task = task_class(config, scrubber) | ||
summary = task.run() | ||
summary = await task.run() | ||
Comment on lines
-61
to
+56
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that this is still not running tasks in parallel, but just making the task runners able to run async code themselves. (Parallel tasks is a whole other discussion with its own difficulties.) |
||
summary_list.append(summary) | ||
|
||
path = os.path.join(config.dir_job_config(), f"{summary.label}.json") | ||
|
@@ -195,6 +190,9 @@ def make_parser() -> argparse.ArgumentParser: | |
metavar="PATH", | ||
help="Bearer token for custom bearer authentication", | ||
) | ||
export.add_argument( | ||
"--fhir-url", metavar="URL", help="FHIR server base URL, only needed if you exported separately" | ||
) | ||
export.add_argument("--since", help="Start date for export from the FHIR server") | ||
export.add_argument("--until", help="End date for export from the FHIR server") | ||
|
||
|
@@ -213,6 +211,39 @@ def make_parser() -> argparse.ArgumentParser: | |
return parser | ||
|
||
|
||
def create_fhir_client(args, root_input, resources): | ||
client_base_url = args.fhir_url | ||
if root_input.protocol in {"http", "https"}: | ||
if args.fhir_url and not root_input.path.startswith(args.fhir_url): | ||
print( | ||
"You provided both an input FHIR server and a different --fhir-url. Try dropping --fhir-url.", | ||
file=sys.stderr, | ||
) | ||
raise SystemExit(errors.ARGS_CONFLICT) | ||
client_base_url = root_input.path | ||
|
||
try: | ||
try: | ||
# Try to load client ID from file first (some servers use crazy long ones, like SMART's bulk-data-server) | ||
smart_client_id = common.read_text(args.smart_client_id).strip() if args.smart_client_id else None | ||
except FileNotFoundError: | ||
smart_client_id = args.smart_client_id | ||
|
||
smart_jwks = common.read_json(args.smart_jwks) if args.smart_jwks else None | ||
bearer_token = common.read_text(args.bearer_token).strip() if args.bearer_token else None | ||
except OSError as exc: | ||
print(exc, file=sys.stderr) | ||
raise SystemExit(errors.ARGS_INVALID) from exc | ||
|
||
return fhir_client.FhirClient( | ||
client_base_url, | ||
resources, | ||
client_id=smart_client_id, | ||
jwks=smart_jwks, | ||
bearer_token=bearer_token, | ||
) | ||
|
||
|
||
async def main(args: List[str]): | ||
parser = make_parser() | ||
args = parser.parse_args(args) | ||
|
@@ -233,45 +264,49 @@ async def main(args: List[str]): | |
job_context = context.JobContext(root_phi.joinpath("context.json")) | ||
job_datetime = common.datetime_now() # grab timestamp before we do anything | ||
|
||
if args.input_format == "i2b2": | ||
config_loader = loaders.I2b2Loader(root_input, args.batch_size) | ||
else: | ||
config_loader = loaders.FhirNdjsonLoader( | ||
root_input, | ||
client_id=args.smart_client_id, | ||
jwks=args.smart_jwks, | ||
bearer_token=args.bearer_token, | ||
since=args.since, | ||
until=args.until, | ||
) | ||
|
||
# Check which tasks are being run, allowing comma-separated values | ||
task_names = args.task and set(itertools.chain.from_iterable(t.split(",") for t in args.task)) | ||
task_filters = args.task_filter and list(itertools.chain.from_iterable(t.split(",") for t in args.task_filter)) | ||
selected_tasks = tasks.EtlTask.get_selected_tasks(task_names, task_filters) | ||
|
||
# Pull down resources and run the MS tool on them | ||
deid_dir = await load_and_deidentify(config_loader, selected_tasks) | ||
|
||
# Prepare config for jobs | ||
config = JobConfig( | ||
args.dir_input, | ||
deid_dir.name, | ||
args.dir_output, | ||
args.dir_phi, | ||
args.input_format, | ||
args.output_format, | ||
comment=args.comment, | ||
batch_size=args.batch_size, | ||
timestamp=job_datetime, | ||
tasks=[t.name for t in selected_tasks], | ||
) | ||
common.write_json(config.path_config(), config.as_json(), indent=4) | ||
common.print_header("Configuration:") | ||
print(json.dumps(config.as_json(), indent=4)) | ||
# Grab a list of all required resource types for the tasks we are running | ||
required_resources = set(t.resource for t in selected_tasks) | ||
|
||
# Create a client to talk to a FHIR server. | ||
# This is useful even if we aren't doing a bulk export, because some resources like DocumentReference can still | ||
# reference external resources on the server (like the document text). | ||
# If we don't need this client (e.g. we're using local data and don't download any attachments), this is a no-op. | ||
client = create_fhir_client(args, root_input, required_resources) | ||
|
||
async with client: | ||
if args.input_format == "i2b2": | ||
config_loader = loaders.I2b2Loader(root_input, args.batch_size) | ||
else: | ||
config_loader = loaders.FhirNdjsonLoader(root_input, client, since=args.since, until=args.until) | ||
|
||
# Pull down resources and run the MS tool on them | ||
deid_dir = await load_and_deidentify(config_loader, required_resources) | ||
|
||
# Prepare config for jobs | ||
config = JobConfig( | ||
args.dir_input, | ||
deid_dir.name, | ||
args.dir_output, | ||
args.dir_phi, | ||
args.input_format, | ||
args.output_format, | ||
client, | ||
comment=args.comment, | ||
batch_size=args.batch_size, | ||
timestamp=job_datetime, | ||
tasks=[t.name for t in selected_tasks], | ||
) | ||
common.write_json(config.path_config(), config.as_json(), indent=4) | ||
common.print_header("Configuration:") | ||
print(json.dumps(config.as_json(), indent=4)) | ||
|
||
# Finally, actually run the meat of the pipeline! (Filtered down to requested tasks) | ||
summaries = etl_job(config, selected_tasks) | ||
# Finally, actually run the meat of the pipeline! (Filtered down to requested tasks) | ||
summaries = await etl_job(config, selected_tasks) | ||
|
||
# Print results to the console | ||
common.print_header("Results:") | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this file are the actual feature change -- all other changes are mostly just refactoring to bring
FhirClient
out from a bulk export implementation detail up to a core piece of theetl.py
machinery.