From network drives to de-coupled data - what's standing in our way! (and where can it bring the most benefit) #5
Locked
epijim
started this conversation in
roundtable
Replies: 1 comment
-
added to the agenda |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Proposal
Traditionally statistical programmers interacted with network drive type data (e.g. acl permissioned data in folders). In the data science space, this was largely replaced with APIs like GraphQL, ODBC for SQL or REST for many object stores.
In our company, RWD moved 100% to databases and S3 ~8 years ago, and this has helped to enable a culture of data users across the company working from source even if outside the core Scientific Computing Environment platform (e.g. Posit Connect, HPCs, Spotfire) rather than making local copies of data, and allowed new capabilities - like doing joins and pulls across TBs of data in seconds.
We have migrated to
parquet
for clinical trial data - but still store data in mounted filesystem-like interfaces. Can we migrate clinical trial data access to APIs - do we expect the same benefits? Where should we focus? What experiences have people had across companies? What are the opportunities of moving from folder hierarchies to tags (and nested tags)?Expected impact
In the 2023 round tables, the SCE discussion flagged having to support network drive based workflows as one of the biggest blockers SCE leads face when trying to modernise our platforms. There would be value into diving into this topic deeper.
Prior discussions/work
Several companies have experience with this - and even for some doing 'network drive like' end-user experiences, they may have learnings - e.g. using Lsx for Lustre's API to expose S3 data as mounted drives.
Would you be willing to potentially facilitate this discussion?
Maybe/No/ask me again later
Beta Was this translation helpful? Give feedback.
All reactions