Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need functionality that facilitates cross study analysis #145

Open
satagopam7 opened this issue Jan 22, 2020 · 7 comments
Open

Need functionality that facilitates cross study analysis #145

satagopam7 opened this issue Jan 22, 2020 · 7 comments
Labels
calculation Issue concerning statistics server Issues concerning the back end UI Issues concerning the user interface

Comments

@satagopam7
Copy link

We need some functionality both frontend and backend to support cross-study data pooling and/or comparison. I can provide more details if needed.

@sherzinger sherzinger added calculation Issue concerning statistics server Issues concerning the back end UI Issues concerning the user interface labels Jan 22, 2020
@sherzinger
Copy link
Member

I thought about this a bit yesterday evening. Given the current tools and structure of Ada, I can propose this approach:

  1. We give the user the possibility to merge two datasets into a third one. This could be done via a simple form where the user can select DataSet1, DataSet2, and the field name to use for merging (e.g. sampleID). Of course this needs to be properly designed so the user understands "why" and "how", without additional training.

  2. In the background we use an existing or new (should be easy to implement) data transformation to achieve this. It is however important that we create a new field "source_data_set".

  3. The user has now the option to create views, charts, or analysis by using the "source_data_set" field if separation is needed (e.g. Age Boxplot)

This is technically the most straight forward approach I can think of.

@satagopam7
Copy link
Author

satagopam7 commented Jan 23, 2020 via email

@sherzinger
Copy link
Member

Hey Venkata,

given the current design of Ada, comparing datasets without merging them into a third one could be considerable effort. The views, tree, analyses, dictionaries, filters, etc. are all centred around the currently selected data set. Pulling (part of) another dataset into these features would require design (and probably architectural) changes to all of them.

We could however think about making the process of merging invisible to the user.
I was thinking about a tab/button "Compare Datasets" that, when clicked, allows you pick two datasets. We could also add a "Stop comparison" button, which will delete this merged dataset.
The user would not even know that they operate on a new data set.

@satagopam7
Copy link
Author

satagopam7 commented Jan 23, 2020 via email

@peterbanda
Copy link
Member

For cross-study comparison there are essentially three options:

  1. Merge transformation - That's exactly the option Sascha correctly pointed out. It is currently supported and works out-of-box (although it's allowed only for admins). This transformation can be applied to any number of data sets (not just two) where compatibility is checked by field types. There are two flavors: 1) All the fields are automatically merged by name. Note that not necessarily all the fields need to be present in each data set (could be unique). 2) By manual linking the fields by name. Regardless of the transformation type, the result is a new data set with its own dictionary, views, filters, etc.

  2. Data source virtualization - This is something I presented a couple of times and would combine several repo sources into one and effectively hide them. The final union would again implement the CRUR repo interface so would be pluggable wherever a single data source is currently supported. This has already been reported at Introduce a virtual repo defined as an union of partial repos. #4 . The main implementation problem would be sorting (e.g. for box plots), also partially the offset and limit operators could be tricky. For Akka streaming the best approach, preserving order, would be to employ the mergeSorted function, currently used for the optimized linking transformation (not yet released).

  3. Multi Source Visualization - We can of course allow to generate different widgets (charts) in a single view from different sources, which would mean integration at the visualization level (as was done by Fractalis). This can be supported rather quickly (some ad-hoc experiments along these lines worked) but the resulting artifacts/views would not be fitting into a single data set abstraction. Therefore new meta data, tree node/type, controllers, and permissions would need to be introduced (kind of ad-hoc). Also to allow a closer comparison of field values from different data sets (studies) introducing a multi-field distribution widget would be quite handy (low hanging fruit). Already reported at Extend the numeric distribution widget to support multiple fields (at once) with grouping #50 .

Naturally, as it’s probably implied, proper harmonization is expected for all the presented options. Moreover, the solutions 1 and 3 don’t necessarily need that matching fields have the same names, wheres the solution 2 would most likely require that (to have a clean impl).

@sherzinger
Copy link
Member

sherzinger commented Jan 24, 2020

Hi @peterbanda,

Could you show us (maybe in the meeting next week?) what you did with 2.? I've not seen that yet I think.

Regarding Option 3.: This is actually exactly the type of issues I was referring to further up, albeit in less technical language. Technically, injecting some data into a widget is relatively easy, as you mentioned. The problems come from the everything else:

  • How do filters work in this case?
  • Do we have to prefix every single field with the source dataset, in order to show to the user where the field came from?
  • What happens when you update the view by clicking on the widgets?
  • What happens when you update the view by clicking on the foreign study field e.g. in the pie chart?
  • Will it still be possible to save a view?
  • How do we indicate everywhere in the UI to which studies the filters, fields, views belong to?
  • What happens if a multi-dataset view is saved (somehow) and access to the dataset is removed?
  • How do you specify filters for the fields you pull in from the other dataset?
  • Do we need to modify every single widget, such that they can account for the new dimension "source study" alongside e.g. "gender"?

Just some of the questions that came to my mind, and this is largely just UI design. As you correctly pointed out this would also needs to be addressed on an architectural level in many locations.

Maybe limiting option 3 to single analyses/charts (not within a view!) would be doable in a reasonable amount of time if that satisfies the requirements?

And just to underline the fact: Option 1 is already there. We can compare datasets. It just needs to be wrapped in a user friendly interface.

@sherzinger
Copy link
Member

Note: I discussed the issue with Venkata and I think we came to an agreement.
I'll prepare a mockup for the next meeting, so we can talk about it in detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
calculation Issue concerning statistics server Issues concerning the back end UI Issues concerning the user interface
Projects
None yet
Development

No branches or pull requests

3 participants