clean DB #669

sjanssen2 · 2017-05-05T15:34:57Z

Before we add in the new relation "source" we need to clean our DB to make it easier to understand for us and additional new devs.
I spent the last two days sifting through our codebase (master branches) of american-gut-web and labadmin and understand which tables are no longer used. I am quite confident about my findings, but I definitively need a two persons review before we actually drop the selected tables.

The following tables are not mentioned (by grep'ing case insensitive their raw name):
ag_animal_survey
ag_human_survey
ag_import_stats_tmp
ag_map_markers
ag_participant_exceptions
ag_survey_answer
ag_survey_multiples
ag_survey_multiples_backup
barcode_exceptions
controlled_vocabs
controlled_vocab_values
~~survey_response_types~~ (update: used by other active table: survey_question_response_type)

I cannot see why those two tables are needed, because...
table "duplicate_consents":
labadmin/knimin/lib/data_access.py: format_survey_data(): line 726:
here is the only place that we READ from this table. If there are no WRITES, why do we need this static information?

table "promoted_survey_ids":
american-gut-web/amgut/lib/ag_data_access.py: deleteAGParticipantSurvey(): line 325
here is the only place where we DELETE from this table, when a survey (and I figure the consent) is deleted. Can't we simply delete the full table, i.e. for ALL users?

Please @josenavas @wasade review soon, since this is really time critical to get the metadata pulldown right.

sjanssen2 · 2017-05-05T15:38:10Z

here is how I got the number of occurrences of a table name in our codebases:

get all tables:

check occurence of table name:

grep "ag_map_markers" find . -name "*.py" | xargs -c | cut -d ":" -f 2 | sum.pl

list all code files:

files=find . -name "*.py" | xargs;

get counts about table name occurence in code files:

sum.pl

Is a small Perl script:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;

my $sum = 0;
#~ open (PIPE, "|");
	while (my $line = <STDIN>) {
		if ($line =~ m/^(-?\d+\.?\d*)$/) {
			my $number = $1;
			$sum += $number if (defined $number);
		} elsif ($line =~ m/^(\d+.\d*e.\d+)$/) {
			$sum += eval($1);
		} else {
			print $line;
		}
	}
#~ close (PIPE);

print STDOUT "SUM: $sum\n";

wasade · 2017-05-05T16:16:40Z

Thanks, @sjanssen2. Reviewing right now.

wasade · 2017-05-05T16:28:56Z

My guess on the duplicates table is that it was intended to accommodate people who used multiple participant names (eg "daniel mcdonald 1", "daniel mcdonald 2"), see this which was implicated in the "blame" log data access.

I'm fairly certain the promoted table was a temporary one created when we shifted from the old survey to the "new" survey a few years back. I don't see a reason to retain that table.

Still checking the others.

wasade · 2017-05-05T16:41:15Z

ag_animal_survey -> old survey
ag_human_survey -> old survey
ag_import_stats_tmp -> from oracle, associated with original kit creation mechanisms, drop
ag_map_markers -> cache from when we were allowed to use map pins, drop
ag_participant_exceptions -> i think legacy for tracking juvenile approvals
ag_survey_answer -> old survey
ag_survey_multiples -> old survey
ag_survey_multiples_backup -> old survey
barcode_exceptions -> i do not have this in my test database
controlled_vocabs -> empty in the test db, drop
controlled_vocab_values -> empty in the test db, drop

sjanssen2 · 2017-05-05T16:44:14Z

@Duplicates table: If someone is using two different "participant names" doesn't that mean that we have different "sources"? A barcode can than only be assigned to one source. Could it be the case that this table tries to circumvent this 1:1 relation?

josenavas · 2017-05-05T16:48:40Z

I may need to do some digging on emails, but maybe @wasade, @EmbrietteH or @jwdebelius remember something about this. When we discovered the problem with participants with multiple participant names we tried to consolidate the participant name into a single one. The ones that were clear (like Daniel 1, Daniel 2) were consolidated, however for the unclear ones we said that we will contact the participants - but I don't think we actually did. I'm wondering if that table holds those unclear relationships?

sjanssen2 · 2017-05-05T16:51:07Z

but how should we know if someone only added several "participants" by being sloppy from situations where a user distributed his/her barcodes over several "participants", i.e. human or animal sources if he/she e.g. swapped a family?

sjanssen2 · 2017-05-05T16:51:54Z

btw. this table is only used to determine the HOST_SUBJECT_ID for the current metadata pulldown.

josenavas · 2017-05-05T16:57:12Z

@sjanssen2 HOST_SUBJECT_ID is the same as "participant" or as "source" the duplicates means the same participant that provided 2 names. For example, I wanted to fill the survey twice, so I added myself as a human source for my first barcode and the added my self again as another source for barcode 2. Technically, the source is the same since I'm a single participant, however in the system they're represented as 2 different consents. For keeping longitudinal sampling we need that table.

wasade · 2017-05-05T17:01:19Z

Important distinction: a "source" is either animal, human, or environmental. And yes, it looks like the duplicates table is working around the 1:1.

I do not think "The ones that were clear (like Daniel 1, Daniel 2) were consolidated" was ever done.

wasade · 2017-05-05T17:01:49Z

...but the duplicates table could be used for longitudinal?

sjanssen2 · 2017-05-05T17:03:32Z

OK, so we have to scenarios:

longitudinal samples, where we need to abuse the system and add further sources to be able to create several surveys (for different time points).
sampling e.g. a family, where we actually swap different people.

Is there any means in the interface for a user to indicate if he/she is using 1) or 2) ?

josenavas · 2017-05-05T17:05:22Z

@wasade I think you're right - they're not consolidated, but rather used the duplicates table.

@sjanssen2 as far as I know no - however I think that is diverging to a discussion that is not critical to solve the present issue?

sjanssen2 · 2017-05-05T17:07:02Z

@josenavas I'd say it is very critical to this issue, because it puts the finger into our open wound: we have no clear concept of a "source" and a "timeseries". Currently, this seems to be wrongly intermingled!

sjanssen2 · 2017-05-05T17:19:06Z

In my opinion we need to model the following organization:

A user is real life person who contribute to one or more kits and access our web interface. Through the kits he acquires a set of barcodes.
The user can then create one or several sources either human, animal or environmental ones. Every source should have one or more timepoints.
From what I have heard about our IRB restrictions, I'd say a survey is assigned to one timepoint of one source (together with the consent).

Right know, we do not really support longitudinal studies AND suffer from the unclear relation between "participant_name" = "source" and "survey"&"barcode".
We do log the timestamp of the barcodes, don't we? So could we go over the DB once and allocate multiple barcodes for one ag_login_id, participant_name combination to several timepoints.

This implies a major overhaul of our DB, which I cannot do alone and/or within the given limited time. So for know, I'd say, we assume "participant_names" are distinct sources and we only add the many-to-many relation for "sources" into the DB.
What do you think?

wasade · 2017-05-05T17:19:36Z

I agree with @josenavas, I think attempting to resolve the longitudinal sampling is scope creep for the present issue of just being able to pull down all of the metadata associated with a barcode where the person took multiple distinct surveys (eg the main one, the fermented foods one, etc).

josenavas · 2017-05-05T17:20:58Z

I think the issue is different:

As far as I understand, the current problem is the lack of ability to relate the multiple different surveys (i.e. main, surfers, fermented...) to a given barcode. This has nothing to do with longitudinal sampling.
The second problem is efficiently represent time series for a single participant in which some values of the main survey change.

We need to solve the first problem. If we add the sources table that acts as the many-to-many relationship as @wasade points out we solve the first one but not the second one. However, it replicates the current behavior as there is no limitation to add the same person twice as source (as it is currently done). Note that in my opinion this problem raises due to a disparity between the DB and the GUI. The GUI exposes an object (source) that it doesn't exist in the DB, and duct tape was put in place to support this by passing the login id and the participant name. When the multiple surveys were added, instead of fixing this problem, more duct tape was put in place. The duct tape couldn't hold anymore and now the bathroom is full of water...

Fixing the timeseries problem is a big problem that requires a lot of thinking and a big refactor of the DB. Note that we discussed this on the last american gut code sprint and after 3 days we didn't reach a consensus, so this is a hard problem. Pulling down the metadata is way easier and time pressing.

wasade · 2017-05-05T17:24:14Z

"This implies a major overhaul of our DB, which I cannot do alone and/or within the given limited time. So for know, I'd say, we assume "participant_names" are distinct sources and we only add the many-to-many relation for "sources" into the DB.
What do you think?"

I think that is a good direction

sjanssen2 · 2017-05-05T17:29:45Z

OK, I pushed a PR to drop the mentioned tables (except duplicate_consent).

wasade · 2017-05-05T17:32:05Z

Thanks!

…

On Fri, May 5, 2017 at 10:29 AM, Stefan Janssen ***@***.***> wrote: OK, I pushed a PR to drop the mentioned tables (except duplicate_consent). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#669 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAc8ssi7p7P11CE1Sk4jU5P1U9zJtwKrks5r21yKgaJpZM4NSE7x> .

sjanssen2 mentioned this issue May 5, 2017

Clean db #670

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean DB #669

clean DB #669

sjanssen2 commented May 5, 2017 •

edited

Loading

sjanssen2 commented May 5, 2017

wasade commented May 5, 2017

wasade commented May 5, 2017

wasade commented May 5, 2017

sjanssen2 commented May 5, 2017

josenavas commented May 5, 2017

sjanssen2 commented May 5, 2017

sjanssen2 commented May 5, 2017

josenavas commented May 5, 2017

wasade commented May 5, 2017

wasade commented May 5, 2017

sjanssen2 commented May 5, 2017

josenavas commented May 5, 2017

sjanssen2 commented May 5, 2017

sjanssen2 commented May 5, 2017 •

edited

Loading

wasade commented May 5, 2017

josenavas commented May 5, 2017

wasade commented May 5, 2017

sjanssen2 commented May 5, 2017

wasade commented May 5, 2017 via email

clean DB #669

clean DB #669

Comments

sjanssen2 commented May 5, 2017 • edited Loading

sjanssen2 commented May 5, 2017

get all tables:

check occurence of table name:

list all code files:

get counts about table name occurence in code files:

sum.pl

wasade commented May 5, 2017

wasade commented May 5, 2017

wasade commented May 5, 2017

sjanssen2 commented May 5, 2017

josenavas commented May 5, 2017

sjanssen2 commented May 5, 2017

sjanssen2 commented May 5, 2017

josenavas commented May 5, 2017

wasade commented May 5, 2017

wasade commented May 5, 2017

sjanssen2 commented May 5, 2017

josenavas commented May 5, 2017

sjanssen2 commented May 5, 2017

sjanssen2 commented May 5, 2017 • edited Loading

wasade commented May 5, 2017

josenavas commented May 5, 2017

wasade commented May 5, 2017

sjanssen2 commented May 5, 2017

wasade commented May 5, 2017 via email

sjanssen2 commented May 5, 2017 •

edited

Loading

sjanssen2 commented May 5, 2017 •

edited

Loading