Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean DB #669

Open
sjanssen2 opened this issue May 5, 2017 · 20 comments
Open

clean DB #669

sjanssen2 opened this issue May 5, 2017 · 20 comments

Comments

@sjanssen2
Copy link
Contributor

sjanssen2 commented May 5, 2017

Before we add in the new relation "source" we need to clean our DB to make it easier to understand for us and additional new devs.
I spent the last two days sifting through our codebase (master branches) of american-gut-web and labadmin and understand which tables are no longer used. I am quite confident about my findings, but I definitively need a two persons review before we actually drop the selected tables.

The following tables are not mentioned (by grep'ing case insensitive their raw name):
ag_animal_survey
ag_human_survey
ag_import_stats_tmp
ag_map_markers
ag_participant_exceptions
ag_survey_answer
ag_survey_multiples
ag_survey_multiples_backup
barcode_exceptions
controlled_vocabs
controlled_vocab_values
survey_response_types (update: used by other active table: survey_question_response_type)

I cannot see why those two tables are needed, because...
table "duplicate_consents":
labadmin/knimin/lib/data_access.py: format_survey_data(): line 726:
here is the only place that we READ from this table. If there are no WRITES, why do we need this static information?

table "promoted_survey_ids":
american-gut-web/amgut/lib/ag_data_access.py: deleteAGParticipantSurvey(): line 325
here is the only place where we DELETE from this table, when a survey (and I figure the consent) is deleted. Can't we simply delete the full table, i.e. for ALL users?

Please @josenavas @wasade review soon, since this is really time critical to get the metadata pulldown right.

@sjanssen2
Copy link
Contributor Author

here is how I got the number of occurrences of a table name in our codebases:

get all tables:

psql --dbname=ag_test --password --host=localhost --username=sjanssen -c "\dt *." | grep -v "^ pg_catalog" | grep -v "information_schema" | grep -v "^ public" | grep "| table |" | cut -d "|" -f 1,2 | tr -d " " | tr "|" "." > ~/Desktop/tables.txt

check occurence of table name:

grep "ag_map_markers" find . -name "*.py" | xargs -c | cut -d ":" -f 2 | sum.pl

list all code files:

files=find . -name "*.py" | xargs;

get counts about table name occurence in code files:

files=find . -name "*.py" | xargs; for tabname in cat ~/Desktop/tables.txt | cut -d "." -f 2; do res=grep "$tabname" $files -c -i | cut -d ":" -f 2 | sum.pl; echo "$res $tabname"; done | cut -d ":" -f 2 | sort -n

sum.pl

Is a small Perl script:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;

my $sum = 0;
#~ open (PIPE, "|");
	while (my $line = <STDIN>) {
		if ($line =~ m/^(-?\d+\.?\d*)$/) {
			my $number = $1;
			$sum += $number if (defined $number);
		} elsif ($line =~ m/^(\d+.\d*e.\d+)$/) {
			$sum += eval($1);
		} else {
			print $line;
		}
	}
#~ close (PIPE);

print STDOUT "SUM: $sum\n";

@wasade
Copy link
Member

wasade commented May 5, 2017

Thanks, @sjanssen2. Reviewing right now.

@wasade
Copy link
Member

wasade commented May 5, 2017

My guess on the duplicates table is that it was intended to accommodate people who used multiple participant names (eg "daniel mcdonald 1", "daniel mcdonald 2"), see this which was implicated in the "blame" log data access.

I'm fairly certain the promoted table was a temporary one created when we shifted from the old survey to the "new" survey a few years back. I don't see a reason to retain that table.

Still checking the others.

@wasade
Copy link
Member

wasade commented May 5, 2017

ag_animal_survey -> old survey
ag_human_survey -> old survey
ag_import_stats_tmp -> from oracle, associated with original kit creation mechanisms, drop
ag_map_markers -> cache from when we were allowed to use map pins, drop
ag_participant_exceptions -> i think legacy for tracking juvenile approvals
ag_survey_answer -> old survey
ag_survey_multiples -> old survey
ag_survey_multiples_backup -> old survey
barcode_exceptions -> i do not have this in my test database
controlled_vocabs -> empty in the test db, drop
controlled_vocab_values -> empty in the test db, drop

@sjanssen2
Copy link
Contributor Author

@Duplicates table: If someone is using two different "participant names" doesn't that mean that we have different "sources"? A barcode can than only be assigned to one source. Could it be the case that this table tries to circumvent this 1:1 relation?

@josenavas
Copy link
Member

I may need to do some digging on emails, but maybe @wasade, @EmbrietteH or @jwdebelius remember something about this. When we discovered the problem with participants with multiple participant names we tried to consolidate the participant name into a single one. The ones that were clear (like Daniel 1, Daniel 2) were consolidated, however for the unclear ones we said that we will contact the participants - but I don't think we actually did. I'm wondering if that table holds those unclear relationships?

@sjanssen2
Copy link
Contributor Author

but how should we know if someone only added several "participants" by being sloppy from situations where a user distributed his/her barcodes over several "participants", i.e. human or animal sources if he/she e.g. swapped a family?

@sjanssen2
Copy link
Contributor Author

btw. this table is only used to determine the HOST_SUBJECT_ID for the current metadata pulldown.

@josenavas
Copy link
Member

@sjanssen2 HOST_SUBJECT_ID is the same as "participant" or as "source" the duplicates means the same participant that provided 2 names. For example, I wanted to fill the survey twice, so I added myself as a human source for my first barcode and the added my self again as another source for barcode 2. Technically, the source is the same since I'm a single participant, however in the system they're represented as 2 different consents. For keeping longitudinal sampling we need that table.

@wasade
Copy link
Member

wasade commented May 5, 2017

Important distinction: a "source" is either animal, human, or environmental. And yes, it looks like the duplicates table is working around the 1:1.

I do not think "The ones that were clear (like Daniel 1, Daniel 2) were consolidated" was ever done.

@wasade
Copy link
Member

wasade commented May 5, 2017

...but the duplicates table could be used for longitudinal?

@sjanssen2
Copy link
Contributor Author

OK, so we have to scenarios:

  1. longitudinal samples, where we need to abuse the system and add further sources to be able to create several surveys (for different time points).
  2. sampling e.g. a family, where we actually swap different people.

Is there any means in the interface for a user to indicate if he/she is using 1) or 2) ?

@josenavas
Copy link
Member

@wasade I think you're right - they're not consolidated, but rather used the duplicates table.

@sjanssen2 as far as I know no - however I think that is diverging to a discussion that is not critical to solve the present issue?

@sjanssen2
Copy link
Contributor Author

@josenavas I'd say it is very critical to this issue, because it puts the finger into our open wound: we have no clear concept of a "source" and a "timeseries". Currently, this seems to be wrongly intermingled!

@sjanssen2
Copy link
Contributor Author

sjanssen2 commented May 5, 2017

In my opinion we need to model the following organization:

A user is real life person who contribute to one or more kits and access our web interface. Through the kits he acquires a set of barcodes.
The user can then create one or several sources either human, animal or environmental ones. Every source should have one or more timepoints.
From what I have heard about our IRB restrictions, I'd say a survey is assigned to one timepoint of one source (together with the consent).

Right know, we do not really support longitudinal studies AND suffer from the unclear relation between "participant_name" = "source" and "survey"&"barcode".
We do log the timestamp of the barcodes, don't we? So could we go over the DB once and allocate multiple barcodes for one ag_login_id, participant_name combination to several timepoints.

This implies a major overhaul of our DB, which I cannot do alone and/or within the given limited time. So for know, I'd say, we assume "participant_names" are distinct sources and we only add the many-to-many relation for "sources" into the DB.
What do you think?

@wasade
Copy link
Member

wasade commented May 5, 2017

I agree with @josenavas, I think attempting to resolve the longitudinal sampling is scope creep for the present issue of just being able to pull down all of the metadata associated with a barcode where the person took multiple distinct surveys (eg the main one, the fermented foods one, etc).

@josenavas
Copy link
Member

I think the issue is different:

  • As far as I understand, the current problem is the lack of ability to relate the multiple different surveys (i.e. main, surfers, fermented...) to a given barcode. This has nothing to do with longitudinal sampling.
  • The second problem is efficiently represent time series for a single participant in which some values of the main survey change.

We need to solve the first problem. If we add the sources table that acts as the many-to-many relationship as @wasade points out we solve the first one but not the second one. However, it replicates the current behavior as there is no limitation to add the same person twice as source (as it is currently done). Note that in my opinion this problem raises due to a disparity between the DB and the GUI. The GUI exposes an object (source) that it doesn't exist in the DB, and duct tape was put in place to support this by passing the login id and the participant name. When the multiple surveys were added, instead of fixing this problem, more duct tape was put in place. The duct tape couldn't hold anymore and now the bathroom is full of water...

Fixing the timeseries problem is a big problem that requires a lot of thinking and a big refactor of the DB. Note that we discussed this on the last american gut code sprint and after 3 days we didn't reach a consensus, so this is a hard problem. Pulling down the metadata is way easier and time pressing.

@wasade
Copy link
Member

wasade commented May 5, 2017

"This implies a major overhaul of our DB, which I cannot do alone and/or within the given limited time. So for know, I'd say, we assume "participant_names" are distinct sources and we only add the many-to-many relation for "sources" into the DB.
What do you think?"

I think that is a good direction

@sjanssen2 sjanssen2 mentioned this issue May 5, 2017
@sjanssen2
Copy link
Contributor Author

OK, I pushed a PR to drop the mentioned tables (except duplicate_consent).

@wasade
Copy link
Member

wasade commented May 5, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants