-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clean DB #669
Comments
here is how I got the number of occurrences of a table name in our codebases: get all tables:psql --dbname=ag_test --password --host=localhost --username=sjanssen -c "\dt *." | grep -v "^ pg_catalog" | grep -v "information_schema" | grep -v "^ public" | grep "| table |" | cut -d "|" -f 1,2 | tr -d " " | tr "|" "." > ~/Desktop/tables.txt check occurence of table name:grep "ag_map_markers" list all code files:files= get counts about table name occurence in code files:files= sum.plIs a small Perl script:
|
Thanks, @sjanssen2. Reviewing right now. |
My guess on the duplicates table is that it was intended to accommodate people who used multiple participant names (eg "daniel mcdonald 1", "daniel mcdonald 2"), see this which was implicated in the "blame" log data access. I'm fairly certain the promoted table was a temporary one created when we shifted from the old survey to the "new" survey a few years back. I don't see a reason to retain that table. Still checking the others. |
ag_animal_survey -> old survey |
@Duplicates table: If someone is using two different "participant names" doesn't that mean that we have different "sources"? A barcode can than only be assigned to one source. Could it be the case that this table tries to circumvent this 1:1 relation? |
I may need to do some digging on emails, but maybe @wasade, @EmbrietteH or @jwdebelius remember something about this. When we discovered the problem with participants with multiple participant names we tried to consolidate the participant name into a single one. The ones that were clear (like Daniel 1, Daniel 2) were consolidated, however for the unclear ones we said that we will contact the participants - but I don't think we actually did. I'm wondering if that table holds those unclear relationships? |
but how should we know if someone only added several "participants" by being sloppy from situations where a user distributed his/her barcodes over several "participants", i.e. human or animal sources if he/she e.g. swapped a family? |
btw. this table is only used to determine the HOST_SUBJECT_ID for the current metadata pulldown. |
@sjanssen2 HOST_SUBJECT_ID is the same as "participant" or as "source" the duplicates means the same participant that provided 2 names. For example, I wanted to fill the survey twice, so I added myself as a human source for my first barcode and the added my self again as another source for barcode 2. Technically, the source is the same since I'm a single participant, however in the system they're represented as 2 different consents. For keeping longitudinal sampling we need that table. |
Important distinction: a "source" is either animal, human, or environmental. And yes, it looks like the duplicates table is working around the 1:1. I do not think "The ones that were clear (like Daniel 1, Daniel 2) were consolidated" was ever done. |
...but the duplicates table could be used for longitudinal? |
OK, so we have to scenarios:
Is there any means in the interface for a user to indicate if he/she is using 1) or 2) ? |
@wasade I think you're right - they're not consolidated, but rather used the duplicates table. @sjanssen2 as far as I know no - however I think that is diverging to a discussion that is not critical to solve the present issue? |
@josenavas I'd say it is very critical to this issue, because it puts the finger into our open wound: we have no clear concept of a "source" and a "timeseries". Currently, this seems to be wrongly intermingled! |
In my opinion we need to model the following organization: A user is real life person who contribute to one or more kits and access our web interface. Through the kits he acquires a set of barcodes. Right know, we do not really support longitudinal studies AND suffer from the unclear relation between "participant_name" = "source" and "survey"&"barcode". This implies a major overhaul of our DB, which I cannot do alone and/or within the given limited time. So for know, I'd say, we assume "participant_names" are distinct sources and we only add the many-to-many relation for "sources" into the DB. |
I agree with @josenavas, I think attempting to resolve the longitudinal sampling is scope creep for the present issue of just being able to pull down all of the metadata associated with a barcode where the person took multiple distinct surveys (eg the main one, the fermented foods one, etc). |
I think the issue is different:
We need to solve the first problem. If we add the sources table that acts as the many-to-many relationship as @wasade points out we solve the first one but not the second one. However, it replicates the current behavior as there is no limitation to add the same person twice as source (as it is currently done). Note that in my opinion this problem raises due to a disparity between the DB and the GUI. The GUI exposes an object (source) that it doesn't exist in the DB, and duct tape was put in place to support this by passing the login id and the participant name. When the multiple surveys were added, instead of fixing this problem, more duct tape was put in place. The duct tape couldn't hold anymore and now the bathroom is full of water... Fixing the timeseries problem is a big problem that requires a lot of thinking and a big refactor of the DB. Note that we discussed this on the last american gut code sprint and after 3 days we didn't reach a consensus, so this is a hard problem. Pulling down the metadata is way easier and time pressing. |
"This implies a major overhaul of our DB, which I cannot do alone and/or within the given limited time. So for know, I'd say, we assume "participant_names" are distinct sources and we only add the many-to-many relation for "sources" into the DB. I think that is a good direction |
OK, I pushed a PR to drop the mentioned tables (except duplicate_consent). |
Thanks!
…On Fri, May 5, 2017 at 10:29 AM, Stefan Janssen ***@***.***> wrote:
OK, I pushed a PR to drop the mentioned tables (except duplicate_consent).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#669 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAc8ssi7p7P11CE1Sk4jU5P1U9zJtwKrks5r21yKgaJpZM4NSE7x>
.
|
Before we add in the new relation "source" we need to clean our DB to make it easier to understand for us and additional new devs.
I spent the last two days sifting through our codebase (master branches) of american-gut-web and labadmin and understand which tables are no longer used. I am quite confident about my findings, but I definitively need a two persons review before we actually drop the selected tables.
The following tables are not mentioned (by grep'ing case insensitive their raw name):
ag_animal_survey
ag_human_survey
ag_import_stats_tmp
ag_map_markers
ag_participant_exceptions
ag_survey_answer
ag_survey_multiples
ag_survey_multiples_backup
barcode_exceptions
controlled_vocabs
controlled_vocab_values
survey_response_types(update: used by other active table: survey_question_response_type)I cannot see why those two tables are needed, because...
table "duplicate_consents":
labadmin/knimin/lib/data_access.py: format_survey_data(): line 726:
here is the only place that we READ from this table. If there are no WRITES, why do we need this static information?
table "promoted_survey_ids":
american-gut-web/amgut/lib/ag_data_access.py: deleteAGParticipantSurvey(): line 325
here is the only place where we DELETE from this table, when a survey (and I figure the consent) is deleted. Can't we simply delete the full table, i.e. for ALL users?
Please @josenavas @wasade review soon, since this is really time critical to get the metadata pulldown right.
The text was updated successfully, but these errors were encountered: