Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: fermentation products #144

Open
morganx opened this issue Apr 20, 2022 · 9 comments
Open

Feature request: fermentation products #144

morganx opened this issue Apr 20, 2022 · 9 comments
Labels
curation enhancement New feature or request

Comments

@morganx
Copy link

morganx commented Apr 20, 2022

It's very useful to be able to pull glucose fermentation products out of Bergey data. For example, because human microbiome-related publications regularly claim 'X organism is a butyrate producer' and this is only sometimes correct.

Here's a linked example of a Bergey chapter (https://drive.google.com/file/d/1L1sgkZClTp3NjDW3RgAnZvzx4fiobyCG/view?usp=sharing). I've highlighted some of the different ways this data is displayed for various bugs. You need to download it to see highlights (sorry) - Google doc view does not show them. It would be challenging to parse out this data, but really useful to include as a phenotype if possible in future.

@lwaldron
Copy link
Member

We have never actually parsed the Bergey's full-text like this, only abstracts. Why did you select Table 3 instead of Table 2 or instead of both? Could we focus just on the tables? There are PDF table parsers that, depending on how the PDF is built, might be able to automate this. Otherwise it will be a Natural Language Processing project.

image

@lwaldron lwaldron added the enhancement New feature or request label Apr 20, 2022
@morganx
Copy link
Author

morganx commented Apr 20, 2022

@lwaldron I don't believe every chapter is formatted consistently to the point that Table 1 or Table 2 would always have fermentation products. I highlighted several different examples because I don't always find the info in the same place.

I haven't thought much about how to solve this problem because I didn't even realize you were working on it until we talked today, but it might be possible to at least curate a medium-to-large chunk just by parsing tables with "fermentation products".

@lwaldron
Copy link
Member

Yeah at the level of tables a semi-automated process could look something like (not necessarily 3 people but just to show how it could be broken up - Person 1 probably would be more of a microbiome expert than 2-3 though):

  1. Person 1 creates a written TODO list of tables to be ingested
  2. Person 2 cuts PDFs to put those tables each in a 1-page PDF of its own.
  3. Person 3 runs a PDF table parser on each of those files, does some manual cleanup, and does sanity checks against the original PDFs

If you would start noting tables you'd like to see ingested, I'll put them on a priority list for as soon as I can find a person with bandwidth. It doesn't necessarily have to be just fermentation products, and once we're pulling those from a table like this we might as well pull the other rows too.

@JonathanYe3
Copy link
Collaborator

I would like to take a shot at being Person 2 and Person 3! Just need Person 1 to direct me to the tables since I'm not microbiome expert.

@kbeckenrode
Copy link
Contributor

@JonathanYe3 let me know what you need when you get around to this

@JonathanYe3
Copy link
Collaborator

Thanks @kbeckenrode! All I need is a list of tables that we want to parse and cleanup. Can @lwaldron confirm if this pdf (https://drive.google.com/file/d/1L1sgkZClTp3NjDW3RgAnZvzx4fiobyCG/view?usp=sharing) is the only one we're working with?

@kbeckenrode
Copy link
Contributor

@JonathanYe3 any table that describes fermentation we'd like to capture, Let me know if you need help.

@JonathanYe3
Copy link
Collaborator

Yup, thank you! I think Eric Yu is working hard on this project so best of luck to him.

@sdgamboa sdgamboa added this to BugPhyzz Dec 2, 2022
@sdgamboa sdgamboa moved this to Todo in BugPhyzz Dec 2, 2022
@sdgamboa
Copy link
Contributor

Link from related issue #204. I think this was the output from the webscrapping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
curation enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

5 participants