Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Construct a phylogenetic tree from Kraken2 in order to use in phyloseq object #46

Open
magibc opened this issue Mar 10, 2022 · 10 comments

Comments

@magibc
Copy link

magibc commented Mar 10, 2022

Hi,
It's my first time analysing metagenomics from a shotgun experiment but before to post this issue I tried some strategies to overcome the topic described below.

I would like to construct a phyloseq object from Kraken2 + Bracken output but containing a phylogenetic tree in the phyloseq object in order to be able to calculate Unifrac distances in downstream analysis.

Could you recommend me some software to create a phylogenetic tree from Kraken2 output?

Thanks on advance for your help/hints,

Magí.

@magibc magibc changed the title kreport2mpa.py and combine_mpa.py to phyloseq object Construct a phylogenetic tree from Kraken2 in order to use in phyloseq object Mar 15, 2022
@damioresegun
Copy link

I was trying to do this recently too but as I understand it, you can't make it directly from a Kraken output as that only have taxonomic information. For phylogenetic information, you need to carry out a multiple sequence alignment and then carry out bootstrap calculation. Its somewhat computationally difficult to do and I honestly gave up on doing that for my samples. In terms of software that could make one, you could try MEGA although in my experience, that works for small-ish datasets.

For phyloseq, you can make a taxonomic tree that could help in visualisation however, it doesn't really do much in providing phylogenetic information that you can analyse.

If you do manage to find something that works for metagenomic data, let me know!

@CMagnoBR
Copy link

CMagnoBR commented Dec 12, 2022

Hey guys!
Let me know too! Recently Im looking for a method for infer phylogenetic tree, and produce a newick tree file for others purposes.
Using Kraken2, we can extract the mapped sequences using "--classified-out" option. However, there aren`t any method to use these sequences for phylogenetic inference. The head of the reads has the TaxID classified by the Kraken2. I believe that is the way. Suggestions?

Regards.

@CMagnoBR
Copy link

Hi!
Past 1 year, there is some advance for this question??
Thanks!

@paulzierep
Copy link

I need to look into this, but should it not be possible to create a tree from the tax ids using e.g. ete3: http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html
In any case the tree is optional for phyloseq so you can use most functions without the tree.

@cxkBio
Copy link

cxkBio commented Oct 2, 2024

I need to look into this, but should it not be possible to create a tree from the tax ids using e.g. ete3: http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html In any case the tree is optional for phyloseq so you can use most functions without the tree.

Have you implemented getting phylogenetic trees from Kraken?

@paulzierep
Copy link

Since we cannot use sequences to create the tree, it can only work using the taxi id information from ncbi.
Based on this me and chatGPT ;) made this script:

from ete3 import NCBITaxa

# Initialize NCBITaxa object
ncbi = NCBITaxa()

# Function to parse Kraken2 report and extract TaxIDs
def parse_kraken2_report(file_path):
    taxid_set = set()
    with open(file_path, 'r') as file:
        for line in file:
            fields = line.strip().split("\t")
            if len(fields) > 4:
                taxid = fields[4]  # TaxID is in the 5th column
                taxid_set.add(int(taxid))  # Add to the set to avoid duplicates
    return list(taxid_set)

# Function to build a tree from TaxIDs
def build_taxonomic_tree(taxids):
    # Fetch the taxonomic tree based on the TaxIDs
    tree = ncbi.get_topology(taxids)
    return tree

# Function to visualize the tree
def show_tree(tree):
    tree.show()

# Optionally, you can export the tree to Newick or an image
def save_tree(tree, output_file):
    tree.write(format=1, outfile=output_file)
    tree.render(output_file.replace(".nwk", ".png"), w=183, units="mm")

# Main workflow
kraken2_report_file = "path_to_your_kraken2_report.txt"  # Update this path
taxids = parse_kraken2_report(kraken2_report_file)
print(f"Found {len(taxids)} unique TaxIDs.")

tree = build_taxonomic_tree(taxids)

# Show the tree interactively
show_tree(tree)

# Optionally, save the tree to a Newick file and an image
save_tree(tree, "kraken2_taxonomic_tree.nwk")

Since for phyloseq one needs matching ids in the OTU table and the tree you would need to use ncbi id / abundance for the OTU table input and the newick tree from the script. Let me know if that works.

@cxkBio
Copy link

cxkBio commented Oct 2, 2024

Since we cannot use sequences to create the tree, it can only work using the taxi id information from ncbi. Based on this me and chatGPT ;) made this script:

from ete3 import NCBITaxa

# Initialize NCBITaxa object
ncbi = NCBITaxa()

# Function to parse Kraken2 report and extract TaxIDs
def parse_kraken2_report(file_path):
    taxid_set = set()
    with open(file_path, 'r') as file:
        for line in file:
            fields = line.strip().split("\t")
            if len(fields) > 4:
                taxid = fields[4]  # TaxID is in the 5th column
                taxid_set.add(int(taxid))  # Add to the set to avoid duplicates
    return list(taxid_set)

# Function to build a tree from TaxIDs
def build_taxonomic_tree(taxids):
    # Fetch the taxonomic tree based on the TaxIDs
    tree = ncbi.get_topology(taxids)
    return tree

# Function to visualize the tree
def show_tree(tree):
    tree.show()

# Optionally, you can export the tree to Newick or an image
def save_tree(tree, output_file):
    tree.write(format=1, outfile=output_file)
    tree.render(output_file.replace(".nwk", ".png"), w=183, units="mm")

# Main workflow
kraken2_report_file = "path_to_your_kraken2_report.txt"  # Update this path
taxids = parse_kraken2_report(kraken2_report_file)
print(f"Found {len(taxids)} unique TaxIDs.")

tree = build_taxonomic_tree(taxids)

# Show the tree interactively
show_tree(tree)

# Optionally, save the tree to a Newick file and an image
save_tree(tree, "kraken2_taxonomic_tree.nwk")

Since for phyloseq one needs matching ids in the OTU table and the tree you would need to use ncbi id / abundance for the OTU table input and the newick tree from the script. Let me know if that works.

Thank you so much for your help. I'll try it. Best wishes to you!

@paulzierep
Copy link

You are welcome, let me know if it works, I will probably make a Galaxy tool out of the script at one point and make this a easy to use workflow ... should also work with other read-based classifiers

@cxkBio
Copy link

cxkBio commented Oct 2, 2024

You are welcome, let me know if it works, I will probably make a Galaxy tool out of the script at one point and make this a easy to use workflow ... should also work with other read-based classifiers

Perhaps you could give Bracken, Phanta, and metaphlan a try based on reads. Additionally, metaphlan includes way for drawing phylogenetic trees, so you could make comparisons. For now, due to data processing issues, I'm unable to test your code at once.

@CMagnoBR
Copy link

CMagnoBR commented Oct 15, 2024

Wow! Thank you guys!
Let me know if this tool has been implemented as a Galaxy tool. By the way, I thank if this is implemented at Galaxy Europe server.

Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants