Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add more marker genes #28

Open
mudymudy opened this issue May 15, 2023 · 10 comments
Open

How to add more marker genes #28

mudymudy opened this issue May 15, 2023 · 10 comments

Comments

@mudymudy
Copy link

Dear developers,

I've been using r2t successfully when I run it employing marker genes from the OMA db.
My problem is that I want to create a more specific or sensitive tree using Clostridium marker genes that I obtained from the OMA website and some specific Clostridium aminoacidic sequences that I downloaded from UniProt (from C. septicum and C. chauvoei, which are not present in the OMA database).

Is that possible? I did run OMA standalone software with those sequences and it gave me the OG.fa files, which are the OMA group files. But when I use those sequences with r2t, it says that those sequences are not nucleotide sequences.
Then I realised that the OMA marker genes have both AA and nucleotide sequences. Is there any way to include new genomes that are not present in the OMA db? I would like to add some C. septicum and C. chauvoei to my tree but I can't find a way to do it.

Thank you in advance!,

Bruno

@sinamajidian
Copy link
Contributor

Hi Bruno @mudymudy

We are glad you were able to run read2tree successfully.

As you mentioned read2tree needs both amino acid and nucleotide to be able to map the sequencing reads onto protein/gene markers.

Yes, it is possible to generate gene markers for your set of species using OMA standalone and we described it in our wiki here. Note that Viral dataset is an example and the instruction is valid for eukaryotes. Another point is that the instruction is based the format of Fasta record ID of NCBI refseq. I think uniprot does not provide the nucleotide sequences as mentioned here.

You mention that there are some species in OMA browser that you can benefit from. I'm afraid that you can not "directly" use those OGs and add new species to them. I would suggest to run OMA standalone from scratch using all AA sequences of the species (and keep their nt with similar fasta record ID for read2tree) if you are considering <50 species.

But if you have more species and you are limited in computation, you can first use this export All-All section to download the all-vs-all comparison of those species in OMA. Then, use description in here under the sectionAdding/Updating new genomes.

Alternatively, you can use the sequencing reads of C. septicum and C. chauvoei instead of their proteome and run read2tree in multi-species mode.

Please let me which part is not clear and we'll try our best to describe in detail.

Best regards,
Sina

@mudymudy
Copy link
Author

mudymudy commented May 22, 2023

Dear Sina,

Thank you for your reply!
I followed your instructions and I generated the OG files. When I run read2tree
using this command:

read2tree --standalone_path marker_genes --output_path output --reference --dna_reference data/all_cdna_out.fa

I get this error message:

2023-05-22 12:25:52,725 - read2tree.Aligner - INFO - OG1355 with error Sequences must all be the same length

This is just one but there are many samples that gives the same message. do you know if this is a critical error or just an error that can be ignored or if there is a way to fix that?

Thank you again!!

@sinamajidian
Copy link
Contributor

You're welcome.
It would be great if you could share with us the full log/error stored in mplog.log file, and also 5-10 lines of 01_ref_ogs_aa/OG1355.fa and 01_ref_ogs_dna/OG1355.fa. Thanks!

@mudymudy
Copy link
Author

mudymudy commented May 23, 2023

Thank you for your reply!

The first lines of 01_ref_ogs_aa/OG1355.fa:

s0000|JACKWY010000007.1.MBB6715677.1.2612_OG1355 s0000|JACKWY010000007.1.MBB6715677.1.2612 [s0000]
MTFEVVGKNVNRLDGVEKVTGRAKYTDDFFERDMLIGKVLRSPYAHAIVKNIDTTKALALEGVEAVITYKDLPKIKFATAGHPWSLDPSHRDIDDRLILTDKARFVGDAIAAVVASDELIAEKALKLIEVEYEVLPHILKAEDAIKEDAPIIHEERPNNILSTFGSETGKVEDDMKNAHKIFKGVYETSIVQHCHMENHTAYSYVDSNGRIVIISSTQIPHIVRRIVGQALGMSWGNIRVIKPCVGGGFGNKQDVVIEPLVAAMSLAVHGKPVRYALSREECFIDTRTRHGMKIKFNTAVSKDGKLLGLDIENLVNNGAYASHGHSVAMSAGGKFRPLYNFNSIKYSPTTVYTNLPVAGAMRGYGAPQMCFALESHLDDIARELNIDPIEFRKANLIKEGYIDPLSKNVVRSFVLPECIDKGKELIKWDEKKRKYKNQKGDKRRGVGMACFSYFSGTHPVALETAGARIVMNQDGSIQLQIGATEIGQGSDTVFGQMAAECIGLPIDMVHVVETHDTDITPFDTGSYASRQTFVAGAVVKKAAMEVRDKVLTFASNNCGLNKDELDIVNCEIIEKRLGRRICSLEDIAMESYYDRIKCCPITSDTSANVRMNAIAYGVTYAEVEVDIKTGEIEVLEIYNVHDSGIIMNHKLAEGQVDGGVSMGLGYALSEQMLFDEKTGRLLNDNLLDYKLQTIMDTPTINSAFIEKYEPAGSFGQKSLGENTTVSPAPAIRNAVLDAIGIGFNRIPMNPQSVFEKIKESGLVYEGEKENV
s0004|LWAE01000002.1.KZL92195.1.1929_OG1355 s0004|LWAE01000002.1.KZL92195.1.1929 [s0004]
MTYKVIGNSVNRVDAIAKVTGKAKYVDDFFERDMLVGKVLRSPYAHAIVKNIDVSRAKALNGVEAVITHMDLPKIKFSTAGHPWSLDPDHRDIEDRLILTDKARFVGDGVAAVIAVNELIAEKALKLIEVEYEILPHVIDPEEAIKPGAPVVHEERPNNIISSFGAEYGDIEAEFKNSDYVFEGIYETSIVQHCHIENHTSYAYIDTDGRIVIISSTQIPHIVKRIVGQALGLPWGRIRVIKPYVGGGFGNKQDVIIEPLTAAMTLAVKGRPVRYRMTREEAFIDTRTRHAMKFSLKTAVSKEGKLTGIYVGDIVNNGAYASHGHSVAMSAGSKFRPLYNFRSIKFDPKTVYTNLPTAGAMRGYGVPQICFALESHLDDIAREMNIDPIEFRNQNLISAGHMDPLTKNVVRTFGIPECIEKGKELINWDEKKKRYKNQSGERRRGIGMACFSYLSGTHPVALELAGARIIMNQDGSVQLQVSAAEIGQGSDTVLAQMAAEVLGLSMDMVHVIASQDTDVSPFDTGAYASRQTFVTGAAVKKAAVEVRQKVLELAMKKTGLCGDELDMQDAQIIEKRTGRVVCSLEDIAMESYYDRVNASPISSDITENVRINATAYGVTFAEVEVDMKVGKIEVLEIYNVHDSGVIINPKLAEGQVNGGVSMGLGYALSEQLLFDKKTGKPLNNNLLDYKLQTILDTPEIGVAFVEKYEPAGSFGQKSLGENPSISPAPAIRNAVLDATGIAFNKIPMNPQAVFEKFKEAGLL
s0040|CYZV01000038.1.CUO66340.1.2896_OG1355 s0040|CYZV01000038.1.CUO66340.1.2896 [s0040]
MSYKILGKSVNRVDAIAKVTGKAKYAEDYFEREMLVGKVLRSTYAHAKIKNIFIDDALSLDGVEAVITYKDLPNIRFATAGHPYSLDKNHRDVEDRLILTNKARYVGDAIAAVIAKDEIIAQKALKLIKVDYDILDAVFNTEDAIAEGAPIIHEDKPNNIIASSKIEIGDINEEFKNADYVFEGEYETSIVQHCQLECQNAYAYVDENNRIVIVTSTQIPHIVRRIVGQALNIPIGRIRVIKPFVGGGFGNKQDVIIEPLTAAMTLAVNGRPVRLSLDREEVFASTRTRHAMKYKIRTAISKDGKLLAISMNNQVNNGAYASHGHSIAMSAASKF
RPLYSFKAIEVKPTTVYTNLPTAGAMRGYGIPQVCFALESHLDDIAIKLNMDPIKFRENNFISLGYEDPLSGIKVRSFGVKECIKKGKDLIRWDEKKKRYKNITGNKRRGVGMACFSYFSGTYPVSLEIAGARIVMNQDGSVQLQVGATEIGQGSDTIFSEMVAEVLGIDIEKVNVISIQDTDITPFDTGAYASRQSFVTGAAVKKAAKEVKNKVLEIASRKCGLNIDELDIREGIIIEKKMGDEICSLSDIALKSYYDREFANPITTDISENVKINAVAYGVTFAEVEVDIETGKIKVLEIYNVHDSGKILNRKIAEGQVEGGVSMGLGYALSEQMLFNEKTGQPLNNNLLDYKLQTILDTPKIGVDFVETIDPAGSFGQKSLGENPTISPAPAIRNAVLDATGVAFNKIPMNPQSVFEKFKEGGLI

Then the first lines of 01_ref_ogs_dna/OG1355.fa:

s0000|JACKWY010000007.1.MBB6715677.1.2612_OG1355
ATGACATTTGAAGTAGTTGGTAAAAATGTAAATAGACTTGATGGAGTTGAAAAGGTAACTGGCAGGGCAAAGTACACAGATGATTTTTTTGAACGAGATATGCTAATAGGAAAAGTTCTAAGAAGCCCTTATGCACATGCTATAGTAAAAAATATTGATACTACTAAGGCATTAGCTTTAGAAGGAGTGGAAGCTGTAATAACTTATAAGGATTTACCCAAAATAAAGTTTGCAACAGCAGGACATCCATGGTCATTAGATCCAAGTCATAGGGATATTGATGATAGATTAATTTTAACTGATAAAGCAAGATTTGTTGGAGATGCTATTGCAGCAGTAGTTGCAAGTGATGAATTAATAGCAGAGAAAGCACTTAAATTAATTGAAGTAGAGTATGAGGTCTTACCACATATATTAAAGGCAGAAGATGCAATAAAAGAAGATGCACCAATAATACATGAAGAAAGACCTAATAATATTTTAAGTACATTTGGATCAGAAACTGGAAAAGTTGAAGATGATATGAAAAATGCTCACAAGATATTTAAAGGAGTTTATGAAACTAGCATAGTTCAACATTGCCATATGGAAAATCATACTGCATATTCTTATGTAGATAGTAATGGAAGAATAGTTATTATATCATCAACACAAATTCCTCATATAGTAAGAAGAATTGTAGGTCAAGCTCTTGGAATGTCCTGGGGGAATATTAGAGTAATAAAGCCTTGTGTTGGCGGAGGTTTTGGAAATAAGCAAGATGTTGTAATAGAACCTCTAGTTGCAGCTATGTCACTGGCAGTTCATGGTAAACCAGTAAGATATGCATTAAGTAGAGAAGAGTGCTTTATAGATACAAGAACTAGGCATGGAATGAAAATAAAATTTAATACCGCAGTTTCTAAAGATGGAAAGTTATTAGGATTAGATATTGAGAATTTAGTTAATAATGGAGCTTATGCATCTCATGGTCATTCAGTAGCTATGAGCGCAGGTGGGAAATTTAGACCATTATATAATTTTAATTCAATAAAATACTCACCTACAACTGTATATACAAACTTACCAGTTGCAGGAGCTATGAGAGGATATGGAGCACCACAAATGTGCTTTGCTTTAGAAAGTCATTTAGATGATATAGCAAGAGAACTTAATATTGATCCAATAGAATTTAGAAAAGCAAACTTAATAAAAGAGGGCTATATAGACCCATTAAGCAAAAATGTAGTAAGGTCATTTGTTCTGCCAGAATGTATAGATAAAGGTAAGGAACTAATAAAATGGGATGAAAAAAAGAGAAAATATAAAAATCAAAAGGGAGATAAAAGAAGAGGAGTTGGAATGGCTTGCTTTAGTTATTTTAGTGGGACACATCCAGTTGCACTTGAGACCGCAGGGGCAAGAATAGTAATGAATCAAGATGGATCTATTCAATTACAAATTGGTGCTACAGAAATAGGACAGGGTAGTGATACAGTATTCGGTCAAATGGCAGCTGAATGTATAGGTTTACCTATAGACATGGTACATGTTGTAGAAACACATGACACAGATATAACACCTTTTGATACTGGATCTTATGCATCAAGACAAACATTTGTAGCAGGAGCAGTTGTGAAAAAAGCAGCAATGGAAGTAAGAGATAAAGTGCTTACATTTGCAAGTAATAATTGTGGTTTAAATAAAGATGAATTAGATATTGTAAACTGTGAAATTATTGAAAAGAGGTTAGGGAGAAGAATTTGTTCTTTAGAAGATATAGCAATGGAATCTTATTATGATAGAATAAAATGTTGTCCTATTACTAGTGATACATCTGCAAATGTTAGAATGAACGCAATAGCATATGGAGTTACATATGCAGAAGTTGAAGTTGATATAAAAACAGGAGAAATAGAGGTTTTAGAAATATATAACGTTCATGATTCTGGTATTATTATGAATCATAAGCTAGCTGAGGGTCAAGTTGATGGTGGAGTAAGTATGGGACTTGGATATGCATTATCAGAACAAATGCTTTTTGATGAAAAAACAGGAAGATTATTAAATGACAATCTTTTAGATTATAAGTTGCAAACTATAATGGATACACCTACTATTAATTCAGCATTTATTGAAAAGTATGAACCAGCAGGAAGTTTTGGACAAAAATCTCTTGGAGAAAATACAACAGTATCACCAGCGCCGGCAATAAGAAATGCAGTGCTAGATGCAATAGGAATAGGATTTAATAGAATTCCAATGAATCCACAATCAGTATTTGAAAAAATTAAAGAATCTGGACTCGTGTATGAAGGAGAGAAAGAGAATGTT
s0004|LWAE01000002.1.KZL92195.1.1929_OG1355
ATGACATATAAAGTAATAGGTAATAGTGTTAATAGAGTTGATGCAATTGCCAAAGTTACTGGTAAAGCCAAATATGTAGATGACTTTTTTGAACGAGATATGCTAGTAGGAAAAGTTCTAAGAAGTCCTTATGCTCATGCAATTGTAAAAAATATAGATGTAAGCAGGGCAAAGGCGTTAAATGGTGTGGAGGCCGTAATTACTCACATGGACTTGCCTAAAATTAAGTTTTCAACAGCTGGACATCCTTGGTCGTTGGACCCTGACCATAGGGATATAGAGGATAGACTTATTTTAACAGATAAGGCTCGTTTTGTAGGTGATGGAGTTGCAGCTGTTATTGCCGTAAATGAATTGATTGCAGAGAAAGCATTAAAACTAATCGAAGTTGAATATGAGATTCTTCCCCATGTTATAGATCCAGAGGAGGCAATCAAACCAGGTGCACCTGTGGTACATGAAGAAAGACCTAATAATATCATAAGTTCCTTCGGTGCAGAGTATGGTGATATTGAAGCAGAATTTAAGAATAGTGATTATGTTTTTGAAGGGATATATGAAACTAGTATTGTTCAACATTGTCACATAGAAAACCATACTTCCTATGCCTATATAGATACCGATGGGCGTATTGTAATTATATCTTCAACACAGATACCTCATATTGTTAAAAGAATAGTGGGGCAGGCTCTCGGACTGCCTTGGGGTAGAATAAGAGTTATCAAGCCCTATGTTGGAGGAGGATTTGGAAACAAACAGGATGTTATTATTGAGCCTTTAACCGCAGCTATGACTCTCGCTGTAAAGGGTAGGCCTGTTCGATATAGAATGACAAGGGAGGAGGCTTTTATAGACACACGAACACGGCATGCTATGAAATTTAGCTTAAAAACAGCAGTTTCAAAGGAAGGAAAGCTTACTGGAATATATGTAGGAGATATAGTGAATAACGGAGCATATGCTTCACATGGTCACTCAGTTGCCATGAGTGCTGGAAGTAAGTTCAGACCGCTATATAATTTTAGGTCTATAAAATTTGATCCTAAAACAGTTTATACAAATTTACCGACGGCCGGTGCAATGAGAGGTTATGGGGTGCCGCAGATATGTTTTGCATTAGAAAGTCATTTAGATGATATTGCTCGTGAAATGAATATTGACCCTATAGAATTTAGAAATCAAAATTTAATTTCTGCAGGGCATATGGATCCGCTGACTAAAAATGTAGTTCGTACTTTTGGAATTCCTGAATGTATTGAAAAAGGTAAAGAATTAATAAATTGGGATGAAAAGAAAAAAAGATATAAGAATCAAAGTGGCGAAAGGAGAAGAGGCATTGGCATGGCATGTTTTAGTTATTTGTCAGGTACTCATCCAGTAGCCTTAGAACTTGCAGGAGCTAGAATAATAATGAATCAAGACGGCTCCGTTCAGCTTCAGGTTTCTGCAGCTGAAATCGGCCAGGGCAGTGATACTGTATTAGCTCAAATGGCTGCTGAAGTTCTTGGTTTATCAATGGATATGGTACATGTTATTGCATCACAAGACACAGATGTTTCGCCTTTTGACACAGGTGCATATGCGTCCAGGCAGACCTTTGTTACAGGAGCTGCTGTTAAGAAAGCAGCAGTAGAGGTGAGACAAAAAGTGCTAGAGCTGGCCATGAAAAAGACAGGCCTTTGCGGTGATGAATTGGATATGCAAGATGCTCAAATTATTGAAAAAAGAACAGGAAGAGTTGTCTGTTCATTAGAGGATATTGCTATGGAATCCTATTATGACAGGGTTAATGCCTCACCAATTAGCAGTGATATCACAGAGAATGTAAGAATAAATGCAACAGCATATGGGGTTACCTTTGCTGAGGTAGAAGTGGATATGAAGGTAGGAAAAATAGAAGTACTTGAAATTTATAATGTTCATGACTCGGGAGTAATAATTAACCCCAAACTAGCAGAGGGACAGGTGAATGGAGGAGTAAGCATGGGCTTAGGCTATGCTCTGTCAGAGCAATTATTATTTGATAAAAAAACAGGCAAACCATTAAATAACAACTTATTGGATTATAAGCTACAAACAATTTTAGATACACCGGAAATTGGAGTGGCTTTTGTTGAAAAGTATGAACCTGCAGGATCCTTTGGGCAGAAGTCTCTAGGTGAGAATCCATCTATTTCTCCCGCACCTGCAATTCGAAATGCTGTTTTAGATGCAACAGGTATAGCATTTAATAAAATTCCTATGAATCCACAGGCTGTCTTTGAAAAGTTCAAGGAAGCAGGGTTACTG
s0040|CYZV01000038.1.CUO66340.1.2896_OG1355
ATGTCATATAAGATACTAGGGAAAAGTGTAAATAGAGTTGATGCAATTGCCAAAGTAACAGGAAAAGCAAAATATGCTGAGGATTATTTTGAAAGAGAAATGTTAGTAGGAAAGGTGCTTAGAAGTACTTATGCACATGCTAAGATAAAAAATATATTTATAGATGATGCTTTAAGTTTAGATGGAGTTGAAGCTGTAATAACATATAAGGATTTACCTAACATAAGATTTGCAACTGCAGGTCATCCTTATTCTTTAGATAAAAATCATAGAGATGTTGAAGATAGATTGATATTAACTAATAAAGCAAGATATGTAGGGGATGCAATTGCAGCTGTAATTGCAAAAGATGAGATAATTGCACAAAAAGCATTAAAGCTTATAAAGGTTGATTATGATATTTTAGATGCTGTATTTAATACAGAAGATGCAATAGCTGAAGGAGCCCCTATTATTCATGAAGATAAACCAAATAATATTATAGCCTCTTCTAAGATAGAAATTGGAGATATAAATGAAGAATTTAAAAATGCTGACTACGTATTTGAAGGTGAATATGAAACTAGTATTGTTCAACATTGCCAATTAGAATGTCAAAATGCATATGCATATGTTGATGAGAATAATAGGATTGTAATAGTAACATCAACTCAAATACCTCATATAGTAAGAAGAATTGTAGGACAAGCATTAAATATACCAATAGGAAGAATTAGGGTAATAAAACCTTTCGTTGGTGGAGGATTTGGTAATAAGCAAGATGTTATAATTGAACCACTAACAGCAGCAATGACTTTAGCTGTTAATGGTAGACCTGTAAGACTTTCATTAGATAGGGAAGAAGTTTTTGCATCTACAAGAACAAGACATGCCATGAAGTATAAAATAAGAACTGCTATATCTAAGGATGGAAAATTATTAGCCATTTCTATGAATAATCAGGTAAATAATGGGGCATATGCATCTCATGGGCATTCTATTGCAATGAGTGCAGCAAGTAAATTTAGACCATTATACTCTTTTAAAGCTATTGAAGTGAAACCTACTACAGTATATACCAATTTACCTACAGCAGGAGCAATGAGGGGATATGGTATTCCACAAGTTTGCTTTGCTTTAGAAAGCCATCTAGATGATATTGCAATTAAATTAAATATGGATCCAATAAAGTTTAGAGAAAATAATTTTATAAGTTTAGGATATGAGGATCCATTAAGTGGAATAAAGGTGAGATCATTTGGAGTTAAAGAGTGTATAAAAAAAGGGAAAGACTTAATTAGATGGGATGAAAAAAAGAAAAGGTATAAAAATATTACTGGAAACAAAAGAAGAGGGGTAGGAATGGCATGCTTTAGCTATTTTTCAGGAACGTATCCTGTATCATTAGAAATAGCAGGAGCAAGAATTGTAATGAATCAGGATGGTTCAGTTCAACTGCAAGTTGGTGCAACTGAAATTGGACAAGGAAGTGATACAATTTTTAGTGAGATGGTAGCTGAAGTACTTGGAATAGATATTGAAAAGGTTAATGTTATTTCTATTCAAGATACGGATATTACTCCATTTGATACAGGAGCATATGCATCAAGACAAAGTTTTGTTACAGGGGCAGCTGTAAAAAAGGCTGCAAAAGAAGTTAAAAATAAAGTGCTTGAAATTGCAAGTAGGAAATGTGGATTAAATATAGATGAATTAGACATAAGAGAGGGAATAATAATAGAAAAAAAAATGGGTGATGAAATTTGTTCATTAAGCGATATAGCATTAAAGTCATATTATGATAGAGAATTTGCAAATCCAATTACTACTGATATATCTGAAAATGTAAAAATTAATGCCGTAGCATATGGTGTTACCTTTGCAGAAGTTGAAGTTGATATTGAAACTGGAAAGATAAAGGTACTAGAAATTTATAATGTTCATGATTCAGGAAAAATATTAAATAGGAAAATAGCAGAGGGACAAGTTGAAGGTGGAGTAAGTATGGGGCTTGGTTATGCTTTAAGTGAACAAATGCTATTTAATGAAAAAACAGGACAACCATTAAATAATAATTTATTGGATTATAAACTTCAAACAATTTTAGATACACCTAAAATAGGAGTTGATTTTGTTGAAACTATTGATCCTGCAGGAAGTTTTGGTCAAAAGTCATTAGGAGAAAATCCTACAATTTCACCAGCTCCAGCTATAAGAAATGCAGTTTTAGATGCAACTGGAGTAGCTTTTAATAAAATACCTATGAATCCTCAAAGTGTATTTGAAAAATTTAAAGAAGGTGGATTAATA

The log file:
mplog.log

Thank you so much!!

@sinamajidian
Copy link
Contributor

Thanks for sharing the log file.

I can guess the result of two runs are in the log file (based on the date). It seems that in the first run you didn't use --dna_reference and read2tree is trying to download the cdna from OMA. It might be the case that in your second run, read2tree is using some files from previous run, this might resulted in MSA with different size.

I would suggest to use a new name for output folder with --output_path if a run fails. Alternatively, you could start from a new folder where you put the folder marker_genes and all_cdna_out.fa and run this

read2tree --standalone_path marker_genes --output_path output --reference --dna_reference all_cdna_out.fa

In order to make the process faster, you may want to use top 50-200 biggest OGs from marker_genes folder. No change is needed for all_cdna_out.fa.

Hope it helps. Please keep us updated.

@mudymudy
Copy link
Author

mudymudy commented Jun 1, 2023

Thank you for your reply. I did ran OMA again to create the marker genes and everything went fine. Then when I run:

read2tree --standalone_path marker_genes --output_path output --reference --dna_reference all_cdna_out.fa

I get the same error. This is the mp.log of that specific run.

mplog.log

Thank you for your time!

@mudymudy
Copy link
Author

mudymudy commented Jun 7, 2023

Hi @sinamajidian

I couldn't make read2tree work with my own generated marker genes, but I followed the other directions and I just added extra fastq files and run it in the multi species mode and that did the job.

But I have another question.. when I export marker genes from OMA browser, let's say I select all of the E. coli organisms and then a random one, e.g. Moranella endobia, would that make Moranella the outgroup? Or do I have to specify somehow the outgroup in the read2tree step?

Thank you again for your help!

@sinamajidian
Copy link
Contributor

Sorry I didn't come back you yet, we are discussing the case internally and update you asap.

About your question, read2tree does't ask for outgroup information. But, the fact is that the inferred tree (by IQtree, as part of read2tree) is unrooted and the rooting should be done using the outgroup species. You can do it with ete3 package or with phylo.io visually.

@alpae
Copy link
Member

alpae commented Jun 8, 2023

Hi @mudymudy

I had a look at the the mplog and your sequences. To me, it seems that there is an inconsistency, i.e. your OG1355 contains only 3 sequences, but in the log it says 8 sequences.

From the formatting of the sequences above, it could be that you also put a '>' on the line with the actual sequence. could that be true? It seems rather unlikely as the problem occurs only on a few OGs...

Cheers Adrian

@sinamajidian
Copy link
Contributor

Dear @mudymudy

Regarding creating gene markers using OMA, we are keen to investigate the issue further. I'm wondering whether it is possible to share with us the dataset that you are working on. Specifically, the nucleotide and amino acid sequences of Clostridium (I guess) that you used with OMA standalone. Then, we can run OMA and read2tree to find the issue. (You could also email us at sina.majidian gmail.com)

Thank you in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants