Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input file problem #124

Open
ShevchenkoAlla opened this issue Nov 18, 2024 · 4 comments
Open

Input file problem #124

ShevchenkoAlla opened this issue Nov 18, 2024 · 4 comments
Labels
question Further information is requested

Comments

@ShevchenkoAlla
Copy link

Hello,
I have the output file from analysis of 16s dataset, picrust2 plagin was used in qiime2, then I have converted the biom file to tsv format. Now I am trying to visualise results, I have some issues. I have attached the tsv table picture, I think something wrong with it.

First I tried -

"""""
library(readr)
library(ggpicrust2)
library(tibble)
library(tidyverse)
library(ggprism)
library(patchwork)

Load necessary data: abundance data and metadata

abundance_file <- "/home/output/path_exported/ko_feature_table.biom.tsv"
metadata <- read_delim(
"/home/sample-metadata.tsv",
delim = "\t",
escape_double = FALSE,
trim_ws = TRUE
)

Run ggpicrust2 with input file path

results_file_input <- ggpicrust2(file = abundance_file,
metadata = metadata,
group = "Sample type", # For example dataset, group = "Environment"
reference = "Controls",
pathway = "KO",
daa_method = "LinDA",
ko_to_kegg = TRUE,
order = "pathway_class",
p_values_bar = TRUE,
x_lab = "pathway_name")

metadata$Sample type <- as.factor(metadata$Sample type)
levels(metadata$Sample type)
"""
I got next mistakes-
"""
Starting the ggpicrust2 analysis...

Converting KO to KEGG...

Loading data from file...
Rows: 10556 Columns: 1
── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): # Constructed from biom file

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
Loading KEGG reference data. This might take a while...
Performing KO to KEGG conversion. Please be patient, this might take a while...
|======================================================================================| 100%
KO to KEGG conversion completed. Time elapsed: 0.01 seconds.
Removing KEGG pathways with zero abundance across all samples...
KEGG abundance calculation completed successfully.
Performing pathway differential abundance analysis...

Sample names extracted.
Identifying matching columns in metadata...
Matching columns identified: #SampleID . This is important for ensuring data consistency.
Using all columns in abundance.
Converting abundance to a matrix...
Reordering metadata...
Converting metadata to a matrix and data frame...
Extracting group information...
Running LinDA analysis...
Error in relevel.factor(LinDA_metadata_df$Group_group_nonsense_, ref = reference) :
'ref' must be an existing level
In addition: Warning message:
One or more parsing issues, call problems() on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)

metadata$Sample type <- as.factor(metadata$Sample type)
levels(metadata$Sample type)
[1] "categorical" "Controls" "Permafrost"

Run ggpicrust2 with input file path

results_file_input <- ggpicrust2(file = abundance_file,

  •                              metadata = metadata,
    
  •                              group = "Sample type", 
    
  •                              reference = "Controls",
    
  •                              pathway = "KO",
    
  •                              daa_method = "LinDA",
    
  •                              ko_to_kegg = TRUE,
    
  •                              order = "pathway_class",
    
  •                              p_values_bar = TRUE,
    
  •                              x_lab = "pathway_name")
    

Starting the ggpicrust2 analysis...

Converting KO to KEGG...

Loading data from file...
Rows: 10556 Columns: 1
── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): # Constructed from biom file

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
Loading KEGG reference data. This might take a while...
Performing KO to KEGG conversion. Please be patient, this might take a while...
|======================================================================================| 100%
KO to KEGG conversion completed. Time elapsed: 0.01 seconds.
Removing KEGG pathways with zero abundance across all samples...
KEGG abundance calculation completed successfully.
Performing pathway differential abundance analysis...

Sample names extracted.
Identifying matching columns in metadata...
Matching columns identified: #SampleID . This is important for ensuring data consistency.
Using all columns in abundance.
Converting abundance to a matrix...
Reordering metadata...
Converting metadata to a matrix and data frame...
Extracting group information...
Running LinDA analysis...
Error in relevel.factor(LinDA_metadata_df$Group_group_nonsense_, ref = reference) :
'ref' must be an existing level
In addition: Warning message:
One or more parsing issues, call problems() on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)

"""""""
Screenshot from 2024-11-17 02-12-16

"""""""
I am not sure what this is about.
When I tried to check problems - seems there are no such things, but there is no input file which is there - the path are written correctly

""""problems(metadata)

A tibble: 0 × 5

ℹ 5 variables: row , col , expected , actual , file

problems(results_file_input)
Error: object 'results_file_input' not found
problems(abundance_file)
""""""
So, I have tried to do a different way

"""""""""""
metadata <- read_delim("/home/sample-metadata.tsv", delim = "\t", escape_double = FALSE, trim_ws = TRUE)
Rows: 24 Columns: 6
── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): #SampleID, Composition, Freezing time, Sample type
dbl (2): count_reads, Layers

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

"""""""'''
then
"""""""""

kegg_abundance <- ko2kegg_abundance("/home/output/path_exported/ko_feature_table.biom.tsv")
Loading data from file...
Rows: 10556 Columns: 1
── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): # Constructed from biom file

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
Loading KEGG reference data. This might take a while...
Performing KO to KEGG conversion. Please be patient, this might take a while...
|======================================================================================| 100%
KO to KEGG conversion completed. Time elapsed: 0.01 seconds.
Removing KEGG pathways with zero abundance across all samples...
KEGG abundance calculation completed successfully.
Warning message:
One or more parsing issues, call problems() on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)

"""""""""

I have tried next -

""""""
ko_abundance_file <- "/home/output/path_exported/ko_feature_table.biom.tsv"

kegg_abundance <- ko2kegg_abundance(ko_abundance_file)

Loading data from file...
Rows: 10556 Columns: 1
── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): # Constructed from biom file

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
Loading KEGG reference data. This might take a while...
Performing KO to KEGG conversion. Please be patient, this might take a while...
|======================================================================================| 100%
KO to KEGG conversion completed. Time elapsed: 0.01 seconds.
Removing KEGG pathways with zero abundance across all samples...
KEGG abundance calculation completed successfully.
Warning message:
One or more parsing issues, call problems() on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)
""""""""""""'""
here i have an emply "kegg_abundance" variable in the Rstudio

I think something wrong with the input tsv table, but I cant understand what and how to fix it.

I would appreciate any help
Thank you for your time
Best,
Alla

@cafferychen777
Copy link
Owner

Hi Alla,

After reviewing your error messages, I noticed the issue likely stems from your TSV file format. Looking at the error message, it seems the first line contains "# Constructed from biom file" which is causing the parsing issues.

Could you try:

  1. Opening the ko_feature_table.biom.tsv file
  2. Removing the first line that starts with "#"
  3. Save the file and run your analysis again

This should resolve the parsing error and allow ggpicrust2 to properly read your feature table.

Let me know if you need any further assistance.

Best regards,
Chen

@ShevchenkoAlla
Copy link
Author

ShevchenkoAlla commented Nov 18, 2024

Hi,
Thank you so much.
It worked)

But I can't do the analysis, I got another errors after that(

""""'''

Run ggpicrust2 with input file path

results_file_input <- ggpicrust2(file = abundance_file,

  •                              metadata = metadata,
    
  •                              group = "Sample type", # For example dataset, group = "Environment"
    
  •                              reference = "Controls",
    
  •                              pathway = "KO",
    
  •                              daa_method = "LinDA",
    
  •                              ko_to_kegg = TRUE,
    
  •                              order = "pathway_class",
    
  •                              p_values_bar = TRUE,
    
  •                              x_lab = "pathway_name")
    

Starting the ggpicrust2 analysis...

Converting KO to KEGG...

Loading data from file...
Rows: 10543 Columns: 25
── Column specification ────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): #OTU ID
dbl (24): 1A, 1B, 1C, 2A, 2B, 2C, 3AA, 3AB, 3AC, 3BA, 3BB, 3BC, 4A, 4B, 4C, 5A, 5B, 5C, 5D, 5E, ...

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
Loading KEGG reference data. This might take a while...
Performing KO to KEGG conversion. Please be patient, this might take a while...
|==========================================================================================| 100%
KO to KEGG conversion completed. Time elapsed: 31.63 seconds.
Removing KEGG pathways with zero abundance across all samples...
KEGG abundance calculation completed successfully.
Performing pathway differential abundance analysis...

Sample names extracted.
Identifying matching columns in metadata...
Matching columns identified: #SampleID . This is important for ensuring data consistency.
Using all columns in abundance.
Converting abundance to a matrix...
Reordering metadata...
Converting metadata to a matrix and data frame...
Extracting group information...
Running LinDA analysis...
Performing LinDA analysis...
0 features are filtered!
The filtered data has 24 samples and 299 features will be tested!
Imputation approach is used.
Fit linear models ...
Completed.
Processing LinDA results...
LinDA analysis is complete.
Success: Found 222 statistically significant biomarker(s) in the dataset.
Annotating pathways...

Starting pathway annotation...
DAA results data frame is not null. Proceeding...
KO to KEGG is set to TRUE. Proceeding with KEGG pathway annotations...
We are connecting to the KEGG database to get the latest results, please wait patiently.

The number of statistically significant pathways exceeds the database's query limit. Please consider breaking down the analysis into smaller queries or selecting a subset of pathways for further investigation.

Returning DAA results filtered annotation data frame...
Creating pathway error bar plots...

The following pathways are missing annotations and have been excluded: ko05340, ko00564, ko00562, ko00563, ko03030, ko00561, ko00440, ko00250, ko04062, ko00740, ko00195, ko04650, ko03450, ko00920, ko00311, ko00310, ko04146, ko00600, ko04140, ko04142, ko00604, ko04260, ko05142, ko04540, ko04710, ko04712, ko00909, ko00513, ko05110, ko04974, ko04976, ko00450, ko01051, ko00565, ko00904, ko00524, ko00300, ko00905, ko00402, ko03440, ko00750, ko00950, ko05140, ko00592, ko00591, ko00590, ko00062, ko04662, ko03070, ko00253, ko03060, ko04370, ko04730, ko04740, ko00380, ko00500, ko05120, ko04666, ko04966, ko05322, ko04964, ko05320, ko04962, ko04960, ko04660, ko00625, ko00624, ko00627, ko00626, ko00623, ko00622, ko00270, ko04380, ko00941, ko00943, ko00100, ko00945, ko01057, ko01056, ko05016, ko01058, ko04145, ko00071, ko00072, ko04360, ko05219, ko05218, ko05216, ko05215, ko05213, ko05211, ko01055, ko00902, ko05330, ko00534, ko04910, ko00531, ko04916, ko00533, ko00532, ko00360, ko00633, ko00363, ko00364, ko05130, ko00121, ko04914, ko00130, ko03050, ko00361, ko00040, ko00730, ko00362, ko01040, ko00603, ko03018, ko04270, ko00281, ko00280, ko03013, ko04626, ko05200, ko00601, ko03015, ko00312, ko05143, ko00523, ko00520, ko00521, ko05146, ko00052, ko00051, ko00400, ko04020, ko00350, ko00480, ko00643, ko00640, ko00720, ko00120, ko00965, ko04614, ko04340, ko00980, ko00410, ko00983, ko05150, ko00791, ko05131, ko04711, ko00020, ko00710, ko00196, ko02060, ko00340, ko00785, ko00550, ko00650, ko03320, ko04744, ko04745, ko00522, ko04612, ko04621, ko04620, ko04623, ko04622, ko04971, ko00460, ko04970, ko00830, ko00780, ko00511, ko00970, ko00030, ko00232, ko00230, ko04120, ko04350, ko00540, ko03022, ko03020, ko00982, ko04630, ko03010, ko05100, ko00331, ko05310, ko00908, ko04930, ko04320, ko03430, ko00906, ko00901, ko04520, ko00903, ko00471, ko00472, ko00473, ko04510, ko00942, ko04810, ko04210, ko00240, ko04012, ko04011, ko00944, ko04113, ko04640, ko04310, ko03420, ko04912, ko00670, ko04672, ko04920, ko05160, ko04144, ko00930, ko04112, ko04720, ko04722, ko04075
You can use the 'pathway_annotation' function to add annotations for these pathways.
The 'method' column in the 'daa_results_df' data frame contains more than one method. Please filter it to contain only one method.
The 'group1' or 'group2' column in the 'daa_results_df' data frame contains more than one group. Please filter each to contain only one group.
Error in pathway_errorbar(abundance = abundance, daa_results_df = daa_sub_method_results_df, :
Visualization with 'pathway_errorbar' cannot be performed because there are no features with statistical significance. For possible solutions, please check the FAQ section of the tutorial.

"""""""""""
Somthing confising
First there are significant difference, than not..
I dont have - 'daa_results_df' data frame
Not sure what is it
Can you explain or suggest something?

I checked metadata file

it looks ok to me

metadata$Sample type <- as.factor(metadata$Sample type)

levels(metadata$Sample type)
[1] "Controls" "Permafrost"

Thank you,
Best,
Alla

@cafferychen777
Copy link
Owner

Hi Alla,

I see you've resolved the first issue, but now encountering errors with pathway annotations and visualizations. From the error messages, it seems the all-in-one ggpicrust2() function is having trouble handling your dataset.

I suggest using our step-by-step pipeline instead, which gives you more control over each stage of the analysis. You can follow these steps:

  1. First, convert KO to KEGG:
kegg_abundance <- ko2kegg_abundance(file = abundance_file)
  1. Then perform differential abundance analysis:
daa_results_df <- pathway_daa(abundance = kegg_abundance, 
                             metadata = metadata, 
                             group = "Sample type", 
                             daa_method = "LinDA",
                             select.taxa = NULL,
                             reference = "Controls")
  1. Add pathway annotations:
pathway_annotation_df <- pathway_annotation(pathway = "KO",
                                          daa_results_df = daa_results_df, 
                                          ko_to_kegg = TRUE)
  1. Finally, create the visualization:
pathway_errorbar_plot <- pathway_errorbar(abundance = kegg_abundance,
                                        daa_results_df = daa_results_df,
                                        pathway_annotation_df = pathway_annotation_df,
                                        group = "Sample type",
                                        p_values_threshold = 0.05,
                                        order = "pathway_class",
                                        select_pathway = NULL,
                                        p_value_bar = TRUE,
                                        x_lab = "pathway_name")

This approach will help you identify where exactly the analysis might be failing and give you more flexibility in adjusting parameters at each step.

Let me know if you need any clarification or run into other issues.

Best regards,
Chen

@cafferychen777 cafferychen777 added the question Further information is requested label Nov 18, 2024
@ShevchenkoAlla
Copy link
Author

Hello Chen,
Thank you so much for your response!

I have tried to follow your instructions and have encountered the following problems

"""""""

daa_results_df <- pathway_daa(abundance = kegg_abundance,

  •                           metadata = metadata, 
    
  •                           group = "Sample type", 
    
  •                           daa_method = "LinDA",
    
  •                           select.taxa = NULL,
    
  •                           reference = "Controls")
    

Error in pathway_daa(abundance = kegg_abundance, metadata = metadata, :
unused argument (select.taxa = NULL)
""""""""""'
So I just removed taxa, and left "select = Null"

""""""""'
daa_results_df <- pathway_daa(abundance = kegg_abundance,

  •                           metadata = metadata, 
    
  •                           group = "Sample type", 
    
  •                           daa_method = "LinDA",
    
  •                           select = NULL,
    
  •                           reference = "Controls")
    

Sample names extracted.
Identifying matching columns in metadata...
Matching columns identified: #SampleID . This is important for ensuring data consistency.
Using all columns in abundance.
Converting abundance to a matrix...
Reordering metadata...
Converting metadata to a matrix and data frame...
Extracting group information...
Running LinDA analysis...
Performing LinDA analysis...
Registered S3 method overwritten by 'rmutil':
method from
print.response httr
0 features are filtered!
The filtered data has 24 samples and 299 features will be tested!
Imputation approach is used.
Fit linear models ...
Completed.
Processing LinDA results...
LinDA analysis is complete.

pathway_annotation_df <- pathway_annotation(pathway = "KO",

  •                                         daa_results_df = daa_results_df, 
    
  •                                         ko_to_kegg = TRUE)
    

Starting pathway annotation...
DAA results data frame is not null. Proceeding...
KO to KEGG is set to TRUE. Proceeding with KEGG pathway annotations...
We are connecting to the KEGG database to get the latest results, please wait patiently.

The number of statistically significant pathways exceeds the database's query limit. Please consider breaking down the analysis into smaller queries or selecting a subset of pathways for further investigation.

Returning DAA results filtered annotation data frame...
""'''''''''
Next I have tried

""""
pathway_annotation_df <- pathway_annotation(pathway = "KO",

  •                                         daa_results_df = daa_results_df, 
    
  •                                         ko_to_kegg = TRUE)
    

Starting pathway annotation...
DAA results data frame is not null. Proceeding...
KO to KEGG is set to TRUE. Proceeding with KEGG pathway annotations...
We are connecting to the KEGG database to get the latest results, please wait patiently.

The number of statistically significant pathways exceeds the database's query limit. Please consider breaking down the analysis into smaller queries or selecting a subset of pathways for further investigation.

Returning DAA results filtered annotation data frame...
""""""""""'

If I understand correctly, I have too many pathways in df, I need to reduce them.
Is there an easy way to reduce it to e.g. top 100 pathways? Or something like that?

Also in pathway_annotation_df I have N/A in the pathway_name, description etc columns, so I can't even get those names on a heatmap for example.

Maybe there is a way to just get the pathway names and groups and visualise it against the sample names, without even statistics, just take top of some ammount?

Thank you for your time
I appreciate your help
Best,
Alla

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants