diff --git a/.nojekyll b/.nojekyll index f87bc4a3..8e93a49d 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -ac05d81b \ No newline at end of file +c5714985 \ No newline at end of file diff --git a/accessing-ancient-metagenomic-data.html b/accessing-ancient-metagenomic-data.html index 84860b85..682dfbf2 100644 --- a/accessing-ancient-metagenomic-data.html +++ b/accessing-ancient-metagenomic-data.html @@ -2,13 +2,13 @@ - + -17  Accessing Ancient Metagenomic Data – Introduction to Ancient Metagenomics +16  Accessing Ancient Metagenomic Data – Introduction to Ancient Metagenomics @@ -1798,11 +1792,11 @@

-

13.5.2 Deamination patterns

+
+

12.5.2 Deamination patterns

metaDMG can perform a numerical optimisation of the deamination frequencies (C→T, G→A) using the binomial or beta-binomial likelihood models, where the latter can deal with a large amount of variance (overdispersion). This function is called the metaDMG-cpp dfit or damage estimates function.

metaDMG-cpp dfit 
-

metaDMG-cpp dfit allows us to estimate the four fit parameters of the damage model (Figure 13.5):

+

metaDMG-cpp dfit allows us to estimate the four fit parameters of the damage model (Figure 12.5):

-
-

13.5.3 Ancient metagenomic dataset

-

In this section, we will use 6 metagenomic libraries downsampled with eukaryotes reads from the study by (Zampirolo et al. 2023) (Figure 13.9). The libraries originate from sediment samples of the Velký Mamut’ák rock shelter located in Northern Bohemia (Czech Republic) and covering the period between the Late Neolithic (~6100-5300 cal. BP) to more recent times (800 cal BP).

+
+

12.5.3 Ancient metagenomic dataset

+

In this section, we will use 6 metagenomic libraries downsampled with eukaryotes reads from the study by (Zampirolo et al. 2023) (Figure 12.9). The libraries originate from sediment samples of the Velký Mamut’ák rock shelter located in Northern Bohemia (Czech Republic) and covering the period between the Late Neolithic (~6100-5300 cal. BP) to more recent times (800 cal BP).

-Figure 13.9: Screenshot of preprint of the source dataset by (Zampirolo et al. 2023) +Figure 12.9: Screenshot of preprint of the source dataset by (Zampirolo et al. 2023)
-
-

13.5.4 Ancient metagenomics with metaDMG-cpp: the workflow

+
+

12.5.4 Ancient metagenomics with metaDMG-cpp: the workflow

This section will cover the metaDMG analysis which involve taxonomic classification of the reads starting from sorted SAM files, the damage estimation and compilation of the final metaDMG output.

To begin, we can find raw SAM files used as input to metaDMG we will use for the exercise are stored in metadmg.

We also need the taxonomy files, which are in the folder metadmg/small_taxonomy/, these include names.dmp, nodes.dmp and small_accession2taxid.txt.gz.

@@ -1986,8 +1980,8 @@

done done

-
-

13.5.5 Investigating the final output with R

+
+

12.5.5 Investigating the final output with R

We first visualise our metaDMG output manually navigating to the folder metadmg/ and clicking on “Open folder”. We can double-click on the tsv file concatenated_metaDMGfinal.tsv and visualise it.

We will now investigate the tsv table produced by metaDMG to authenticate damage patterns, visualise the relationship between the damage and the significance, and the degree of damage through depth and time.

R packages for this exercise are located in our original conda environment authentication.

@@ -2009,10 +2003,10 @@

library(purrr) library(ggpubr) -
-

13.5.5.1 Deamination patterns

+
+

12.5.5.1 Deamination patterns

We run the damage plot to visualise the deamination patterns along forward and reverse strands, and we save the results per each taxon detected in the samples.

-

We will use the function get_dmg_decay_fit to visualise damage pattern (Figure 13.10). The function is saved in metadmg/script/, so we only need to run the following command to recall it:

+

We will use the function get_dmg_decay_fit to visualise damage pattern (Figure 12.10). The function is saved in metadmg/script/, so we only need to run the following command to recall it:

source("metadmg/script/get_dmg_decay_fit.R")
@@ -2121,7 +2115,7 @@

} } -

We load our metaDMG output data (tsv file) and generate the damage plots as seen in Figure 13.10 using the function get-damage.

+

We load our metaDMG output data (tsv file) and generate the damage plots as seen in Figure 12.10 using the function get-damage.

df <- read.csv("metadmg/concatenated_metaDMGfinal.tsv",  sep = "\t")
 
@@ -2181,16 +2175,16 @@ 

-Figure 13.10: Deamination patterns for sheep (Ovis) and beech (Fagus) reads. +Figure 12.10: Deamination patterns for sheep (Ovis) and beech (Fagus) reads.

-
-

13.5.6 Amplitude of damage vs Significance

+
+

12.5.6 Amplitude of damage vs Significance

We provide an R script to investigate the main statistics.

-

Here we visualise the amplitude of damage (A) and its significance (Zfit), for the full dataset but filtering it to a minimum of 100 reads and at the genus level (Figure 13.11).

+

Here we visualise the amplitude of damage (A) and its significance (Zfit), for the full dataset but filtering it to a minimum of 100 reads and at the genus level (Figure 12.11).

#Subset dataset at genus level
 dt2 <- df %>% filter(grepl("\\bgenus\\b", rank))
@@ -2232,14 +2226,14 @@ 

-Figure 13.11: Amplitude of damage (A) vs significance (Zfit) for animals and plants. +Figure 12.11: Amplitude of damage (A) vs significance (Zfit) for animals and plants.
-
-

13.5.7 Amplitude of damage and mean fragment length through time

-

Here we visualise the amplitude of damage (A) and the mean length of the fragments (mean_rlen) by depth and by date (BP) for the full dataset but filtering it to a minimum of 100 reads and at the genus level (Figure 13.12).

+
+

12.5.7 Amplitude of damage and mean fragment length through time

+

Here we visualise the amplitude of damage (A) and the mean length of the fragments (mean_rlen) by depth and by date (BP) for the full dataset but filtering it to a minimum of 100 reads and at the genus level (Figure 12.12).

#Import the metadata 
 depth_data <- read.csv ("metadmg/figures/depth_data.csv", sep = ",")
@@ -2288,7 +2282,7 @@ 

-Figure 13.12: Amplitude of damage (A) and mean fragment length (mean_rlen) through time. +Figure 12.12: Amplitude of damage (A) and mean fragment length (mean_rlen) through time.
@@ -2322,8 +2316,8 @@

-

13.6 (Optional) clean-up

+
+

12.6 (Optional) clean-up

Let’s clean up our working directory by removing all the data and output from this chapter.

The command below will remove the /<PATH>/<TO>/authentication as well as all of its contents.

@@ -2346,8 +2340,8 @@

Then to delete the conda environment.

conda remove --name authentication --all -y

-
-

13.7 Summary

+
+

12.7 Summary

In addition, we:

  • Processed bam files with metaDMG to generate taxonomic profiles and damage estimates from a metagenomic dataset
  • @@ -2360,20 +2354,20 @@

    -

    13.8 Acknowledgments

    +
    +

    12.8 Acknowledgments

    We thank Mikkel Winther Pedersen and Antonio Fernandez Guerra for their contribution to the development of the metaDMG section.

    -

@@ -212,25 +206,25 @@ @@ -247,25 +241,25 @@ @@ -282,13 +276,13 @@ @@ -305,13 +299,13 @@ @@ -328,7 +322,7 @@ @@ -345,31 +339,31 @@ @@ -404,7 +398,7 @@

Authors

-

The creation of this text book was developed through a series of summer schools run by the SPAAM community, and financially supported by the Werner Siemens-Stiftung. The have contributed to the development of this textbook.

+

The creation of this text book was developed through a series of summer schools run by the SPAAM community, and financially supported by the Werner Siemens-Stiftung. The have contributed to the development of this textbook. Contributors to the textbook are listed below.

@@ -413,89 +407,104 @@

Authors

- + - + - + + + + + + - - + + - - + + - - - - - - + - + - - - + + + - + - - - - - - + - - - - - - + - + - + - + + + + + + + + + + + + + + + + - + + + + + + + + + + +
2022-20232022-2024 🇬🇧 James Fellows Yates is an archaeology-trained biomolecular archaeologist and convert to palaeogenomics, and is recently pivoting to bioinformatics. He specialises in ancient metagenomics analysis, generating tools and high-throughput approaches and high-quality pipelines for validating and analysing ancient (oral) microbiomes and palaeogenomic data.
2022-20232022-2024 🇺🇸 Christina Warinner is Group Leader of Microbiome Sciences at the Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany, and Associate Professor of Anthropology at Harvard University. She serves on the Leadership Team of the Max Planck-Harvard Research Center for the Archaeoscience of the Ancient Mediterranean (MHAAM), and is a Professor in the Faculty of Biological Sciences at Friedrich Schiller University in Jena, Germany. Her research focuses on the use of metagenomics and paleoproteomics to better understand past human diet, health, and the evolution of the human microbiome.
2022-20232024🇵🇱🇩🇰 Aleksandra Laura Pach is a bioinformatician currently working on a PhD at the Schroeder and Racimo groups at section for Ecology and Evolution, Globe institute, University of Copenhagen. Her research centers around ancient metagenomics specifically focusing on inference of eukaryotes and methodologies.
2022-2024 🇪🇸 Aida Andrades Valtueña is a geneticist interested in pathogen evolution, with particular interest in prehistoric pathogens. She has been exploring new methods to analyse ancient pathogen data to understand their past function and ecology to inform models of pathogen emergence.
2022-2023
2022-2024 🇩🇪 Alexander Herbig is a bioinformatician and group leader for Computational Pathogenomics at the Max Planck Institute for Evolutionary Anthropology. His main interest is in studying the evolution of human pathogens and in methods development for pathogen detection and bacterial genomics.
2022-2023
2022-2024 🇩🇪 Alex Hübner is a computational biologist, who originally studied biotechnology, before switching to evolutionary biology during his PhD. For his postdoc in the Warinner lab, he focuses on investigating whether new methods in the field of modern metagenomics can be directly applied to ancient DNA data. Here, he is particularly interested in the de novo assembly of ancient metagenomic sequencing data and the subsequent analysis of its results.
2022-2023🇩🇪 Alina Hiss is a PhD student in the Computational Pathogenomics group at the Max Planck Institute for Evolutionary Anthropology. She is interested in the evolution of human pathogens and working on material from the Carpathian basin to gain insights about the presence and spread of pathogens in the region during the Early Medieval period.
2022-20232022-2024 🇫🇷 Arthur Kocher initially trained as a veterinarian. He then pursued a PhD in the field of disease ecology, during which he studied the impact of biodiversity changes on the transmission of zoonotic diseases using molecular tools such as DNA metabarcoding. During his Post-Docs, he extended his research focus to evolutionary aspects of pathogens, which he currently investigates using ancient genomic data and Bayesian phylogenetics.
2022-20232022-2024 🇩🇪 Clemens Schmid is a computational archaeologist pursuing a PhD in the group of Stephan Schiffels at the department of Archaeogenetics at the Max Planck Institute for Evolutionary Anthropology. He is trained both in archaeology and computer science and currently develops computational methods for the spatiotemporal co-analysis of archaeological and ancient genomic data. He worked in research projects on the European Neolithic, Copper and Bronze age and maintains research software in R, C++ and Haskell.
2022🇺🇸 Irina Velsko is a postdoc in the Microbiome group of the department of Archaeogenetics at the Max Planck Institute for Evolutionary Anthropology. She did her PhD work on oral microbiology and immunology of the living, and now works on oral microbiomes of the living and the dead. Her work focuses on the evolution and ecology of dental plaque biofilms, both modern and ancient, and the complex interplay between microbiomes and their hosts.2024🇮🇹 Giulia Zampirolo is a postdoctoral researcher specialised in ancient metagenomics, with a background in both archaeology and paleogenomics. During her PhD at the Centre for GeoGenetics (University of Copenhagen), she utilised sedimentary ancient DNA across different types of deposits to investigate past human-environment interactions. Currently based at the Globe Institute, University of Copenhagen, she works in Kristine Bohmann’s research group, contributing to the project SEACHANGE, which explores the evolution of past marine ecosystems and biodiversity. Her research interests lie in the interdisciplinary intersection of archaeology, ancient DNA, and ecology to trace human-induced environmental changes over time.
2022-20232022-2024 🇫🇷 Maxime Borry is a doctoral researcher in bioinformatics at the Max Planck Institute for Evolutionary Anthropology in Germany. After an undergraduate in life sciences and a master in Ecology, followed by a master in bioinformatics, he is now working on the completion of his PhD, focused on developing new tools and data analysis of ancient metagenomic samples.
2022🇺🇸 Megan Michel is a PhD student jointly affiliated with the Archaeogenetics Department at the Max Planck Institute for Evolutionary Anthropology and the Human Evolutionary Biology Department at Harvard University. Her research focuses on using computational genomic analyses to understand how pathogens have co-evolved with their hosts over the course of human history.
2022-20232022-2024 🇷🇺 Nikolay Oskolkov is a bioinformatician at Lund University and the bioinformatics platform of SciLifeLab, Sweden. He defended his PhD in theoretical physics in 2007, and switched to life sciences in 2012. His research interests include mathematical statistics and machine learning applied to genetics and genomics, single cell and ancient metagenomics data analysis.
2022🇦🇺 Sebastian Duchene is an Australian Research Council Fellow at the Doherty Institute for Infection and Immunity at the University of Melbourne, Australia. Prior to joining the University of Melbourne he obtained his PhD and conducted postdoctoral work at the University of Sydney. His research is in molecular evolution and epidemiology of infectious pathogens, notably viruses and bacteria, and developing Bayesian phylodynamic methods.
2022-20232022-2024 🇬🇷 Thiseas Lamnidis is a human population geneticist interested in European population history after the Bronze Age. To gain the required resolution to differentiate between Iron Age European populations, he is developing analytical methods based on the sharing of rare variation between individuals. He has also contributed to pipelines that streamline the processing and analysis of genetic data in a reproducible manner, while also facilitating dissemination of information among interdisciplinary colleagues.
20232023-2024 🇳🇱 Kevin Nota has a PhD in molecular paleoecology from Uppsala University. Currently he is a postdoc in the Max Planck Research Group for Ancient Environmental Genomics. His main research interest is in population genomics from ancient environmental samples.
20232023-2024 🇦🇹 Meriam Guellil is an expert in ancient microbial phylogenomics and metagenomics, particularly of human pathogens. She is particularly interested in the study of diseases that are invisible in the archaeological and osteological record, and the study of their evolution throughout human history. Her previous research includes studies on microbial species such as Yersinia pestis, Haemophilus influenzae, Borrelia recurrentis and Herpes simplex 1.🇦🇹 Meriam Guellil is an expert in ancient microbial phylogenomics and metagenomics, particularly of human pathogens. She is particularly interested in the study of diseases that are invisible in the archaeological and osteological record, and the study of their evolution throughout human history. Her previous research includes studies on microbial species such as Yersinia pestis, Haemophilus influenzae, Borrelia recurrentis and Herpes simplex 1.
2024🇩🇪 Tessa Zeibig is a trained Microbiologist, she’s currently pursuing her PhD in the Computational Pathogenomics Group led by Alexander Herbig at the Max Planck Institute for Evolutionary Anthropology. Her focus is on human viruses, seeking to understand their evolution and development, using ancient genomic data.
2024🇨🇱 Vilma Pérez is a microbial ecologist at the Australian Centre for Ancient DNA, University of Adelaide. Her research focuses on reconstructing microbial communities from present and past environments using environmental DNA techniques. Her objective is to use this information as bioindicators, providing insights into how environments have changed or responded to disturbances over time.
2022-2023🇩🇪 Alina Hiss is a PhD student in the Computational Pathogenomics group at the Max Planck Institute for Evolutionary Anthropology. She is interested in the evolution of human pathogens and working on material from the Carpathian basin to gain insights about the presence and spread of pathogens in the region during the Early Medieval period.
2023 🇩🇪 Robin Warner is a MSc bioinformatics student at the Leipzig University. He is currently writing his master’s thesis in the Max Planck Research Group for Ancient Environmental Genomics about the comparison of ancient sedimentary DNA capture methods and shotgun sequencing.🇩🇪 **Robin Warner* is a MSc bioinformatics student at the Leipzig University. He is currently writing his master’s thesis in the Max Planck Research Group for Ancient Environmental Genomics about the comparison of ancient sedimentary DNA capture methods and shotgun sequencing.
2022🇺🇸 Irina Velsko is a postdoc in the Microbiome group of the department of Archaeogenetics at the Max Planck Institute for Evolutionary Anthropology. She did her PhD work on oral microbiology and immunology of the living, and now works on oral microbiomes of the living and the dead. Her work focuses on the evolution and ecology of dental plaque biofilms, both modern and ancient, and the complex interplay between microbiomes and their hosts.
2022🇺🇸 Megan Michel is a PhD student jointly affiliated with the Archaeogenetics Department at the Max Planck Institute for Evolutionary Anthropology and the Human Evolutionary Biology Department at Harvard University. Her research focuses on using computational genomic analyses to understand how pathogens have co-evolved with their hosts over the course of human history.
diff --git a/bare-bones-bash.html b/bare-bones-bash.html index e0dffa5b..836087d4 100644 --- a/bare-bones-bash.html +++ b/bare-bones-bash.html @@ -2,13 +2,13 @@ - + -7  Introduction to the Command Line – Introduction to Ancient Metagenomics +6  Introduction to the Command Line – Introduction to Ancient Metagenomics - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
- -
- -
- - -
- - - -
- -
-
-

5  Introduction to Evolutionary Biology

-
- - - -
- -
-
Author
-
-

Sebastian Duchene

-
-
- - - -
- - - -
- - -
-
-
- -
-
-Important -
-
-
-

🚧 This page is still under construction 🚧

-
-
-
-
-
- -
-
-Important -
-
-
-

This chapter has not been updated since the 2022 edition of this book.

-
-
-

This chapter is still under construction. For the slides and recorded lecture version of this session, please see the Werner Siemens-Stiftung funded SPAAM Summer School: Introduction to Ancient Metagenomics website.

- - - -
- - -
-
- -
- - - - - \ No newline at end of file diff --git a/introduction-to-metagenomics.html b/introduction-to-metagenomics.html index 1850affa..4cb45f98 100644 --- a/introduction-to-metagenomics.html +++ b/introduction-to-metagenomics.html @@ -2,7 +2,7 @@ - + @@ -186,17 +186,11 @@ 4  Introduction to Microbial Genomics - - @@ -213,25 +207,25 @@ @@ -248,25 +242,25 @@ @@ -283,13 +277,13 @@ @@ -306,13 +300,13 @@ @@ -329,7 +323,7 @@ @@ -346,31 +340,31 @@ diff --git a/introduction-to-microbial-genomics.html b/introduction-to-microbial-genomics.html index c3dba836..f7ce9cec 100644 --- a/introduction-to-microbial-genomics.html +++ b/introduction-to-microbial-genomics.html @@ -2,7 +2,7 @@ - + @@ -31,7 +31,7 @@ - + @@ -186,17 +186,11 @@ 4  Introduction to Microbial Genomics - - @@ -213,25 +207,25 @@ @@ -248,25 +242,25 @@ @@ -283,13 +277,13 @@ @@ -306,13 +300,13 @@ @@ -329,7 +323,7 @@ @@ -346,31 +340,31 @@ @@ -1209,8 +1203,8 @@

- - 5  Introduction to Evolutionary Biology + + 5  Introduction to Environmental DNA diff --git a/introduction-to-ngs-sequencing.html b/introduction-to-ngs-sequencing.html index 7166db86..b895f8e0 100644 --- a/introduction-to-ngs-sequencing.html +++ b/introduction-to-ngs-sequencing.html @@ -2,7 +2,7 @@ - + @@ -206,17 +206,11 @@ 4  Introduction to Microbial Genomics - - @@ -233,25 +227,25 @@ @@ -268,25 +262,25 @@ @@ -303,13 +297,13 @@ @@ -326,13 +320,13 @@ @@ -349,7 +343,7 @@ @@ -366,31 +360,31 @@ diff --git a/phylogenomics.html b/phylogenomics.html index 67fc11a9..5e67bed9 100644 --- a/phylogenomics.html +++ b/phylogenomics.html @@ -2,13 +2,13 @@ - + -16  Phylogenomics – Introduction to Ancient Metagenomics +15  Phylogenomics – Introduction to Ancient Metagenomics @@ -1706,11 +1700,11 @@

There are many ways to turn a DataFrame back into a dictonary (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html#pandas.DataFrame.to_dict), which might be very handy for certain purposes.

-
-

9.4 Data exploration

+
+

8.4 Data exploration

The data for this tutorial/walkthrough is from a customer personality analysis of a company trying to better understand how to modify their product catalogue. Here is the link to the original source (https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis) for more information.

-
-

9.4.1 Columns

+
+

8.4.1 Columns

To display all the column names from the imported DataFrame, the attribute columns can be called.

df.columns
@@ -1790,8 +1784,8 @@

-

9.4.2 Inspecting the DataFrame

+
+

8.4.2 Inspecting the DataFrame

To quickly check how many rows and columns the DataFrame has, we can access the shape attribute.

df.shape
@@ -1871,23 +1865,23 @@

-
- @@ -2488,8 +2482,8 @@

-

9.4.3 Accessing rows and columns

+
+

8.4.3 Accessing rows and columns

It is possible to access parts of the data in DataFrames in different ways. The first method is sub-setting rows using the row name and column name. This can be done with the .loc, which loc ates row(s) by providing the row name and column name [row, column]. When the rows are not named, the row index can be used instead. To print the second row, this would be index 1 since the index in Python starts at 0. To print the all the columns, the : is used.

df.loc[1, :]
@@ -2542,23 +2536,23 @@

-
- @@ -3211,8 +3205,8 @@

-

9.4.4 Conditional subsetting

+
+

8.4.4 Conditional subsetting

So far, all the subsetting has been based on row names and column names. However, in many cases, it is more helpful to look only at data that contain certain items. This can be done using conditional subsetting, which is based on Boolean values True or False. pandas will interpret a series of True and False values by printing only the rows or columns where a True is present and ignoring all rows or columns with a False.

For example, if we are only interested in individuals in the table who graduated, we can test each string in the column Education to see if it is equal (==) to Graduation. This will return a series with Boolean values True or False.

education_is_grad = (df["Education"] == "Graduation")
@@ -3335,23 +3329,23 @@ 

-
- @@ -4087,23 +4081,23 @@

-
- @@ -4826,8 +4820,8 @@

-
-

9.4.5 Describing a DataFrame

+
+

8.4.5 Describing a DataFrame

Sometimes is is nice to get a quick overview of the data in a table, such as means and counts. Pandas has a native function to do just that, it will output a count, mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum for each numeric columns.

df.describe()
@@ -4842,23 +4836,23 @@

-
- @@ -5575,8 +5569,8 @@

-
-

9.4.6 Getting summary statistics on grouped data

+
+

8.4.6 Getting summary statistics on grouped data

Pandas is equipped with lots of useful functions which make complicated tasks very easy and fast. One of these functions is .groupby() with the arguments by=..., which will group a DataFrame using a categorical column (for example Education or Marital_Status). This makes it possible to perform operations on a group directly without the need for subsetting. For example, to get a mean income value for the different Education levels in the DataFrame can be done by specifying the column name for the grouping variable by .groupby(by='Education') and specifying the column name to perform this action on [Income] followed by the sum() function.

df.groupby(by="Education")["Income"].mean()
@@ -5601,8 +5595,8 @@

-

9.4.7 Subsetting Questions and Exercises

+
+

8.4.7 Subsetting Questions and Exercises

Here there are several exercises to try conditional subsetting. Try to first before seeing the awnsers.

@@ -5869,8 +5863,8 @@

-

9.5 Dealing with missing data

+
+

8.5 Dealing with missing data

In large tables, it is often important to check if there are columns or rows that have missing data. pandas represents missing data with NA (Not Available). To identify these missing values, pandas provides the .isna() function. This function checks every cell in the DataFrame and returns a DataFrame of the same shape, where each cell contains a Boolean value: True if the original cell contains NA, and False otherwise.

df.isna()
@@ -5885,23 +5879,23 @@

-
- @@ -6662,10 +6656,10 @@

-

9.6 Combining data

-
-

9.6.1 Concatenation exercises

+
+

8.6 Combining data

+
+

8.6.1 Concatenation exercises

Data is very often present in multiple tables. Think, for example, about a taxonomy table giving count data per sample. One way to combine multiple datasets is through concatenation, which either combines all columns or rows of multiple DataFrames. The function in Pandas that does just that is called .concat. This command combines two DataFrames by appending all rows or columns: .concat([first_dataframe, second_dataframe]).

In the DataFrame, there are individuals with the education levels Graduation, Master, Basic, and 2n Cycle. PhD is missing; however, there is data on people with the education level PhD in another table called phd_data.tsv.

With everything learned so far, and basic information on the .concat()function, try to read in the data from ../phd_data.tsv and concatenate it to the existing df.

@@ -6737,23 +6731,23 @@

-
- @@ -7520,23 +7514,23 @@

-
- @@ -8292,8 +8286,8 @@

-

9.6.2 Merging

+
+

8.6.2 Merging

Besides concatenating two DataFrames, there is another powerful function for combining data from multiple sources: .merge(). This function is especially useful when we have different types of related data in separate tables. For example, we might have a taxonomy table with count data per sample and a metadata table in another DataFrame.

The pandas function .merge() allows us to combine these DataFrames based on a common column. This column must exist in both DataFrames and contain similar values.

To illustrate the .merge() function, we will create a new DataFrame and merge it with the existing one. Let’s rank the different education levels from 1 to 5 in a new DataFrame and merge this with the existing DataFrame.

@@ -8419,23 +8413,23 @@

-
- @@ -9159,12 +9153,12 @@

-

9.7 Data visualisation

+
+

8.7 Data visualisation

Just looking at DataFrames is nice and useful, but in many cases, it is easier to look at data in graphs. The function that can create plots directly from DataFrames is .plot(). The .plot() function uses the plotting library matplotlib by default in the background. There are other plotting libraries such as Plotnine which will be shown further in the tutorial.

The only arguments .plot() requires are kind=..., and the plot axis x=... and y=.... The kind argument specifies the type of plot, such as hist for histogram, bar for bar plot, and scatter for scatter plot. Check out the Pandas documentation (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) for more plot kinds and useful syntax. There are many aesthetic functions that can help to create beautiful plots. These functions such as .set_xlabel() or ,set_title() are added to the plot, as shown in the examples below.

-
-

9.7.1 Histogram

+
+

8.7.1 Histogram

ax = merged_df.plot(kind="hist", y="Income")
 ax.set_xlabel("Income")
 ax.set_title("Histogram of income")
@@ -9180,14 +9174,14 @@

-

Results in Figure 9.3.

+

Results in Figure 8.3.

-Figure 9.3: This is the histogram of income that should appear with we run the code above. +Figure 8.3: This is the histogram of income that should appear with we run the code above.
@@ -9291,14 +9285,14 @@

-

Results in Figure 9.4.

+

Results in Figure 8.4.

-Figure 9.4: To “fix” the histogram, the one person with the income of 666666 is removed, making the plot look a lot neater. +Figure 8.4: To “fix” the histogram, the one person with the income of 666666 is removed, making the plot look a lot neater.
@@ -9309,8 +9303,8 @@

-
-

9.7.2 Bar plot

+
+

8.7.2 Bar plot

Instead of making a plot from the original DataFrame we can use the groupby and mean methods to make a plot with summary statistics.

grouped_by_education = merged_df.groupby(by="Education")["Income"].mean()
 grouped_by_education
@@ -9351,14 +9345,14 @@

-

Results in Figure 9.5.

+

Results in Figure 8.5.

-Figure 9.5: Barplot of the mean income for each education level. +Figure 8.5: Barplot of the mean income for each education level.
@@ -9366,8 +9360,8 @@

-

9.7.3 Scatter plot

+
+

8.7.3 Scatter plot

Another kind of plot is the scatter plot, which needs two columns for the x and y axis.

ax = df.plot(kind="scatter", x="MntWines", y="MntFruits")
 ax.set_title("Wine purchases and Fruit purchases")
@@ -9383,14 +9377,14 @@

-

Results in Figure 9.6.

+

Results in Figure 8.6.

-Figure 9.6: A scatter plot with wine purchases on the x-axis and fruit purchases on the y-axis. +Figure 8.6: A scatter plot with wine purchases on the x-axis and fruit purchases on the y-axis.
@@ -9424,8 +9418,8 @@

-
-

9.8 Plotnine

+
+

8.8 Plotnine

Plotnine is the Python clone of ggplot2, which is very powerful and is great if we are already familiar with the ggplot2 syntax!

from plotnine import *
(ggplot(merged_df, aes("Education", "MntWines", fill="Education"))
@@ -9442,14 +9436,14 @@ 

-

Results in Figure 9.7.

+

Results in Figure 8.7.

-Figure 9.7: Boxplot with the amount spent on wine per education. +Figure 8.7: Boxplot with the amount spent on wine per education.
@@ -9472,30 +9466,30 @@

-

Results in Figure 9.8.

+

Results in Figure 8.8.

-Figure 9.8: Plot of the income of people born after 1900, faceted by marital status, and filled by education level. +Figure 8.8: Plot of the income of people born after 1900, faceted by marital status, and filled by education level.

-
-

9.8.1 Advanced Questions and Exercises

+
+

8.8.1 Advanced Questions and Exercises

Now that we are familiar with python, pandas, and plotting. There are two data.tables from AncientMetagenomeDir which contains metadata from metagenomes. We should, by using the code in the tutorial be able to explore the datasets and make some fancy plots.

file names:
 sample_table_url
 library_table_url
-
-

9.9 Summary

+
+

8.9 Summary

In this chapter, we have started exploring the basics of data analysis using Python with the versatile Pandas library. We wrote Python code and executed it in a Jupyter Notebook, with just a handful of functions such as .read_csv(), .loc[], drop(), merge(), .concat() andplot()`, we have done data manipulation, calculated summary statistics, and plotted the data.

The takeaway messages therefore are:

    @@ -9504,8 +9498,8 @@

    -

    9.10 (Optional) clean-up

    +
    +

    8.10 (Optional) clean-up

    Let’s clean up your working directory by removing all the data and output from this chapter.

    When closing your jupyter notebook(s), say no to saving any additional files.

    Press ctrl + c on your terminal, and type y when requested. Once completed, the command below will remove the /<PATH>/<TO>/python-pandas directory **as well as all of its contents*.

    @@ -9956,12 +9950,12 @@

    diff --git a/r-tidyverse.html b/r-tidyverse.html index 44a06e78..04a2df28 100644 --- a/r-tidyverse.html +++ b/r-tidyverse.html @@ -2,13 +2,13 @@ - + -8  Introduction to R and the Tidyverse – Introduction to Ancient Metagenomics +7  Introduction to R and the Tidyverse – Introduction to Ancient Metagenomics @@ -2385,26 +2379,26 @@

    -Table 11.6: A five column table of TAXIDs and the organisms corresponding relative abundance, and the attached taxonomic path associated with the TAXID, but also the rank and name of the particular taxonomic ID, filtered to only species +Table 10.6: A five column table of TAXIDs and the organisms corresponding relative abundance, and the attached taxonomic path associated with the TAXID, but also the rank and name of the particular taxonomic ID, filtered to only species
    -
    - @@ -2910,7 +2904,7 @@

    -Table 11.7: Reconstruction of the first two column taxonomic profile but with species-level organism names rather than TAXIDs +Table 10.7: Reconstruction of the first two column taxonomic profile but with species-level organism names rather than TAXIDs
    @@ -2954,7 +2948,7 @@

    -Table 11.8: Reconstruction of the first two column taxonomic profile but with phylum-level organism names rather than TAXIDs +Table 10.8: Reconstruction of the first two column taxonomic profile but with phylum-level organism names rather than TAXIDs

    @@ -3019,35 +3013,35 @@

    -

    11.10 Bringing together ancient and modern samples

    +
    +

    10.10 Bringing together ancient and modern samples

    Now let’s load our modern reference samples

    -

    First at the phylum level (Table 11.9)

    +

    First at the phylum level (Table 10.9)

    modern_phylums = pd.read_csv("../data/curated_metagenomics/modern_sources_phylum.csv", index_col=0)
     modern_phylums.head()
    -Table 11.9: Taxonomic profiles at phylum level of multiple modern samples +Table 10.9: Taxonomic profiles at phylum level of multiple modern samples
    -
    - @@ -3638,31 +3632,31 @@

    modern_species = pd.read_csv("../data/curated_metagenomics/modern_sources_species.csv", index_col=0)

    -

    As usual, we always check if our data has been loaded correctly (Table 11.10)

    +

    As usual, we always check if our data has been loaded correctly (Table 10.10)

    modern_species.head()
    -Table 11.10: Taxonomic profiles at species level of multiple modern samples +Table 10.10: Taxonomic profiles at species level of multiple modern samples
    -
    - @@ -4280,38 +4274,38 @@

    -

    11.10.1 Time to merge !

    +
    +

    10.10.1 Time to merge !

    Now, let’s merge our ancient sample with the modern data in one single table. For that, we’ll use the pandas merge function which will merge the two tables together, using the index as the merge key.

    all_species = ancient_species.merge(modern_species, left_index=True, right_index=True, how='outer').fillna(0)
     all_phylums = ancient_phylums.merge(modern_phylums, left_index=True, right_index=True, how='outer').fillna(0)
    -

    Finally, let’s load the metadata, which contains the information about the modern samples (Table 11.11).

    +

    Finally, let’s load the metadata, which contains the information about the modern samples (Table 10.11).

    metadata = pd.read_csv("../data/metadata/curated_metagenomics_modern_sources.csv")
     
     metadata.head()
    -Table 11.11: Various metadata information about the samples in this example +Table 10.11: Various metadata information about the samples in this example
    -
    - @@ -4931,12 +4925,12 @@

    -
    -

    11.11 Comparing ancient and modern samples

    -
    -

    11.11.1 Taxonomic composition

    +
    +

    10.11 Comparing ancient and modern samples

    +
    +

    10.11.1 Taxonomic composition

    One common plot in microbiome papers in a stacked barplot, often at the phylum or family level.

    -

    First, we’ll do some renaming, to make the value of the metadata variables a bit easier to understand (Table 11.12)

    +

    First, we’ll do some renaming, to make the value of the metadata variables a bit easier to understand (Table 10.12)

    group_info = pd.concat(
         [
             (
    @@ -4974,7 +4968,7 @@ 

    -Table 11.12: Table of samples and their group +Table 10.12: Table of samples and their group

    @@ -5050,26 +5044,26 @@

    -Table 11.13: Table of the raw multi-sample taxonomic table +Table 10.13: Table of the raw multi-sample taxonomic table
    -
    - @@ -5802,7 +5796,7 @@

    Table 11.13) in the tidy format, with the melt function.

    +

    Now, we need transform this (Table 10.13) in the tidy format, with the melt function.

    tidy_phylums = (
         all_phylums
         .transpose()
    @@ -5820,11 +5814,11 @@ 

    -

    11.12 Let’s make some plots

    +
    +

    10.12 Let’s make some plots

    We first import plotnine

    from plotnine import *
    -

    And then run plotnine to a barplot of the mean abundance per group (Figure 11.6).

    +

    And then run plotnine to a barplot of the mean abundance per group (Figure 10.6).

    ggplot(tidy_phylums, aes(x='group', y='relative_abundance', fill='Phylum')) \
     + geom_bar(position='stack', stat='identity') \
     + ylab('mean abundance') \
    @@ -5836,7 +5830,7 @@ 

    <

    -Figure 11.6: Stacked bar chart of ancient, non-westernised, and westernised sample groups on the X axis columns, and mean abundance percentage on the Y-axis. The legend and stacks of the bar represent different phyla each with a different colour +Figure 10.6: Stacked bar chart of ancient, non-westernised, and westernised sample groups on the X axis columns, and mean abundance percentage on the Y-axis. The legend and stacks of the bar represent different phyla each with a different colour

    @@ -5870,13 +5864,13 @@

    < -
    -

    11.13 Ecological diversity

    -
    -

    11.13.1 Alpha diversity

    +
    +

    10.13 Ecological diversity

    +
    +

    10.13.1 Alpha diversity

    Alpha diversity is the measure of diversity withing each sample. It is used to estimate how many species are present in a sample, and how diverse they are. We’ll use the python library scikit-bio (http://scikit-bio.org/) to compute it, and the plotnine (https://plotnine.readthedocs.io/) library (a python port of ggplot2 (https://ggplot2.tidyverse.org/reference/ggplot.html) to visualise the results).

    import skbio
    -

    Let’s compute the species richness, the Shannon index, and Simpson index of diversity (Table 11.14)

    +

    Let’s compute the species richness, the Shannon index, and Simpson index of diversity (Table 10.14)

    shannon = skbio.diversity.alpha_diversity(
         metric="shannon", counts=all_species.transpose(), ids=all_species.columns
     )
    @@ -5893,7 +5887,7 @@ 

    -Table 11.14: Table of the shannon, simpson, and richness alpha diversity indicies for a subset of samples +Table 10.14: Table of the shannon, simpson, and richness alpha diversity indicies for a subset of samples

    @@ -5996,7 +5990,7 @@

    -Table 11.15: Table of the shannon, simpson, and richness alpha diversity indicies for a subset of samples but with the group metadata +Table 10.15: Table of the shannon, simpson, and richness alpha diversity indicies for a subset of samples but with the group metadata

    @@ -6106,14 +6100,14 @@

    -

    And as always, we need it in tidy format (Table 11.16) for plotnine.

    +

    And as always, we need it in tidy format (Table 10.16) for plotnine.

    alpha_diversity = alpha_diversity.melt(id_vars='group', value_name='alpha diversity', var_name='diversity_index', ignore_index=False)
     
     alpha_diversity
    -Table 11.16: Table of the shannon, simpson, and richness alpha diversity indicies for a subset of samples but with the group metadata but in long-form tidy format +Table 10.16: Table of the shannon, simpson, and richness alpha diversity indicies for a subset of samples but with the group metadata but in long-form tidy format

    @@ -6209,7 +6203,7 @@

    -

    We now make a violin plot to compare the alpha diversity for each group, faceted by the type of alpha diversity index (Figure 11.7).

    +

    We now make a violin plot to compare the alpha diversity for each group, faceted by the type of alpha diversity index (Figure 10.7).

    g = ggplot(alpha_diversity, aes(x='group', y='alpha diversity', color='group'))
     g += geom_violin()
     g += geom_jitter()
    @@ -6225,7 +6219,7 @@ 

    -Figure 11.7: Three groups of violin plots of an ancient sample, westernised samples and non-westernised samples (x-axis) of the alpha diversity (y-axis) calculated for richness, shannon and simpson alpha indicies +Figure 10.7: Three groups of violin plots of an ancient sample, westernised samples and non-westernised samples (x-axis) of the alpha diversity (y-axis) calculated for richness, shannon and simpson alpha indicies
    @@ -6259,8 +6253,8 @@

    -
    -

    11.13.2 Beta diversity

    +
    +

    10.13.2 Beta diversity

    The Beta diversity is the measure of diversity between a pair of samples. It is used to compare the diversity between samples and see how they relate.

    We will compute the beta diversity using the bray-curtis dissimilarity

    beta_diversity = skbio.diversity.beta_diversity(metric='braycurtis', counts=all_species.transpose(), ids=all_species.columns, validate=True)
    @@ -6317,26 +6311,26 @@

    -Table 11.17: Table of principal coordinates (columns) for each of the samples (rows) +Table 10.17: Table of principal coordinates (columns) for each of the samples (rows)
    -
    - @@ -7069,7 +7063,7 @@

    -

    Let’s look at the variance explained by the first axes by using a scree plot (Figure 11.8).

    +

    Let’s look at the variance explained by the first axes by using a scree plot (Figure 10.8).

    var_explained = pcoa.proportion_explained[:9].to_frame(name='variance explained').reset_index().rename(columns={'index':'PC'})
     
     ggplot(var_explained, aes(x='PC', y='variance explained', group=1)) \
    @@ -7082,7 +7076,7 @@ 

    -Figure 11.8: Scree plot describing the variance explained (Y-axis), for each Principal Componanent (X-axis), with a curved line from PC1 having highest variance to lowest on PC9. +Figure 10.8: Scree plot describing the variance explained (Y-axis), for each Principal Componanent (X-axis), with a curved line from PC1 having highest variance to lowest on PC9.
    @@ -7097,7 +7091,7 @@

    ) pcoa_embed['group'] = pcoa_embed['group'].replace({'yes':'non_westernized','no':'westernized', pd.NA:'ERR5766177'}) -

    Let’s first look at these components with 2D plots (Figure 11.9, Figure 11.10)

    +

    Let’s first look at these components with 2D plots (Figure 10.9, Figure 10.10)

    ggplot(pcoa_embed, aes(x='PC1', y='PC2', color='group')) \
     + geom_point() \
     + theme_classic() \
    @@ -7108,7 +7102,7 @@ 

    -Figure 11.9: Principal Coordinate Analysis plot of PC1 (X-axis) and PC2 (Y-axis), with three groups of points in the scatter plot - blue circles of westernised data points in the bottom left, overlapping with green circles of non-westernised datapoints in the top right, and the single ancient sample as a red circle falling in between the two on the right of the overlap +Figure 10.9: Principal Coordinate Analysis plot of PC1 (X-axis) and PC2 (Y-axis), with three groups of points in the scatter plot - blue circles of westernised data points in the bottom left, overlapping with green circles of non-westernised datapoints in the top right, and the single ancient sample as a red circle falling in between the two on the right of the overlap
    @@ -7122,7 +7116,7 @@

    -Figure 11.10: Principal Coordinate Analysis plot of PC1 (X-axis) and PC3 (Y-axis), a similar overlap between westernised/non-westernised individuals and position of the ancient sample as in the PC1-PC2 PCoA, however this time in a horseshoe shape from bottom left for the westernised data points, curving up to the top of PC3 at a peak, and then falling again at the top of PC1 +Figure 10.10: Principal Coordinate Analysis plot of PC1 (X-axis) and PC3 (Y-axis), a similar overlap between westernised/non-westernised individuals and position of the ancient sample as in the PC1-PC2 PCoA, however this time in a horseshoe shape from bottom left for the westernised data points, curving up to the top of PC3 at a peak, and then falling again at the top of PC1
    @@ -7149,7 +7143,7 @@

    -

    Finally, we can also visualise this distance matrix using a clustered heatmap, where pairs of sample with a small beta diversity are clustered together (Figure 11.11).

    +

    Finally, we can also visualise this distance matrix using a clustered heatmap, where pairs of sample with a small beta diversity are clustered together (Figure 10.11).

    import seaborn as sns
     import scipy.spatial as sp, scipy.cluster.hierarchy as hc

    We set the color in seaborn to match the color palette we’ve used so far.

    @@ -7169,7 +7163,7 @@

    -Figure 11.11: Sample-by-sample clustered heatmap, with tree representation of the clustering on the left and top of the heatmap +Figure 10.11: Sample-by-sample clustered heatmap, with tree representation of the clustering on the left and top of the heatmap
    @@ -7204,8 +7198,8 @@

    -
    -

    11.14 (Optional) clean-up

    +
    +

    10.14 (Optional) clean-up

    Let’s clean up your working directory by removing all the data and output from this chapter.

    When closing your jupyter notebook(s), say no to saving any additional files.

    Press ctrl + c on your terminal, and type y when requested. Once completed, the command below will remove the /<PATH>/<TO>/taxonomic-profiling directory as well as all of its contents.

    @@ -7229,8 +7223,8 @@

    To delete the conda environment

    conda remove --name taxonomic-profiling --all -y

    -
    -

    11.15 Summary

    +
    +

    10.15 Summary

    In this practical session we

    • Looked at how to process the raw sequencing data to focus only on the non-human reads
    • @@ -7251,8 +7245,8 @@

      -

      11.16 References

      +
      +

      10.16 References

      @@ -7724,7 +7718,7 @@

      diff --git a/tools.html b/tools.html index a8ddc7cc..76a8bc0a 100644 --- a/tools.html +++ b/tools.html @@ -2,12 +2,12 @@ - + -21  Tools – Introduction to Ancient Metagenomics +20  Tools – Introduction to Ancient Metagenomics