Skip to content

Commit

Permalink
Update main.tex
Browse files Browse the repository at this point in the history
Removing Least-Square
  • Loading branch information
JustGag authored Oct 14, 2024
1 parent e2368d8 commit c74757f
Showing 1 changed file with 4 additions and 52 deletions.
56 changes: 4 additions & 52 deletions papers/Gagnon_Kebe_Tahiri/main.tex
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ \subsection{{\textit{aPhyloGeo} software}\label{aPhyloGeo-software}}

In our case, we set up the \textit{aPhyloGeo} software as follows: $pairwiseAligner$ for sequence alignment; $\text{Hamming distance}$ to measure simple dissimilarities between sequences of identical length; $\text{Wider Fit by elongating with Gap (starAlignment)}$ algorithm takes alignment gaps into account, which is often mandatory in the case of major deletions or insertions in the sequences; $\text{windows\_size}$: 1 nucleotide (nt); and finally, $\text{step\_size}$: 10 nt. The last two configurations imply that for each 1 nt window, a phylogenetic tree is produced using the nucleotide of each Cumacea, then the window is moved by 10 nt, creating a new tree. Each window in the alignment will give a genetic tree. If there are $n$ windows, there will be $n$ phylogenetic trees. Genetic trees will be used in an object called $T_1$, while spatial and ecological trees are used in another object called $T_2$.

\item \textbf{In the third step}, the genetic trees constructed in each sliding window are compared with ecosystemic, atmospheric, and regional trees using Robinson-Foulds distance \citep{robinson_comparison_1981}, normalized Robinson-Foulds distance, Euclidean distance, and Least Squares distance. These contribute to understanding the correspondence between Cumacea genetic sequences and their habitat. The approach also takes bootstrapping into account \citep{koshkarov_phylogeography_2022}. The results of these metrics were obtained using the functions $least\_square(tree1, tree2)$, $robinson\_foulds(tree1, tree2)$, $euclidean\_dist(tree1, tree2)$ from the \textit{aPhyloGeo} software and were organized by the main function (\autoref{lst:main}). Those for the normalized Robinson-Foulds distance were obtained with the function $robinson\_foulds(tree1, tree2)$ (see the last line of code in \autoref{lst:robinsonFoulds}). The metric output tells us which of our attributes has the greatest divergence of phylogenetic relationships in our samples, based on the magnitude of the metric distances (see Figure \ref{fig:fig6} and Figure \ref{fig:fig7}).
\item \textbf{In the third step}, the genetic trees constructed in each sliding window are compared with ecosystemic, atmospheric, and regional trees using Robinson-Foulds distance \citep{robinson_comparison_1981}, normalized Robinson-Foulds distance and Euclidean distance. These contribute to understanding the correspondence between Cumacea genetic sequences and their habitat. The approach also takes bootstrapping into account \citep{koshkarov_phylogeography_2022}. The results of these metrics were obtained using the functions $robinson\_foulds(tree1, tree2)$ and $euclidean\_dist(tree1, tree2)$ from the \textit{aPhyloGeo} software and were organized by the main function (\autoref{lst:main}). Those for the normalized Robinson-Foulds distance were obtained with the function $robinson\_foulds(tree1, tree2)$ (see the last line of code in \autoref{lst:robinsonFoulds}). The metric output tells us which of our attributes has the greatest divergence of phylogenetic relationships in our samples, based on the magnitude of the metric distances (see Figure \ref{fig:fig6} and Figure \ref{fig:fig7}).

In addition to identifying the specific attribute, a sliding-window approach enables the precise localization of subtle sequences with high rates of genetic mutation \citep{koshkarov_phylogeography_2022}. This method requires shifting a fixed-size window over the alignment of genetic sequences, allowing phylogenetic trees to be reconstructed for each part of the sequence. It therefore allows us to recognize changes in evolutionary relationships along the 16S rRNA mitochondrial gene region of Cumacea species. This method is essential for determining whether Cumacea-specific gene sequences in this region of their genome may be affected by certain ecological or spatial attributes of their habitat (see Figure \ref{fig:fig6} and Figure \ref{fig:fig7}).
\end{enumerate}
Expand Down Expand Up @@ -345,55 +345,7 @@ \subsubsection{Euclidean distance}\label{euclidean}
return ed
\end{lstlisting}

\subsubsection{Least Squares distance}\label{LS}
The Least Squares (LS) distance measures the sum of the squares of the differences between the phylogenetic distances of the leaf pairs between the two sets of trees ($T_1$ and $T_2$) (see Equation \eqref{eq:ls} and \autoref{lst:LeastSquare}). As with Euclidean distance, the distance between each pair of leaves in the genetic trees is compared with that of the habitat attribute trees \citep{czarna2006topology, balaban2020apples}. This metric allows us to measure the topological dissimilarity between the two sets of trees based on the length of the branches (i.e., mutation rate) and to understand how these different habitat attributes influence the topological structure of phylogenetic trees. A high value between a specific window and other windows considered in the LS distance analysis may indicate a structural discrepancy between this DNA sequence and the tree built from a habitat attribute. Furthermore, we might not be able to conclude with certainty that there is a correlation between them since genetic variations in this window are inconsistent with variations in this habitat parameter.

\begin{equation}\label{eq:ls}
d_{\text{LS}}(T_1, T_2) = \sum_{i=1}^{n-1}\sum_{j=i+1}^{n}(d_T1(i,j) - d_T2(i,j))^2
\end{equation}

where $d_{\text{LS}}(T_1, T_2)$ is the Least Squares distance between the two sets of trees, and $d_T1(i,j)$ and $d_T2(i,j)$, the distance between leaves $i$ and $j$ in $T_1$ and $T_2$, respectively.

%\autoref{lst:LeastSquare}.
\begin{lstlisting}[label=lst:LeastSquare, language=Python, caption=Python script for calculating the LSD using the ete3 package in the aPhyloGeo package]
import ete3

def least_square(tree1, tree2):
"""

Parameters:
- tree1: Genetic trees.
- tree2: Atmospherical, ecosystemic, and spatial trees.

Returns:
- ls: The Least Squares distance between the two sets of trees.
"""

# Initialize the Least Squares distance to zero
ls = 0.0

# Retrieve the list of terminal leaves (species) from the first tree
leaves = tree1.get_terminals()

# Extract the names of the terminal leaves
leaves_name = [leaf.name for leaf in leaves]

# Iterate over each pair of leaves in the trees
for i in leaves_name:
# Remove the first leaf from the list to avoid redundant comparisons
leaves_name.pop(0)
for j in leaves_name:
# Calculate the distance between the pair of leaves in tree1
d1 = tree1.distance(tree1.find_any(i), tree1.find_any(j))
# Calculate the distance between the same pair of leaves in tree2
d2 = tree2.distance(tree2.find_any(i), tree2.find_any(j))
# Accumulate the absolute difference of distances into the LSD
ls += abs(d1 - d2)

return ls
\end{lstlisting}

Interestingly, Euclidean distance and LS distance are more sensitive to quantitative differences in branch length and subtle tree topology, making them suitable for identifying detailed correlations between genetic fluctuations and those of habitat parameters \citep{czarna2006topology, choi2009comparison}. They can therefore be used to study fine divergences between trees, enabling nuanced identification of the effects of habitat attributes on the genetic structure of species \citep{czarna2006topology, choi2009comparison}.
Interestingly, Euclidean distance is more sensitive to quantitative differences in branch length and subtle tree topology, making it suitable for identifying detailed correlations between genetic fluctuations and those of habitat parameters \citep{czarna2006topology, choi2009comparison}. It can therefore be used to study fine divergences between trees, enabling nuanced identification of the effects of habitat attributes on the genetic structure of species \citep{czarna2006topology, choi2009comparison}.

As for Robinson-Foulds distance (normalized or not), although widely applied in evolutionary biology, they are less sensitive to slight topological dissimilarities, making them less accurate for identifying fine correlations between genetics and habitat parameters \citep{smith2019bayesian, smith2020information}. This is due to their structural nature and the fact that they are not measured by branch length \citep{smith2019bayesian, smith2020information}.

Expand Down Expand Up @@ -448,13 +400,13 @@ \section{Results}\label{results}
\begin{figure}[]
\centering
\includegraphics[width=0.7\textwidth]{figure5.png}
\caption{Analysis of fluctuations in four distance metrics using multiple sequence alignment (MSA): a) Least Squares distance, b) Robinson-Foulds distance, c) normalized Robinson-Foulds distance, and d) Euclidean distance. The distance variations are studied to establish the potential dissimilarity between the 16S rRNA mitochondrial gene region of 62 Cumacea specimens and the variability in wind speed (m/s) at the start of sampling. \label{fig:fig6}}
\caption{Analysis of fluctuations in four distance metrics using multiple sequence alignment (MSA): a) Robinson-Foulds distance, b) normalized Robinson-Foulds distance, and c) Euclidean distance. The distance variations are studied to establish the potential dissimilarity between the 16S rRNA mitochondrial gene region of 62 Cumacea specimens and the variability in wind speed (m/s) at the start of sampling. \label{fig:fig6}}
\end{figure}

\begin{figure}[]
\centering
\includegraphics[width=0.7\textwidth]{figure6.png}
\caption{Analysis of fluctuations in four distance metrics using multiple sequence alignment (MSA): a) Least Squares distance, b) Robinson-Foulds distance, c) normalized Robinson-Foulds distance, and d) Euclidean distance. These distances aim to determine the degree of dissimilarity between the 16S rRNA mitochondrial gene region of 62 Cumacea specimens and the variation in O\textsubscript{2} concentration (mg/L) at the sampling sites. \label{fig:fig7}}
\caption{Analysis of fluctuations in four distance metrics using multiple sequence alignment (MSA): a) Robinson-Foulds distance, b) normalized Robinson-Foulds distance, and c) Euclidean distance. These distances aim to determine the degree of dissimilarity between the 16S rRNA mitochondrial gene region of 62 Cumacea specimens and the variation in O\textsubscript{2} concentration (mg/L) at the sampling sites. \label{fig:fig7}}
\end{figure}

The divergence between the genetic sequences and two attributes, one climatic (wind speed (m/s) at the start of sampling) and the other environmental (O\textsubscript{2} concentration (mg/L)) is presented in Figure \ref{fig:fig6} and Figure \ref{fig:fig7}. All the attributes given in the first step of the \autoref{aPhyloGeo-software} section were analyzed and their script and figure will be soon available in the $img$ and $script$ python file on \href{https://github.com/tahiri-lab/Cumacea_aPhyloGeo}{GitHub}. However, only these two attributes showed the most interesting mutation rate. Using the four metrics mentioned in the section \autoref{metrics}, we noticed that the Euclidean distance is particularly sensitive to our data, manifesting considerable sequence variation at the position in MSA 560-569 amino acids (aa) (Euclidean distance: 0.86; see Figure \ref{fig:fig6}d) and 1210-1219 aa (Euclidean distance: 1.23; see Figure \ref{fig:fig7}d). Unlike the other windows for this metric in the two figures (see Figure \ref{fig:fig6}d and Figure \ref{fig:fig7}d), the fluctuations in wind speed (m/s) at the start of sampling and in O\textsubscript{2} concentration (mg/L) do not appear to explain the variations in these two specific sequences. This could indicate the absence of directional selection in these sequences due to these habitat attributes, local selective pressures not considered in our analysis, or other evolutionary factors (e.g., genetic drift or biotic interactions) predominate over these two parameters concerning these two sequences. On the other hand, this may suggest that these two attributes could potentially influence the divergent (i.e., genetic diversification) rather than a convergent adaptation of these Cumacea, reflecting unique evolutionary responses to these specific ecological pressures. These results are consistent with the aim of our study, which is to identify the Cumacea genetic region that diverges most as a function of habitat attribute, to determine whether this is due to divergent local adaptation or other evolutionary processes.
Expand Down

0 comments on commit c74757f

Please sign in to comment.