Skip to content

This is a repository for scientific project in the Institute of Bioinformatics, 2022.

Notifications You must be signed in to change notification settings

AnnaToi01/EC_genes_BI_Project_2022

Repository files navigation

Search for homologs of egg-cell specific genes, study of their expression patterns and regulatory elements for the creation of effective constructs for genetic engineering

Students:
Elena Grigoreva (github, telegram)
Anna Toidze (github, telegram)

Supervisors:
Maria Logacheva, Skoltech
Artem Kasianov, IITP RAS

Project slides

Introduction

Choosing a promoter for Cas nucleases - is an important step in genome editing. Mostly, constitutive promoters such as 35S are used for genetic engineering in plants, as they have high levels of expressions in all cell types.But using promoters germ line cells-specific is more effective approach because it leads to more homogeneity and to decrease of target mutations across the generated lines of cells.

EC 1.1 and EC 1.2 are A. thaliana genes from Egg Cell family that are specifically and highly expressed in egg cells. It was shown that using of promoters of these genes significantly improved genome editing (Wang et al, 2015).

No similar promoters are known in other plants. But knowing that homologous genes can have similar functions, we supposed that EC homologs could have similar expression patterns and using their promoters could also be effective.

So the aim of our project is to find functional analogs of EC genes in different crops and model plants and explore their expression patterns and regulatory elements.

Table of Contents

  1. Pipeline
    1. Downloading genomes, annotations and amino acid sequences
    2. Searching for orthogroups containing EC1 gene family
    3. Alignment and phylogenetic analysis of these orthogroups
    4. Gene expression patterns analysis
    5. Searching for regulatory elements in upstream sites of the gene-orthologs
    6. Conclusion
  2. Literature
  3. Setup
  4. Software Requirements

Pipeline:

  • Download genomes, annotations and amino acid sequences;
  • Search for orthogroups containing EC1 gene family;
  • Alignment and phylogenetic analysis of these orthogroups;
  • Gene expression patterns analysis;
  • Search for regulatory elements in upstream sites of the gene-orthologs.

Downloading genomes, annotations and amino acid sequences

We downloaded genomes, annotations and amino acid sequences for 53 plant species. All species and corresponding links to sources are in the table Species_table.xlsx. We used databases Plant Ensemble (releases 52 and 53), PLAZA, MBKBASE and Phytozome.

Searching for orthogroups containing EC1 gene family

To find EC1 genes orthologs we used Orthofinder tool v.2.5.4.

Before running orthofinder we devided species for several groups due to high memory usage and for faster computation. The list of groups and species can be found in Groups.csv. The amino acid sequences for each group were put in a folder groups/i/, where i corresponds to the number of the group.

Script for running orthofinder for all groups is located in ./OrthoFinder_launch/ folder.

Code for analysis of OrthoFinder output is located in ./OrthoGroups_analysis/OrthoFinder_results_analysis.ipynb.

According to Orthofinder results EC1.1 and EC1.2 genes belong to one orthogroup. We extracted all genes that were in the same orthogroup with EC genes (201 genes) and examined their protein and nucleotide sequences, as well as annotations, for further analysis.

File with protein sequences of EC genes of different species is ./OrthoGroups_analysis/conc_protein_seq.fa.

Alignment and phylogenetic analysis of these orthogroups

The code for phylogenetic analysis and all resulting files are located in ./Phylogenetic_analysis folder.

To align protein sequences we tried three aligners - Muscle, MAFFT and ClustalO. ClustalO showed the best coverage.

MAFFT alignment
mafft alignment
Muscle alignment
muscle alignment
ClustalO alignment
clustalo alignment

Obtained alignment was taken for a phylogenetic tree. Tree was constructed using IQ-TREE tool v2.2.0_beta by maximum likelihood method using ultrafast bootstrap approximation. Amborella trichopoda was chosen as outgroup.

Resulting tree can be found in file ./Phylogenetic_analysis/clustalo_not_trimmed_iqtree_bootstrap.treefile.

To visualise tree we used R package ggtree. Script for tree drawing is ./Phylogenetic_analysis/tree_drawing.R.

phylogenetic tree

There two clades with very high support on the tree, which roughly correspond to EC1.1 and EC1.2 gene families. Inside the clades genes are grouped according to species phylogeny. These clades contain the majority of genes of both, dicots and monocots. This could implicate that the duplication leading to the emergence of EC1.1 and EC1.2 occured in the early stages of the evolution of flowering plants, even before divergence of dicots and monocots. The structure of the tree within each of the clades is more or less consistent with the phylogeny of flowering plants. However, genes from monocots are present in only one clade, EC1.1. This suggests that it is likely that the common ancestor of the monocots lost one of the paralogs corresponding to EC1.2

The outside group probably contains genes that are not the EC1.1 or EC1.2 orthologs (and are there due to, e.g., long branch attraction).

Some species have several orthologs of EC genes. The next step is to figure out which of the genes are the most similar with Arabidopsis genes by their expression pattern.

Gene expression patterns analysis

To find out which expression patterns found orthologs have we searched for open-assesed transcriptional data based on RNA-seq analysis. All used databases are presented in ./Transctriptional_databases.xlsx.

Due to absence or bad quality of transcription data for some species, only 20 species and 53 genes orthologs were taken for further analysis. According to their expression profile all genes were divided into three groups. The found genes and according expression profiles are in ./Gene_expression_data.csv file.

Group number Group name Amount of genes Description
1 generative 23 As for EC genes - expression in female generative organs is the highest and specific
2 non-specific 29 Expression in female generative organs is present but not only there or/and not the highest
3 vegetative 5 No expression in female generative organs. Only in vegetative parts of plants

Searching for regulatory elements in upstream sites of the gene-orthologs

The next step was to search patterns in genes that got in one of three groups. From Jaspar 2022 database we took all known motif sequences specific for plants (656 motifs).

FIMO v5.4.1 tool was used to search for these motifs in 500 bp upstream region of found orthologs. Nucleotide FASTA files with 500 bp upstream sequences were grouped by their expression patterns - ./Searching_motifs/generative_group1.fasta, ./Searching_motifs/non-specific_group2.fasta and ./Searching_motifs/vegetative_group3.fasta.

After that headmaps that reflects presence of different motifs in upstream sequences in each group was made.

GENERATIVE
heatmap_group1

NON-SPECIFIC
heatmap_group2

VEGETATIVE
heatmap_group3

We also looked at motifs that are present in more than 50% genes in groups 1 (only female reproductive organs) and groups 2 (female reproductive organs and other plant tissues).

Motif Group Expressed during Process
ZHD9 generative leaves, flowering, embryo, senescence glucosinolate metabolic process
DOF3.6 both roots enhances binding of OBF TFs to OCS
CDF5 both leaves, flowering, petal differentiation & expansion links circadian oscillation and photoperiodism, accumulation delays flowering
DOF5.1 both vascular tissues adaxial-abaxial polarity, auxin response
DOF5.8 both leaves vein network formation, auxin response
Zm00001d027846 non-specific leaf protoplast -
DOF3.4 non-specific leaves, embryo, senescence cell wall modification,regulation of cell cycle, transcription, auxin response
DOF1 non-specific leaves, flowering, embryo regulation of transcription
BPC1 non-specific polen, leaves, flowering, embryo, senescence, petal ovule identity
DOF4.2 non-specific fruit cotyledon and seed coat development, shoot formation

So, no consistent patterns for motif presence were observed for any group.

Conclusion

No single motif accounts for the specific expression in generative organs. Possibly, it is only enabled by a specific combination of different motifs and for each gene this combination is unique.

Literature

  • Wang, ZP., Xing, HL., Dong, L. et al. Egg cell-specific promoter-controlled CRISPR/Cas9 efficiently generates homozygous mutants for multiple target genes in Arabidopsis in a single generation. Genome Biol 16, 144 (2015).

Setup

Install Anaconda if not already installed (see Instructions).

  1. Create virtual environment via conda using file conda_requirements.txt.
$ conda create --name <env_name>
  1. Activate it
$ conda activate <env_name>
  1. Install necessary libraries
$ pip install -r requirements.txt
  1. Follow instructions on how to install OrthoFinder v2.5.4 on the according GitHub page.
  2. For phylogenetic analysis download using conda (e.g. MUSCLE conda) or GitHub Page (e.g. MUSCLE GitHub).
    • MUSCLE - v5.1
    • MAFFT - v7.505
    • Clustal Omega - 1.2.3
    • Ugene - v41
    • IQ-Tree v2.0.3
    • trimAl - v1.4.rev15

Software Requirements

  • Python 3.8
  • Ubuntu 21.04
  • Bash
  • R 4.1.2

About

This is a repository for scientific project in the Institute of Bioinformatics, 2022.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages