data_processing

Custom python package to assist in data processing and database interactions.

Dependencies

Matplotlib - plotting library
Matplotlib-Venn - creates venn diagrams
MySQL.Connector - connects to mySQL database
Pandas - dataframes
Paramiko - SSH connection
PyODBC - connects to MSSQL database
SciPy - scientific computing functions
SSHTunnel - SSH tunnel

Files

dna_functions.py

Common functions for dealing with DNA and RNA sequences.

Functions:

rev_comp - reverse complements the passed in DNA or RNA sequence
transcribe - transcribes the passed in DNA sequence to RNA
rev_trans - reverse transcribes the passed in RNA sequence to DNA
dna_to_rna - directly converts DNA to RNA by replacing T with U
rna_to_dna - directly converts RNA to DNA by replacing U with T

edit_db.py

Functions to connect to and edit a MSSQL or MySQL database. All of the specifications for the database connection should be included in a config.csv file. This file contains two columns: parameters and values. The parameters are:

default_db [optional]: default database to connect to, could also pass db_name as parameter of functions
host_ip: IP address of the database
host_user_name: user name on server
mysql_user_name [optional]: MySQL user name
db_prefix: if the database has a prefix, can add this, otherwise use empty string
key_loc [optional]: location of the ssh key, if ssh tunneling is necessary to connect

Functions:

connect_db - creates a conncection to a MySQL or MSSQL database
- sql_version: version of SQL of the database (ie MSSQL or MySQL)
- db_name [optional]: name of the database to connect to. If MySQL, db_prefix will be added automatically; default: None
- firewall [optional]: if behind a firewall, will use ssh tunneling to connect to database; default: False
- key_loc [optional]: file location of the ssh key for accessing MySQL database; default: None
- db_pwd [optional]: database user password; default: None
- key_pwd [optional]: password for the ssh key; default: None
close_db_connection - closes the connection to the database
- connection: database connection
- server [optional]: server connection tunnel; default: None
excute_with_error_check - executes the passed in string with error checking
- execute_str: SQL string to execute
- connection: connection to the database
- cursor: database cursor
- sql_version: version of SQL of the database (ie MSSQL or MySQL)
check_table_exists - checks if a table with the passed in name already exists in the database
- table_name: table name to check if exists
- cursor: database cursor
make_table - Creates a new table with the passed in columns. Several checks are run to ensure database integrity will not be unknowingly compromised (see flow chart)
- table_name: name of the table to create
- columns_dict: dictionary of column names with a list of conditions (ie 'NOT NULL') and datatypes
- other_conditions [optional]: a list of other conditions for the table, such as primary key and foreign keys
- db_name [optional]: name of the database to connect to; default: None, will be replaced with config file default during connection
- sql_version [optional]: version of SQL of the database (ie MSSQL or MySQL); default: MSSQL
- firewall [optional]: if behind a firewall, will use ssh tunneling to connect to database; default: False
- key_loc [optional]: file location of the ssh key for accessing MySQL database; default: None, will be replaced with config file default during connection
- db_pwd [optional]: database user password; default: None
- key_pwd [optional]: password for the ssh key; default: None
check_fk - checks if the table has a foreign key referencing it
- table: name of the table to check
- cursor: database cursor
- sql_version: version of SQL of the database (ie MSSQL or MySQL)
delete_table - drops table from the database after checking for foreign key dependencies
- table: name of the table to drop
- connection: connection to the database
- cursor: database cursor
- sql_version: version of SQL of the database (ie MSSQL or MySQL)
clear_table - truncates the table. This function ignores foreign key constraints, so make sure to clear referencing table first!
- table: name of the table to clear
- db_name [optional]: name of the database to connect to; default: None, will be replaced with config file default during connection
- sql_version [optional]: version of SQL of the database (ie MSSQL or MySQL); default: MSSQL
- firewall [optional]: if behind a firewall, will use ssh tunneling to connect to database; default: False
- key_loc [optional]: file location of the ssh key for accessing MySQL database; default: None, will be replaced with config file default during connection
- db_pwd [optional]: database user password; default: None
- key_pwd [optional]: password for the ssh key; default: None
add_column - adds a column to an existing table
- column_name: name of column to add
- column_type: type of the new column
- table: name of the table to add the column to
- db_name [optional]: name of the database to connect to; default: None, will be replaced with config file default during connection
- sql_version [optional]: version of SQL of the database (ie MSSQL or MySQL); default: MSSQL
- firewall [optional]: if behind a firewall, will use ssh tunneling to connect to database; default: False
- key_loc [optional]: file location of the ssh key for accessing MySQL database; default: None, will be replaced with config file default during connection
- db_pwd [optional]: database user password; default: None
- key_pwd [optional]: password for the ssh key; default: None
make_row - creates a new row with the passed in data for the given table
- insert_dict: dictionary of column name: value to insert
- table: name of the table to add the row to
- db_name [optional]: name of the database to connect to; default: None, will be replaced with config file default during connection
- sql_version [optional]: version of SQL of the database (ie MSSQL or MySQL); default: MSSQL
- firewall [optional]: if behind a firewall, will use ssh tunneling to connect to database; default: False
- key_loc [optional]: file location of the ssh key for accessing MySQL database; default: None, will be replaced with config file default during connection
- db_pwd [optional]: database user password; default: None
- key_pwd [optional]: password for the ssh key; default: None
make_many_rows - Instead of inserting a single row into the database, inserts up to 950 in one statement, looping until all rows are inserted. Returns True if successful.
- insert_dict: dictionary of column name: list of value to insert
- table: name of the table to add the rows to
- db_name [optional]: name of the database to connect to; default: None, will be replaced with config file default during connection
- sql_version [optional]: version of SQL of the database (ie MSSQL or MySQL); default: MSSQL
- firewall [optional]: if behind a firewall, will use ssh tunneling to connect to database; default: False
- key_loc [optional]: file location of the ssh key for accessing MySQL database; default: None, will be replaced with config file default during connection
- db_pwd [optional]: database user password; default: None
- key_pwd [optional]: password for the ssh key; default: None
update_row - Updates a row in the table
- update_dict: dictionary of column names and values to be updated
- condition_dict: dictionary of column names and values to identify the row to be updated
- table: name of table with row to be updated
- db_name [optional]: name of the database to connect to; default: None, will be replaced with config file default during connection
- sql_version [optional]: version of SQL of the database (ie MSSQL or MySQL); default: MSSQL
- firewall [optional]: if behind a firewall, will use ssh tunneling to connect to database; default: False
- key_loc [optional]: file location of the ssh key for accessing MySQL database; default: None, will be replaced with config file default during connection
- db_pwd [optional]: database user password; default: None
- key_pwd [optional]: password for the ssh key; default: None
update_many_rows - Updates multiple rows in the table
- update_dict: dictionary of column names and list of values to be updated
- condition_dict: dictionary of column names and list of values to identify the rows to be updated
- table: name of table with rows to be updated
- db_name [optional]: name of the database to connect to; default: None, will be replaced with config file default during connection
- sql_version [optional]: version of SQL of the database (ie MSSQL or MySQL); default: MSSQL
- firewall [optional]: if behind a firewall, will use ssh tunneling to connect to database; default: False
- key_loc [optional]: file location of the ssh key for accessing MySQL database; default: None, will be replaced with config file default during connection
- db_pwd [optional]: database user password; default: None
- key_pwd [optional]: password for the ssh key; default: None
format_value - formats the value for use in a SQL statement. Adds '' to strings and convertes python's None to NULL
- value: value to format

fold_change.py

Calculates the fold change and the log2 fold change.

Functions:

log_fold_change - Calculates the log2 fold change between experimental column(s) and the average of control columns and returns dataframe
- df: pandas dataframe
- control_list: list of column names of control samples
- experimental_col: column name or list of column names of experimental/treatment sample(s)
fold_change - Calculates the fold change between experimental column(s) and the average of control columns and returns dataframe
- df: pandas dataframe
- control_list: list of column names of control samples
- experimental_col: column name or list of column names of experimental/treatment sample(s)

make_venn.py

Creates venn diagrams.

Function:

make_diagram
- group_list: list of lists of values in the given group
- name_list: list of names for each group
- title: title for the venn diagram
- output_file: name of file to save diagram to
- return_names [optional]: returns the names/values in overlap of all groups; default False
- figsize [optional]: size (in inches) of figure; default (8,6)
- high_qual [optional]: returns a higher quality (800 dpi) figure; default False

normalize.py

The functions inside this file are divided by normalization method.

Median Normalization

med_norm - median normalizes a dataframe
- df: pandas dataframe of read counts

RPM Normalization

rpm_norm_df - read per million normalizes the dataframe
- df: pandas dataframe of read counts
rpm_norm_ser - reads per million normalizes the series
- ser: pandas series of read counts

RPKM Normalization

rpkm_norm - read per kilobase per million normalizes the reads
- df: pandas dataframe of read counts
- gene_len_file: file with lenght of genes
get_gene_len_gtf - adds together the exon lengths minus the overlap for each gene
- gtf_file: file with exon genomic locations in gtf format
- out_file: file to output the lengths to

TMM Normalization

tmm_norm - trimmed mean of M-values normalizes the reads. Was adapted from EdgeR's calcNormFactors function.
- df: pandas dataframe of read counts
- ref_samp: reference sample column name
- trim_fc_perc [optional]: percentage of top and bottom fold change values to trim; default 30
- trim_abs_perc [optional]: percentage of top and bottom absolute expression values to trim; default 5

read_counting.py

Counts the reads which align to mature miRNAs or sgRNAs.

functions:

create_mir_dict - Creates a dictionary of chromosomes which hold a positive strand and negative strand dictionary which each hold a dictionary with genomic locations +/- 5 bp as keys and mature miRNA IDs of the miRNAs which align to that location as values. This dictionary includes not only the genomic location of the mature miRNA, but also the genomic locations in hg38 which are identical to the mature miRNA. Also creates a dictionary with every matID as keys and 0 as values to intialize counter object
find_mir_match - Counts the reads which overlap with mature miRNAs. Returns the read counter and a QC counter.
- f_name: location of the sam file with the aligned reads
- chrom_dict: miRNA dictionary created by create_mir_dict
- mat_ids: dictionary of all mature read counts with 0 key values to initialize read counter. Created by create_mir_dict.
count_sgrna - Counts the reads which align to sgRNAs. Returns a pandas dataframe of read counts and a summary dataframe.
- fnameList: list of sam files with aligned reads from amplicon sequencing
- sampleNameList: list of sample names, one per file in the same order
- sgRNANameList: list of names of sgRNAs in the library

TrimAndAlign class (trim_align.py)

This class allows for NGS reads to be trimmed and aligned on a server which runs Btrim, Bowtie, Bowtie2 and TopHat. Creation of a TrimAndAlign object results in a connection to the server and creation of a log file. The connection and log file are closed when the object is deleted.

Functions:

fileToServer: moves file to server
- curLocation: current location of the file
- fname: name of the file on the server
- ext [optional]: file extension on server; default ""
fileFromServer: moves file from server
- locLocation: local directory to transfer file to
- fname: name of the file to transfer
- ext [optional]: file extension on server; default ""
trim: trims the reads using Btrim
- trimName: file with the sequences to be trimmed from the reads (details here)
- sampName: name of sample file to trim (.fastq)
- param [optional]: Btrim parameters to use (details here); default "-l 16"
makeIndex: creates a Bowtie2 index
makeIndex_bowtie: creates a Bowtie index
align: aligns reads using Bowtie2
align_bowtie: aligns reads using Bowtie
align_tophat: aligns mRNA-seq reads using TopHat
cleanUp: deletes all the files with sample name from the server

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data_processing		data_processing
readme_images		readme_images
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

data_processing

Dependencies

Files

Table of Contents

dna_functions.py

Functions:

edit_db.py

Functions:

fold_change.py

Functions:

make_venn.py

Function:

normalize.py

Median Normalization

RPM Normalization

RPKM Normalization

TMM Normalization

read_counting.py

functions:

TrimAndAlign class (trim_align.py)

Functions:

About

Uh oh!

Releases

Packages

Languages

nhu-github/data_processing

Folders and files

Latest commit

History

Repository files navigation

data_processing

Dependencies

Files

Table of Contents

dna_functions.py

Functions:

edit_db.py

Functions:

fold_change.py

Functions:

make_venn.py

Function:

normalize.py

Median Normalization

RPM Normalization

RPKM Normalization

TMM Normalization

read_counting.py

functions:

TrimAndAlign class (trim_align.py)

Functions:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages