diff --git a/04_more_visualizations.ipynb b/04_more_visualizations.ipynb deleted file mode 100644 index fac9fdc..0000000 --- a/04_more_visualizations.ipynb +++ /dev/null @@ -1,676 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Outline\n", - "\n", - "Why this tutorial will be useful:\n", - "- This tutorial provides a basic introduction into working with tables\n", - "- You will solely need to change the code at one position marked with ```# <- adapt code here```, however you are encouraged to experiment further\n", - "\n", - "Main purpose of tutorial:\n", - "- Ensure that software works\n", - "\n", - "Within this tutorial you will:\n", - "- Load gene expression data\n", - "- Filter the gene expression data for human brains\n", - "- Create a time-resolved clustermap - a table where similar genes will be grouped together, and the gene expression represented by colors" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Loading file\n", - "The Python programming language essentially acts as a powerful glue that can stitch different tools together. One tool for working with tables is pandas. In order to use pandas as a tool, we will at first need to import it so that it becomes available to pyhton. Per convention, we give it the name pd" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To execute the code contained in the \"cell\" (grey box) below, press SHIFT and RETURN." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "pd comes with several functions, which will form specific tasks. To see all of them, click on the pd below and press SHIFT and TAB. \n", - "\n", - "A useful function, to import tables is .read_csv. Functions take function-specific arguments. To see the required arguments, click onto the function name (e.g.: .rad_csv, and press SHIFT and TAB and TAB (again))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "On the next line you will need to adjust the code, to point it toward the location on your computer, which contains \"chaperome_expression_rpkm_1_1_orthologs.csv\", a file, which you can download from canvas (Class-3-Thursday, Oct3)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "my_table = pd.read_csv(\n", - " filepath_or_buffer='/Users/tstoeger/Box/chaperome_student_course/2020/material/chaperome_developmental_expression_rpkm_1_1_orthologs.csv' # <- adapt code here`\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "What just happened?\n", - "- You called pandas, through its name pd, to read the table with expression data.\n", - "- The equal sign defines the order in which the code is executed. At first the right side is executed, and the result of this execution is than stored in a \"variable\" on the left hand side. You could give this variable any variable name, as long as you always use the same name when talking about it within your code. In this case, we give the variable the name my_table" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Let us look at content, using the .head function\n", - "# Btw, # marks \"comments\", which will not be executed\n", - "my_table.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Well done! But something appears funny. There is a column called \"Unnamed: 0\" that apperas to duplicate the \"index\" (the left most values, which are likely displayed in bold on your computer). Let us drop \"Unnamed: 0\"." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "my_table = my_table.drop(\n", - " labels='Unnamed: 0', \n", - " axis='columns')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Have a look at the data by plotting the first few rows with .head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Filtering for human brain" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Let us at first inspect, which organisms are present\n", - "my_table['organism'].unique()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Which tissues are present?\n", - "my_table['tissue'].unique()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# let us define variables that state, which organism and tissue\n", - "# we would currently like to investigate (this will later facilitate\n", - "# looking at other tissues or organs)\n", - "tissue_of_interest = 'Brain'\n", - "organism_of_interest = 'Human'\n", - "\n", - "is_tissue_of_interest = my_table['tissue'] == tissue_of_interest\n", - "\n", - "is_organism_of_interest = my_table['organism'] == organism_of_interest\n", - "\n", - "use_these_records = is_tissue_of_interest & is_organism_of_interest\n", - "\n", - "filtered_table = my_table.loc[use_these_records, :]\n", - "\n", - "pivotted_table = filtered_table.pivot(\n", - " index='gene',\n", - " columns='age',\n", - " values='median_RPKM'\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pivotted_table.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This already looks good. But if you scroll the table to the right, you will notice something funny. The columns are not ordered in a meaningful way. It would appear nicer, if the columns wer ordered by timepoint, going from the earliest time to the latest. Let us at first see, which ages we have." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pivotted_table.columns" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Luckily the above is not too long. We could manually reorder them in a meaningful way." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "preferred_order = [\n", - " '4wpc', \n", - " '5wpc', \n", - " '7wpc', \n", - " '8wpc', \n", - " '9wpc',\n", - " '10wpc', \n", - " '11wpc', \n", - " '12wpc', \n", - " '13wpc', \n", - " '16wpc', \n", - " '18wpc', \n", - " '19wpc', \n", - " '20wpc',\n", - " 'newborn',\n", - " 'toddler',\n", - " 'infant', \n", - " 'school',\n", - " 'teenager',\n", - " 'youngAdult',\n", - " 'youngMidAge',\n", - " 'olderMidAge', \n", - " 'senior']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pivotted_table = pivotted_table.reindex(columns=preferred_order)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pivotted_table.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Visualize" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# note that the following line is specific to notebooks\n", - "# and will not be required on most computers. It serves\n", - "# as a safety mechanism to ensure that the figures will\n", - "# be shown in this notebook rather than elsewhere on\n", - "# your computer\n", - "%matplotlib inline" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Seaborn is a powerfull tool for visualizaiton\n", - "\n", - "import seaborn as sns" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let us visualize the data as a clustermap, which pairs similar samples. This is only one of many options of seaborn. For more ideas visit: https://seaborn.pydata.org/examples/index.html" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sns.clustermap(\n", - " data=pivotted_table)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Visualize nicer\n", - "\n", - "The above visualization has a few problems:\n", - "- genes are grouped by the absolute number of transcript molecules (RPKM) rather than their change over time\n", - "- Columns loose information on time" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To make genes with differen expression levels comparable, we need to normalize. One way of normalization is z-scoring (https://en.wikipedia.org/wiki/Standard_score), which normalizes sampels according to mean and median. Let us build a custom function, which can z-score our data. This function will call numpy, a toolbox for mathematical operations" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def zscore(input_values):\n", - " z_scored = (input_values-np.mean(input_values)) / (np.std(input_values))\n", - " return(z_scored)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "z_scored_table = pivotted_table.apply(\n", - " func=zscore,\n", - " axis='columns'\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sns.clustermap(\n", - " data=z_scored_table)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Above already looks better, we see patterns emerging. However the colors are not nice. The nicest would be to use a divergent colormap where values close to 0 (those close to the mean of a gene) are white" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sns.clustermap(\n", - " data=z_scored_table,\n", - " cmap='bwr', # blue white red colormap\n", - " vmin=-3, # fix the lowest value that will be covered by the colormap,\n", - " vmax=3, # fix the highest value that will be covered by the colormap\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wow! The looks even better. Now we see clearly, which time points have a reduced (blue) or elevated (red) expression when compared to all other timepoints of the same gene.\n", - "\n", - "Yet there is more that we can do. For instance we can suppress the clustering of the columsn to keep our origianl order (which will correspond to the age)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sns.clustermap(\n", - " data=z_scored_table,\n", - " cmap='bwr', # blue white red colormap\n", - " vmin=-3, # fix the lowest value that will be covered by the colormap,\n", - " vmax=3, # fix the highest value that will be covered by the colormap\n", - " col_cluster=False # Avoid clustering columns\n", - ")\n", - "\n", - "\n", - "# Export as pdf\n", - "# import matplotlib.pyplot as plt\n", - "# import matplotlib as mpl\n", - "# mpl.rcParams['pdf.fonttype'] = 42\n", - "# mpl.rcParams['font.family'] = 'Arial'\n", - "# plt.savefig('/Users/tstoeger/Desktop/test.pdf', bbox_inches='tight')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Conclusion\n", - "\n", - "Hopefully this tutorial has served as a brief introduction into computationally working with tables and visualize teh expression of genes. It looks as if some chaperones become downregulated after birth, whereas others become upregulated. Changing the visualization and data normalization can reveal distinct patterns of your data. You are among the first people to know about birth separting two types of regulation of chaperones, and you just discoverd something new!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "my_table.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Add time course visualization" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this case want to create a lineplot where the x-axis is chronologic time, and y-axis is raw counts." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "x_axis = chronologic_ages = [\n", - " -40*7+7*4, #'4wpc', \n", - " -40*7+7*4, #'5wpc', \n", - " -40*7+7*7, #'7wpc', \n", - " -40*7+7*8, #'8wpc', \n", - " -40*7+7*9, #'9wpc',\n", - " -40*7+7*10, #'10wpc', \n", - " -40*7+7*11, #'11wpc', \n", - " -40*7+7*12, #'12wpc', \n", - " -40*7+7*13, #'13wpc', \n", - " -40*7+7*16, #'16wpc', \n", - " -40*7+7*18, #'18wpc', \n", - " -40*7+7*19, #'19wpc', \n", - " -40*7+7*20, #'20wpc',\n", - " 0, #'newborn',\n", - " 365*2, #'toddler',\n", - " 365*5, #'infant', \n", - " 365*10, #'school',\n", - " 365*16, #'teenager',\n", - " 365*25, #'youngAdult',\n", - " 365*35, #'youngMidAge',\n", - " 365*50, #'olderMidAge', \n", - " 365*70, #'senior']\n", - "]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "gene_of_interest = 'CCT4'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pivotted_table.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "y_axis = pivotted_table.loc[gene_of_interest, :]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Import another libary for plotting\n", - "# This library - in contrast to seaborn - provides access to\n", - "# elemental parts of drawing images\n", - "import matplotlib.pyplot as plt " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "plt.plot(x_axis, y_axis)\n", - "plt.xlabel('Age (days)', fontsize=20)\n", - "plt.ylabel('RPKM of '+ gene_of_interest, fontsize=20)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# More visualizations" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "plt.figure(figsize=(20, 10))\n", - "sns.boxenplot(\n", - " x='age',\n", - " y='rpkm',\n", - " data=z_scored_table.stack().to_frame('rpkm').rename_axis(['gene', 'age']).reset_index(),\n", - " order=preferred_order,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": false - }, - "outputs": [], - "source": [ - "plt.figure(figsize=(20, 10))\n", - "sns.boxplot(\n", - " x='age',\n", - " y='rpkm',\n", - " data=z_scored_table.stack().to_frame('rpkm').rename_axis(['gene', 'age']).reset_index(),\n", - " order=preferred_order,\n", - " notch=True\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sns.distplot(z_scored_table.loc[:, '4wpc'], bins=np.arange(-4, 4, 0.5), color='blue')\n", - "sns.distplot(z_scored_table.loc[:, 'senior'], bins=np.arange(-4, 4, 0.5), color='red')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def get_tissue(organism_of_interest, tissue_of_interest):\n", - "\n", - "\n", - " is_tissue_of_interest = my_table['tissue'] == tissue_of_interest\n", - "\n", - " is_organism_of_interest = my_table['organism'] == organism_of_interest\n", - "\n", - " use_these_records = is_tissue_of_interest & is_organism_of_interest\n", - "\n", - " filtered_table = my_table.loc[use_these_records, :]\n", - "\n", - " pivotted_table = filtered_table.pivot(\n", - " index='gene',\n", - " columns='age',\n", - " values='median_RPKM'\n", - " )\n", - " \n", - " return pivotted_table" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from scipy.stats import spearmanr" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "first = 'Brain'\n", - "second = 'Cerebellum'\n", - "\n", - "first_tissue = get_tissue('Human', first)\n", - "second_tissue = get_tissue('Human', second)\n", - "\n", - "def corr(x):\n", - " c = spearmanr(range(len(x)), x)[0]\n", - " return c\n", - "\n", - "d = pd.merge(\n", - " first_tissue.apply(corr, 1).to_frame('brain').reset_index(),\n", - " second_tissue.apply(corr, 1).to_frame('cerebellum').reset_index()).set_index('gene') #.apply(lambda x: np.log10(x))\n", - "\n", - "plt.figure(figsize=(5, 5))\n", - "plt.scatter(d['brain'], d['cerebellum'])\n", - "plt.xlabel(first, fontsize=20)\n", - "plt.ylabel(second, fontsize=20)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.3" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -}