Frictionless JSON Data Packages for Life Science
Here is a demo how dataflows
:
- converts a wide table to a long table
- converts a long table to a wide table
- generates automatically a package descriptor in both cases
$ make install
$ python3 layouts/wide.py
$ python3 layouts/long.py
-
GSE60450_Lactation-GenewiseCounts.csv: gene expression matrix with simple layout, with the first 2 fields being molecular entity descriptors, the remainder of the fields correspond to
read counts
persamples
. GSE60450 -
P1-SARS-CoV2_Virus_FPKM.csv: gene expression matrix with simple layout, with the first field being molecular entity identifier, the remaineder of the fields correspond to
FPKM measure
persample
(file produced during the Elixir BioHackathon on COVID19). -
GSE52778_All_Sample_FPKM_Matrix.csv : gene expression matrix of more complex structure, with the first 9 fields (columns[A-I]) being molecular entity descriptors, then 4 sets of 3 fields, matching the 4 experimental conditions and 3 quantitation types (including FPKM measures) (columns[J-Y], then individual experimental conditions (per cell line) column[Z-AO], from NCBI GEO experiment GSE52778
- allow 'flatening of multi row headers' as documented by @lilwinfree
!pip install tabulator
then:
from tabulator import Stream
with Stream('oxford.csv', headers=[1,2], multiline_headers_joiner='.') as stream:
print(stream.headers)
['Gene Name', 'Sample_id1.mean', 'Sample_id1.standard dev', 'Sample_id2.mean', 'Sample_id2.standard dev']
-
pivot/unpivot operation: expand on existing code provided by @roll to enable an offset parameter (fixing the number of field associated with
molecular entity
descriptors) and 2 additional parameters, one to obtain the number of experimental conditions or unique samples, and another one, lising the number of quantitation types):important: possibly agree on a
conventional separator
to detectflattened headers
as inSample_id1.standard dev
. Separator could be selected from [".","|","__"]