Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repackage xpore & add Yuk Kei's CI #231

Merged
merged 13 commits into from
Nov 7, 2024
Merged

Repackage xpore & add Yuk Kei's CI #231

merged 13 commits into from
Nov 7, 2024

Conversation

lrauschning
Copy link

@lrauschning lrauschning commented Nov 5, 2024

Basically what it says on the tin, modernizes the packaging as setup.py is considered deprecated now.
Also took the opportunity to clean up some parts of the code and update some depenencies (eg numpy, pandas to version 2) and add the CI Yuk Kei had in the upstream dev branch.

@yuukiiwa yuukiiwa changed the base branch from master to dev November 5, 2024 05:56
Copy link
Collaborator

@yuukiiwa yuukiiwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Leon, (tagging you here @lrauschning)

Can you provide the screenshots of the following to the PR comments (just for record keeping) before I merge this to xpore's dev branch:

  • head diffmod.table from running the 1) master branch 2) your branch
  • wc -l diffmod.table from running the 1) master branch 2) your branch
  • md5sum /absolute/path/to/diffmod.table from running the 1) master branch 2) your branch
  • ls -lh dataprep from running the 1) master branch 2) your branch

Thanks!

xpore/__init__.py Outdated Show resolved Hide resolved
xpore/scripts/dataprep.py Show resolved Hide resolved
@lrauschning
Copy link
Author

Can have a look after lunch.
Maybe its worth it to add this to the CI also, if its a common benchmark example?

@lrauschning
Copy link
Author

Here's the output:

$ head out-dev/diffmod.table
id,position,kmer,diff_mod_rate_KO_vs_WT,pval_KO_vs_WT,z_score_KO_vs_WT,mod_rate_KO-rep1,mod_rate_WT-rep1,coverage_KO-rep1,coverage_WT-rep1,mu_unmod,mu_mod,sigma2_unmod,sigma2_mod,conf_mu_unmod,conf_mu_mod,mod_assignment,t-test
ENSG00000114125,141738508,CCAGG,-0.041489868336121075,0.20201901785494158,-1.2758203904833725,0.023132820665607474,0.06462268900172854,134.00000000000006,68.0,94.94913647797989,85.87930929612114,14.510990123734313,45.51691240535464,0.5361200322781137,0.24474815107073977,lower,0.04212320399650848
ENSG00000114125,141745295,TTCTT,-0.03870717460869445,0.06756172250483479,-1.82791891048328,6.024023806942085e-06,0.038713198632501396,166.0,83.00000000000003,80.33651626309049,82.1394426454296,1.0011378750120743,7.324054063639398,0.6267971979757309,0.17471174042799487,higher,0.03567486852179156
ENSG00000114125,141738383,GCCTC,-0.08741863585377607,0.0030003789851071865,-2.9676990947043413,0.9125617570719916,0.9999803929257677,91.99999999999994,51.0,67.43631848054436,71.96066111714177,2.6937457313028075,3.535665915543322,0.8137359670570272,0.09109626873234387,higher,0.03980864694917237
ENSG00000114125,141738447,GGAAC,0.22070056273017538,2.8259009033624178e-06,4.683084397529611,0.28110704865449937,0.060406485924323995,140.0,73.0,122.26719347747873,114.51784832140928,19.497050305486347,43.431952534778574,0.9444809035049347,0.1854605379079559,lower,0.0782285612237539
ENSG00000114125,141745213,GGTGA,-0.04895918583970506,0.1977770783411129,-1.2879106196447487,0.05719457791414195,0.10615376375384701,163.99999999999991,84.99999999999996,93.52854547465117,106.21403521348968,14.266012104047485,88.63375736012213,0.9127732211493238,0.23029828683575682,higher,0.053252008424165165
ENSG00000114125,141738362,GGAGA,-0.09121796018197537,0.09762525461398275,-1.6564770219398404,0.1514171728441054,0.24263513302608078,157.00000000000003,82.99999999999997,114.69930865485345,128.02879655871783,15.18983298719446,18.558211585309266,0.40574518357597034,0.13073443352500366,higher,0.09733049212490888
ENSG00000114125,141738288,CTAGC,-0.30952318364200965,0.02177238486476282,-2.294315465542227,0.06994906160691057,0.3794722452489202,26.000000000000004,14.999999999999998,97.48325969880833,92.40900566603999,4.816181402612806,20.201261772764163,0.9613325862232216,0.1956951816890401,lower,0.020138073228277553
ENSG00000114125,141745258,GGTCC,0.15585004188569423,0.017461778055440018,2.3768375168690623,0.49689049374036737,0.34104045185467313,151.0,85.0,115.60430303831595,107.07396361506318,8.493780193457651,35.09559172284622,0.93622455135727,0.1374527958043858,lower,0.08332617377417274
ENSG00000114125,141745356,GAACA,0.09750274502151615,0.05347080001489446,1.9310990861337336,0.22843676967197132,0.13093402465045517,154.00000000000003,80.99999999999999,94.30848605931507,98.95816309697946,2.192556166536036,17.120340020982063,0.7254657205312687,0.31897375277795514,higher,0.052676191757621685
$ head out-2.1.0/diffmod.table
id,position,kmer,diff_mod_rate_KO_vs_WT,pval_KO_vs_WT,z_score_KO_vs_WT,mod_rate_KO-rep1,mod_rate_WT-rep1,coverage_KO-rep1,coverage_WT-rep1,mu_unmod,mu_mod,sigma2_unmod,sigma2_mod,conf_mu_unmod,conf_mu_mod,mod_assignment,t-test
ENSG00000114125,141745411,GGGAC,0.6603353324794022,3.758712658923191e-72,17.963565234455608,0.6603470969084745,1.1764429072257123e-05,166.00000000000003,85.0,117.41669369241039,120.33535678971649,8.510801119098604,2.684688440614087,0.7829390787300419,0.5187827079131453,higher,2.625272285355091e-05
ENSG00000114125,141745639,ATGCT,-0.05695677312897491,0.08950922720954697,-1.69799223599655,0.0334091757577756,0.09036594888675051,169.00000000000003,87.99999999999999,88.53752091396795,83.64748432092723,4.022203526774115,10.00696161123963,0.5507397130082655,0.33421436059515663,lower,0.033164091912884205
ENSG00000114125,141745619,AGTTT,0.1803102016876974,0.006385672530150852,2.7272908496965593,0.5887406429610382,0.4084304412733408,161.99999999999997,84.0,124.85592538961889,116.8351393006027,9.847210888300896,38.23200041021144,0.9450836303999456,0.31635128570389703,lower,0.08100058410008491
ENSG00000114125,141738461,ATGTG,-0.08283409437214623,0.07255412466757753,-1.795625329635554,0.08762065439905659,0.1704547487712028,159.0,87.00000000000001,92.49154277974945,81.51892160630808,6.526415689649383,13.982362915258369,0.1522511497646898,0.14124115744534568,lower,0.04777504211350653
ENSG00000114125,141745369,GATGA,0.11291199953096898,0.07462472673430996,1.782764009310659,0.6967268116976741,0.5838148121667052,169.00000000000003,87.99999999999999,81.03129193768758,77.40730067881358,4.026887258102841,2.5754615434858454,0.6809357519206776,0.599412675739899,lower,0.08139582473486048
ENSG00000114125,141745258,GGTCC,0.15585004188569423,0.017461778055440018,2.3768375168690623,0.49689049374036737,0.34104045185467313,151.0,85.0,115.60430303831595,107.07396361506318,8.493780193457651,35.09559172284622,0.93622455135727,0.1374527958043858,lower,0.08332617377417274
ENSG00000114125,141745423,AGCCT,0.17469232243692462,0.006829601408666953,2.705040734267908,0.5639179827648111,0.38922566032788647,171.0,87.00000000000001,109.72719812527666,113.54804421126003,2.2039059297948684,4.215094577052336,0.6522060289013658,0.4773000748362811,higher,0.017376232614522556
ENSG00000114125,141745412,GGACT,-0.8330423850231179,2.6283106205487696e-124,-23.713313212365144,0.1159927294214769,0.9490351144445948,166.99999999999997,78.00000000000003,123.61837581578368,117.61408644753972,6.01839910059391,18.125538290614394,0.9647605939786407,0.19439542774440577,lower,1.6419568752813653e-19
ENSG00000114125,141745693,TTTAA,0.008479709534743607,0.2568713336773576,1.1338169638625555,0.00849269621041437,1.2986675670761797e-05,150.99999999999986,77.0,85.0772030370484,76.54008633879455,9.54859049333411,105.0640824210423,0.79561546929721,0.0013211429271707978,lower,0.03999133079676653

$ wc -l out-dev/diffmod.table
130 out-dev/diffmod.table
$ wc -l out-2.1.0/diffmod.table
130 out-2.1.0/diffmod.table

$ md5sum out-dev/diffmod.table
0510288c2e711807dc18af8eb2f24f30  out-dev/diffmod.table
$ md5sum out-2.1.0/diffmod.table
f1c41d03dd52f830b3ded8e43256639d  out-2.1.0/diffmod.table


$ e data/HEK293T-METTL3-KO-rep1/dataprep-dev
.rw-r--r--   70 mat  5 Nov 16:23 data.index
.rw-r--r-- 1,7M mat  5 Nov 16:23 data.json
.rw-r--r--   89 mat  5 Nov 16:23 data.log
.rw-r--r--   52 mat  5 Nov 16:23 data.readcount
.rw-r--r--  12k mat  5 Nov 16:23 eventalign.index
$ e data/HEK293T-METTL3-KO-rep1/dataprep-2.1.0
.rw-r--r--   70 mat  6 Nov 13:29 data.index
.rw-r--r-- 1,7M mat  6 Nov 13:29 data.json
.rw-r--r--   89 mat  6 Nov 13:29 data.log
.rw-r--r--   52 mat  6 Nov 13:29 data.readcount
.rw-r--r--  12k mat  6 Nov 13:29 eventalign.index
$ e data/HEK293T-WT-rep1/dataprep-dev
.rw-r--r--   99 mat  6 Nov 13:26 data.index
.rw-r--r-- 952k mat  6 Nov 13:26 data.json
.rw-r--r--  111 mat  6 Nov 13:26 data.log
.rw-r--r--   68 mat  6 Nov 13:26 data.readcount
.rw-r--r-- 6,4k mat  6 Nov 13:26 eventalign.index
$ e data/HEK293T-WT-rep1/dataprep-2.1.0
.rw-r--r--   99 mat  6 Nov 13:29 data.index
.rw-r--r-- 952k mat  6 Nov 13:29 data.json
.rw-r--r--  111 mat  6 Nov 13:29 data.log
.rw-r--r--   68 mat  6 Nov 13:29 data.readcount
.rw-r--r-- 6,4k mat  6 Nov 13:29 eventalign.index

@lrauschning
Copy link
Author

lrauschning commented Nov 6, 2024

The different md5sums seem to be due to different ordering of the kmers, all of the lines are identical:

$ cat out-dev/diffmod.table out-2.1.0/diffmod.table | sort | uniq | wc -l
130

I suspect the ordering is non-deterministic due to multithreading.

E: Yes, after sorting the files are exactly the same:

$ diff <(sort out-dev/diffmod.table) <(sort out-2.1.0/diffmod.table)
(no output)

I guess if extending the CI to cover this test case, we should be careful to include some kind of a sorting step.

@yuukiiwa yuukiiwa merged commit 94a3ead into GoekeLab:dev Nov 7, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants