Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test long read giraffe for short reads #4378

Closed
wants to merge 1,177 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1177 commits
Select commit Hold shift + click to select a range
d959100
New defaults for downsampling parameters
xchang1 May 21, 2024
dd532ac
Elide empty tree positions
adamnovak May 21, 2024
45cbf80
Figure out that I'm probably not getting the pin point for the offset…
adamnovak May 21, 2024
8353331
New parameters for illumina
xchang1 May 22, 2024
8247fe8
Merge remote-tracking branch 'upstream/lr-giraffe' into HEAD
xchang1 May 22, 2024
220e416
Use pin offset for deduplicating chains and making dotplots
adamnovak May 22, 2024
9f40a4e
Make sure parameter types match
xchang1 May 22, 2024
d3326dd
Change default min unique node fraction to 0 to turn it off
adamnovak May 22, 2024
74c0b4d
Merge remote-tracking branch 'xian/lr-giraffe' into lr-giraffe
adamnovak May 23, 2024
7f52ae4
Log chaining info when showing work
adamnovak May 24, 2024
4295c42
Add new illumina parameters
xchang1 May 26, 2024
c66f54d
Add a way to give points for non-gap bases in transitions in fragment…
adamnovak May 29, 2024
e0e91f2
Quiet deduplication debugging
adamnovak May 29, 2024
93db068
Add some wiggle room when comparing score and coverage of zipcode trees
xchang1 May 31, 2024
8e13416
Note when parts of a dumped chain go off of all the reference paths
adamnovak May 31, 2024
6777d8f
Use doubles for comparing ziptree scores
xchang1 Jun 3, 2024
b939bfb
Fix command line downsmapling parameters to match what they're doing
xchang1 Jun 3, 2024
4ee2592
Merge pull request #4303 from xchang1/fix-lr-params
xchang1 Jun 3, 2024
5a877d1
Try not doing chaining for short reads
xchang1 Jun 3, 2024
0b800e2
New illumina parameters
xchang1 Jun 4, 2024
213eb3c
Add a limit to how many fragments get turned directly into chains
xchang1 Jun 4, 2024
39a7462
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Jun 5, 2024
d78c972
Set chainier R10 parameters and swap count and length presets for HiF…
adamnovak Jun 5, 2024
59f5395
Add new default parameters for illumina
xchang1 Jun 6, 2024
3d849a9
Adopt R10 parameters with lower fragment gap cost and no fragment sco…
adamnovak Jun 6, 2024
15f1027
Merge remote-tracking branch 'upstream/lr-giraffe' into sr-no-chaining
xchang1 Jun 7, 2024
26de73f
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Jun 7, 2024
1b5a89e
Add score to vg filter output
adamnovak Jun 7, 2024
b0e64a0
Make chain middle DP limit large and configurable, and allow sorting …
adamnovak Jun 7, 2024
9c1ed64
Move fragmenting out into a helper function
xchang1 Jun 7, 2024
00db723
Move chaining into helper function
xchang1 Jun 7, 2024
3eed45c
Move chain stats into helper function
xchang1 Jun 7, 2024
a21d707
Move alignment from chains into helper function
xchang1 Jun 7, 2024
189e5c2
Move picking mappings into a helper function and fix the multiplicities
xchang1 Jun 10, 2024
6285f9d
Fix multiplicity
xchang1 Jun 10, 2024
cd99c5a
Don't redo alignment if we find at least two good full length gapless…
xchang1 Jun 10, 2024
a54f1f7
Fix bug
xchang1 Jun 10, 2024
755872f
Track gapless extensions in the funnel
xchang1 Jun 10, 2024
f2a6e10
Merge remote-tracking branch 'origin/lr-giraffe' into HEAD
xchang1 Jun 10, 2024
e5fe49c
Merge branch 'exact-name' into lr-giraffe
adamnovak Jun 11, 2024
8748ba6
Clear refpos
xchang1 Jun 12, 2024
10023da
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Jun 13, 2024
5a22368
Get chain scores to carry through operation counts
adamnovak Jun 13, 2024
883402b
I think this makes it track full length gapless extensions through th…
xchang1 Jun 13, 2024
0540ffb
Tie in a score-upper-bound-could-matter-for-mapq filter
adamnovak Jun 13, 2024
de97014
Don't count insertions against score upper bound
adamnovak Jun 13, 2024
45d8652
Quiet debugging
adamnovak Jun 13, 2024
50e2588
Don't do extra gapless extension if any previous tree had full length…
xchang1 Jun 14, 2024
a7c729b
Change mis-estimated score from an error to a warning
adamnovak Jun 14, 2024
2da39f1
Revert score bound filter and operation tracking but leave double ite…
adamnovak Jun 14, 2024
e8c4072
Apply item scale and bonus to first item in chain
adamnovak Jun 14, 2024
02e6a00
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Jun 14, 2024
201e99b
Merge upstream
xchang1 Jun 14, 2024
aa85ca3
Fix merge error
xchang1 Jun 14, 2024
d975e0c
Update illumina parameters
xchang1 Jun 17, 2024
1013272
Merge remote-tracking branch 'upstream/lr-giraffe' into HEAD
xchang1 Jun 18, 2024
2b8fd39
New illumina parameters
xchang1 Jun 20, 2024
6cf4875
Skip alignment if we get at least one full length gapless extension (…
xchang1 Jun 20, 2024
fa5c26e
Make sure not to do extra fragmenting if we found a full-length gaple…
xchang1 Jun 20, 2024
dc32f58
Get multiplicity of trees properly for full-length extensions
xchang1 Jun 20, 2024
6046431
Teach vg inject to compute base-level alignment scores
adamnovak Jun 24, 2024
a1ff5d0
Merge remote-tracking branch 'origin/no-tail-surject' into lr-giraffe
adamnovak Jun 25, 2024
5d15607
Merge remote-tracking branch 'origin/no-tail-surject' into lr-giraffe
adamnovak Jun 25, 2024
7378234
Make surject generate tail softclips in the original graph space
adamnovak Jun 25, 2024
c029ba2
Apply R10 parameters to HiFi to search from
adamnovak Jun 26, 2024
ef2a138
Adopt HiFi parameters that look competitive with minimap2 on the 10k …
adamnovak Jun 26, 2024
1b52afe
Stop applying a score scale *and* a score window to HiFi reads
adamnovak Jun 27, 2024
ac748fa
Add debugging code to find and dump problems for adjacent indels
adamnovak Jun 28, 2024
c380167
Report offending alignment better
adamnovak Jun 28, 2024
74ed7bf
Remove old check function and add test
adamnovak Jun 28, 2024
5c91a96
Check for validity around simplify
adamnovak Jun 28, 2024
5d15bed
Isolate problem in a small unit test
adamnovak Jun 28, 2024
0555273
Stop using generic mapping cutter and just move insertions
adamnovak Jun 28, 2024
3277498
Quiet debugging
adamnovak Jun 28, 2024
d8803f0
Merge remote-tracking branch 'origin/lr-giraffe' into lr-giraffe
adamnovak Jul 1, 2024
22c6e69
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Jul 1, 2024
3a1cd38
Fix Mac build
adamnovak Jul 1, 2024
accbe34
Fix compatibility with Mac sed by using a nonempty backup extension
adamnovak Jul 1, 2024
88c3ff6
Add in offset even when the alignment starts past the end of the anch…
adamnovak Jul 1, 2024
a6dfc43
Add --max-graph-scale/-g option to surject and make giving up scale w…
adamnovak Jul 3, 2024
01ac6ee
Check zipcode tree distances in snarls properly
xchang1 Jul 9, 2024
dd468e8
Turn off random tests
xchang1 Jul 9, 2024
0d6723d
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Jul 9, 2024
fcdff75
Update to current libbdsg
adamnovak Jul 9, 2024
8d0878a
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Jul 9, 2024
b0ece9b
Make crashes know what subcommand we tried to run
adamnovak Jul 9, 2024
4d149ed
Prefer to link vg's own built htslib dylib over the system one on Mac
adamnovak Jul 9, 2024
6d7f4b6
Merge remote-tracking branch 'origin/lr-giraffe' into lr-giraffe
adamnovak Jul 9, 2024
a3b296e
Use libvgio that doesn't multithread decompression and make sure we a…
adamnovak Jul 10, 2024
93c3f7a
Use payload less for old short read code path
xchang1 Jul 10, 2024
783e452
Use the distance index a bit less
xchang1 Jul 10, 2024
425c4cf
Stop making new decoders
xchang1 Jul 10, 2024
6e8f889
Get parent depth from zipcodes but idk if its any faster
xchang1 Jul 10, 2024
7fc2f1f
Save chain component
xchang1 Jul 10, 2024
4620f16
Make docs build not have untracked files laying around in submodules
adamnovak Jul 10, 2024
da73e20
Add code reference
adamnovak Jul 10, 2024
89e74ac
Make crash code actually include stringstream
adamnovak Jul 10, 2024
45d4987
Cache more stuff in the seed from the zipcodes
xchang1 Jul 11, 2024
d8c6e18
Try getting all values at once but it doesnt work so Im going to try …
xchang1 Jul 12, 2024
0244761
Get all cache values from zipcodes once
xchang1 Jul 13, 2024
729c322
Dont use distance index for getting is_root_snarl
xchang1 Jul 13, 2024
33b7e87
Add the zipcode to the clustering problems and use to get parents
xchang1 Jul 13, 2024
412dcec
Acually use zipcode to get handle
xchang1 Jul 13, 2024
2bdcb1a
Stop expecting the parent of a snarl to be a root
xchang1 Jul 13, 2024
ea7a0ed
Use zipcodes for getting chain parent
xchang1 Jul 13, 2024
6a693ad
Use zipcodes for distances
xchang1 Jul 14, 2024
536c7d7
Add distance check
xchang1 Jul 14, 2024
7fa624d
Don't copy seed to cluster
xchang1 Jul 15, 2024
83d1ffa
Fix is_root_snarl
xchang1 Jul 15, 2024
f79ad09
Reserve space in some vectors
xchang1 Jul 15, 2024
276d60b
Take out unused map
xchang1 Jul 15, 2024
5e4fea4
Take out another unused hash set
xchang1 Jul 15, 2024
acff6f6
Reserve more and fix indenting
xchang1 Jul 15, 2024
310beb7
Reserve memory for zipcode
xchang1 Jul 16, 2024
73477be
Reserve memory for decoders
xchang1 Jul 16, 2024
f5d0c4b
Use zipcode for snarl length
xchang1 Jul 16, 2024
a2dc51f
Reserve more memory
xchang1 Jul 16, 2024
c912afb
Add cluster checking
xchang1 Jul 17, 2024
ee5b6b6
Find minimizer hit count by walking through minimziers ordered by rea…
xchang1 Jul 17, 2024
a7bb77c
Revert "Find minimizer hit count by walking through minimziers ordere…
xchang1 Jul 17, 2024
02a3369
Add hacky way of dealing with multicomponent chains
xchang1 Jul 18, 2024
a410321
Revert "Add hacky way of dealing with multicomponent chains"
xchang1 Jul 19, 2024
3b536dd
Merge commit '094b956' into lr-giraffe
adamnovak Jul 19, 2024
381001c
Merge commit '755a634' into lr-giraffe
adamnovak Jul 19, 2024
11fc884
Merge commit 'dfac868' into lr-giraffe
adamnovak Jul 19, 2024
adc6476
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Jul 19, 2024
e9aac7e
Merge commit 'b55be13' into lr-giraffe
adamnovak Jul 19, 2024
e551487
Merge remote-tracking branch 'origin/unlimited-anchors' into lr-giraffe
adamnovak Jul 19, 2024
eb1e302
Make a string identifier for a snarl tree node
xchang1 Jul 19, 2024
b177d07
Fix zipcode identifiers
xchang1 Jul 21, 2024
636abe8
Add chain component to zipcodes
xchang1 Jul 22, 2024
a4da073
Use zipcodes to get chain component for nodes
xchang1 Jul 22, 2024
6e287f5
Add get_chain_component to zipcodes
xchang1 Jul 22, 2024
6565acb
Get chain component from zipcodes for payload
xchang1 Jul 22, 2024
c476d12
Use zipcode chain component
xchang1 Jul 22, 2024
9e2153f
Add failing unit test
xchang1 Jul 22, 2024
2c49a23
Get chain component for irregular snarls
xchang1 Jul 22, 2024
ce9b0ad
Fix debug code
xchang1 Jul 22, 2024
da5f891
Add chain component count to chains zipcodes
xchang1 Jul 23, 2024
3b4855e
Add unit test for looping zipcodes and fix bugs
xchang1 Jul 23, 2024
d135aab
Use zipcodes for chain component in clustering
xchang1 Jul 23, 2024
ddef07f
Use regular snarls to skip finding distances
xchang1 Jul 24, 2024
9761748
Add external connectivity to zipcodes
xchang1 Jul 24, 2024
74d4122
Take out end_in net_handle_t
xchang1 Jul 24, 2024
804c44d
Fix payload with new values
xchang1 Jul 24, 2024
18b4e7e
Add external connectivity for root nodes
xchang1 Jul 24, 2024
56fbd52
Use zipcode for external connectivity
xchang1 Jul 24, 2024
f2bd83a
Don't use distance index for ordering children in chains
xchang1 Jul 24, 2024
07de5de
Fix orientation for simple snarls since we took them out
xchang1 Jul 24, 2024
5203413
Start taking out net_handle_t's from the payload
xchang1 Jul 24, 2024
461de64
Take out grandparent handle
xchang1 Jul 24, 2024
8655513
Add net identifier strings but don't actually use them
xchang1 Jul 25, 2024
5170df0
Use net identifiers as keys for all the lookups
xchang1 Jul 25, 2024
e541b48
Mostly take out finding net handles until they're needed
xchang1 Jul 25, 2024
985ea32
Fix identifiers but its not working
xchang1 Jul 25, 2024
9b36204
Fix parent
xchang1 Jul 26, 2024
d23653a
Fix getting the node identifier
xchang1 Jul 26, 2024
6e09743
Fix getting handle to trivial chain
xchang1 Jul 26, 2024
e433831
Include chain component in identifier
xchang1 Jul 26, 2024
fd9a049
Fix getting net handle for child of root snarl
xchang1 Jul 26, 2024
27f47cf
Fix clustering the root snarl
xchang1 Jul 27, 2024
7e74771
Get chain sorting values for snarls earlier and sort properly for two…
xchang1 Jul 29, 2024
a4eadd3
Get parent from children
xchang1 Jul 29, 2024
bc642aa
Take out distance index when not used
xchang1 Jul 29, 2024
1a32937
Take the distance index out of more functions that don't use it
xchang1 Jul 29, 2024
2a77f66
Use distance index for chain values less
xchang1 Jul 29, 2024
d2fc0a3
turn off debug
xchang1 Jul 29, 2024
82f714e
Get identifier at the same time as payload
xchang1 Jul 30, 2024
0adfea0
Merge remote-tracking branch 'upstream/master' into lr-giraffe
adamnovak Jul 30, 2024
aa87f36
Undo using identifier strings as keys and fix some bugs
xchang1 Jul 31, 2024
b9a2010
Put zipcode and decoder together
xchang1 Jul 31, 2024
a18edec
Fix unit tests
xchang1 Jul 31, 2024
831f231
Fix reserving decoder length
xchang1 Jul 31, 2024
2923cde
Add an int vector that uses a minimal bit width for storing stuff
xchang1 Jul 31, 2024
2bbd62b
Merge branch 'read-quietly' into lr-giraffe
adamnovak Jul 31, 2024
595cafb
Use new int vectors for zipcodes but it doesn't work yet
xchang1 Aug 1, 2024
c5385a3
Merge remote-tracking branch 'origin/low-cplx-context' into lr-giraffe
adamnovak Aug 1, 2024
de6c76f
Fix zipcodes
xchang1 Aug 2, 2024
7aa1fe7
Revert using minint vectors
xchang1 Aug 3, 2024
39ed6c8
Make decoder use fewer bits
xchang1 Aug 3, 2024
116dc01
Only store zipcodes in a separate file
xchang1 Aug 3, 2024
eace413
Serialize zipcode and decoder
xchang1 Aug 3, 2024
a9dffbe
Revert "Serialize zipcode and decoder"
xchang1 Aug 3, 2024
05480d0
Revert "Only store zipcodes in a separate file"
xchang1 Aug 3, 2024
0207c82
Undo putting zipcode and decoder together
xchang1 Aug 3, 2024
69d48f0
Fix typo
xchang1 Aug 3, 2024
e12b23a
Use distance index less for chain values
xchang1 Aug 3, 2024
f4f10f3
Put decoder back with zipcode
xchang1 Aug 5, 2024
d72d232
Serialize decoder
xchang1 Aug 5, 2024
c4a2e48
Actually serialize the decoder
xchang1 Aug 5, 2024
7fd62d2
Put decoder into zipcode payload
xchang1 Aug 5, 2024
020cbb4
Actually serialize the decoder
xchang1 Aug 5, 2024
06a6b04
Add an unpacked zipcode but it doesn't compile yet
xchang1 Aug 6, 2024
d5a9e4d
Get everything to compile
xchang1 Aug 6, 2024
dfde735
Add unit tests and fix bug for unpacked zip codes
xchang1 Aug 6, 2024
1ecc8ae
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Aug 7, 2024
7cc9f4c
Use the unpacked zipcode for clustering instead of the old payload
xchang1 Aug 7, 2024
2c955f9
Use the length of the last chain component as the length of a multico…
xchang1 Aug 8, 2024
cf6b5be
Use unpacked zipcode in SnarlTreeNodeProblem instead of zipcode
xchang1 Aug 8, 2024
e5fe8b6
Reserve memory for unpacked zipcode
xchang1 Aug 9, 2024
8ccb110
Reserve memory
xchang1 Aug 9, 2024
a73b80e
Take out chain component
xchang1 Aug 9, 2024
4286171
Fix bug getting best distanecs
xchang1 Aug 9, 2024
434a861
Make vg filter see empty list annotations as not there
adamnovak Aug 9, 2024
007df92
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Aug 9, 2024
dbc77d7
Merge remote-tracking branch 'origin/left-align-better' into lr-giraffe
adamnovak Aug 9, 2024
5beb1be
Get the length of a cyclic chain
xchang1 Aug 10, 2024
a0cbb23
Turn off debug
xchang1 Aug 10, 2024
d7e2553
Get values from unpacked zipcode for children of problems
xchang1 Aug 10, 2024
98a5518
Take out unused chain component ints
xchang1 Aug 11, 2024
f5ad687
Reserve memory when getting zipcodes from payload
xchang1 Aug 12, 2024
d02f4f7
Remove unused ints and reserve memory for unpacked zipcode
xchang1 Aug 12, 2024
2762b05
Reserve less memory because it made it slower
xchang1 Aug 12, 2024
f5e5638
Undo unpacking zipcode to get back to 020cbb
xchang1 Aug 13, 2024
c172816
Fix bug getting minimum distances
xchang1 Aug 13, 2024
aff0cc6
Get chains last component length and get chain length from zipcode
xchang1 Aug 13, 2024
7ed8f6e
Make a decoded code type (but not for roots) and use it for building …
xchang1 Aug 13, 2024
a600bbc
Add getters and setters for unpacked codes
xchang1 Aug 13, 2024
4b994b6
Move build info headers from phony targets to top-level shell invocat…
adamnovak Aug 13, 2024
5b33b5e
Move build info headers from phony targets to top-level shell invocat…
adamnovak Aug 13, 2024
41be671
Move version headers into a directory that will always exist
adamnovak Aug 13, 2024
cc418dc
Merge branch 'update-version-fast' into lr-giraffe
adamnovak Aug 13, 2024
aaf3cb9
Remove extra tabs
adamnovak Aug 13, 2024
bd32d15
Remove extra tabs
adamnovak Aug 13, 2024
3cbd7be
Add functions for interpreting one level of the zipcode
xchang1 Aug 13, 2024
72025a0
Add unit tests for unpacked codes
xchang1 Aug 13, 2024
e00dd49
Merge remote-tracking branch 'origin/left-align-better' into lr-giraffe
adamnovak Aug 13, 2024
ae6b91a
Cut direct, broken static flag to version header dependency
adamnovak Aug 13, 2024
7e23f9d
Cut direct, broken static flag to version header dependency
adamnovak Aug 13, 2024
096000a
Use chain_minimum_length in zipcode
xchang1 Aug 14, 2024
0356878
Make unacking const
xchang1 Aug 14, 2024
dca72e7
Used unpacked zipcode for getting chain values
xchang1 Aug 14, 2024
01e7b4d
Use unpacked zipcode for getting snarl values and fix but I missed fi…
xchang1 Aug 14, 2024
62f6c38
Get parent with distance index if its faster
xchang1 Aug 14, 2024
b0c0234
Make a map from connected component number to net handle
xchang1 Aug 14, 2024
a4410a9
Make map from node id to net handle
xchang1 Aug 14, 2024
45cba14
Revert "Make map from node id to net handle"
xchang1 Aug 14, 2024
17120fb
Reserve memory for children
xchang1 Aug 14, 2024
b0f1f70
Revert "Reserve memory for children"
xchang1 Aug 14, 2024
083f2b3
Set up Docker build and CI to get the version build right
adamnovak Aug 14, 2024
d173607
Set up Docker build and CI to get the version build right
adamnovak Aug 14, 2024
1248b0f
Remove wayward tabs
adamnovak Aug 14, 2024
2ce314b
Remove wayward tabs
adamnovak Aug 14, 2024
08218c6
Report errors better for vg inject failures
adamnovak Aug 14, 2024
ae5c467
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe-sr
xchang1 Aug 16, 2024
68a7716
Take out random failed unit test
xchang1 Aug 16, 2024
aa3a4e7
Don't complain about not having an environment version file on initia…
adamnovak Aug 16, 2024
b9630c0
Change Gitlab Docker builds to use make version
adamnovak Aug 16, 2024
0e7d66f
Merge branch 'update-version-fast' into lr-giraffe
adamnovak Aug 16, 2024
50c99a6
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe-sr
xchang1 Aug 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ include/
obj/
cpp/
.pre-build
src/vg_git_version.hpp
src/vg_environment_version.hpp
*.o
vg
*.a
Expand Down
5 changes: 2 additions & 3 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -122,8 +122,7 @@ build-job:
- PLATFORMS=linux/amd64
- THREADS=8
- DOCKER_TAG=ci-${CI_PIPELINE_IID}-${CI_COMMIT_SHA}
- make include/vg_git_version.hpp
- cat include/vg_git_version.hpp
- make version
# Connect so we can upload our images
- docker login -u "${CI_REGISTRY_USER}" -p "${CI_REGISTRY_PASSWORD}" "${CI_REGISTRY}"
# Note that A LOCAL CACHE CAN ONLY HOLD ONE TAG/TARGET AT A TIME!
Expand Down Expand Up @@ -165,7 +164,7 @@ production-build-job:
- THREADS=8 # Oversubscribe since the ARM build will take way longer anyway.
# Determine what we should be tagging vg Dockers as. If we're running on a Git tag we want to use that. Otherwise push over the tag we made already.
- if [[ ! -z "${CI_COMMIT_TAG}" ]]; then DOCKER_TAG="${CI_COMMIT_TAG}" ; else DOCKER_TAG="ci-${CI_PIPELINE_IID}-${CI_COMMIT_SHA}"; fi
- make include/vg_git_version.hpp
- make version
# Make sure ARM emulation is available.
- if [[ "${CI_BUILDKIT_DRIVER}" != "kubernetes" ]] ; then docker run --privileged --rm tonistiigi/binfmt --install all || true ; fi
# TODO: deduplicate this code with normal build above
Expand Down
16 changes: 6 additions & 10 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,11 @@ FROM base AS build
ARG THREADS=8
ARG TARGETARCH

# If you didn't `make version` berfore building the Docker, you can provide a
# version value here to claim to be.
ARG VG_GIT_VERSION
ENV VG_GIT_VERSION=${VG_GIT_VERSION:-unknown}

RUN echo build > /stage.txt

RUN apt-get -qq -y update && \
Expand Down Expand Up @@ -56,22 +61,13 @@ RUN find . -name CMakeCache.txt | xargs rm -f
COPY Makefile /vg/Makefile
RUN . ./source_me.sh && CXXFLAGS="$(if [ -z "${TARGETARCH}" ] || [ "${TARGETARCH}" = "amd64" ] ; then echo " -march=nehalem "; fi)" CFLAGS="$(if [ -z "${TARGETARCH}" ] || [ "${TARGETARCH}" = "amd64" ] ; then echo " -march=nehalem "; fi)" make -j $((THREADS < $(nproc) ? THREADS : $(nproc))) deps

# Bring in the sources, which we need in order to build
# Bring in the sources, which we need in order to build.
COPY src /vg/src

# Build all the object files for vg, but don't link.
# Also pass the arch here
RUN . ./source_me.sh && CXXFLAGS="$(if [ -z "${TARGETARCH}" ] || [ "${TARGETARCH}" = "amd64" ] ; then echo " -march=nehalem "; fi)" make -j $((THREADS < $(nproc) ? THREADS : $(nproc))) objs

# Bring in any includes we pre-made, like the git version, if present
COPY include /vg/include

# Make sure version introspection is up to date
RUN rm -f obj/version.o && . ./source_me.sh && CXXFLAGS="$(if [ -z "${TARGETARCH}" ] || [ "${TARGETARCH}" = "amd64" ] ; then echo " -march=nehalem "; fi)" make -j $((THREADS < $(nproc) ? THREADS : $(nproc))) obj/version.o

# Announce the version file, which must exist by now
RUN ls /vg/include && cat /vg/include/vg_git_version.hpp

# Do the final build and link, knowing the version. Trim down the resulting binary but make sure to include enough debug info for profiling.
RUN . ./source_me.sh && CXXFLAGS="$(if [ -z "${TARGETARCH}" ] || [ "${TARGETARCH}" = "amd64" ] ; then echo " -march=nehalem "; fi)" make -j $((THREADS < $(nproc) ? THREADS : $(nproc))) static && strip -d bin/vg

Expand Down
198 changes: 113 additions & 85 deletions Makefile

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion deps/dozeu
Submodule dozeu updated 2 files
+87 −8 dozeu.h
+65 −58 example.c
2 changes: 1 addition & 1 deletion deps/sublinear-Li-Stephens
Submodule sublinear-Li-Stephens updated 1 files
+28 −28 makefile
4 changes: 4 additions & 0 deletions doc/publish-docs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ COMMIT_AUTHOR_EMAIL="[email protected]"
# We expect GITLAB_SECRET_FILE_DOCS_SSH_KEY to come in from the environment,
# specifying the private deploy key we will use to get at the docs repo.

# Make sure no submodules have untracked files from caching
# See <https://gist.github.com/nicktoumpelis/11214362#file-repo-rinse-sh-L2>
git submodule foreach --recursive git clean -xfd

# Find all the submodules that Doxygen wants to look at and make sure we have
# those.
cat Doxyfile | grep "^INPUT *=" | cut -f2 -d'=' | tr ' ' '\n' | grep "^ *deps" | sed 's_ *\(deps/[^/]*\).*_\1_' | sort | uniq | xargs -n 1 git submodule update --init --recursive
Expand Down
60 changes: 52 additions & 8 deletions scripts/giraffe-facts.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,8 +98,10 @@ def parse_args(args):
parser = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)

parser.add_argument("--input", type=argparse.FileType('r'), default=sys.stdin,
help="line-oriented JSON GAM to process")
parser.add_argument("input", type=str,
help="GAM to process")
parser.add_argument("--vg", type=str, default="vg",
help="vg binary to use")
parser.add_argument("outdir",
help="directory to place output in")

Expand Down Expand Up @@ -286,11 +288,44 @@ def add_in_stats(destination, addend):

def read_line_oriented_json(lines):
"""
For each line in the given stream, yield it as a parsed JSON object.
For each line in the given iterable of lines (such as a stream), yield it as a parsed JSON object.
"""

for line in lines:
yield json.loads(line)
line = line.strip()
if len(line) > 0:
yield json.loads(line)


def read_read_views(vg, filename):
"""
Given a vg binary and a filename, iterate over subsets of the parsed read dicts for each read in the file.

The subsets will have the annotation and time_used fields.
"""

# Extract just the annotations and times of reads as JSON, with a # header
# We don't know all the annotation field names in advance so we have to dump them all.
filter_process = subprocess.Popen([vg, "filter", "--tsv-out", "annotation;time_used", filename], stdout=subprocess.PIPE)

lines = iter(filter_process.stdout)
# Drop header line
next(lines)

for line in lines:
# Parse the TSV and reconstruct a view of the full read dict.
line = line.decode('utf-8')
line = line.strip()
if len(line) == 0:
continue
parts = line.split("\t")
assert len(parts) == 2
read = {"annotation": json.loads(parts[0]), "time_used": float(parts[1])}

yield read

return_code = filter_process.wait()
assert return_code == 0

class Table(object):
"""
Expand Down Expand Up @@ -916,11 +951,20 @@ def main(args):
# Count all the reads
read_count = 0

# Record mapping parameters from at least one read
# Record mapping parameters from special magic GAM chunk, if any, or a read
params = None

for read in read_line_oriented_json(options.input):


# Get the params from a magic chunk.
# TODO: This is a whole pass through a possibly big file!
params_json = subprocess.check_output([options.vg, "view", "--extract-tag", "PARAMS_JSON", "--first", options.input]).decode('utf-8')
lines = params_json.split("\n")
for parsed_params in read_line_oriented_json(lines):
if params is None:
params = parsed_params

for read in read_read_views(options.vg, options.input):
# For the data we need on each read

if params is None:
# Go get the mapping parameters
params = sniff_params(read)
Expand Down
2 changes: 1 addition & 1 deletion scripts/giraffe-wrangler.sh
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ if [[ ! -z "${SIM_GAM}" ]] ; then

# Compute loss stages
# Let giraffe facts errors out
vg view -aj "${WORK}/mapped.gam" | scripts/giraffe-facts.py "${WORK}/facts" >"${WORK}/facts.txt"
scripts/giraffe-facts.py "${WORK}/mapped.gam" "${WORK}/facts" >"${WORK}/facts.txt"
fi

if [[ ! -z "${REAL_FASTQ}" ]] ; then
Expand Down
93 changes: 57 additions & 36 deletions scripts/make_pbsim_reads.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
#!/usr/bin/env bash
# make_pbsim_reads.sh: script to simulate reads with pbsim2.
# Mostly theoretical; records commands that would have worked better than what was actually run
# Intended to run on UCSC Courtyard/Plaza systems
# Intended to run on UCSC behind-the-firewall systems
# You may also need to CFLAGS=-fPIC pip3 install --user bioconvert

set -ex
Expand All @@ -10,27 +9,34 @@ set -ex
# You can set these in the environment to override them and I don't have to write a CLI option parser.
# See https://stackoverflow.com/a/28085062

# Graph to simulate from. Can be S3 URLs or local file paths.
# Graph to simulate from. Can be S3 URLs or local file paths. If GRAPH_GBZ_URL
# is set, GRAPH_XG_URL and GRAPH_GBWT_URL are not used.
: "${GRAPH_XG_URL:=s3://human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.xg}"
: "${GRAPH_GBWT_URL:=s3://human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.gbwt}"
: "${GRAPH_GBZ_URL:=""}"
# Name to use for graph when downloaded
: "${GRAPH_NAME:=hprc-v1.0-mc-grch38}"
# Sample to simulate from
: "${SAMPLE_NAME:=HG00741}"
# Technology name to use in output filenames
: "${TECH_NAME:=hifi}"
# FASTQ to use as a template, or "/dev/null"
: "${SAMPLE_FASTQ:=/public/groups/vg/sjhwang/data/reads/real_HiFi/tmp/HiFi_reads_100k_real.fq}"
: "${SAMPLE_FASTQ:=/private/groups/patenlab/anovak/projects/hprc/lr-giraffe/reads/real/hifi/HiFi_reads_100k.fq}"
# HMM model to use instead of a FASTQ, or "/dev/null"
: "${PBSIM_HMM:=/dev/null}"
# This needs to be the pbsim2 command, which isn't assumed to be in $PATH
: "${PBSIM:=/public/groups/vg/sjhwang/tools/bin/pbsim}"
# This needs to be the pbsim2 binary, which might not be in $PATH.
# It can be installed with
# git clone https://github.com/yukiteruono/pbsim2.git
# cd pbsim2
# git checkout eeb5a19420534a0f672c81db2670117e62a9ee38
# autoupdate
# automake --add-missing
# autoreconf
# ./configure --prefix=$HOME/.local && make
# The binary will be in src/pbsim
: "${PBSIM:=pbsim}"
# Parameters to use with pbsim for simulating reads for each contig. Parameters are space-separated and internal spaces must be escaped.
: "${PBSIM_PARAMS:=--depth 1 --accuracy-min 0.00 --length-min 10000 --difference-ratio 6:50:54}"
# This needs to be a command line which can execute Stephen's script that adds qualities from a FASTQ back into a SAM that is missing them.
# Arguments are space-separated and internal spaces must be escaped.
# This script is at https://gist.github.com/adamnovak/45ae4f500a8ec63ce12ace4ca77afc21
: "${ADD_QUALITIES:=python3 /public/groups/vg/sjhwang/vg_scripts/bin/readers/sam_reader.py}"
: "${PBSIM_PARAMS:=--depth 4 --accuracy-min 0.00 --length-min 10000 --difference-ratio 6:50:54}"
# Directory to save results in
: "${OUT_DIR:=./reads/sim/${TECH_NAME}/${SAMPLE_NAME}}"
# Number of MAFs to convert at once
Expand All @@ -49,33 +55,48 @@ fi
# Make sure scratch directory exists
mkdir -p "${WORK_DIR}"

# Fetch graph
if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.xg" ]] ; then
# This comparison require Bash 3 or later. See <https://stackoverflow.com/a/2172365>
if [[ ${GRAPH_XG_URL} =~ ^s3:.* ]]; then
# Download from S3
aws s3 cp "${GRAPH_XG_URL}" "${WORK_DIR}/${GRAPH_NAME}.xg.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.xg.tmp" "${WORK_DIR}/${GRAPH_NAME}.xg"
else
# Use local symlink
ln -s "$(realpath "${GRAPH_XG_URL}")" "${WORK_DIR}/${GRAPH_NAME}.xg"
if [[ -z "${GRAPH_GBZ_URL}" ]] ; then

# Fetch graph
if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.xg" ]] ; then
# This comparison require Bash 3 or later. See <https://stackoverflow.com/a/2172365>
if [[ ${GRAPH_XG_URL} =~ ^s3:.* ]]; then
# Download from S3
aws s3 cp "${GRAPH_XG_URL}" "${WORK_DIR}/${GRAPH_NAME}.xg.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.xg.tmp" "${WORK_DIR}/${GRAPH_NAME}.xg"
else
# Use local symlink
ln -s "$(realpath "${GRAPH_XG_URL}")" "${WORK_DIR}/${GRAPH_NAME}.xg"
fi
fi
fi
if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbwt" ]] ; then
if [[ ${GRAPH_GBWT_URL} =~ ^s3:.* ]]; then
if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbwt" ]] ; then
if [[ ${GRAPH_GBWT_URL} =~ ^s3:.* ]]; then
# Download from S3
aws s3 cp "${GRAPH_GBWT_URL}" "${WORK_DIR}/${GRAPH_NAME}.gbwt.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbwt.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbwt"
else
# Use local symlink
ln -s "$(realpath "${GRAPH_GBWT_URL}")" "${WORK_DIR}/${GRAPH_NAME}.gbwt"
fi
fi

if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbz" ]] ; then
# Make it one file
time vg gbwt -x "${WORK_DIR}/${GRAPH_NAME}.xg" "${WORK_DIR}/${GRAPH_NAME}.gbwt" --gbz-format -g "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbz"
fi

elif [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbz" ]] ; then
# Fetch the GBZ
if [[ ${GRAPH_GBZ_URL} =~ ^s3:.* ]]; then
# Download from S3
aws s3 cp "${GRAPH_GBWT_URL}" "${WORK_DIR}/${GRAPH_NAME}.gbwt.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbwt.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbwt"
aws s3 cp "${GRAPH_GBZ_URL}" "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbz"
else
# Use local symlink
ln -s "$(realpath "${GRAPH_GBWT_URL}")" "${WORK_DIR}/${GRAPH_NAME}.gbwt"
ln -s "$(realpath "${GRAPH_GBZ_URL}")" "${WORK_DIR}/${GRAPH_NAME}.gbz"
fi
fi

if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbz" ]] ; then
# Make it one file
time vg gbwt -x "${WORK_DIR}/${GRAPH_NAME}.xg" "${WORK_DIR}/${GRAPH_NAME}.gbwt" --gbz-format -g "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbz"
fi

if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}-${SAMPLE_NAME}-as-ref.gbz" ]] ; then
Expand Down Expand Up @@ -150,7 +171,7 @@ function do_job() {
mv "${SAM_NAME}.tmp" "${SAM_NAME}"
fi
set -o pipefail
${ADD_QUALITIES} -s "${SAM_NAME}" -f "${FASTQ_NAME}" | sed "s/ref/${CONTIG_NAME}/g" | samtools view -b - > "${RENAMED_BAM_NAME}.tmp"
python3 "$(dirname -- "${BASH_SOURCE[0]}")/reinsert_qualities.py" -s "${SAM_NAME}" -f "${FASTQ_NAME}" | sed "s/ref/${CONTIG_NAME}/g" | samtools view -b - > "${RENAMED_BAM_NAME}.tmp"
set +o pipefail
mv "${RENAMED_BAM_NAME}.tmp" "${RENAMED_BAM_NAME}"
else
Expand Down Expand Up @@ -207,14 +228,14 @@ fi
# Work out howe many reads there are
TOTAL_READS="$(vg stats -a "${WORK_DIR}/${SAMPLE_NAME}-reads/${SAMPLE_NAME}-sim-${TECH_NAME}.gam" | grep "^Total alignments:" | cut -f2 -d':' | tr -d ' ')"

if [[ "${TOTAL_READS}" -lt 10500 ]] ; then
echo "Only ${TOTAL_READS} reads were simulated. Cannot subset to 10000 reads with buffer!"
if [[ "${TOTAL_READS}" -lt 1000500 ]] ; then
echo "Only ${TOTAL_READS} reads were simulated. Cannot subset to 1000000 reads with buffer!"
exit 1
fi
echo "Simulated ${TOTAL_READS} reads overall"

SUBSAMPLE_SEED=1
for READ_COUNT in 100 1000 10000 ; do
for READ_COUNT in 100 1000 10000 100000 1000000 ; do
# Subset to manageable sizes (always)
# Get the fraction of reads to keep, overestimated, with no leading 0, to paste onto subsample seed.
FRACTION="$(echo "(${READ_COUNT} + 500)/${TOTAL_READS}" | bc -l | sed 's/^[0-9]*//g')"
Expand Down
39 changes: 39 additions & 0 deletions scripts/mark_secondaries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/usr/bin/python3
# mark_secondaries.py: Mark all but the first alignment with a given name as secondary
"""
Mark duplicate alignments for a given read name as secondary. Useful for GraphAligner output which does not mark its secondaries. Assumes that the first alignment is the primary alignment, ignoring score.

vg view -a unmarked.gam mark_secondaries.py | vg view -JaG - > marked.gam

"""
import sys
import json


def filter_json_gam(infile):
"""
process gam json made with vg view -a my.gam
"""

seen_names = set()

for line in infile:
gam = json.loads(line)

if gam['name'] in seen_names:
gam['is_secondary'] = True
else:
gam['is_secondary'] = False
seen_names.add(gam['name'])

print(json.dumps(gam))

def main():
"""
Main entry point for the program.
"""
filter_json_gam(sys.stdin)

if __name__ == "__main__" :
main()

2 changes: 1 addition & 1 deletion scripts/plot-qq.R
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ print(x$ci)

# Now plot the points as different sizes, but the error bar line ranges as a consistent size
dat.plot <- ggplot(x, aes(1-mapprob+1e-7, 1-observed+1e-7, color=aligner, size=N, weight=N, label=round(mapq,2))) +
scale_color_manual(values=colors, guide=guide_legend(title=NULL, ncol=1)) +
scale_color_manual(values=colors, guide=guide_legend(title=NULL, ncol=2)) +
scale_y_log10("measured error", limits=c(1e-7,2), breaks=c(1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0)) +
scale_x_log10("error estimate", limits=c(1e-7,2), breaks=c(1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0)) +
scale_size_continuous("number", guide=guide_legend(title=NULL, ncol=4)) +
Expand Down
Loading
Loading