Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Read Giraffe #4323

Open
wants to merge 1,242 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1242 commits
Select commit Hold shift + click to select a range
ef2a138
Adopt HiFi parameters that look competitive with minimap2 on the 10k …
adamnovak Jun 26, 2024
1b52afe
Stop applying a score scale *and* a score window to HiFi reads
adamnovak Jun 27, 2024
ac748fa
Add debugging code to find and dump problems for adjacent indels
adamnovak Jun 28, 2024
c380167
Report offending alignment better
adamnovak Jun 28, 2024
74ed7bf
Remove old check function and add test
adamnovak Jun 28, 2024
5c91a96
Check for validity around simplify
adamnovak Jun 28, 2024
5d15bed
Isolate problem in a small unit test
adamnovak Jun 28, 2024
0555273
Stop using generic mapping cutter and just move insertions
adamnovak Jun 28, 2024
3277498
Quiet debugging
adamnovak Jun 28, 2024
d8803f0
Merge remote-tracking branch 'origin/lr-giraffe' into lr-giraffe
adamnovak Jul 1, 2024
22c6e69
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Jul 1, 2024
3a1cd38
Fix Mac build
adamnovak Jul 1, 2024
accbe34
Fix compatibility with Mac sed by using a nonempty backup extension
adamnovak Jul 1, 2024
88c3ff6
Add in offset even when the alignment starts past the end of the anch…
adamnovak Jul 1, 2024
a6dfc43
Add --max-graph-scale/-g option to surject and make giving up scale w…
adamnovak Jul 3, 2024
01ac6ee
Check zipcode tree distances in snarls properly
xchang1 Jul 9, 2024
dd468e8
Turn off random tests
xchang1 Jul 9, 2024
0d6723d
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Jul 9, 2024
fcdff75
Update to current libbdsg
adamnovak Jul 9, 2024
8d0878a
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Jul 9, 2024
b0ece9b
Make crashes know what subcommand we tried to run
adamnovak Jul 9, 2024
4d149ed
Prefer to link vg's own built htslib dylib over the system one on Mac
adamnovak Jul 9, 2024
6d7f4b6
Merge remote-tracking branch 'origin/lr-giraffe' into lr-giraffe
adamnovak Jul 9, 2024
a3b296e
Use libvgio that doesn't multithread decompression and make sure we a…
adamnovak Jul 10, 2024
93c3f7a
Use payload less for old short read code path
xchang1 Jul 10, 2024
783e452
Use the distance index a bit less
xchang1 Jul 10, 2024
425c4cf
Stop making new decoders
xchang1 Jul 10, 2024
6e8f889
Get parent depth from zipcodes but idk if its any faster
xchang1 Jul 10, 2024
7fc2f1f
Save chain component
xchang1 Jul 10, 2024
4620f16
Make docs build not have untracked files laying around in submodules
adamnovak Jul 10, 2024
da73e20
Add code reference
adamnovak Jul 10, 2024
89e74ac
Make crash code actually include stringstream
adamnovak Jul 10, 2024
45d4987
Cache more stuff in the seed from the zipcodes
xchang1 Jul 11, 2024
d8c6e18
Try getting all values at once but it doesnt work so Im going to try …
xchang1 Jul 12, 2024
0244761
Get all cache values from zipcodes once
xchang1 Jul 13, 2024
729c322
Dont use distance index for getting is_root_snarl
xchang1 Jul 13, 2024
33b7e87
Add the zipcode to the clustering problems and use to get parents
xchang1 Jul 13, 2024
412dcec
Acually use zipcode to get handle
xchang1 Jul 13, 2024
2bdcb1a
Stop expecting the parent of a snarl to be a root
xchang1 Jul 13, 2024
ea7a0ed
Use zipcodes for getting chain parent
xchang1 Jul 13, 2024
6a693ad
Use zipcodes for distances
xchang1 Jul 14, 2024
536c7d7
Add distance check
xchang1 Jul 14, 2024
7fa624d
Don't copy seed to cluster
xchang1 Jul 15, 2024
83d1ffa
Fix is_root_snarl
xchang1 Jul 15, 2024
f79ad09
Reserve space in some vectors
xchang1 Jul 15, 2024
276d60b
Take out unused map
xchang1 Jul 15, 2024
5e4fea4
Take out another unused hash set
xchang1 Jul 15, 2024
acff6f6
Reserve more and fix indenting
xchang1 Jul 15, 2024
310beb7
Reserve memory for zipcode
xchang1 Jul 16, 2024
73477be
Reserve memory for decoders
xchang1 Jul 16, 2024
f5d0c4b
Use zipcode for snarl length
xchang1 Jul 16, 2024
a2dc51f
Reserve more memory
xchang1 Jul 16, 2024
c912afb
Add cluster checking
xchang1 Jul 17, 2024
ee5b6b6
Find minimizer hit count by walking through minimziers ordered by rea…
xchang1 Jul 17, 2024
a7bb77c
Revert "Find minimizer hit count by walking through minimziers ordere…
xchang1 Jul 17, 2024
02a3369
Add hacky way of dealing with multicomponent chains
xchang1 Jul 18, 2024
a410321
Revert "Add hacky way of dealing with multicomponent chains"
xchang1 Jul 19, 2024
3b536dd
Merge commit '094b956' into lr-giraffe
adamnovak Jul 19, 2024
381001c
Merge commit '755a634' into lr-giraffe
adamnovak Jul 19, 2024
11fc884
Merge commit 'dfac868' into lr-giraffe
adamnovak Jul 19, 2024
adc6476
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Jul 19, 2024
e9aac7e
Merge commit 'b55be13' into lr-giraffe
adamnovak Jul 19, 2024
e551487
Merge remote-tracking branch 'origin/unlimited-anchors' into lr-giraffe
adamnovak Jul 19, 2024
eb1e302
Make a string identifier for a snarl tree node
xchang1 Jul 19, 2024
b177d07
Fix zipcode identifiers
xchang1 Jul 21, 2024
636abe8
Add chain component to zipcodes
xchang1 Jul 22, 2024
a4da073
Use zipcodes to get chain component for nodes
xchang1 Jul 22, 2024
6e287f5
Add get_chain_component to zipcodes
xchang1 Jul 22, 2024
6565acb
Get chain component from zipcodes for payload
xchang1 Jul 22, 2024
c476d12
Use zipcode chain component
xchang1 Jul 22, 2024
9e2153f
Add failing unit test
xchang1 Jul 22, 2024
2c49a23
Get chain component for irregular snarls
xchang1 Jul 22, 2024
ce9b0ad
Fix debug code
xchang1 Jul 22, 2024
da5f891
Add chain component count to chains zipcodes
xchang1 Jul 23, 2024
3b4855e
Add unit test for looping zipcodes and fix bugs
xchang1 Jul 23, 2024
d135aab
Use zipcodes for chain component in clustering
xchang1 Jul 23, 2024
ddef07f
Use regular snarls to skip finding distances
xchang1 Jul 24, 2024
9761748
Add external connectivity to zipcodes
xchang1 Jul 24, 2024
74d4122
Take out end_in net_handle_t
xchang1 Jul 24, 2024
804c44d
Fix payload with new values
xchang1 Jul 24, 2024
18b4e7e
Add external connectivity for root nodes
xchang1 Jul 24, 2024
56fbd52
Use zipcode for external connectivity
xchang1 Jul 24, 2024
f2bd83a
Don't use distance index for ordering children in chains
xchang1 Jul 24, 2024
07de5de
Fix orientation for simple snarls since we took them out
xchang1 Jul 24, 2024
5203413
Start taking out net_handle_t's from the payload
xchang1 Jul 24, 2024
461de64
Take out grandparent handle
xchang1 Jul 24, 2024
8655513
Add net identifier strings but don't actually use them
xchang1 Jul 25, 2024
5170df0
Use net identifiers as keys for all the lookups
xchang1 Jul 25, 2024
e541b48
Mostly take out finding net handles until they're needed
xchang1 Jul 25, 2024
985ea32
Fix identifiers but its not working
xchang1 Jul 25, 2024
9b36204
Fix parent
xchang1 Jul 26, 2024
d23653a
Fix getting the node identifier
xchang1 Jul 26, 2024
6e09743
Fix getting handle to trivial chain
xchang1 Jul 26, 2024
e433831
Include chain component in identifier
xchang1 Jul 26, 2024
fd9a049
Fix getting net handle for child of root snarl
xchang1 Jul 26, 2024
27f47cf
Fix clustering the root snarl
xchang1 Jul 27, 2024
7e74771
Get chain sorting values for snarls earlier and sort properly for two…
xchang1 Jul 29, 2024
a4eadd3
Get parent from children
xchang1 Jul 29, 2024
bc642aa
Take out distance index when not used
xchang1 Jul 29, 2024
1a32937
Take the distance index out of more functions that don't use it
xchang1 Jul 29, 2024
2a77f66
Use distance index for chain values less
xchang1 Jul 29, 2024
d2fc0a3
turn off debug
xchang1 Jul 29, 2024
82f714e
Get identifier at the same time as payload
xchang1 Jul 30, 2024
0adfea0
Merge remote-tracking branch 'upstream/master' into lr-giraffe
adamnovak Jul 30, 2024
aa87f36
Undo using identifier strings as keys and fix some bugs
xchang1 Jul 31, 2024
b9a2010
Put zipcode and decoder together
xchang1 Jul 31, 2024
a18edec
Fix unit tests
xchang1 Jul 31, 2024
831f231
Fix reserving decoder length
xchang1 Jul 31, 2024
2923cde
Add an int vector that uses a minimal bit width for storing stuff
xchang1 Jul 31, 2024
2bbd62b
Merge branch 'read-quietly' into lr-giraffe
adamnovak Jul 31, 2024
595cafb
Use new int vectors for zipcodes but it doesn't work yet
xchang1 Aug 1, 2024
c5385a3
Merge remote-tracking branch 'origin/low-cplx-context' into lr-giraffe
adamnovak Aug 1, 2024
de6c76f
Fix zipcodes
xchang1 Aug 2, 2024
7aa1fe7
Revert using minint vectors
xchang1 Aug 3, 2024
39ed6c8
Make decoder use fewer bits
xchang1 Aug 3, 2024
116dc01
Only store zipcodes in a separate file
xchang1 Aug 3, 2024
eace413
Serialize zipcode and decoder
xchang1 Aug 3, 2024
a9dffbe
Revert "Serialize zipcode and decoder"
xchang1 Aug 3, 2024
05480d0
Revert "Only store zipcodes in a separate file"
xchang1 Aug 3, 2024
0207c82
Undo putting zipcode and decoder together
xchang1 Aug 3, 2024
69d48f0
Fix typo
xchang1 Aug 3, 2024
e12b23a
Use distance index less for chain values
xchang1 Aug 3, 2024
f4f10f3
Put decoder back with zipcode
xchang1 Aug 5, 2024
d72d232
Serialize decoder
xchang1 Aug 5, 2024
c4a2e48
Actually serialize the decoder
xchang1 Aug 5, 2024
7fd62d2
Put decoder into zipcode payload
xchang1 Aug 5, 2024
020cbb4
Actually serialize the decoder
xchang1 Aug 5, 2024
06a6b04
Add an unpacked zipcode but it doesn't compile yet
xchang1 Aug 6, 2024
d5a9e4d
Get everything to compile
xchang1 Aug 6, 2024
dfde735
Add unit tests and fix bug for unpacked zip codes
xchang1 Aug 6, 2024
1ecc8ae
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Aug 7, 2024
7cc9f4c
Use the unpacked zipcode for clustering instead of the old payload
xchang1 Aug 7, 2024
2c955f9
Use the length of the last chain component as the length of a multico…
xchang1 Aug 8, 2024
cf6b5be
Use unpacked zipcode in SnarlTreeNodeProblem instead of zipcode
xchang1 Aug 8, 2024
e5fe8b6
Reserve memory for unpacked zipcode
xchang1 Aug 9, 2024
8ccb110
Reserve memory
xchang1 Aug 9, 2024
a73b80e
Take out chain component
xchang1 Aug 9, 2024
4286171
Fix bug getting best distanecs
xchang1 Aug 9, 2024
434a861
Make vg filter see empty list annotations as not there
adamnovak Aug 9, 2024
007df92
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Aug 9, 2024
dbc77d7
Merge remote-tracking branch 'origin/left-align-better' into lr-giraffe
adamnovak Aug 9, 2024
5beb1be
Get the length of a cyclic chain
xchang1 Aug 10, 2024
a0cbb23
Turn off debug
xchang1 Aug 10, 2024
d7e2553
Get values from unpacked zipcode for children of problems
xchang1 Aug 10, 2024
98a5518
Take out unused chain component ints
xchang1 Aug 11, 2024
f5ad687
Reserve memory when getting zipcodes from payload
xchang1 Aug 12, 2024
d02f4f7
Remove unused ints and reserve memory for unpacked zipcode
xchang1 Aug 12, 2024
2762b05
Reserve less memory because it made it slower
xchang1 Aug 12, 2024
f5e5638
Undo unpacking zipcode to get back to 020cbb
xchang1 Aug 13, 2024
c172816
Fix bug getting minimum distances
xchang1 Aug 13, 2024
aff0cc6
Get chains last component length and get chain length from zipcode
xchang1 Aug 13, 2024
7ed8f6e
Make a decoded code type (but not for roots) and use it for building …
xchang1 Aug 13, 2024
a600bbc
Add getters and setters for unpacked codes
xchang1 Aug 13, 2024
4b994b6
Move build info headers from phony targets to top-level shell invocat…
adamnovak Aug 13, 2024
cc418dc
Merge branch 'update-version-fast' into lr-giraffe
adamnovak Aug 13, 2024
aaf3cb9
Remove extra tabs
adamnovak Aug 13, 2024
3cbd7be
Add functions for interpreting one level of the zipcode
xchang1 Aug 13, 2024
72025a0
Add unit tests for unpacked codes
xchang1 Aug 13, 2024
e00dd49
Merge remote-tracking branch 'origin/left-align-better' into lr-giraffe
adamnovak Aug 13, 2024
ae6b91a
Cut direct, broken static flag to version header dependency
adamnovak Aug 13, 2024
096000a
Use chain_minimum_length in zipcode
xchang1 Aug 14, 2024
0356878
Make unacking const
xchang1 Aug 14, 2024
dca72e7
Used unpacked zipcode for getting chain values
xchang1 Aug 14, 2024
01e7b4d
Use unpacked zipcode for getting snarl values and fix but I missed fi…
xchang1 Aug 14, 2024
62f6c38
Get parent with distance index if its faster
xchang1 Aug 14, 2024
b0c0234
Make a map from connected component number to net handle
xchang1 Aug 14, 2024
a4410a9
Make map from node id to net handle
xchang1 Aug 14, 2024
45cba14
Revert "Make map from node id to net handle"
xchang1 Aug 14, 2024
17120fb
Reserve memory for children
xchang1 Aug 14, 2024
b0f1f70
Revert "Reserve memory for children"
xchang1 Aug 14, 2024
083f2b3
Set up Docker build and CI to get the version build right
adamnovak Aug 14, 2024
1248b0f
Remove wayward tabs
adamnovak Aug 14, 2024
08218c6
Report errors better for vg inject failures
adamnovak Aug 14, 2024
35ae926
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Aug 15, 2024
0e7d66f
Merge branch 'update-version-fast' into lr-giraffe
adamnovak Aug 16, 2024
370f8fa
Merge remote-tracking branch 'upstream/master' into lr-giraffe
adamnovak Aug 19, 2024
748169d
Implement consistent orientation around WFA connections
adamnovak Aug 19, 2024
4e871f9
Implement consistent orientation for align_sequence_between with a wr…
adamnovak Aug 19, 2024
3a6022f
Fix build errors in implementation
adamnovak Aug 19, 2024
eef142d
Add test for non-WFA orientation fix
adamnovak Aug 19, 2024
b99200a
Add testing for connect_consistently
adamnovak Aug 19, 2024
f36ebdd
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Aug 21, 2024
ff6fa27
Use bitfields for child_info_t and add chain component
xchang1 Aug 22, 2024
247aaef
Check chain component when making zipcode trees
xchang1 Aug 22, 2024
cb1f600
Add unit test for multicomponent chain
xchang1 Aug 22, 2024
292bdd4
Merge remote-tracking branch 'upstream/master' into HEAD
xchang1 Aug 22, 2024
bd8ded6
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
adamnovak Aug 22, 2024
ad62378
Add handling for softclips off the edge of contigs in inject
adamnovak Aug 29, 2024
3bf557e
Fetch out edit
adamnovak Aug 29, 2024
d4b77fb
Add missing continue
adamnovak Aug 29, 2024
7ecf940
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Sep 3, 2024
fab3f5e
Fix empty top-level snarl zip trees
xchang1 Sep 3, 2024
f0e8389
Turn off debug
xchang1 Sep 3, 2024
a7ba479
Remove any empty snarl
xchang1 Sep 3, 2024
1b5387f
Check if previous_index is 0
xchang1 Sep 4, 2024
1bef104
Check orientation of seeds
xchang1 Sep 9, 2024
a325238
Add stuff to go around a looping chain but I haven't tested it yet
xchang1 Sep 9, 2024
20fa2d2
Add unit test for looping chains and fix sorting nodes in multicompon…
xchang1 Sep 10, 2024
dafb61b
Don't count nodes that loop
xchang1 Sep 10, 2024
beec239
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Sep 10, 2024
cd1ffce
Take out looping chains in zipcode trees
xchang1 Sep 12, 2024
92628c9
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Sep 12, 2024
cffccc9
Use more bits for distance index indices
xchang1 Sep 16, 2024
2b9ac09
Try bailing on banded global alignment between seeds if it's going to…
xchang1 Sep 18, 2024
3d118f3
Add dp limit to be the same as the dozeu limit
xchang1 Sep 18, 2024
c1f7cf5
Get cell count estimate right
xchang1 Sep 18, 2024
61df098
Get sequence properly
xchang1 Sep 18, 2024
068d88c
Merge remote-tracking branch 'origin/master' into lr-giraffe
adamnovak Sep 18, 2024
c667c6a
Revert trying to stop banded global aligner when the problem is too big
xchang1 Sep 18, 2024
0850aa2
Revert "Try bailing on banded global alignment between seeds if it's …
xchang1 Sep 18, 2024
b2b2ea9
Apply --max-dp-cells to BandedGlobalAligner and set to 100M for R10 a…
adamnovak Sep 18, 2024
42f2915
Merge remote-tracking branch 'upstream/lr-giraffe' into lr-giraffe
xchang1 Sep 22, 2024
5041ff4
Lower default max-dp-cells
xchang1 Sep 22, 2024
062ba7a
Actually use fragment-min-score
xchang1 Sep 24, 2024
839e934
Bump max-dp-cells back up
xchang1 Sep 24, 2024
4aee9f3
Increase max-dp-cells. Works for r10 reads on non-d9 hprc graph but I…
xchang1 Oct 18, 2024
b05d3e3
Break off most of main() from the x86-64-supporting preflight code
adamnovak Oct 18, 2024
8459d7b
Partially revert "fix build on Ubuntu 24.04.1 / GCC 13.2.0"
adamnovak Oct 18, 2024
527f4a4
Make padded window size parity match anchor parity for orientation in…
adamnovak Oct 21, 2024
ef939ee
Remove extra debugging
adamnovak Oct 21, 2024
09d2e7e
Change short tail pruning to only require being one tail and not both
adamnovak Oct 21, 2024
1ff1009
Add a test for surjection orientation independence
adamnovak Oct 21, 2024
c25f9ed
Add a script to cut GAM records to node ranges
adamnovak Oct 24, 2024
1221a0c
Add simple local slide check to anchor pruning
adamnovak Oct 24, 2024
04e267c
Increase search length to include gap length when finding the dagifie…
xchang1 Oct 25, 2024
1863c9e
Add the max_gap_length earlier to match what the other calls to align…
xchang1 Oct 25, 2024
de2f21c
Merge pull request #4425 from vgteam/lr-giraffe-missinsert
xchang1 Oct 25, 2024
af6ce11
Add --max-slide option to surject to control duplicate anchor sequenc…
adamnovak Oct 25, 2024
092568c
Merge remote-tracking branch 'origin/lr-giraffe' into lr-giraffe
adamnovak Oct 25, 2024
afb393f
Add timing to surject and std:: to move
adamnovak Oct 25, 2024
ace49a2
Turn off debug output
xchang1 Oct 26, 2024
821623a
Merge pull request #4427 from xchang1/lr-giraffe-missinsert
xchang1 Oct 26, 2024
adeed1c
Switch to sort_shuffling_ties for sorting minimizers
xchang1 Oct 28, 2024
d70b6bd
Keep picking seeds until the agglomerations cover most of the read
xchang1 Oct 28, 2024
69fc8d4
Add a target read coverage for minimizers
xchang1 Oct 29, 2024
08257c0
Only keep seeds if there is no coverage in their agglomeratoin
xchang1 Oct 29, 2024
b3578c9
Remember the coverage of reads taken before the minimizer count cutoff
xchang1 Oct 29, 2024
998ca50
Remember to count the bases of seeds
xchang1 Oct 29, 2024
f3836d5
Use a constant flank size instead of agglomeration
xchang1 Oct 29, 2024
73ef8e8
Fix adding coverage
xchang1 Oct 29, 2024
ff0e63d
Fix counting overlapping bases
xchang1 Oct 30, 2024
44e1dc2
Take out target coverage and increase flank
xchang1 Oct 30, 2024
345163b
Make minimizer flank coverage a command line option
xchang1 Oct 30, 2024
d0e7490
Make limiting the max_middle_gap an option
xchang1 Oct 30, 2024
30a8319
Add new defaults
xchang1 Oct 31, 2024
329e434
Make sorting minimizers check key
xchang1 Nov 1, 2024
9f569ed
Limit the number of hits that a minimizer can have when adding it bey…
xchang1 Nov 4, 2024
e667000
Merge pull request #4429 from xchang1/lr-giraffe-seedsort
xchang1 Nov 4, 2024
8e0373a
Only keep extra minimizers if they're as good as the ones we kept
xchang1 Nov 5, 2024
a14a8d5
Merge remote-tracking branch 'upstream/master' into HEAD
adamnovak Nov 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion deps/dozeu
Submodule dozeu updated 2 files
+87 −8 dozeu.h
+65 −58 example.c
4 changes: 4 additions & 0 deletions doc/publish-docs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ COMMIT_AUTHOR_EMAIL="[email protected]"
# We expect GITLAB_SECRET_FILE_DOCS_SSH_KEY to come in from the environment,
# specifying the private deploy key we will use to get at the docs repo.

# Make sure no submodules have untracked files from caching
# See <https://gist.github.com/nicktoumpelis/11214362#file-repo-rinse-sh-L2>
git submodule foreach --recursive git clean -xfd

# Find all the submodules that Doxygen wants to look at and make sure we have
# those.
cat Doxyfile | grep "^INPUT *=" | cut -f2 -d'=' | tr ' ' '\n' | grep "^ *deps" | sed 's_ *\(deps/[^/]*\).*_\1_' | sort | uniq | xargs -n 1 git submodule update --init --recursive
Expand Down
60 changes: 52 additions & 8 deletions scripts/giraffe-facts.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,8 +98,10 @@ def parse_args(args):
parser = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)

parser.add_argument("--input", type=argparse.FileType('r'), default=sys.stdin,
help="line-oriented JSON GAM to process")
parser.add_argument("input", type=str,
help="GAM to process")
parser.add_argument("--vg", type=str, default="vg",
help="vg binary to use")
parser.add_argument("outdir",
help="directory to place output in")

Expand Down Expand Up @@ -286,11 +288,44 @@ def add_in_stats(destination, addend):

def read_line_oriented_json(lines):
"""
For each line in the given stream, yield it as a parsed JSON object.
For each line in the given iterable of lines (such as a stream), yield it as a parsed JSON object.
"""

for line in lines:
yield json.loads(line)
line = line.strip()
if len(line) > 0:
yield json.loads(line)


def read_read_views(vg, filename):
"""
Given a vg binary and a filename, iterate over subsets of the parsed read dicts for each read in the file.

The subsets will have the annotation and time_used fields.
"""

# Extract just the annotations and times of reads as JSON, with a # header
# We don't know all the annotation field names in advance so we have to dump them all.
filter_process = subprocess.Popen([vg, "filter", "--tsv-out", "annotation;time_used", filename], stdout=subprocess.PIPE)

lines = iter(filter_process.stdout)
# Drop header line
next(lines)

for line in lines:
# Parse the TSV and reconstruct a view of the full read dict.
line = line.decode('utf-8')
line = line.strip()
if len(line) == 0:
continue
parts = line.split("\t")
assert len(parts) == 2
read = {"annotation": json.loads(parts[0]), "time_used": float(parts[1])}

yield read

return_code = filter_process.wait()
assert return_code == 0

class Table(object):
"""
Expand Down Expand Up @@ -916,11 +951,20 @@ def main(args):
# Count all the reads
read_count = 0

# Record mapping parameters from at least one read
# Record mapping parameters from special magic GAM chunk, if any, or a read
params = None

for read in read_line_oriented_json(options.input):


# Get the params from a magic chunk.
# TODO: This is a whole pass through a possibly big file!
params_json = subprocess.check_output([options.vg, "view", "--extract-tag", "PARAMS_JSON", "--first", options.input]).decode('utf-8')
lines = params_json.split("\n")
for parsed_params in read_line_oriented_json(lines):
if params is None:
params = parsed_params

for read in read_read_views(options.vg, options.input):
# For the data we need on each read

if params is None:
# Go get the mapping parameters
params = sniff_params(read)
Expand Down
2 changes: 1 addition & 1 deletion scripts/giraffe-wrangler.sh
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ if [[ ! -z "${SIM_GAM}" ]] ; then

# Compute loss stages
# Let giraffe facts errors out
vg view -aj "${WORK}/mapped.gam" | scripts/giraffe-facts.py "${WORK}/facts" >"${WORK}/facts.txt"
scripts/giraffe-facts.py "${WORK}/mapped.gam" "${WORK}/facts" >"${WORK}/facts.txt"
fi

if [[ ! -z "${REAL_FASTQ}" ]] ; then
Expand Down
93 changes: 57 additions & 36 deletions scripts/make_pbsim_reads.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
#!/usr/bin/env bash
# make_pbsim_reads.sh: script to simulate reads with pbsim2.
# Mostly theoretical; records commands that would have worked better than what was actually run
# Intended to run on UCSC Courtyard/Plaza systems
# Intended to run on UCSC behind-the-firewall systems
# You may also need to CFLAGS=-fPIC pip3 install --user bioconvert

set -ex
Expand All @@ -10,27 +9,34 @@ set -ex
# You can set these in the environment to override them and I don't have to write a CLI option parser.
# See https://stackoverflow.com/a/28085062

# Graph to simulate from. Can be S3 URLs or local file paths.
# Graph to simulate from. Can be S3 URLs or local file paths. If GRAPH_GBZ_URL
# is set, GRAPH_XG_URL and GRAPH_GBWT_URL are not used.
: "${GRAPH_XG_URL:=s3://human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.xg}"
: "${GRAPH_GBWT_URL:=s3://human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.gbwt}"
: "${GRAPH_GBZ_URL:=""}"
# Name to use for graph when downloaded
: "${GRAPH_NAME:=hprc-v1.0-mc-grch38}"
# Sample to simulate from
: "${SAMPLE_NAME:=HG00741}"
# Technology name to use in output filenames
: "${TECH_NAME:=hifi}"
# FASTQ to use as a template, or "/dev/null"
: "${SAMPLE_FASTQ:=/public/groups/vg/sjhwang/data/reads/real_HiFi/tmp/HiFi_reads_100k_real.fq}"
: "${SAMPLE_FASTQ:=/private/groups/patenlab/anovak/projects/hprc/lr-giraffe/reads/real/hifi/HiFi_reads_100k.fq}"
# HMM model to use instead of a FASTQ, or "/dev/null"
: "${PBSIM_HMM:=/dev/null}"
# This needs to be the pbsim2 command, which isn't assumed to be in $PATH
: "${PBSIM:=/public/groups/vg/sjhwang/tools/bin/pbsim}"
# This needs to be the pbsim2 binary, which might not be in $PATH.
# It can be installed with
# git clone https://github.com/yukiteruono/pbsim2.git
# cd pbsim2
# git checkout eeb5a19420534a0f672c81db2670117e62a9ee38
# autoupdate
# automake --add-missing
# autoreconf
# ./configure --prefix=$HOME/.local && make
# The binary will be in src/pbsim
: "${PBSIM:=pbsim}"
# Parameters to use with pbsim for simulating reads for each contig. Parameters are space-separated and internal spaces must be escaped.
: "${PBSIM_PARAMS:=--depth 1 --accuracy-min 0.00 --length-min 10000 --difference-ratio 6:50:54}"
# This needs to be a command line which can execute Stephen's script that adds qualities from a FASTQ back into a SAM that is missing them.
# Arguments are space-separated and internal spaces must be escaped.
# This script is at https://gist.github.com/adamnovak/45ae4f500a8ec63ce12ace4ca77afc21
: "${ADD_QUALITIES:=python3 /public/groups/vg/sjhwang/vg_scripts/bin/readers/sam_reader.py}"
: "${PBSIM_PARAMS:=--depth 4 --accuracy-min 0.00 --length-min 10000 --difference-ratio 6:50:54}"
# Directory to save results in
: "${OUT_DIR:=./reads/sim/${TECH_NAME}/${SAMPLE_NAME}}"
# Number of MAFs to convert at once
Expand All @@ -49,33 +55,48 @@ fi
# Make sure scratch directory exists
mkdir -p "${WORK_DIR}"

# Fetch graph
if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.xg" ]] ; then
# This comparison require Bash 3 or later. See <https://stackoverflow.com/a/2172365>
if [[ ${GRAPH_XG_URL} =~ ^s3:.* ]]; then
# Download from S3
aws s3 cp "${GRAPH_XG_URL}" "${WORK_DIR}/${GRAPH_NAME}.xg.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.xg.tmp" "${WORK_DIR}/${GRAPH_NAME}.xg"
else
# Use local symlink
ln -s "$(realpath "${GRAPH_XG_URL}")" "${WORK_DIR}/${GRAPH_NAME}.xg"
if [[ -z "${GRAPH_GBZ_URL}" ]] ; then

# Fetch graph
if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.xg" ]] ; then
# This comparison require Bash 3 or later. See <https://stackoverflow.com/a/2172365>
if [[ ${GRAPH_XG_URL} =~ ^s3:.* ]]; then
# Download from S3
aws s3 cp "${GRAPH_XG_URL}" "${WORK_DIR}/${GRAPH_NAME}.xg.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.xg.tmp" "${WORK_DIR}/${GRAPH_NAME}.xg"
else
# Use local symlink
ln -s "$(realpath "${GRAPH_XG_URL}")" "${WORK_DIR}/${GRAPH_NAME}.xg"
fi
fi
fi
if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbwt" ]] ; then
if [[ ${GRAPH_GBWT_URL} =~ ^s3:.* ]]; then
if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbwt" ]] ; then
if [[ ${GRAPH_GBWT_URL} =~ ^s3:.* ]]; then
# Download from S3
aws s3 cp "${GRAPH_GBWT_URL}" "${WORK_DIR}/${GRAPH_NAME}.gbwt.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbwt.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbwt"
else
# Use local symlink
ln -s "$(realpath "${GRAPH_GBWT_URL}")" "${WORK_DIR}/${GRAPH_NAME}.gbwt"
fi
fi

if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbz" ]] ; then
# Make it one file
time vg gbwt -x "${WORK_DIR}/${GRAPH_NAME}.xg" "${WORK_DIR}/${GRAPH_NAME}.gbwt" --gbz-format -g "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbz"
fi

elif [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbz" ]] ; then
# Fetch the GBZ
if [[ ${GRAPH_GBZ_URL} =~ ^s3:.* ]]; then
# Download from S3
aws s3 cp "${GRAPH_GBWT_URL}" "${WORK_DIR}/${GRAPH_NAME}.gbwt.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbwt.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbwt"
aws s3 cp "${GRAPH_GBZ_URL}" "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbz"
else
# Use local symlink
ln -s "$(realpath "${GRAPH_GBWT_URL}")" "${WORK_DIR}/${GRAPH_NAME}.gbwt"
ln -s "$(realpath "${GRAPH_GBZ_URL}")" "${WORK_DIR}/${GRAPH_NAME}.gbz"
fi
fi

if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}.gbz" ]] ; then
# Make it one file
time vg gbwt -x "${WORK_DIR}/${GRAPH_NAME}.xg" "${WORK_DIR}/${GRAPH_NAME}.gbwt" --gbz-format -g "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp"
mv "${WORK_DIR}/${GRAPH_NAME}.gbz.tmp" "${WORK_DIR}/${GRAPH_NAME}.gbz"
fi

if [[ ! -e "${WORK_DIR}/${GRAPH_NAME}-${SAMPLE_NAME}-as-ref.gbz" ]] ; then
Expand Down Expand Up @@ -150,7 +171,7 @@ function do_job() {
mv "${SAM_NAME}.tmp" "${SAM_NAME}"
fi
set -o pipefail
${ADD_QUALITIES} -s "${SAM_NAME}" -f "${FASTQ_NAME}" | sed "s/ref/${CONTIG_NAME}/g" | samtools view -b - > "${RENAMED_BAM_NAME}.tmp"
python3 "$(dirname -- "${BASH_SOURCE[0]}")/reinsert_qualities.py" -s "${SAM_NAME}" -f "${FASTQ_NAME}" | sed "s/ref/${CONTIG_NAME}/g" | samtools view -b - > "${RENAMED_BAM_NAME}.tmp"
set +o pipefail
mv "${RENAMED_BAM_NAME}.tmp" "${RENAMED_BAM_NAME}"
else
Expand Down Expand Up @@ -207,14 +228,14 @@ fi
# Work out howe many reads there are
TOTAL_READS="$(vg stats -a "${WORK_DIR}/${SAMPLE_NAME}-reads/${SAMPLE_NAME}-sim-${TECH_NAME}.gam" | grep "^Total alignments:" | cut -f2 -d':' | tr -d ' ')"

if [[ "${TOTAL_READS}" -lt 10500 ]] ; then
echo "Only ${TOTAL_READS} reads were simulated. Cannot subset to 10000 reads with buffer!"
if [[ "${TOTAL_READS}" -lt 1000500 ]] ; then
echo "Only ${TOTAL_READS} reads were simulated. Cannot subset to 1000000 reads with buffer!"
exit 1
fi
echo "Simulated ${TOTAL_READS} reads overall"

SUBSAMPLE_SEED=1
for READ_COUNT in 100 1000 10000 ; do
for READ_COUNT in 100 1000 10000 100000 1000000 ; do
# Subset to manageable sizes (always)
# Get the fraction of reads to keep, overestimated, with no leading 0, to paste onto subsample seed.
FRACTION="$(echo "(${READ_COUNT} + 500)/${TOTAL_READS}" | bc -l | sed 's/^[0-9]*//g')"
Expand Down
39 changes: 39 additions & 0 deletions scripts/mark_secondaries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/usr/bin/python3
# mark_secondaries.py: Mark all but the first alignment with a given name as secondary
"""
Mark duplicate alignments for a given read name as secondary. Useful for GraphAligner output which does not mark its secondaries. Assumes that the first alignment is the primary alignment, ignoring score.

vg view -a unmarked.gam mark_secondaries.py | vg view -JaG - > marked.gam

"""
import sys
import json


def filter_json_gam(infile):
"""
process gam json made with vg view -a my.gam
"""

seen_names = set()

for line in infile:
gam = json.loads(line)

if gam['name'] in seen_names:
gam['is_secondary'] = True
else:
gam['is_secondary'] = False
seen_names.add(gam['name'])

print(json.dumps(gam))

def main():
"""
Main entry point for the program.
"""
filter_json_gam(sys.stdin)

if __name__ == "__main__" :
main()

2 changes: 1 addition & 1 deletion scripts/plot-qq.R
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ print(x$ci)

# Now plot the points as different sizes, but the error bar line ranges as a consistent size
dat.plot <- ggplot(x, aes(1-mapprob+1e-7, 1-observed+1e-7, color=aligner, size=N, weight=N, label=round(mapq,2))) +
scale_color_manual(values=colors, guide=guide_legend(title=NULL, ncol=1)) +
scale_color_manual(values=colors, guide=guide_legend(title=NULL, ncol=2)) +
scale_y_log10("measured error", limits=c(1e-7,2), breaks=c(1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0)) +
scale_x_log10("error estimate", limits=c(1e-7,2), breaks=c(1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0)) +
scale_size_continuous("number", guide=guide_legend(title=NULL, ncol=4)) +
Expand Down
Loading
Loading