Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeline #2

Closed
16 of 38 tasks
thejmazz opened this issue Jun 24, 2016 · 3 comments
Closed
16 of 38 tasks

Timeline #2

thejmazz opened this issue Jun 24, 2016 · 3 comments
Assignees

Comments

@thejmazz
Copy link
Member

thejmazz commented Jun 24, 2016

A tentative plan for the way forward. I think once the core API is stable, time would be best spent reimplementing real world workflows that can be most improved with Streams. The primary concerns are

  • task orchestration
  • integration with SGE

After that work can begin on DSL, Docker, admin panel app, nbind etc.

If you think there is not enough time partitioned to plans as their should be, or if some should be swapped, triaged for others, etc, don't hesitate to let me know. I'd like us all to agree upon realistic plans for these next 8 weeks that are exciting and fully satisfy the original overarching goal.

Week 5

  • formalize task orchestration API
    • blocking vs streaming vs async see Task Architecture #1
    • join
    • parallel
    • forking
    • transform stream as task
  • tests on core API with basic examples, CI, and coverage
  • pipe tasks through each other instead of one task with shellPipe taking an array

Summary

  • refactored original prototype
    • task object now a stream that can wrap streams/non-streams
    • This makes task easier to use with existing stream modules
  • chatted with Matthias and Max about API, internal code
    • happy with API design so far, keep it going forward
    • change new File to vannila JS obect {file: 'foo'} - easier to new devs, no peer dependencies - ab782f
      • ability for other devs to work with, over shorter/cleaner syntax of "new" from consumer perspective - DSL will be clean
    • dont emit custom events after "end" of "finish" of stream, use those instead, and leave an object in the stream object "to be retrieved" (rather than emitting data with custom event)
  • chatted with Davide
    • does big social data analytics type stuff
    • interested in using waterwheel, contributing
    • walked him through clone and install

For this week, continue improving waterwheel, examples with real world workflows. We already have a basic genomic one, I'd like to try out an RNASeq pipeline, or whatever you guys suggest. - stick to improving genomic workflow with sound reentrancy

Task orchestration core codebase largely resolved, parallel/forking/join becomes easy when task returns a regularly compliant stream (i.e. no more custom events). Forking not done yet, but because task is now a steam, can be done with existing modules - e.g. multi-write-stream.

Week 6

  • Unit tests for task orchestration Unit tests for task orchestration #18
    • check resolution scenarios (from previous task vs fs)
    • simple joins
    • simple parallels
    • joins and parallels should return valid task streams -> further composable
    • simple forking
    • transform task
    • stream search results, filter down to IDs, run pipeline on each ID
  • file validation File Validation #22
    • existence
    • not null check enabled by default
    • pass in custom validator function(s)
  • reentrancy Test Reentrancy #27
    • file timestamp or checksum
    • force rerun of all/specific task
    • --resume option
    • tee stream to file

Pushed down:

  • integrate with clustering systems like SGE - try to solve this bionode-sra issue
  • run tasks in their own folders
  • implement new real world workflows from papers

Week summary:

  • more unit tests on task orchestration, still need more for complex scenarios
  • basic reentrancy using existence of file, non null file, custom validator on file
  • pass in custom validators as an array, functions that take file path, return true/false
  • working simple variant calling example
  • came across, how to let user provide validators that take more than one file (e.g. reads_1 and reads_2)
  • came across this problem Resolve input from all tasks in join  #35
Week 7
  • play with "pass files from other tasks" problem
  • integrate with Docker - specify a container for each Task
  • [ ] formalize YAML/hackfile based DSL

Week Summary:

  • this week didn't feel very productive, but
  • took time to think about how to restructure codebase with consideration of the "pass files from other tasks" problem
  • set up a gitbook, partially documenting the restructure approach: https://thejmazz.gitbooks.io/bionode-waterwheel/content/
  • tasks have a hash for params, input, and output, and series of tasks are arranged hierarchically using these
  • the pipeline at any point in time is a very stateful entity - the config and tasks of the pipeline are now managed in a redux store, this lets me describe every change to a live task (e.g. resolving output, running output validators, finding matching input from previous tasks) with an action, the action results in a reducer being called that returns a new state - reducers are small and more easily testable since they are pure functions. the giant codebase of task is now gradually moving into many smaller reducers which a smaller scope

Week 8

  • implement new real world workflows from papers

Week summary:

  • refactor fully completed - pipeline state managed by redux, actions are dispatched for each step in task lifecycle --> bug reports can be submitted with a snapshot of the exact state
    • big functions --> small, testable, pure functions
  • improved simple vcf example to be updated to refactored codebase
  • began implementation of "hierarchical output dump"
    • each task has a "trajectory" which is an array of keys of the output dump
    • task will match input patterns to absolute paths in the output dump, going through each "trajection" in the trajectory
    • keys of the DAG of the output dump are JSON.stringify(params) of each task
    • works somewhat but needs improvement (WIP)

Week 9

  • implement new real world workflows from papers

Week 10

  • implement new real world workflows from papers

Week 11

  • Project website
  • Complete documentation, examples, use cases, etc

Week 12

  • Final cleanup of website, docs, testing, examples

Extras/Pushed out

  • prototype a simple pipeline with nbind - an in browser functional Waterwheel pipeline will be a great way to introduce and teach the module
  • web/electron admin panel app - view tasks, edit tasks, see progress, see logs in realtime
@bmpvieira
Copy link
Member

Good, but I think that:

  • "implement methods from papers with great use cases for streaming"
  • "example workflow that exemplifies mixing node transform/filter streams and data analysis"

need to be much higher (like week 6) since:

  • these are what actually showcase the benefits of the project compared to other existing solutions
  • convinces people to use bionode
  • allow feedback from users
  • help drive development towards tasks with maximal benefit

Otherwise, it might be very easy to get distracted implementing features that sound nice but might not have a clear and immediate benefit to people. Also, the web/electron GUI is an ambitious project in itself, and so we shouldn't dedicate time to it until we have a solid and useful CLI solution, which probably won't happen before the end of GSoC since we only have 8 weeks left.

I believe @yannickwurm shares a similar view on this.

@thejmazz
Copy link
Member Author

Edited the timeline. If you guys can pick out a few papers that have good methods that we can reproduce with waterwheel, that would be great. Having a workflow goal to implement really helps determine a goal and discover edge/use cases. I'm thinking something that takes a stream of SRA accessions (and can then distribute those emitted values over SGE/cluster). I'd like to play with RNAseq a bit too, maybe use Kallisto.

@thejmazz thejmazz changed the title Revised Timeline Timeline Jul 4, 2016
@bmpvieira bmpvieira self-assigned this Apr 12, 2017
@bmpvieira
Copy link
Member

Some of this can be recycled for next GSoC, but I'm closing it since the 2016 one is over.
Here's a summary of what was achieved and what's next:
https://github.com/bionode/gsoc16/blob/master/README.md

@bmpvieira bmpvieira assigned bmpvieira and thejmazz and unassigned bmpvieira Apr 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants