Skip to content

Commit

Permalink
change up readme
Browse files Browse the repository at this point in the history
  • Loading branch information
shashi authored Aug 13, 2020
1 parent e3850ff commit 72d2c33
Showing 1 changed file with 91 additions and 4 deletions.
95 changes: 91 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,101 @@
[![Build Status](https://travis-ci.org/shashi/FileTrees.jl.svg?branch=master)](https://travis-ci.org/shashi/FileTrees.jl) [![Build status](https://ci.appveyor.com/api/projects/status/6sei8e7et721usx6?svg=true)](https://ci.appveyor.com/project/shashi/filetrees-jl)
[![Coverage Status](https://coveralls.io/repos/github/shashi/FileTrees.jl/badge.svg?branch=master)](https://coveralls.io/github/shashi/FileTrees.jl?branch=master)

Easy everyday parallelism with a file tree abstraction.

**Note:** this package is a work in progress, the API is undocumented and still in flux. Talk to me or Julian about using or contributing.

FileTrees is a set of tools to lazy-load, process and write file trees. Built-in parallelism allows you to max out compute on any machine.
## With FileTrees you can

There are no restrictions on what files you can read and write, as long as you have functions to work with one file, you can use the same to work with a directory of files.
- Read a directory structure as a Julia data structure, (lazy-)load the files, apply map and reduce operations on the data while not exceeding available memory if possible. ([docs](http://shashi.biz/FileTrees.jl/values/))
- Filter data by file name using familiar Unix syntax ([docs](http://shashi.biz/FileTrees.jl/patterns/))
- Make up a file tree in memory, create some data to go with each file (in parallel), write the tree to disk (in parallel). (See example below)
- Virtually `mv` and `cp` files within trees, merge and diff trees, apply different functions to different subtrees. ([docs](http://shashi.biz/FileTrees.jl/tree-manipulation/))

Lazy directory operations let you freely restructure file trees so as to be convenient to set up computations. Tree manipulation functions help with this. Files in a FileTree tree can have any value attached to them, values can be combined by merging trees or subtrees, and written to disk.

## Example

Here is an example of using FileTrees to create a 3025 images which form a big 16500x16500 image of a Mandelbrot set (I tried my best to make them all contiguous, it's almost right, but I'm still figuring out those parameters.)

Then we load it back and compute a Histogram of the HSV values across all the images in parallel using OnlineStats.jl.

```julia
@everywhere using Images, FileTrees, FileIO

tree = maketree("mandel"=>[]) # an empty file tree
params = [(x, y) for x=-1:0.037:1, y=-1:0.037:1]
for i = 1:size(params,1)
for j = 1:size(params,2)
tree = touch(tree, "$i/$j.png"; value=params[i, j])
end
end

# map over the values to create an image at each node.
# 300x300 tile per image.
t1 = FileTrees.mapvalues(tree) do params
mandelbrot(50, params..., 300) # zoom level, moveX, moveY, size
end

# save it
@time FileTrees.save(t1) do file
FileIO.save(path(file), file.value)
end
```
This takes about 150 seconds when Julia is started with 10 processes with 4 threads each, in other words on a 12 core machine. (oversubscribing this much gives good perormance in this case.)
In other words,
```
export JULIA_NUM_THREADS=4
julia -p 10
```

Then load it back in a new session:

```julia
using Distributed
@everywhere using FileTrees, FileIO, Images, .Threads, OnlineStats, Distributed

t = FileTree("mandel")

# Lazy-load each image and compute its histogram
t1 = FileTree.load(t; lazy=true) do f
h = Hist(0:0.05:1)
img = FileIO.load(path(f))
println("pid, ", myid(), "threadid ", threadid(), ": ", path(f))
fit!(h, map(x->x.v, HSV.(img)))
end

# combine them all into one histogram using `merge` method on OnlineStats

@time h = reducevalues(merge, t1) |> exec # exec computes a lazy value
```
Plot the Histogram:

```julia
┌ ┐
0.0 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100034205
0.05302199
0.1666776
0.15378473
0.2864297
0.251053490
0.3602937
0.35667619
0.41573476
0.45949928
0.5 ┤■ 2370727
0.551518383
0.6 ┤■ 3946507
0.65 ┤■■ 6114414
0.7 ┤■ 4404784
0.75 ┤■■ 5920436
0.8 ┤■■■■■■ 20165086
0.85 ┤■■■■■■ 19384068
0.9 ┤■■■■■■■■■■■■■■■■■■■■■■ 77515666
0.95 ┤■■■■■■■ 23816529
└ ┘

```
this takes about 100 seconds.

At any point in time the whole computation holds 40 files in memory, because there are 40 computing elements 4 threads x 10 processes. The scheduler also takes care of freeing any memory that it knows will not be used after the result is computed. This means you can work on data that on the whole will not fit in memory.

<a href="https://shashi.github.io/FileTrees.jl">See the docs &rarr;</a>

0 comments on commit 72d2c33

Please sign in to comment.