This document explains some problems that you might face when using Haskell lists and why using streams is better in those cases.
Haskell is fundamentally different from other languages. In most languages, a variable name refers to some concrete data in memory; in Haskell, a variable may refer to concrete data or it could refer to a computation or expression which when evaluated would produce the concrete data that the variable is referring to. When the value of the variable is needed, the computation is evaluated and replaced by the result.
When a Haskell program starts, the entire program is an unevaluated expression which is evaluated on-demand as needed. The top level expression is forced to produce data when a result is demanded from it for IO. Parts of the expression are reduced as needed, the reduction process can be controlled by the programmer.
When we assign a value to a name, the name refers to the unevaluated expression and not what it would produce when evaluated. For example:
>>> infiniteList = [1..]
The name infiniteList
refers to an expression which when evaluated
would generate the list. infiniteList
is not an actual physical list
but just an expression to generate the list when needed
. Because we
are not storing the actual list in memory, we can represent an infinite
list without requiring infinite amount of memory.
Let's try to get the length of this list:
>>> len = length infiniteList
Like before, the name len
refers to an expression that when evaluated
would give us the length of the list. Till now Haskell has been saying
that I duly noted how to do what you want to do but I will do it only
when you need it. Let's ask it to do that now:
>>> len
...hangs forever
Now, Haskell has no choice but to start evaluating the expression but we know it can never finish because the list is infinite.
All expressions in Haskell are unevaluated by default. Haskell evaluates the expressions only when it is absolutely required e.g. when we have to print something, or write it to a file or to network.
Let us illustrate expression evaluation in Haskell with an example. Write
this little program in a file example.hs
:
main =
let largeList = [1..1000000::Int]
n = length largeList
in print n
The expression largeList
is never evaluated to generate the entire
list at once. It is evaluated on-demand one element at a time by the
length
function. The length
function is in turn evaluated by the
print
function when it evaluates its argument n
. The print
function is evaluated by the top level program because it wants to print
n
to the console.
We can verify that the whole list is not stored in memory by looking at the memory usage of this program. Let's compile and run it:
$ ghc -O example.hs
$ ./example +RTS -s
1000000
51,480 bytes allocated in the heap
24 bytes copied during GC
44,328 bytes maximum residency (1 sample(s))
largeList
is an expression representing a million element list, but
the program takes only 44K maximum memory. The length
function
evaluates largeList
one element at a time to count them, as soon as
an element is counted it can be garbage collected as there are no more
references to it in the entire program.
Let's try this program instead:
main = do
let largeList = [1..1000000::Int]
x1 = last largeList
x2 = head largeList
in print (x1, x2)
$ ghc -O example.hs
$ ./example +RTS -s
(1000000,1)
64,051,552 bytes allocated in the heap
400 bytes copied during GC
26,233,360 bytes maximum residency (4 sample(s))
This program is showing 26MB maximum residency instead of 44K
earlier. Why is it so? To evaluate the first element of the tuple, x1
,
we need to evaluate the entire list to get to the last element. However,
the head of the list is referenced by the element x2
as well which
is yet to be evaluated and used. Therefore, we need to retain the head
until x2
is printed and freed. But head of the list is referencing the
next element in the list which in turn is referring to the next element
and so on, therefore, the entire list is retained in memory between the
evaluation of x1
and x2
.
If we change the print statement to (x2, x1)
then we do not need to
retain the entire list between the two evaluations.
When memory is unexpectedly retained in a program it is known as a "space-leak" in the Haskell parlance.
Even if there is no explicit dependency in the program forcing
evaluation of an expression, the programmer can force the evaluation of
an expression using seq
:
main = do
let largeList = [1..1000000::Int]
x1 = last largeList
x2 = head largeList
in print (x1 `seq` x2)
Similar to the example in previous section, this example also retains
the entire list in memory because seq
forces x1
to be evaluated
before x2
and then returns x2
, the evaluated list is held until x2
is
returned.
Memory retention can be avoided by changing the way we construct the expression to process the data. We can use Streamly to avoid memory retention in the above program.
Let's first write it using streamly and reproduce the same problem:
import Data.Functor.Identity (runIdentity)
import qualified Streamly.Data.Stream as Stream
import qualified Streamly.Data.Fold as Fold
main = do
let largeList = Stream.fromList [1..1000000::Int]
x1 = runIdentity $ Stream.fold Fold.latest largeList
x2 = runIdentity $ Stream.fold Fold.one largeList
in print (x1, x2)
This program has the same behavior as the list based program, and for the same reasons.
$ ghc -O2 example.hs
$ ./example +RTS -s
(Just 1000000,Just 1)
64,052,520 bytes allocated in the heap
400 bytes copied during GC
26,233,360 bytes maximum residency (4 sample(s))
However, we can compose the two folds using the teeWith
combinator
such that we distribute a each element of the list to both the folds
simultaneously, thus avoiding the retainment of the entire list:
import Data.Functor.Identity (runIdentity)
import qualified Streamly.Data.Stream as Stream
import qualified Streamly.Data.Fold as Fold
main = do
let largeList = Stream.fromList [1..1000000::Int]
x = Fold.teeWith (,) Fold.latest Fold.one
in print $ runIdentity $ Stream.fold x largeList
$ ghc -O2 example.hs
$ ./example +RTS -s
(Just 1000000,Just 1)
64,052,736 bytes allocated in the heap
280 bytes copied during GC
44,328 bytes maximum residency (2 sample(s))
This program uses constant small memory footprint (44K) irrespective of the order of the elements in the tuple. When an element is generated by the list it is supplied to both the folds before generating the next element, therefore, it can be immediately freed.
Streamly covers all the functionality of standard Haskell lists but in addition provides powerful composition tools for better efficiency.