Advantages of Being Lazy by Owein Reese

 

Lazy data structures have many advantages over classically eager data structures. Strictly lazy structures are not widely used in imperative code due to difficulties involved with composition of exception handling and mutable state. Despite these difficulties, they offer a compelling tool for solving a large class of problem spaces. 

This post will attempt to highlight two of those benefits: memory footprint and overall programmatic computational cost. It will begin by first defining a common, basic data structure, a Tree. After implementing two methods on a Tree, map and exists, it will use those methods to solve a problem. Finally, using a lazy version of a tree, the differences in terms of cost and benefit will be shown.

A Tree

Instead of using a tree as a container for other objects, like in many algorithms and structures classes, this tree will express a hierarchy of values. Let us suppose that all hierarchical trees have the following interface:

which exposes just two methods: map and exists.

We can implement an arbitrarily sized hierarchical tree structure by allowing each level to vary on the number of children it has by using a List:

Where for each action performed, the evaluation of that action is called directly upon evocation.

A Problem Statement

Suppose that we want to represent an extended family whose fortunes have swelled but net worth is tied up in a complex set of independent and interwoven investment vehicles. Each member of the family is entitled to a certain number of votes based upon some ownership of shares. The board of one of the companies has called for a vote where a large outside investor is attempting to influence the outcome.

Being an efficient and careful activist, this investor would like to first determine if there is a single family member who has enough voting power such that he could win the majority. If such a member does exist, the investor will spend all of his energy lobbying that person. If not, the investor will divide his time based on the number of votes:

Modeled like this, using the eager tree defined above, in the worst case the algorithm would require an evaluation of every single family member's voting rights twice; once for the expensive calculation and one time for the predicate condition. The best case with eager evaluation would also require every single family member to be evaluated at least once, N expensive calculations but just once for the predicate.

To improve upon the bounds and more accurately model the outside investor as an efficient actor, we would need another implementation. We could rewrite the function to use some form of short circuiting accumulation. This route would lead to increased implementation complexity and decreased maintainability. Or we could examine using lazy data structures and keep the function as written above completely unchanged and intact.

A Lazy Tree

The relationship between an eagerly evaluated tree and a lazy tree is much that same as that between map and exists (on an eager data structure.) Exists is a short circuiting function, executing the function the least number of times necessary until a positive result is found while map will call it N times for N objects. On a lazy data structure, map will be delayed until the first time a result is requested.

We can define a lazy tree as follows:

wherein the call to head is preferable to repeated calculations of value, if it is to be requested multiple times. Coincidentally, the definition of LazyTree is nearly identical to EagerTree save for the implementation detail of nodes as a Stream vs a List and the use of Thunks for member values.

While small in difference cosmetically, programmatically it is a powerful distinction. The actual calculation of value is delayed until it is explicitly called and so too is any calculation using nodes. To understand how this is so, we need but study the implementation of the cons cell of Stream and the map member function, copied with some liberty below:

So defined, calling map on a Stream of N items creates a Stream of 1 element with a tail wrapped in a Thunk. Iterating over each element creates yet another single cons cell with a delayed realization of the tail, hence the concept of lazy evaluation.

Like the eager tree, the worst case scenario for the problem statement would be the evaluation of each family member twice. The best case scenario, on the other hand, would be the evaluation of only a single family member once. Additionally, in the best case scenario the memory footprint would be greatly reduced to 2 (the value and the nodes Thunks) instead of the entire tree of weighted votes, saving not only time but also space.

Conclusion

Efficiency is defined as doing no more than the minimal effort required. Clearly the lazy tree accomplishes this by putting off as much as possible until explicitly asked. As shown in this post, a direct, easy to understand algorithm was made both cost and space efficient by switching from an eager data structure to a lazy data structure without destroying readability.

The key to the improvement of this algorithm was the potential need to retain intermediate values for some but not all branches of logic. In situations like that, lazy structures shine. However, it would be remiss to not caution you, the reader, to understand that lazy structures are not always ideal. With anything there are trade-offs. Lazy structures can induce hard to understand performance profiles and in multi-threaded applications, if coded with persistence in mind, lock based overheads (look into the how of Scala's lazy val.)

That said, knowing about lazy structures and understanding situations where they can help arms you with another tool in the tool box. In the next post, another use case of lazy structures will be talked about. One which leads to patterns that would be impossible to write with eager structures.

Type Soup by Owein Reese

Type soup is an expression I once heard a coworker say that has stuck with me over the course of several years. Said coworker had originally worked in dynamic programming languages like R, Julia and Python but had switched to Scala for large scale numerical analysis. The phrase was uttered in reaction to seeing simple code sprout so many type parameters it was no longer simple to reason about.

The difficulty of watching agreeable code get cluttered with type parameters even if used for good effect never goes away. To counter this, a good rule of thumb is to avoid no more than one of a phantom type, free parameter within an implicit or a compound type within a method signature or object constructor. That is to say, you should only chose to have exclusively one of rather than the inclusive one of each in any signature.

This is a bit more restrictive than Twitter's style guide but better than the normal coding standards which don't attempt to set any boundaries. Types should be there to guide, not confuse. For instance, which of the following is easier to understand?

Without me telling you I bet you could deduce that the Key and the Value type parameters were related to some sort of key-value store. I have no doubt in my mind that you can't tell me what the other bit of code is doing without either knowing the context or seeing the code.

This is what type soup has come to mean to me. It's an utterly abstract bit of code searching to tie together several different underlying representations or even concepts by generalizing over the type signatures of the methods and/or objects. Please don't do this. You are the only one who will ever benefit from your "clever" code.