Speeding Up Compilation Time with scalac-profiling

Monday 4 June 2018

Jorge Vicente Cantero

Notice (2024) Since this blog post was published, a documentation site for scalac-profiling was created at https://scalacenter.github.io/scalac-profiling/ . Some of the details in this post may now be out of date, but much of the information is still relevant.

Introduction (2018)

In this blog article, I’d like to introduce you to a new tool that we’ve developed at the Scala Center called scalac-profiling, which aims to help you better understand what is slowing down compilation in your project.

It turns out that the use of Scala’s implicits as well as macros can greatly increase compilation times, depending on how they are used and how your codebase is organized. Codebases that make use of libraries like Shapeless, or which rely on typeclass derivation, may be particularly prone to these slow-downs. As of Scala 2.12, typeclass derivation is based on implicitly-triggered macro expansions.

The goal of this blog post is to help you understand when these things are happening in your code, so you can remove code that triggers unnecessary implicit searches or macro expansions.

In this blog post, I walk you through how to reduce these compile times with scalac-profiling. scalac-profiling is a new Scala Center compiler plugin to complement my recent work on the compiler statistics infrastructure merged in 2.12.5. I use the plugin to speed up the compile times of a Bloop module by 8x.

The analysis and optimizations here presented can be migrated to any other Scala project that makes heavy use of typeclass deriving code, implicits, and/or macros.

After reading the blog post, you should understand:

How to use scalac-profiling to analyze the impact of implicits and macros on compile times.
Why typeclass deriving or Shapeless-based code is prone to slow compilation times if not used with care.
How implicit search and macros interact in unexpected ways that can hurt developer productivity and how you can optimize their interaction.

The most important take-away from this guide is that you should not take slow Scala compile times for granted. It’s worth investigating why slow compiles happens as as it is often possible to fix our projects to compile faster!

Pointers to read the article

If you want to apply the same technique in your Scala projects or you have no previous knowledge of the reasons why automatic typeclass derivation is slow, reading the whole blog post is highly recommended.

If you’re already familiar with the source of inefficiencies or you don’t have much time, jump directly to the detective work that digs into the profiling data.

This is a long blog post. Put on your profiling hat and let’s get our hands dirty!

The codebase

Bloop is a build-tool-agnostic compilation server with a focus on developer productivity that I developed at the Scala Center together with Martin Duhem. It is a compilation server that integrates with sbt and gives you about ~20-25% faster compilation times than sbt.

At around ~10000 lines of Scala code, it has three main submodules:

jsonConfig: the module that defines the JSON schema of the configuration files.
backend: the module that defines the compiler-specific data structures and integrations.
frontend: the high level code that defines the internal task engine and nailgun integration.

The two first modules are lightweight and fast to compile. In a hot compiler, their batch compilations take 2 or 3 seconds. However, compiling frontend is more than 20x slower. This slowness is surprising given that frontend is only about 6000 LOC, so in theory it should compile in ~4 seconds.

frontend depends on case-app, a command-line parsing library for Scala that uses Shapeless. This library is excellent, but slows our compilation down to 30 seconds.

Waiting 30 seconds for a change to take effect (even under incremental compilation) is a no-go. It may not seem much, but this kind of wait kills productivity and gets me out of the zone. That affects my decision-making process a big deal.

In the past, I’ve also noticed that a slow workflow discourages me from adding complete test suites (the more tests I add the more I need to wait to compile) or making experiments in the code. That has rendered my experience as a Scala developer less pleasant.

And that deserves putting some time aside to find out how we can make Bloop compile times faster.

The setup and workflow

To profile Bloop compilation times, we use Bloop itself to make sure that we preserve hot compilers across all runs. You should be able to replicate the results with sbt too, but make sure that every time you reload the build you warm up the compiler at least 10 times.

To set up Bloop as a user, follow the installation instructions in our website. You only need to install Bloop and start the server. You then clone the Bloop codebase and generate the configuration files.

git clone https://github.com/scalacenter/bloop
cd bloop
git checkout v1.0.0-M10
git submodule update --init
sbt bloopInstall

After that, run bloop projects in the base directory to check the projects in your build.

Compiling the codebase

All we’ll do in the next sections is to compile the codebase several times and see how the compilation times behave after applying our changes.

I recommend cleaning and compiling frontend sequentially at least 10 times to get a stable hot compiler. Every change we’ll do to the codebase from now on will require a full compile (running clean before compile) to get stable implicit search and performance results. This methodology will simplify reading and interpreting the results without taking into account what incremental compilation is doing.

Warm up the compiler

for i in {1..10}; do
  echo "Warming up the compiler; iteration $i"
  bloop clean frontend
  bloop compile frontend
done

The profiling toolkit and theory

The first step to analyze your compilation times is that you set your intuitions aside. We’re going to look at the raw compiler data with fresh eyes and see where that leads us.

If you try to validate previously-formed assumptions, it’s likely you’ll be misled by the data. I’ve been there, so don’t fall into the same trap.

Profiling compilation times requires dedicated tools. There isn’t much we can get from using profilers like Yourkit or Java Flight Recorder because they show the result of the inefficiencies, not the cause.

There are cases when knowing the hot methods, inspecting the heap or studying GC statistics is useful. I’ve used this data in the past to find and fix inefficiencies in the compiler. However, this guide is only concerned about the misuse of language features, and so we need to take a higher-level profiling approach.

Compiler statistics

The compiler has built-in support for statistics from 2.12.5 on. This work resulted from a Scala Center Advisory Board proposal about compiler profiling. Morgan Stanley, the creator of the document, proposed the Scala Center to develop tools to help diagnose compilation bottlenecks.

I was interested in this topic and so the proposal was assigned to me. My work on the compiler revolved around fixing the broken implementation of statistics in 2.11.x, creating better profiling tools and reducing the instrumentation overhead in the statistics engine.

Compiler statistics have both timers and counters that record data about expensive compiler operations like subtype checks, finding members, implicit searches, macro expansion, class file loading, et cetera. This data is the perfect starting point to have a high-level idea of what’s going on.

Setting statistics up

Enable compiler statistics by adding the -Ystatistics compiler flag to the project you want to benchmark.

Note that you need to use Scala 2.12.5 or above. I highly recommend using the latest version. At the moment of this writing, that’s 2.12.6.

Add the compiler flag to the field options inside the .bloop/frontend.json json configuration file. When you save, Bloop will automatically pick up your changes and add the compiler option without the need of a reload.

If you use sbt, add scalacOptions in Compile += "-Ystatistics" to your project settings. If you want to profile tests scope it to Test instead of Compile.

Run bloop compile frontend -w --reporter scalac (we use the default scalac reporter for clarity) and have a look at the data. The output of the compilation will be similar to this log. Check the end of it. You should see a report of compilation time spent per phase.

*** Cumulative timers for phases
#total compile time      : 1 spans, ()32545.975ms
  parser                 : 1 spans, ()65.017ms (0.2%)
  namer                  : 1 spans, ()42.827ms (0.1%)
  packageobjects         : 1 spans, ()0.187ms (0.0%)
  typer                  : 1 spans, ()27432.596ms (84.3%)
  patmat                 : 1 spans, ()1169.028ms (3.6%)
  superaccessors         : 1 spans, ()36.02ms (0.1%)
  extmethods             : 1 spans, ()3.548ms (0.0%)
  pickler                : 1 spans, ()9.449ms (0.0%)
  xsbt-api               : 1 spans, ()159.278ms (0.5%)
  xsbt-dependency        : 1 spans, ()94.846ms (0.3%)
  refchecks              : 1 spans, ()627.633ms (1.9%)
  uncurry                : 1 spans, ()408.305ms (1.3%)
  fields                 : 1 spans, ()414.151ms (1.3%)
  tailcalls              : 1 spans, ()38.455ms (0.1%)
  specialize             : 1 spans, ()184.562ms (0.6%)
  explicitouter          : 1 spans, ()80.488ms (0.2%)
  erasure                : 1 spans, ()624.472ms (1.9%)
  posterasure            : 1 spans, ()63.249ms (0.2%)
  lambdalift             : 1 spans, ()125.944ms (0.4%)
  constructors           : 1 spans, ()47.109ms (0.1%)
  flatten                : 1 spans, ()46.527ms (0.1%)
  mixin                  : 1 spans, ()59.808ms (0.2%)
  cleanup                : 1 spans, ()42.336ms (0.1%)
  delambdafy             : 1 spans, ()47.771ms (0.1%)
  jvm                    : 1 spans, ()714.008ms (2.2%)
  xsbt-analyzer          : 1 spans, ()5.175ms (0.0%)

The report suggests that about 84.3% of the compilation time is spent on typer. This is an unusual high value. Typechecking a normal project is expected to take around 50-70% of the whole compilation time.

If you have a higher number than the average, then it most likely means you’re pushing the typechecker hard in some unexpected way, and you should keep on the exploration.

Walking into the lion’s den

Now that the data signals a bottleneck in typer, let’s keep our statistics log short and enable -Ystatistics:typer. That will report only statistics produced during typing.

We then run compilation again. The logs contain information about timers and counters of several places in the typechecker. These timers and counters help you know how you’re stressing the compiler.

If the compilation of your program requires an unusual amount of subtype checks, time spent in <:< will be high. There are no normal values for subtype checks –the time spent here depends on a lot of factors– but an abnormal value would be anything above 15% of the whole typechecking time.

The first thing we notice when studying the logs is that typechecking frontend takes 28 seconds. We also see some unusual values for the following counters:

#class symbols             : 1842246
#typechecked identifiers   : 134734
#typechecked selections    : 225020
#typechecked applications  : 82421

The Scala compiler creates almost two million class symbols (!) and typechecks 134734 identifiers, almost double the selections and half of the applications. Those are pretty high values. That begs the question: why are we creating so many classes?

Next, we check time spent in common typechecking operations:

time spent in lubs         : 67 spans, ()63.194ms (0.2%) aggregate, 16.29ms (0.1%) specific
time spent in <:<          : 1548620 spans, ()1791.068ms (6.5%) aggregate, 1583.94ms (5.8%) specific
time spent in findmember   : 873498 spans, ()638.792ms (2.3%) aggregate, 592.663ms (2.2%) specific
time spent in findmembers  : 0 spans, ()0.0ms (0.0%) aggregate, 0.0ms (0.0%) specific
time spent in asSeenFrom   : 2541823 spans, ()1299.199ms (4.7%) aggregate, 1238.814ms (4.5%) specific

time spent in lubs should be high whenever you use lots of pattern matching or if expressions, and the compiler needs to lub (find the common type of a sequence of types – also called finding “least upper bound” among some types). Eugene Yokota explains it well in this well-aged blog post.

time spent in findmember and its sister time spent in findmembers should be up in the profiles whenever you have deep class hierarchies and lots of overridden methods.

time spent in asSeenFrom is high whenever your code makes a heavy use of dependent types, type projections or abstract types in a more general way.

In the case of frontend, the durations of all these operations are reasonable, which hints us that the inefficiency is elsewhere.

For most of the cases, these timers are unlikely to be high when typechecking your program. If they are, try to figure out why and file a ticket in Scala compiler’s issue tracker so that either I or the Scala team can look into it.

The troublemaker

Most of the projects that suffer from compilation times abuse or misuse either macros (for example, inefficient macro implementations that do a lot of typecheck/untypecheck), implicit searches (for example, misplaced implicit instances that take too long to find) or a combination of both.

It’s difficult to miss how long macro expansion and implicit searches take in the compilation of frontend, and how the values seem to be highly correlated.

time spent implicits   : 33609 spans, ()26808.491ms (97.7%)
  successful in scope  : 346 spans, ()71.931ms (0.3%)
  failed in scope      : 33263 spans, ()3195.452ms (11.6%)
  successful of type   : 18286 spans, ()26730.255ms (97.4%)
  failed of type       : 14977 spans, ()17370.235ms (63.3%)
  assembling parts     : 18647 spans, ()374.562ms (1.4%)
  matchesPT            : 136322 spans, ()505.763ms (1.8%)
time spent macroExpand : 44445 spans, ()26451.132ms (96.4%)

This is a red flag. We expand around 44500 macro expansions (!) and spend almost the totality of the macro expansion time searching for implicits. We have our troublemaker.

An initial exploration of the data

How do we know which implicit searches are the most expensive? What are the macro expansions that dominate the compile time?

The data we get from -Ystatistics doesn’t help us answer these questions, even though they are fundamental to our analysis. As users, we treat macros as blackboxes —mere building blocks of our library or application— and now we need to unravel them.

A profiling plugin for `scalac`

To answer the previous questions, we’re going to use scalac-profiling, a compiler plugin that exposes more profiling data to Scala developers.

I wrote the plugin with three goals in mind:

Expose a common file format that encapsulates all the compilation profiling data, called profiledb.
Use visual tools to ease analysis of the data (e.g. flamegraphs).
Allow third parties to develop tooling to integrate this data in IDEs and editors. There is a rough vscode prototype working.

The compiler plugin hooks into several parts of the compiler to extract information related to implicit search and macro expansion. This data will prove instrumental to understand the interaction between both features.

Install scalac-profiling by fetching the 1.0.0 release.

> $ coursier fetch --intransitive ch.epfl.scala:scalac-profiling_2.12:1.0.0
https://repo1.maven.org/maven2/ch/epfl/scala/scalac-profiling_2.12/6cac8b23/scalac-profiling_2.12-6cac8b23.jar
  100.0% [##########] 4.1 MiB (2.1 MiB / s)
/home/jvican/.coursier/cache/v1/https/oss.sonatype.org/content/repositories/staging/ch/epfl/scala/scalac-profiling_2.12/1.0.0/scalac-profiling_2.12-1.0.0.jar

Then open the frontend’s bloop configuration file and add the following compiler options in the options field. Note that -Xplugin contains the $PATH_TO_PLUGIN_JAR variable which you must replace with the resolved artifact from coursier. Replace $BLOOP_CODEBASE_DIRECTORY by the base directory of the cloned bloop repository.

  "-Ystatistics",
  "-Ycache-plugin-class-loader:last-modified",
  "-Xplugin:$PATH_TO_PLUGIN_JAR",
  "-P:scalac-profiling:no-profiledb",
  "-P:scalac-profiling:show-profiles",
  "-P:scalac-profiling:sourceroot:$BLOOP_CODEBASE_DIRECTORY"

The first two flags set up the compiler plugin.

The flag -P:scalac-profiling:no-profiledb disables the generation of profiledbs and -P:scalac-profiling:sourceroot gives the base directory of the project to the plugin. The profiledb is only required when we process the data with other tools, so by disabling it we keep the overhead of the plugin to the bare minimum.

The flag -P:scalac-profiling:show-profiles displays the following data in the compilation logs:

Implicit searches by position. Useful to know how many implicit searches were triggered per position.
Implicit searches by type. Essential data to know how many implicit searches were performed for a given type and how much time they took.
Repeated macro expansions. An optimistic counter that tells us how many of the macros returned the same stringified AST nodes and could therefore be cached across all use-sites (in the macro implementation).
Macro data in total, per file and per call-site. The macro data contains how many invocations of a macro were performed, how many AST nodes were synthesized by the macro and how long it took to perform all the macro expansions.

The profiling logs will be large, so make sure the buffer of your terminal is big enough so that you can browse through them.

When you’ve added all the compile options to the configuration file and saved it, the next compilation will output a log similar to this one. This is the profiling data we’re going to dig into.

The first visual

First thing you notice from the data: compilation time has gone up. Don’t worry, you haven’t done anything wrong.

#total compile time   : 1 spans, ()46169.708ms

Compiler plugins add overhead so the increased compilation time is expected. In particular, the cost of scalac-profiling is high since it instruments key parts of typer via compiler hooks. Remember that the cost will disappear as soon as you remove the plugin from the bloop configuration file.

The first thing we need is to get a visual of the implicit searches. To do that, we’re going to create an implicit search flamegraph. Grep for the line “Writing graph to” in the logs to find the .flamegraph file containing the data.

To generate a flamegraph, clone scalacenter/scalac-profiling, cd into FlameGraph, git submodule update --init (note that this command will require you to have ssh authentication configured with your GitHub account) and run the following script in the repository:

./flamegraph.pl \
    --hash --countname="μs" \
    --color=scala-compilation \
    $PATH_TO_FLAMEGRAPH_DATA > bloop-profile-initial.svg

You can then visualize it with $BROWSER bloop-profile-initial.svg.

After we’re all set up, we’ll then get an svg file that looks like this:

(The flamegraph in the blog post is a png image. You can check the svg by opening the image in a new tab. Svg images allow you to hover on every stack, search through the stack entries and check the compilation times of every box.)

We finally have a visual of all the implicit searches our program is doing, and how their dependencies look like. But before we keep finding out what the graph represents, let’s take a slight detour and learn about common implicit and macro usage patterns and in which context they are used.

This background information will help us read the flamegraph.

Typeclass derivation for the win

Typeclass derivation is a process that synthesizes typeclasses from other types. The process can be manual (you define an Encoder for every node of your GADT) or automatic (the Encoder derivation happens at compile time, i.e. the compiler generates the code for you).

There are two families of automatic typeclass derivation:

Automatic: to derive a typeclass for a given type T, the compiler will materialize any typeclass type T needs if it’s not in scope.
Semi-automatic: to derive a typeclass for a given type T, all the types T depends on have to have a derived typeclass in scope.

Typeclass derivation is popular in the Scala community. A few libraries (for example, scalatest) define their own macros to synthesize type classes. The most common approach, though, is to use Shapeless to guide the type derivation on the library side, which removes the need for extra macros.

Shapeless is a generic programming library that defines some basic building blocks (macros) to enable typelevel computations. These computations are driven by implicit search and happen at compilation time. Shapeless is popular library for automatic typeclass derivation because it can find out the generic representation of any sealed trait/case class you have in your program. So when the implicit search needs an instance that doesn’t exist in the scope, macros materialize it.

The compilation of frontend does automatic typeclass derivation via case-app, which depends on Shapeless. case-app derives a caseapp.core.Parser for a GADT defining the commands and parameters that your command line interface accepts. This derivation relies on the Lazy, Strict, Tagged and LabelledGeneric macros, as well as other Shapeless data structures like Coproduct and HList.

These are normal dependencies of any library that uses Shapeless to guide typeclass derivation.

The cost of implicit macros

Automatic and semi-automatic typeclass derivation use macro definitions defined as implicit to guide the typeclass derivation at compile-time. For example, every time you derive an encoder for an HList, say Encoder, you derive it inductively for every element of its generic representation (HList or Coproduct).

But how can macro definitions be implicit and what are the consequences of that?

implicit def foo[T](p: T): Foo[T] = macro fooImpl[T]

// Undefined macro implementation for simplicity
def fooImpl[T: ctx.WeakTypeTag]
  (ctx: blackbox.Context)(p: ctx.Tree): ctx.Tree = ???

The code above defines an implicit def that synthesizes a type Foo for the type T. The code generation only depends on p and the type T so there are no functional dependencies. It is a dummy blackbox macro.

When macros like foo are defined in a library and they are eligible for an implicit search of type T, the compiler goes through the list of all candidates based on the priority of implicit search and gets the first non-ambiguous match. If the match is a macro like foo, the macro is expanded and the code inlined at the call-site.

This algorithm is correct but problematic for macros. The compiler will always expand macros that are eligible to the implicit search even if the resulting trees are thrown away.

On top of that, if several macros are candidates to an implicit search in the same implicit scope, all of them will be expanded because the compiler needs to check for ambiguity of implicit instances.

The efficiency of this process worsens when whitebox macros are used. For the sake of this blog post, let’s think of a whitebox macro as a blackbox macro that can redefine the type of its enclosing definition.

Whitebox macros are powerful and that makes them more expensive than blackbox macros: they are typechecked three times by the Scala macro engine.

All kinds of macros eligible for implicit search pose a threat to compile times and so they need to be used with care.

The world of shapeless

Shapeless defines 28 whitebox macros. The most common whitebox Shapeless macros are Generic, Lazy, Nat, Default and polymorphic function values Poly. These are heavyweight macros that are common in many Scala projects.

The main problem with these macros is that their use is heavy in automatic typeclass derivation. When used in that context, it is common that the compiler repeats the materialization of implicit instances. This is the main source of inefficiencies.

Travis Brown explains these inefficiencies well in a more high-level manner in this talk about Generic Derivation at Scalaworld.

Once a macro is triggered because an implicit doesn’t exist in the scope of the call-site, the implicit search needs to materialize all the functional dependencies (all the implicits that are required for another implicit to be eligible) together with the implicits in scope.

All these materialized instances cannot be shared across different implicit searches since they don’t exist in the call-site. Once the macro is triggered, the code is expanded and typechecked and nothing can be re-used as the macro code doesn’t have access to the previous expansions.

Quick derivation example

sealed trait Base
case class Foo(xs: List[String]) extends Base
case class Bar(xs: List[String], i: Int) extends Base

implicitly[Encoder[Foo]]
implicitly[Encoder[Bar]]

For example, the code above illustrates how a hypothetical Encoder typeclass would need to materialize List[String] twice since its type appears in both definitions. The second call cannot detect that the first one synthesizes Encoder[List[String]] because for all purposes there isn’t an implicit instance in scope.

The same happens in nested implicit searches and it worsens as more and more functional dependencies generated by macros and nested types are used and required in every step of the inductive process.

Remember that frontend was expanding 44500 macros and how intense typechecking was? Well, now we have a faint idea why.

The quest for optimization

We now have a clearer idea what we’re after. caseapp.core.Parser is just an standard typeclass that is automatically materialized by Shapeless recursively.

The profiling data reveals us that the majority of the time is spent in macro expansion and implicit searches, so it’s likely we should see some of the repeated materialized instances we discussed about in the previous section.

Let’s look at the profiling data again.

“Implicit searches by position” and “Macro data per file” tell us that almost all of the work happens in a file called CliParsers.

object CliParsers {
  // Stubs to simplify reading the code
  implicit val inputStreamRead: ArgParser[InputStream] = ???
  implicit val printStreamRead: ArgParser[PrintStream] = ???
  implicit val pathParser: ArgParser[Path] = ???

  implicit val completionFormatRead: ArgParser[Format] = ???
  implicit val propParser: ArgParser[PrettyProperties] = ???

  import caseapp.core.{Messages, Parser}
  val BaseMessages: Messages[DefaultBaseCommand] =
    Messages[DefaultBaseCommand]
  val OptionsParser: Parser[CliOptions] =
    Parser.apply[CliOptions]

  import caseapp.core.{CommandsMessages, CommandParser}
  val CommandsMessages: CommandsMessages[RawCommand] =
    CommandsMessages[RawCommand]
  val CommandsParser: CommandParser[RawCommand] =
    CommandParser.apply[RawCommand]
}

The file defines specific parsers for data structures that frontend defines, creates an instance of caseapp.core.Messages, a parser for cli options, and then two instances of CommandsMessages and CommandsParser.

The two lines that define CommandsMessages and CommandsParser are the ones that dominate the compilation time.

CommandsParser is creating a parser for all the commands defined in the RawCommand GADT, which you can find in this source file. RawCommand has eight subclasses (commands) that specify the inputs required for every CLI invocation. Each of this commands defines a @Recurse field that allows Shapeless to reuse the parser for CliOptions. CliOptions in turn requires the parser of CommonOptions in the same fashion. Materializing this parser takes almost 14 seconds.

CommandsMessages does a similar thing but instead of materializing a parser it materializes a class with a map of all the commands and information about the parameters (fields) it takes. The materialization of Messages takes around 13 seconds.

Let’s come back to the flamegraph. We’re now ready to continue our exploration.

Reading the implicit search flamegraph

The flamegraph has three colors. Every color has a meaning.

Green: a successful implicit search whose result didn’t come from a macro (the normal case).
Blue (aqua): a successful implicit search whose result came from a macro.
Red: a failed implicit search that triggered at least one macro.

Every implicit search in the graph has some metadata at the end of the title. Depending on the color, we can find:

Implicit search id: a number to identify an implicit search and inspect its result tree via -P:scalac-profiling:print-search-result:$SEARCH_ID.
The number of macro expansions triggered by an implicit search. This number only covers the direct macro expansions (not the transitive ones).
If the result tree comes from a macro, the macro location that expanded it.

Example:

shapeless.Strict[caseapp.core.Parser[bloop.cli.Commands.Run]] (id 12121) (expanded macros 3) (tree from `shapeless.LazyMacrosRef.mkStrictImpl`)  (417,117 μs, 3.28%)

On every stack trace, you have also the information about the timing. The unit of time is microseconds. So one million μs is one second. We use microseconds because flamegraphs cannot display decimal values and we want to lose as little time precision as possible.

Beware that an implicit search may not appear in the flamegraph even if it’s performed by scalac. There could be implicit searches that are so fast to do that they take less than 0 μs. Flamegraphs do not show entries whose value is under 0.

We’re not going to use all of this information in the blog post, but it may turn handy whenever you research on your own. Check the rest of the supported compiler plugin flags in the code.

After this short intro, let’s delve into the data. The first thing that struck me is how similar all the towers of implicits look (both in shape and duration). If we hover over all the chunks, the repetition will be obvious; most bite-sized chunks materialize either Parser[CommonOptions] or Parser[CliOptions], depending at the height of the implicit branch we look at.

This makes sense. After all, we’re not caching these implicits in the call sites. Let’s cache them before the materialization of CommandsMessages and CommandsParser.

implicit val coParser: Parser.Aux[CommonOptions, _] =
  Parser.generic
implicit val cliParser: Parser.Aux[CliOptions, _] =
  Parser.generic

The code above calls the materialization entrypoints from caseapp.Parser directly. Type inference and implicit search will figure out the type parameters that Parser.generic needs from the return type we specify explicitly in the cached implicits.

A word of caution when caching implicits: make sure the rhs of the implicit definition doesn’t depend on the implicit you’re caching.

It is common that scalac detects coParser as the candidate for the implicit search on its rhs. This creates a recursive call that causes a null pointer exception at runtime. We can reproduce the issue if we redefine coParser as implicitly[Parser[CommonOptions]] or the[Parser[CommonOptions]].

Great! Well, let’s check the compile time and flamegraphs now.

#total compile time  : 1 spans, ()19060.196ms
  typer              : 1 spans, ()13625.005ms (71.5%)

The compile time is 2.5x faster. Not bad for a two line change. The duration of implicit search accounts for 13 seconds, roughly ~95% of typer.

The flamegraph has slimmed down and doesn’t contain the successful implicit searches for Parser[CommonOptions] and Parser[CliOptions]. However, there seems to be quite a few of failed implicit searches that trigger unnecessary macro expansions that are afterwards discarded because their type doesn’t match the predicate type of the implicit search.

caseapp.core.Parser[bloop.cli.CliOptions]{type D = HD} (expanded macros 0)   (278,828 μs, 2.19%)
caseapp.core.Parser[bloop.cli.CommonOptions]{type D = HD} (expanded macros 0)   (189,414 μs, 1.49%)

It looks like the implicit search doesn’t immediately reuse our cached parsers for CommonOptions and CliOptions and first tries to pass in a explicit refinement type D that fails the search. The error seems to happen when finding an implicit for HListParser (which takes type parameters inferred from its other functional dependencies).

Let’s further debug this by adding -Xlog-implicits to the scalac options of the bloop configuration file.

This is a good moment to try to minimize the problem. -Xlog-implicits will log a lot of failed searches and we want to be able to see through the noise. I did minimise the issue with Add reproduced version of case-app inefficiency. Doing implicitly[Parser[CliOptions]] also reproduces it.

Among all the logs, this is the one that attracts my attention the most.

/data/rw/code/scala/loop/frontend/src/main/scala/bloop/cli/CliParsers.scala:48:37: shapeless.this.Generic.materialize is not a valid implicit value for shapeless.Generic.Aux[bloop.cli.CommonOptions,V] because:
type parameters weren't correctly instantiated outside of the implicit tree: inferred type arguments [String :: java.io.PrintStream :: java.io.InputStream :: java.io.PrintStream :: java.io.PrintStream :: java.io.PrintStream :: bloop.cli.CommonOptions.PrettyProperties :: Int :: shapeless.HNil,Nothing] do not conform to method materializeCoproduct's type parameter bounds [V <: shapeless.Coproduct,R <: shapeless.Coproduct]
    caseapp.core.CommandParser.apply[Commands.RawCommand]
                                    ^

The compiler infers R to be Nothing, which of course cannot be a Coproduct, but that doesn’t prevent the macro in materializeCoproduct to materialize and suck up some of our compile times. After all, the implicit search needs to have the exact return type of the macro.

Generic is required by case-app via LabelledGeneric, which is required by HListParser. However, why is materializeCoproduct eligible in this context if all we want is to derive parsers for all the products of Command (e.g. all the subclasses that extend the Command GADT)?

It seems this is bringing us to uncharted territory. We now need to investigate what the Generic macro is doing in the Shapeless codebase.

A tour through Shapeless’s `Generic`

Generic is a macro that will derive the generic representation of a given product, a type that aggregates other types. A case class Foo(i: Int, s: String) aggregates Int and String types, whereas a sealed trait Bar is either of all its subclasses.

The source code of Generic has two implicit candidates that materialize the instance depending if the candidate type is a Product or a Coproduct: materializeProduct and materializeCoproduct.

The problem of incorrect instantiated type arguments we saw before seems specific to the way that the compiler carries out the implicit search. Fixing it requires most likely changes to the implicit search algorithm, as a similar Scala compiler issue did. I tried porting these changes to 2.12.x and use -Xsource:2.13 but the failed macro expansions didn’t go away.

So we need to find a way to fix this in userspace if we want to make the logs disappear. The root of the issue is that both materializeProduct and materializeCoproduct are candidates of the implicit search and both are tried. The compiler considers both eligible even though materializeCoproduct should be discarded. As this isn’t the case, the compiler then forces the expansion of all candidates to check for ambiguous ambiguous implicits in the same scope).

Let’s try a trick. Let’s move the definition of materializeCoproduct to a trait of low priority implicits that the Generic companion extends. This way, materializeProduct (the most common materializer) will always be the first one to be tried.

Only if that search fails the implicit search will try materializeCoproduct in the lower priority scope that is any super class of the Generic companion class.

After making the change in the Shapeless codebase, we coreJVM/package in the shapeless build and replace the jar of shapeless 2.3.3 in the classpath by the one we just created with package. We also remove -Xlog-implicits and compile.

#total compile time  : 1 spans, ()16869.585ms
  typer              : 1 spans, ()13011.067ms (77.1%)
#implicit searches          : 13515
  #plausibly compatible     : 15415 (114.1%)
  #matching                 : 15415 (114.1%)
  #typed                    : 15381 (113.8%)
  #found                    : 8082 (59.8%)
  #implicit improves tests  : 3673 (27.2%)
  #implicit improves cached : 2614 (19.3%)
  #implicit inscope hits    : 348 (2.6%)
  #implicit oftype hits     : 7400 (54.8%)
  from macros               : 12851 (95.1%)
time spent in implicits   : 13515 spans, ()12409.099ms (95.4%)
  successful in scope     : 348 spans, ()78.224ms (0.6%)
  failed in scope         : 13167 spans, ()1363.491ms (10.5%)
  successful of type      : 7400 spans, ()12287.668ms (94.4%)
  failed of type          : 5767 spans, ()8322.049ms (64.0%)
  assembling parts        : 8033 spans, ()237.456ms (1.8%)
  matchesPT               : 79854 spans, ()566.231ms (4.4%)
time spent in macroExpand : 17175 spans, ()11974.695ms (92.0%)

The change had a mild positive effect – we gained two seconds. This change seems to have removed the log we saw before and some of the failed implicit searches from the flamegraph, but most of the other ones still persist.

What is really going on? Our modification fixed the unnecessary expansion for Generic, but there seems to be a more fundamental issue at play.

The Strict-Lazy macro doesn’t like the aux pattern

It took me a while to find out what was happening, though I couldn’t come up with a fix in the compiler (where I think the real issue is – though it’s still to be determined). However, I did come up with a fix in the library side.

After looking at the new output of -Xlog-implicits, I realized that the compiler must find a mismatch in the refinement types that are inferred previously to the search and materialized after the expansion.

The Aux pattern, a common technique used when declaring typeclasses, relies heavily on refinement types and all the failed implicit searches seem to be related in some way or another to the aux pattern of either HListParser or Strict.

We can have a look at the definition of Parser again, which requires the materialization of HListParser. I intuited that the Strict macro may be doing something weird under the hood and causing the type mismatch.

I expanded and pretty-printed some of the implicit logs I found:

shapeless.Strict[
  caseapp.core.HListParser.Aux[
    shapeless.labelled.FieldType[Symbol with shapeless.tag.Tagged[String("workingDirectory")],String] :: java.io.PrintStream with shapeless.labelled.KeyTag[Symbol with shapeless.tag.Tagged[String("out")],java.io.PrintStream] :: java.io.InputStream with shapeless.labelled.KeyTag[Symbol with shapeless.tag.Tagged[String("in")],java.io.InputStream] :: java.io.PrintStream with shapeless.labelled.KeyTag[Symbol with shapeless.tag.Tagged[String("err")],java.io.PrintStream] :: java.io.PrintStream with shapeless.labelled.KeyTag[Symbol with shapeless.tag.Tagged[String("ngout")],java.io.PrintStream] :: java.io.PrintStream with shapeless.labelled.KeyTag[Symbol with shapeless.tag.Tagged[String("ngerr")],java.io.PrintStream] :: shapeless.HNil,
    Option[String] :: Option[java.io.PrintStream] :: Option[java.io.InputStream] :: Option[java.io.PrintStream] :: Option[java.io.PrintStream] :: Option[java.io.PrintStream] :: shapeless.HNil,
    List[caseapp.Name] :: scala.collection.immutable.Nil.type :: scala.collection.immutable.Nil.type :: scala.collection.immutable.Nil.type :: scala.collection.immutable.Nil.type :: scala.collection.immutable.Nil.type :: shapeless.HNil,
    Option[caseapp.ValueDescription] :: None.type :: None.type :: None.type :: None.type :: None.type :: shapeless.HNil,
    Option[caseapp.HelpMessage] :: None.type :: None.type :: None.type :: None.type :: None.type :: shapeless.HNil,
    Option[caseapp.Hidden] :: Some[caseapp.Hidden] :: Some[caseapp.Hidden] :: Some[caseapp.Hidden] :: Some[caseapp.Hidden] :: Some[caseapp.Hidden] :: shapeless.HNil,
    None.type :: None.type :: None.type :: None.type :: None.type :: None.type :: shapeless.HNil,
    Option[String] :: this.P
  ]
]

does not match expected type

shapeless.Strict[
  caseapp.core.HListParser.Aux[
    shapeless.labelled.FieldType[Symbol @@ String("workingDirectory"),String] :: shapeless.labelled.FieldType[Symbol @@ String("out"),java.io.PrintStream] :: shapeless.labelled.FieldType[Symbol @@ String("in"),java.io.InputStream] :: shapeless.labelled.FieldType[Symbol @@ String("err"),java.io.PrintStream] :: shapeless.labelled.FieldType[Symbol @@ String("ngout"),java.io.PrintStream] :: shapeless.labelled.FieldType[Symbol @@ String("ngerr"),java.io.PrintStream] :: shapeless.ops.hlist.ZipWithKeys.hnilZipWithKeys.Out,
    Option[String] :: Option[java.io.PrintStream] :: Option[java.io.InputStream] :: Option[java.io.PrintStream] :: Option[java.io.PrintStream] :: Option[java.io.PrintStream] :: shapeless.HNil,
    scala.collection.immutable.Nil.type :: scala.collection.immutable.Nil.type :: scala.collection.immutable.Nil.type :: scala.collection.immutable.Nil.type :: scala.collection.immutable.Nil.type :: scala.collection.immutable.Nil.type :: shapeless.HNil,
    None.type :: None.type :: None.type :: None.type :: None.type :: None.type :: shapeless.HNil,
    None.type :: None.type :: None.type :: None.type :: None.type :: None.type :: shapeless.HNil,
    Some[caseapp.Hidden] :: Some[caseapp.Hidden] :: Some[caseapp.Hidden] :: Some[caseapp.Hidden] :: Some[caseapp.Hidden] :: Some[caseapp.Hidden] :: shapeless.HNil,
    None.type :: None.type :: None.type :: None.type :: None.type :: None.type :: shapeless.HNil,
    HD
  ]
]

And inspect the generated code by the macro expansion by using -P:scalac-profiling:print-search-result:_ and -Ymacro-debug-lite/-Ymacro-debug-verbose (which dumps all macro related logs). That extra inspection gave me some hints.

The issue seems to be in the refinement of HListParser. In the previous log the last type parameter of HListParser.Aux (the refinement type) was HD, an abstract type used in hconsRecursive, and the returned refinement type from the macro was Option[String] :: this.P.

We can try to debug and expand all type parameters, see what we get and continue the exploration from there. But whenever we find such a mysterious open-ended error, it’s difficult to pinpoint what the real problem and fix should be.

Myself, I had a gut feeling that the Strict macro was interacting weirdly with the aux pattern and followed that hint. The way we can test this hypothesis is by going to the definition of Parser where we remove the Strict wrapping HListParser, ++2.12.6 coreJVM/package in the sbt build and replace the new jar by the classpath entry for case-app core we’re using in frontend.json. Afterwards we compile.

This change may cause errors since the use of Strict and Lazy disable the implicit divergence checks of the compiler, which can give false positives when working with Shapeless data structures (this is the short story).

/data/rw/code/scala/loop/frontend/src/main/scala/bloop/Bloop.scala:22:22:could not find implicit value for parameter parser: caseapp.Parser[bloop.cli.CliOptions]
object Bloop extends CaseApp[CliOptions] {
                     ^
/data/rw/code/scala/loop/frontend/src/main/scala/bloop/Bloop.scala:22:22:not enough arguments for constructor CaseApp: (implicit parser: caseapp.Parser[bloop.cli.CliOptions], implicit messages: caseapp.core.Messages[bloop.cli.CliOptions]) caseapp.CaseApp[bloop.cli.CliOptions].
Unspecified value parameters parser, messages.
object Bloop extends CaseApp[CliOptions] {
                     ^

The error can be fixed by importing coParser and cliParser from CliParsers in the Bloop.scala source file. But doing so would change our baseline (because we’re caching Parser[CliOptions] in another call-site that isn’t our initial CliParsers). So let’s remove the new case-app classpath entry, compile with the old case-app, and then compile again with the changed version.

#total compile time  : 1 spans, ()15972.609ms
  typer              : 1 spans, ()11360.512ms (71.1%)

The new caching only shaves around ~600ms of compile times. Let’s check compiling with our new case-app now.

#total compile time  : 1 spans, ()7432.332ms
  typer              : 1 spans, ()5074.836ms (68.3%)

Bingo! Most of the time-consuming failed implicit searches are gone and compilation time has halved. Our hypothesis is confirmed: the Strict macro is doing something suspicious.

We could try to find out what that is, but that would require us to investigate how the Strict macro works and spot why it doesn’t behave correctly.

We’re short of time, so our best call is to file a ticket and let others more experienced with the codebase have a look at it. If we’re lucky, someone will fix this issue upstream soon and we’ll benefit from this speed up when we upgrade.

After discussing this issue with the author of Shapeless, Miles Sabin, we both agree the strict/lazy macro is not handling refinement types correctly and that this performance penalty is a bug. This bug will most likely be fixed in a future version of Shapeless after 2.3.3 for all its users. Some of these performance implications will be gone with Scala 2.13, that adds by-name implicits to the compiler.

Deduplicating more expansions

There are still too many repeated tower of implicits in our flamegraph. CommandsParser and CommandsMessages are deriving Parsers for every type in our Command GADT twice. Let’s cache those too.

implicit val autocompleteParser: Parser[Commands.Autocomplete] = Parser.generic
implicit val aboutParser: Parser[Commands.About] = Parser.generic
implicit val bspParser: Parser[Commands.Bsp] = Parser.generic
implicit val cleanParser: Parser[Commands.Clean] = Parser.generic
implicit val compileParser: Parser[Commands.Compile] = Parser.generic
implicit val configureParser: Parser[Commands.Configure] = Parser.generic
implicit val helpParser: Parser[Commands.Help] = Parser.generic
implicit val projectsParser: Parser[Commands.Projects] = Parser.generic
implicit val runParser: Parser[Commands.Run] = Parser.generic
implicit val testParser: Parser[Commands.Test] = Parser.generic

#total compile time  : 1 spans, ()10154.603ms
  typer              : 1 spans, ()7925.156ms (78.0%)

We’re in the right direction, but there doesn’t seem to be any straightforward way of decreasing that compilation time anymore.

The flamegraph may not make obvious how many repeated expansions are happening in every branch, so let’s have a look at the data emitted by -P:scalac-profiling:show-profiles.

The “Macro expansions by type” and “Implicit searches by type” tells us how many repeated macros and implicit searches we have per type.

For example, let’s look at the most important entries from the “Implicit searches by type” section.

  "caseapp.util.Implicit[caseapp.core.Default[String] :: shapeless.HNil]" -> 20,
  "caseapp.core.Default[String] :: shapeless.HNil" -> 20,
  "caseapp.util.Implicit[Some[caseapp.core.Default[String]] :+: None.type :+: shapeless.CNil]" -> 20,
  "caseapp.core.Default[String]" -> 20,
  "Some[caseapp.core.Default[String]] :+: None.type :+: shapeless.CNil" -> 20,
  "caseapp.util.Implicit[Option[caseapp.core.Default[String]]]" -> 20,
  "caseapp.core.ArgParser[String]" -> 20,
  "caseapp.util.Implicit[Some[caseapp.core.Default[String]]]" -> 20,
  "Option[caseapp.core.Default[String]]" -> 20,
  "shapeless.HNil" -> 21,
  "Some[caseapp.core.Default[Boolean]] :+: None.type :+: shapeless.CNil" -> 35,
  "caseapp.core.Default[Boolean]" -> 35,
  "caseapp.util.Implicit[Some[caseapp.core.Default[Boolean]] :+: None.type :+: shapeless.CNil]" -> 35,
  "caseapp.util.Implicit[caseapp.core.Default[Boolean]]" -> 35,
  "shapeless.Strict[caseapp.core.ArgParser[Boolean]]" -> 35,
  "caseapp.core.Default[Boolean] :: shapeless.HNil" -> 35,
  "caseapp.util.Implicit[Some[caseapp.core.Default[Boolean]]]" -> 35,
  "caseapp.util.Implicit[Option[caseapp.core.Default[Boolean]]]" -> 35,
  "Some[caseapp.core.Default[Boolean]]" -> 35,
  "Option[caseapp.core.Default[Boolean]]" -> 35,
  "caseapp.util.Implicit[caseapp.core.Default[Boolean] :: shapeless.HNil]" -> 35,
  "caseapp.util.Implicit[Option[caseapp.core.Default[java.io.PrintStream]]]" -> 56,
  "caseapp.util.Implicit[Some[caseapp.core.Default[java.io.PrintStream]] :+: None.type :+: shapeless.CNil]" -> 56,
  "Some[caseapp.core.Default[java.io.PrintStream]] :+: None.type :+: shapeless.CNil" -> 56,
  "shapeless.Strict[caseapp.core.ArgParser[java.io.PrintStream]]" -> 56,
  "Option[caseapp.core.Default[java.io.PrintStream]]" -> 56,
  "caseapp.core.Default[java.io.PrintStream]" -> 56,
  "caseapp.util.Implicit[caseapp.core.Default[java.io.PrintStream]]" -> 56,
  "Some[caseapp.core.Default[java.io.PrintStream]]" -> 56,
  "caseapp.util.Implicit[caseapp.core.Default[java.io.PrintStream] :: shapeless.HNil]" -> 56,
  "caseapp.core.Default[java.io.PrintStream] :: shapeless.HNil" -> 56,
  "caseapp.util.Implicit[Some[caseapp.core.Default[java.io.PrintStream]]]" -> 56,
  "None.type" -> 153,
  "caseapp.util.Implicit[None.type]" -> 153,
  "caseapp.util.Implicit[None.type :+: shapeless.CNil]" -> 153,
  "caseapp.util.Implicit[shapeless.HNil]" -> 185

Let’s cache some more implicits from here, especially the ones we see are most expensive.

import shapeless.{HNil, CNil, :+:, ::, Coproduct}
implicit val implicitHNil: Implicit[HNil] = Implicit.hnil
implicit val implicitNone: Implicit[None.type] = Implicit.instance(None)
implicit val implicitNoneCnil: Implicit[None.type :+: CNil] =
  Implicit.instance(Coproduct(None))

implicit val implicitOptionDefaultString: Implicit[Option[Default[String]]] =
  Implicit.instance(Some(caseapp.core.Defaults.string))

implicit val implicitOptionDefaultInt: Implicit[Option[Default[Int]]] =
  Implicit.instance(Some(caseapp.core.Defaults.int))

implicit val implicitOptionDefaultBoolean: Implicit[Option[Default[Boolean]]] =
  Implicit.instance(Some(caseapp.core.Defaults.boolean))

implicit val implicitDefaultBoolean: Implicit[Default[Boolean]] =
  Implicit.instance(caseapp.core.Defaults.boolean)

implicit val implicitOptionDefaultOptionPath: Implicit[Option[Default[Option[Path]]]] =
  Implicit.instance(None)

implicit val implicitOptionDefaultPrintStream: Implicit[Option[Default[PrintStream]]] =
  Implicit.instance(Some(Default.instance[PrintStream](System.out)))

implicit val implicitOptionDefaultInputStream: Implicit[Option[Default[InputStream]]] =
  Implicit.instance(Some(Default.instance[InputStream](System.in)))
implicit val labelledGenericCommonOptions: LabelledGeneric.Aux[CommonOptions, _] =
  LabelledGeneric.materializeProduct
implicit val labelledGenericCliOptions: LabelledGeneric.Aux[CliOptions, _] =
  LabelledGeneric.materializeProduct

And now let’s check the compilation time.

#total compile time  : 1 spans, ()7285.771ms
  typer              : 1 spans, ()5435.895ms (74.6%)

Great, that reduced compile times by 3 more seconds. You can continue the same strategy over and over. This is where we stop; we have already cached the most expensive implicits, so other additions won’t have such a big impact.

You get the general idea of the process: read the profiles and optimize according to what the data shows and your understanding of the codebase is.

We only have one left assignment: removing all those failed implicit searches in our flamegraph. We saw that wrapping the implicit in Strict was problematic, can we do something about it in our end instead of waiting for a fix upstream?

The answer is yes. Strict or Lazy are only required when:

We have recursive GADTs.
We use an automatic typeclass derivation scheme that increases the number of type parameters to be determined by implicit search and thus “diverge” the search.

Good, we don’t have a recursive GADT (and it’s unlikely you will in a CLI application). But we do have an automatic typeclass derivation process that meets the previous criteria. We experienced the implicit search failure before when we removed the Strict typeclass from case-app and Bloop.scala failed to compile.

Such automatic typeclass derivation, though, doesn’t diverge after we cached the implicits! The divergence only happens when we derive typeclasses for types transitively (and the transitive typeclasses don’t exist). For example, if the parser for CliOptions isn’t present in scope and we derive a Parser[Compile], which has CliOptions as a parameter type, it will fail.

So we can remove the uses of Strict from case-app. Once we cache these intermediary derivations, the divergence won’t happen.

Aside from the change proposed in the previous section, we also remove the appearance of Strict in HListParser.hconsRecursive. These changes can be found in this case-app diff.

It’s worth noting what we’re doing here explicitly: we’re trading compile times by ergonomics. Whenever we add a parameter that doesn’t have a cached Parser for it in CliParser, the implicit search will fail with a "Not found implicit instance" error.

This is a judgement call. Personally, I prefer having faster compile times than ergonomics, and even more so if the part of the code (the cli) doesn’t change often as it is the case. Let’s try out the new change!

#total compile time  : 1 spans, ()4511.197ms
  typer              : 1 spans, ()2887.031ms (64.0%)

Great! We now have a compile time under 5 seconds for an application that still uses a powerful derivation mechanism, it’s easy to maintain and it’s over 6000 LOC.

In the flamegraph, we observe that we have removed the most expensive failed implicit searches, while some negligible searches remain. We can live with those.

The duration of the typechecker is back to normal levels: 64%, a reasonable value for the codebase we’re working on. We now need to remove the instrumentation overhead to see what’s the final speedup we get.

Getting the final results

Let’s remove the scalac-profiling plugin and all its flags from frontend.json. Run compilation two or three times to get stable results.

*** Cumulative timers for phases
#total compile time           : 1 spans, ()4098.49ms
  parser                      : 1 spans, ()18.775ms (0.5%)
  namer                       : 1 spans, ()12.408ms (0.3%)
  packageobjects              : 1 spans, ()0.074ms (0.0%)
  typer                       : 1 spans, ()2612.532ms (63.7%)
  patmat                      : 1 spans, ()286.802ms (7.0%)
  superaccessors              : 1 spans, ()12.026ms (0.3%)
  extmethods                  : 1 spans, ()3.201ms (0.1%)
  pickler                     : 1 spans, ()6.389ms (0.2%)
  xsbt-api                    : 1 spans, ()75.191ms (1.8%)
  xsbt-dependency             : 1 spans, ()54.559ms (1.3%)
  refchecks                   : 1 spans, ()122.441ms (3.0%)
  uncurry                     : 1 spans, ()130.194ms (3.2%)
  fields                      : 1 spans, ()57.397ms (1.4%)
  tailcalls                   : 1 spans, ()13.512ms (0.3%)
  specialize                  : 1 spans, ()105.903ms (2.6%)
  explicitouter               : 1 spans, ()26.837ms (0.7%)
  erasure                     : 1 spans, ()193.214ms (4.7%)
  posterasure                 : 1 spans, ()16.83ms (0.4%)
  lambdalift                  : 1 spans, ()41.906ms (1.0%)
  constructors                : 1 spans, ()11.108ms (0.3%)
  flatten                     : 1 spans, ()14.379ms (0.4%)
  mixin                       : 1 spans, ()15.936ms (0.4%)
  cleanup                     : 1 spans, ()11.516ms (0.3%)
  delambdafy                  : 1 spans, ()26.534ms (0.6%)
  jvm                         : 1 spans, ()224.115ms (5.5%)
  xsbt-analyzer               : 1 spans, ()1.376ms (0.0%)
Done compiling.

It is safe to say it out loud now: we have reduced compilation time from 32.5 seconds to 4 seconds. That’s an 8x reduction in our compile time.

A great result taking into account that we’ve only modified around 30 lines of code in Bloop.

Conclusion

Shapeless is a great library that enables use cases that before were too difficult for the majority of Scala developers. These use cases save a lot of boilerplate.

Shapeless has relieved these users from learning macros and getting familiar with the internals of the compiler to do both basic generic and advanced typelevel programming in Scala.

However, the techniques used in Shapeless cause slow compilation times and may give an impression that the Scala compiler is terribly slow. These techniques are not specific to Shapeless and may happen in other libraries that use a lot of implicits and macros.

In all these use cases, the slowness is most likely to be caused by a unintentional misuse of the APIs provided by these frameworks. In this guide, we have tried to identify what those issues are and how we can get the best of Shapeless and the compiler without compromising our productivity.

We have learned that automatic typeclass derivation, while powerful and user-friendly, is likely to materialize implicits for the same type lots of times.

We have used a new Scala Center tool (scalac-profiling) to profile implicits and macros to reduce the compile times of Bloop by 8x.

Finally, we have gotten a little bit more familiar about the way automatic typeclass derivation interacts with macro and implicit searches. It is generally agreed that we need to find a better way to bake generation into the language to alleviate some of the pitfalls here described.

There’s some activity in this area. Adriaan opened a ticket about it several months ago, and Miles is backporting the heavy machinery from Shapeless properly into the compiler (like by-name implicits). I applaud these efforts.

I believe we still need to find solutions to some of the fundamental problems of implicit searches and macros. In particular, being more aggressive in caching macro generated trees and baking into the compiler all the required knowledge to invalidate caching depending on the kind of macro and call-site.

There’s a bright future ahead of us and we are working hard to get there.

In the meanwhile, this blog post aims to provide all the possible data to alleviate the compile times of users that leverage automatic typeclass derivation. I hope this blog post helps make your team more productive with Scala.