Svoboda | Graniru | BBC Russia | Golosameriki | Facebook

Leveraging sbt remote caching on a big modular monolith

Sébastien Boulet
Teads Engineering
Published in
10 min readMar 15, 2024

--

Constructing a massive modular monolith in Scala comes with its own set of intriguing challenges. In this article we review how sbt remote caching emerges as the game-changer, enhancing the development experience and crafting a more efficient workflow.

Context: a modular monolith

More than 5 years ago, we decided to rationalize the architecture of our primary database access. The majority of our reference data is stored across a few MySQL clusters. Although there was an existing API in place, it contained little or no business logic, and many services bypassed it by interacting directly with the databases. This pattern, known as a “shared database”, led to several issues:

  • business logic was dispersed across clients rather than centralized.
  • evolving the database schema was hard, if not impossible.
  • dependencies between services were not explicit.

In response to these challenges, we introduced a new API to abstract the databases and consolidate the business logic. Services are required to pass through this API to reach the database.

We build our API within a mono repository and package it as a monolith. The monolith comprises several business domains, each divided into an API module and an implementation module. Domains communicate through their API, hence implementation modules are loosely coupled. A core team implements the API framework which provides building blocks for the domains.

The API modules expose gRPC (Protocol Buffers) endpoints transpiled to Scala, while the implementation modules are also implemented in Scala. Like any Scala project at Teads, the API is built using sbt.

sbt is a build tool for Scala (and Java) and is the most widely used tool in the Scala ecosystem. It can be easily extended and supports parallel task execution.

A very active repository

The API repository is continuously updated:

  • several deployments per day
  • more than one thousand pull requests per year
  • hundreds of contributors

As of today, the API has become rather big:

  • more than half a million lines of code
  • about 10000 automated tests

For several years, optimizing build performance has been one of the primary challenges in this repository.

To decrease the duration of both Continuous Integration (CI) jobs and local build times, we leverage sbt remote caching. We are going to explore how sbt remote caching works and how we have integrated it into our build to avoid recompilation. However, that’s not all — we have extended this caching mechanism to further optimize our build by caching Scalafix processing and targeting the automated tests to execute.

What’s sbt remote caching

In 2020, sbt 1.4.0 introduced a new experimental feature: remote caching (cached compilation).

The idea is for a team of developers and/or a continuous integration (CI) system to share build outputs. If the build is repeatable, the output from one machine can be reused by another machine, which can make the build significantly faster.

Based on Zinc, the incremental compiler

Remote caching relies on Zinc, the Scala incremental compiler. Zinc is a major tool, yet not well known, of the Scala ecosystem. Zinc enhances scalac by compiling only what has changed since the previous compilation.

When you change a source file, Zinc analyzes the structure of your code and recompiles only the source files affected by your change. The result should be identical to the output of a clean compile.

Zinc stores its analysis on the disk (/target/scala-2.13/zinc/inc_compile_2.13.zip) to track changes since the previous analysis.

CLI

sbt remote caching exposes 2 main sbt tasks to the end user:

  • pushRemoteCache to push the remote cache artifact to a repository
  • pullRemoteCache to pull the remote cache artifact from the repository

and a setting key:

  • pushRemoteCacheTo to specify the repository

To see the remote caching in action:

  • first, run sbt compile followed by sbt pushRemoteCache from a machine
  • second, sbt pullRemoteCache followed by sbt compile from another machine

You can experiment on your machine by using a local repository:

ThisBuild / pushRemoteCacheTo := Some(MavenCache("local-cache", file("/tmp/remote-cache")))

The distant repository

The remote cache repository is a shared key-value cache. The values are the remote cache artifacts. A key identifies each artifact.

Building and publishing a remote cache artifact to the repository works similarly as it does for a library:

  • first, sbt builds the remote cache artifact: a jar file containing the compiled code (*.class) and, the incremental compiler (zinc) analysis. Hence, when the artifact is fetched, zinc can resume its work as it had compiled the code locally.
  • second, sbt creates the key for storing (and later retrieving) the artifact in the cache. The key is a library coordinate where the project version (0.0.0) and the remote cache id (5e94c5faaaef95ea) compose the revision:
groupID % artifactID % revision % configuration
"tv.teads" % "api-commons" % "0.0.0-5e94c5faaaef95ea" "cached-compile"
  • last, sbt pushes the artifact to the distant repository at the given coordinates (the key).

The remote cache id

Different source files result in a different build output, resulting in distinct remote cache artifacts. sbt generates an identifier (remoteCacheId) for each artifact. This identifier is a combined hash of, among other things, the sources (*.scala) and library dependencies. It means that any change in the sources or any library update will yield a different artifact.

Our remote cache set up

Push from CI, Pull from anywhere

For our API, we use Nexus as a remote cache storage. Nexus is a repository manager that allows us to access and distribute artifacts (libraries) across the company. sbt remote caching relies on maven publishing and resolution mechanism, this is why we have reused our already-in-place Nexus.

Our continuous integration (Jenkins) checks out the repository source code from GitHub, then compiles the code, and finally pushes remote cache artifacts to Nexus. These remote cache artifacts are then automatically pulled by subsequent builds on Jenkins or by engineers on their machines.

Pulling the remote cache automatically

In our build, pullRemoteCache is automatically called before the compilation. It’s a very neat addition to the developer experience. We don't have to explicitly call pullRemoteCache. And, we don't need to understand how it works; it just happens automatically.

To do that, we just added dependencies between the tasks consuming the class directory (where the remote cache is extracted) and pullRemoteCache.

compileIncremental := compileIncremental.dependsOn(pullRemoteCache).value, 
products := products.dependsOn(pullRemoteCache).value,
copyResources := copyResources.dependsOn(pullRemoteCache).value,

We are aware that this setup could potentially break after an sbt upgrade, but this setup has been in place for over 2 years, and it has proven to be effective.

Refining the remote cache id

To compute the remote cache id, sbt takes into account only a subset of the parameters affecting the compilation. As a result, even with a cache hit, zinc might need to compile.

We have modified the computation to build an exact remote cache id by also hashing the remote cache id of internal dependencies. As a consequence, more remote cache artifacts are published, and in case of a cache hit, Zinc has nothing to do (except checking that it has nothing to do).
The topology of our API makes this improvement very effective. Indeed, the majority of contributions occur in a single domain, and often only in the implementation project of the domain.

Having more precise cache keys leads to more cache misses. An improvement would be to fall back on a less precise key to avoid compiling from scratch in case of a cache miss. sbt proposes a task named remoteCacheIdCandidatesthat could fit.

Refining the remote cache id has been particularly challenging. It was difficult to identify the appropriate parameters for hashing. Troubleshooting a cache miss on a specific project or understanding why certain classes are recompiled despite a cache hit has also been tough. Additionally, the remote cache provided by sbt is experimental, and we have encountered or reported several bugs (sbt#6027) or limitations (sbt#6286, sbt#6298, sbt#6312).

Extending the remote cache

Avoiding compilation is a substantial improvement. However, other tasks are time-consuming in our build. We have provided a setting key allowing other tasks to contribute to the remote cache:

val cacheGenerators = settingKey[Seq[Task[Seq[File]]]](
"List of tasks that generate cache files to include in remote cache artifacts."
)

This setting, cacheGenerators, produces cache files that are automatically included in a specific folder of the remote cache artifact. When sbt pulls the remote cache artifact, these cache files are extracted and moved back to their expected location. The mechanism transforms also the absolute paths present in these files into relative paths (and then the other way around) to be independent of the machine.

This mechanism allows us to cache two expensive tasks: scalafix and test

Including scalafix cache in the remote cache

Scalafix is a “Refactoring and linting tool for Scala” that we use in our build. scalafix runs multiple rules (built-in and custom) on our source code. A fresh run of scalafix is long. Happily, the sbt scalafix plugin comes with an incremental cache: the next execution analyses only the modified files.

We have included the scalafix incremental cache into the remote cache artifact. First, we contributed upstream to make the scalafix cache machine-independent. and then we enriched the remote cache artifact using the mechanism described above:

cacheGenerators += Def.task {
val scalafixCacheDir = (scalafix / streams).value.cacheDirectory
val scalafixCacheFiles =
(scalafixCacheDir ** ("*" -- "out" -- "err")).filter(_.isFile).get()
scalafixCacheFiles
}.dependsOn(scalafixCheckAll).taskValue

The cache generator task depends on scalafix to ensure that scalafix has been called and avoid race conditions.

Run only tests of the impacted projects

Running tests is by far the most expensive task of our build. We take advantage of the remote cache to run only the tests of the impacted projects on the CI. To do that, we have refined the remote cache id to also hash every parameter that can affect the test results: for example the project resources (Unlike compilation, test execution can be affected by a resource change)

So when there is cache hit in a project, the tests are skipped.

However, skipping tests requires extra caution. As a safety net, we have decided to use testQuick to skip tests. As for scalafix, we have enriched the remote cache artifact with the testQuick cache file (target/streams/test/test/_global/streams/succeeded_tests):

cacheGenerators += Def
.task {
val s = (test / streams).value
val succeededFile = Defaults.succeededFile(s.cacheDirectory)
val cacheFiles = if (succeededFile.isFile) Seq(succeededFile) else Seq.empty
cacheFiles
}.dependsOn(testQuick.toTask("")).taskValue

And, the CI build calls testQuick instead of test before.

The succeeded_tests file contains the successful run time of each test. When there is a cache hit, this file is extracted from the remote cache, and thus testQuick will check if files get recompiled after that (It should not be the case but it’s a safety net).

For a while, only pull request builds were using this feature. But after several months, we were confident enough and we decided to activate it on the main branch build too. It’s been 18 months, and we haven’t encountered any issues.

Measuring the impact

In case of a full cache hit, the sbt build takes about 3min 30 seconds.
This duration is still a few minutes because not all sbt tasks are cached. For instance, the transpilation of Protocol Buffers to Scala case classes is not cached. This scenario occurs regularly on the main branch when the merged pull request has already been rebased.

On the other hand, when there is a full cache miss, for example when Scala Steward opens a pull request to bump a library, everything is rebuilt and all the tests are executed. That takes up to 45 minutes. Therefore, a fully cached build is more than 92% efficient.

In real life, what engineers experience often falls somewhere between these two extremes. Build times can vary significantly based on the changes they make.

Another benefit of caching is saving resources: remote caching relies on I/O and gives back CPU and memory to the machine. It reduces the load on our CI (Jenkins) and enhances the experience of engineers on their machines.

What’s next?

Though sbt remote caching has allowed us to maintain acceptable build times, they continue to increase. We’re exploring other potential improvements that could help:

  • improving the observability of our build: currently, we lack visibility, making it difficult to measure improvements and prevent regressions effectively. It would be great to integrate metrics or traces into our observability stack.
  • fallback on a precise remote cache artifact in case of cache miss: building from an almost good state is faster than building from scratch
  • adding aggressive timeouts to fail fast when we cannot retrieve remote cache artifacts promptly. For example, in cases where an engineer has a poor internet connection. This was previously managed via configuration of the internal HTTP client (OkHttp), which was deprecated in sbt 1.7.0., and we have not yet found a replacement.
  • loading only the required domain in sbt on an engineer’s machine can be beneficial. Most of the time, engineers work on a single domain, and loading the entire project in sbt (and in Intellij) is resource-intensive. We have begun working on a feature that we refer to as “partial build”.
  • splitting the monolith? Despite our efforts, aggregating all the modules into a single artifact takes time. Breaking down the monolith into multiple artifacts and building only those impacted by a change could cut build time. Of course, this comes with its own set of challenges.
  • the upcoming sbt 2.0 release (no date yet) will bring some potential improvements in remote caching. For instance, there will be a mechanism enabling tasks to participate in the cache system, which means that more tasks are likely to be remotely cached. Also, we could make our remote cache setup simpler by removing our custom cacheGenerators.
  • migrating to Bazel is another option. Bazel comes with very neat features out of the box: remote caching like sbt and remote execution to distribute the build to several workers. Other companies using Scala, like Databricks, have already adopted it. However, migrating our sbt build to Bazel would require a significant effort, which we recently estimated to take 6 months. Indeed, while rules for scala are available, we quickly faced a bug, highlighting the complexity of our build and the challenges associated with migration. Additionally, to leverage remote execution, we would have to set up a build farm.

Source code

You can find the source code for our remote cache setup in this GitHub gist. It’s not self-contained and it’s tailored to our specific requirements. As such, it cannot be directly integrated into other repositories. Nevertheless, it’s worth sharing.

Big thanks to Brice Jaglin for teaching me sbt and assisting with setting up remote caching. I’d also like to thank Rémi Saissy, Daniel Connelly, and Benjamin Davy for their insightful feedback on the article.

--

--