Random, Pie-in-the-Sky SBT Idea: Distributed Compilation

9 replies

Thu, 2011-10-06, 06:27

daniel

Joined: 2008-08-20,

In order to determine which files need to be selectively recompiled following a modification, SBT needs to compute a directed graph of dependencies between files. In theory, this graph gives SBT the ability to actually distribute the compilation process to remote agents in the case of a multi-file recompile. This would work by separating the dependency graph into connected, separable sub-graphs such that each graph has a file with changes (thus requiring the recompilation of the whole graph). As these are separable graphs, they represent file sets that may be compiled entirely independently, and thus are eligible for simultaneous compilation across a distributed cluster.

Note that in the case of a single file save (the ~compile case), there will be at most one such graph, and thus the compilation would have to be run on a single machine. However, in the case where multiple files have been changed, or a clean compile where the dependency information has been preserved from prior analysis, this could theoretically result in some appreciable gains in compilation times.

At least that's the theory. I'm not sure how often this multi-file case actually arises. You can also make the argument that, due to natural coding patterns, when it arises all affected files are likely to be in the same connected graph, and thus no distribution is possible. Finally, I don't know enough about SBT's actual dependency analysis to make any real judgements about this, so the whole thing may be rubbish.

Just a thought.

Daniel

Thu, 2011-10-06, 07:07

ichoran

Joined: 2009-08-14,

Re: Random, Pie-in-the-Sky SBT Idea: Distributed Compilation

Interesting!

But I'd think if one were trying to speed compilation, it would be _much_ better to maintain a compiled AST with a diff feature to update the AST and dependencies explicitly labeled, rather than redoing the full compile. If you make a tiny tweak in a core utility object, the dependency graph is going to say that everything depends, but the AST will probably say that almost nothing cares (or only a few lines do).

--Rex

On Thu, Oct 6, 2011 at 1:27 AM, Daniel Spiewak <djspiewak@gmail.com> wrote:

In order to determine which files need to be selectively recompiled following a modification, SBT needs to compute a directed graph of dependencies between files. In theory, this graph gives SBT the ability to actually distribute the compilation process to remote agents in the case of a multi-file recompile. This would work by separating the dependency graph into connected, separable sub-graphs such that each graph has a file with changes (thus requiring the recompilation of the whole graph). As these are separable graphs, they represent file sets that may be compiled entirely independently, and thus are eligible for simultaneous compilation across a distributed cluster.

Note that in the case of a single file save (the ~compile case), there will be at most one such graph, and thus the compilation would have to be run on a single machine. However, in the case where multiple files have been changed, or a clean compile where the dependency information has been preserved from prior analysis, this could theoretically result in some appreciable gains in compilation times.

At least that's the theory. I'm not sure how often this multi-file case actually arises. You can also make the argument that, due to natural coding patterns, when it arises all affected files are likely to be in the same connected graph, and thus no distribution is possible. Finally, I don't know enough about SBT's actual dependency analysis to make any real judgements about this, so the whole thing may be rubbish.

Just a thought.

Daniel

Thu, 2011-10-06, 08:27

Ismael Juma 2

Joined: 2011-01-22,

Re: Random, Pie-in-the-Sky SBT Idea: Distributed Compilation

On Thu, Oct 6, 2011 at 6:59 AM, Rex Kerr wrote:
> If you make a tiny tweak in a core utility object, the dependency graph is
> going to say that everything depends

Depends on the tweak actually. For example, if you change something
that doesn't modify the API of the class, SBT >= 0.10 will only
compile the class and nothing else.

Best,
Ismael

Thu, 2011-10-06, 08:37

Ismael Juma 2

Joined: 2011-01-22,

Re: Random, Pie-in-the-Sky SBT Idea: Distributed Compilation

On Thu, Oct 6, 2011 at 8:18 AM, Ismael Juma wrote:
> Depends on the tweak actually. For example, if you change something
> that doesn't modify the API of the class, SBT >= 0.10 will only
> compile the class and nothing else.

I should say "API" as it takes into account Scala's code generation
strategy to ensure that things like traits are handled correctly.

Best,
Ismael

Thu, 2011-10-06, 13:57

Joshua.Suereth

Joined: 2008-09-02,

Re: Random, Pie-in-the-Sky SBT Idea: Distributed Compilation

You're not the only one who has wanted this. My idea goes a bit further than this.
Not only could you build in the cloud, but you could cache built objects between developers in a company so only one would have to compile a sub-graph if you made a change in it. Combined with making the 'cloud SBT' aware of your VCS and repository you could ensure that only local changes need be sent into the cloud to reduce network overhead. You still have network overhead of *downloading* the built binaries back from the cloud, and you have some 'join' tasks where these files need to be passed around a network.
It's not a simple task, but I think a lot of us have this dream. It's also something that will require work.

On Thu, Oct 6, 2011 at 1:27 AM, Daniel Spiewak <djspiewak@gmail.com> wrote:

In order to determine which files need to be selectively recompiled following a modification, SBT needs to compute a directed graph of dependencies between files. In theory, this graph gives SBT the ability to actually distribute the compilation process to remote agents in the case of a multi-file recompile. This would work by separating the dependency graph into connected, separable sub-graphs such that each graph has a file with changes (thus requiring the recompilation of the whole graph). As these are separable graphs, they represent file sets that may be compiled entirely independently, and thus are eligible for simultaneous compilation across a distributed cluster.

Note that in the case of a single file save (the ~compile case), there will be at most one such graph, and thus the compilation would have to be run on a single machine. However, in the case where multiple files have been changed, or a clean compile where the dependency information has been preserved from prior analysis, this could theoretically result in some appreciable gains in compilation times.

At least that's the theory. I'm not sure how often this multi-file case actually arises. You can also make the argument that, due to natural coding patterns, when it arises all affected files are likely to be in the same connected graph, and thus no distribution is possible. Finally, I don't know enough about SBT's actual dependency analysis to make any real judgements about this, so the whole thing may be rubbish.

Just a thought.

Daniel

Thu, 2011-10-06, 14:07

Razvan Cojocaru 3

Joined: 2010-07-28,

Re: Random, Pie-in-the-Sky SBT Idea: Distributed Compilation

What do we call this? The "object torrent"?

Thanks,Razvan
On 2011-10-06, at 8:56 AM, Josh Suereth <joshua.suereth@gmail.com> wrote:

You're not the only one who has wanted this. My idea goes a bit further than this.
Not only could you build in the cloud, but you could cache built objects between developers in a company so only one would have to compile a sub-graph if you made a change in it. Combined with making the 'cloud SBT' aware of your VCS and repository you could ensure that only local changes need be sent into the cloud to reduce network overhead. You still have network overhead of *downloading* the built binaries back from the cloud, and you have some 'join' tasks where these files need to be passed around a network.
It's not a simple task, but I think a lot of us have this dream. It's also something that will require work.

On Thu, Oct 6, 2011 at 1:27 AM, Daniel Spiewak < (djspiewak [at] gmail [dot] com> wrote:

In order to determine which files need to be selectively recompiled following a modification, SBT needs to compute a directed graph of dependencies between files. In theory, this graph gives SBT the ability to actually distribute the compilation process to remote agents in the case of a multi-file recompile. This would work by separating the dependency graph into connected, separable sub-graphs such that each graph has a file with changes (thus requiring the recompilation of the whole graph). As these are separable graphs, they represent file sets that may be compiled entirely independently, and thus are eligible for simultaneous compilation across a distributed cluster.

Note that in the case of a single file save (the ~compile case), there will be at most one such graph, and thus the compilation would have to be run on a single machine. However, in the case where multiple files have been changed, or a clean compile where the dependency information has been preserved from prior analysis, this could theoretically result in some appreciable gains in compilation times.

At least that's the theory. I'm not sure how often this multi-file case actually arises. You can also make the argument that, due to natural coding patterns, when it arises all affected files are likely to be in the same connected graph, and thus no distribution is possible. Finally, I don't know enough about SBT's actual dependency analysis to make any real judgements about this, so the whole thing may be rubbish.

Just a thought.

Daniel

Thu, 2011-10-06, 14:27

Ismael Juma 2

Joined: 2011-01-22,

Re: Random, Pie-in-the-Sky SBT Idea: Distributed Compilation

On Thu, Oct 6, 2011 at 1:56 PM, Josh Suereth wrote:
> It's not a simple task, but I think a lot of us have this dream. It's also
> something that will require work.

Google (as you know) has done a lot of work on this:

http://google-engtools.blogspot.com/2011/06/build-in-cloud-accessing-sou...
http://google-engtools.blogspot.com/2011/06/testing-at-speed-and-scale-o...
http://google-engtools.blogspot.com/2011/09/build-in-cloud-distributing-...

Best,
Ismael

Thu, 2011-10-06, 14:37

daniel

Joined: 2008-08-20,

Re: Random, Pie-in-the-Sky SBT Idea: Distributed Compilation

Wow, that's even better! I was thinking that the people who would benefit most from this would be large teams working on the same codebase. Foursquare, for example, would probably save man weeks per year if they could do this for just their NY office.

Daniel

On Thu, Oct 6, 2011 at 7:56 AM, Josh Suereth <joshua.suereth@gmail.com> wrote:

You're not the only one who has wanted this. My idea goes a bit further than this.
Not only could you build in the cloud, but you could cache built objects between developers in a company so only one would have to compile a sub-graph if you made a change in it. Combined with making the 'cloud SBT' aware of your VCS and repository you could ensure that only local changes need be sent into the cloud to reduce network overhead. You still have network overhead of *downloading* the built binaries back from the cloud, and you have some 'join' tasks where these files need to be passed around a network.
It's not a simple task, but I think a lot of us have this dream. It's also something that will require work.

On Thu, Oct 6, 2011 at 1:27 AM, Daniel Spiewak <djspiewak@gmail.com> wrote:

In order to determine which files need to be selectively recompiled following a modification, SBT needs to compute a directed graph of dependencies between files. In theory, this graph gives SBT the ability to actually distribute the compilation process to remote agents in the case of a multi-file recompile. This would work by separating the dependency graph into connected, separable sub-graphs such that each graph has a file with changes (thus requiring the recompilation of the whole graph). As these are separable graphs, they represent file sets that may be compiled entirely independently, and thus are eligible for simultaneous compilation across a distributed cluster.

Note that in the case of a single file save (the ~compile case), there will be at most one such graph, and thus the compilation would have to be run on a single machine. However, in the case where multiple files have been changed, or a clean compile where the dependency information has been preserved from prior analysis, this could theoretically result in some appreciable gains in compilation times.

At least that's the theory. I'm not sure how often this multi-file case actually arises. You can also make the argument that, due to natural coding patterns, when it arises all affected files are likely to be in the same connected graph, and thus no distribution is possible. Finally, I don't know enough about SBT's actual dependency analysis to make any real judgements about this, so the whole thing may be rubbish.

Just a thought.

Daniel

Thu, 2011-10-06, 14:47

Joshua.Suereth

Joined: 2008-09-02,

Re: Random, Pie-in-the-Sky SBT Idea: Distributed Compilation

Yep. However compiling Scala code on their build system was still slower than locally in the general case. Partly because of how adding new functionality works. I would still have used SBT for most of my Scala projects, assuming the Java/DLLs were pre-compiled.
I do think we should experiment with distribution, because the right model could be outstanding. Rapture also has a distributed build-y process.

On Thu, Oct 6, 2011 at 9:18 AM, Ismael Juma <ismael@juma.me.uk> wrote:

On Thu, Oct 6, 2011 at 1:56 PM, Josh Suereth <joshua.suereth@gmail.com> wrote:
> It's not a simple task, but I think a lot of us have this dream. It's also
> something that will require work.

Google (as you know) has done a lot of work on this:

http://google-engtools.blogspot.com/2011/06/build-in-cloud-accessing-source-code.html
http://google-engtools.blogspot.com/2011/06/testing-at-speed-and-scale-of-google.html
http://google-engtools.blogspot.com/2011/09/build-in-cloud-distributing-build-steps.html

Best,
Ismael

Thu, 2011-10-06, 17:07

Grey

Joined: 2009-01-03,

Re: Random, Pie-in-the-Sky SBT Idea: Distributed Compilation

While there are always exceptions, it seems any project in the X,000's of source files and/or classes range is really 2 or more projects pleading to become separately compiled libraries with small surface dependencies defined by a few interfaces / API. For us, all of our project components / libraries are in the low 1,000s if not 100s of classes/source files which SBT + scalac handles with (relative) alacrity.
For us anyway, Scalac /project compilation speed has never been a huge issue for us or a pain point deemed to be a "drag" on the project development effort.
On Thu, Oct 6, 2011 at 9:28 AM, Josh Suereth <joshua.suereth@gmail.com> wrote:

Yep. However compiling Scala code on their build system was still slower than locally in the general case. Partly because of how adding new functionality works. I would still have used SBT for most of my Scala projects, assuming the Java/DLLs were pre-compiled.
I do think we should experiment with distribution, because the right model could be outstanding. Rapture also has a distributed build-y process.

On Thu, Oct 6, 2011 at 9:18 AM, Ismael Juma <ismael@juma.me.uk> wrote:

On Thu, Oct 6, 2011 at 1:56 PM, Josh Suereth <joshua.suereth@gmail.com> wrote:
> It's not a simple task, but I think a lot of us have this dream. It's also
> something that will require work.

Google (as you know) has done a lot of work on this:

http://google-engtools.blogspot.com/2011/06/build-in-cloud-accessing-source-code.html
http://google-engtools.blogspot.com/2011/06/testing-at-speed-and-scale-of-google.html
http://google-engtools.blogspot.com/2011/09/build-in-cloud-distributing-build-steps.html

Best,
Ismael

Scala Main Menu

Random, Pie-in-the-Sky SBT Idea: Distributed Compilation

Scala Quick Links

Featured News

User login