This page contains a lot of boring pseudo-proofs. Prepare to be smacked-down with some fierce logic and hand-waving. If you disagree with me you can throw down the gauntlet and we can engage in an intellectual mind-fight at the speed of lasers. As with all mind-fights, it is assumed your gauntlet will be a standard Gauntlet of Reasoning +2 or higher.

In most of these proofs, "you" can be you, who is a person taking the place of a generic build system like tup or make. Your adversary is the developer, who makes changes to things and generally makes your life miserable. Your nemesis is Clyde, who is better than you at everything you do.

Why do you need to provide a list of changed files up front?

Suppose you have a set of files that are already built and up-to-date. Now someone goes and changes a file, but doesn't tell you. How do you find out what changed? Some options are:

  1. Stat every file and compare timestamps
  2. Use MD5 or some other hashing algorithm to compare hashes

Make and most of its derivatives use 1), so they would look at the timestamps on (for example) a .o and .c file and see if the .c file needs to be recompiled. Method 2) is used by some other build systems, since timestamps can sometimes be misleading. It's a little slower since you have to read the file contents, but will be more accurate. In either case, imagine if you have a million files spread out all over the place, and one of them is changed. Do you want to use either of these two methods to figure out which one changed? And do that *everytime* a change is made? No, that would be silly. Both of these options are linear-time algorithms. They may work fine at first, but once your project gets large you have to start taking shortcuts.

Since both of those options suck, let's change our assumption. Instead of assuming we were handed a set of files with one unknown file changed, what if we were handed a set of files, along with a list of changes? Now we might have a million files, along with a note that says "I changed foo.c". You don't have to go trolling through the filesystem looking for changes - you already know what changed. Isn't it fun to skip questions by getting the answer up front?

Of course, someone has to actually generate that list. There are a few ways one could go about this:

  1. The developer could generate it
  2. Programs that change files could generate it (ie: your editor)
  3. A filesystem monitor can watch files for modifications
  4. The filesystem could provide the list of modifications (reading its journal, or something)

Option 1 is listed to make it look like there are many choices, and is obviously dumb. Option 2 is a possibility, but there are many programs that can change the state of the filesystem (vi, touch, rm, git, patch, blah blah blah...). It's kinda unreasonable to have to change them all to support the build system. Option 3 is what tup uses. It's kinda annoying because it has to be running to be useful. Also it has to go through the filesystem once to build up a list of watches. It's a bit like running make in that it stats every file, but that happens only once per reboot. Option 4 would be super-awesome, but I have no idea if that's feasible - I haven't really looked into filesystems. It seems like it could be easy to have a list of changes in the filesystem somewhere, and I could have a pointer for "tup" somewhere in that list, no? Then I could just read the list of changes from the "tup" pointer to the end of the list of changes, and then update my pointer. Pipe-dreamish and infeasible, perhaps, but I think it would be neat.

And lo, the monitor program was born. It watches the filesystem and flags files that were modified, created, or deleted.

Why do the arrows go up?

Now we have a bunch of files in a filesystem, and separately we have a list of file changes. Let's assume for the time being that the list just contains files that are modified (so we added some stuff to a few .c and .h files). The files are already built, and you have all the dependency information. We will now give the same set of files and modification list to you and your nemesis, Clyde. However, the dependency information you get -- although identical in content -- is different in storage. You get dependencies as you would see in Makefiles and dependency files generated from the likes of 'gcc -MD' or 'makedepend'. It is very easy for you to think about an object file, like foo.o, and find out what files it depends on (foo.c, foo.h). Clyde, your nemesis who cut you off in traffic that one time and didn't even think twice about it, gets them oppositely. Clyde can think about a file, like foo.h, and find out what files depend on it (foo.o, bar.o). Let us now construct each of your algorithms.

Your information might look like this:

Makefilefoo.dbar.dGraph
hello_world: foo.o bar.o foo.o: foo.c foo.h bar.o: bar.c foo.h

Now you take this information and the list of file changes. Suppose the list just contains "foo.h". What do you do? You have to look through the Makefile for foo.h, and foo.d for foo.h, and bar.d for foo.h. And if we were compiling a million other .c files you'd have to look through all of their dependency files too. In fact, there's no way you can avoid looking through the complete set of Makefiles and dependency files (the entire DAG) because of the way the information is stored. That's what we in the biz call an O(n) algorithm.

Meanwhile Clyde is over here with this information:

foo.cfoo.hbar.cfoo.obar.ohello_worldGraph
foo.o foo.o bar.o bar.o hello_world hello_world

Clyde looks at the list of file changes, and sees foo.h. So he goes to the foo.h column and sees foo.o and bar.o. Then he goes to foo.o and sees hello_world. In hello_world there is nothing, so he continues at bar.o and sees hello_world, which is already visited. Nothing else to look at. Now imagine we gave Clyde a million extra nodes, but this part of the DAG is still intact (ie: the new files are unrelated). Does Clyde have to look at those files at all? No, they won't be listed in the foo.h or *.o or hello_world columns. But if one of them did use foo.h, that object would be listed there and Clyde would read it in all cool-like. Meanwhile you'd have to search through a million dependency files to find out which one has foo.h in it.

So Clyde is able to read in just the part of the DAG he needs, and still has time to go and steal your girlfriend. Meanwhile you're stuck reading through every dependency file in the system. Way to blow it, dude.

This proves why the arrows must go up. As a corollary, if your build system reads dependencies generated from gcc -MD (or any variant thereof), or makedepend, or in any case makes it difficult to find out which object files depend on a header file, then the arrows go down. You're doing it wrong. Your builds are linearly slow. The bad guy wins, and everyone goes home and says "man, that movie sucked".

Why is everything stored in a database?

You know how Clyde is able to go through his dependencies super-fast? Well he has to be able to take a file, like "foo.h" and find the correct column that has the next list of files to look at. So ideally you'd be able to do that really quick, like a hash table or some such. Rather than build all that crap myself, I decided to use SQLite. Since there's a few well-placed indexes on the tables, you can pretty much find any file's list of dependent files in O(log n) time. You might think you could do better by using a hash table, but you'd be wrong. If you'd like to prove me wrong, go ahead and write your own O(1) database. Then I can accept defeat and start using it. Just keep in mind that O(log n) is really good. I read on Wikipedia this one time that the number of atoms in the universe is 10^80. That's about the same as 2^266. Log2(2^266) = 266. So if you were building a project that had a file for every atom in the universe, doing a lookup would take about 266 * x time, where x is the amount of time it takes to go from one branch of the tree to the next. Really, you could consider 266 * x to be the constant upper bound for feasible input sizes. So for all projects that you will ever work on in the known universe, you could find a set of dependencies for a file in (266 * a constant) time, which as we all know is O(1). See, I told you there'd be hand-waving.

Anyway, I'm not too solid on database construction or performance maximization, so if you have any suggestions in that regard I'd be happy to hear them. The schema is pretty simple, but there are a bunch of weird queries.

Why are commands in the DAG?

Tup's DAG is slightly more complex than what is shown in the previous section. It actually includes the commands that were executed in the DAG so it can handle commands with multiple outputs. Consider these partial DAGs:

Partial DAG 1Partial DAG 2

In the first case, we'd have to run two commands (gcc on foo.c and gcc on bar.c). In the second case, we'd have to run bison once on parse.y. (I think that's how it works - I never really use bison, except that one time in CS class. This is evidenced by my completely lackluster and substandard parser code in tup). What make does is get to a node and ask "what do I run to update this file?" The problem is on the DAG on the right it would ask that for both parse.c and parse.h and come up with the same answer, and run it twice. There are ways to work around that in make, but it's kinda silly. See if you can read through the entire Multiple Outputs page on gnu.org without saying "man, that's silly".

Let's defy the immutable laws of build systems once again and put more than just files in the DAG. Here, we'll include the commands that update nodes. Each command has links coming into it that represents the files it will read. It also has outgoing links that represent the files it will write.

DAG with Commands 1DAG with Commands 2

Now it is very easy to read the DAG and run the appropriate commands. You can pretty much just walk the DAG and be all like:

  1. Is this a file? If so I don't care (these are just used for getting the ordering of commands).
  2. Is this a command? If so run it.

Since we only loaded the part of the DAG we need, we pretty much necessarily have to run every command we come across. The only way you wouldn't is if you checked like the output files signatures before and after running a command, and then you could short circuit the DAG. Maybe I'll do that someday.