Algorithms Matter: Incrementally Naming Files

A few days ago, I was trying to split a data set into various files according to their type, but it was terribly slow. I saved an hour by taking a step back and thinking about my algorithms. Because algorithms matter. But not terribly much.

Let's face it, most programming is about debugging. Only rarely do we have to employ any glitzy algorithms, especially when the language already has a decent standard library with sorting algorithms etc.. But occasionally, a good algorithm can make the difference between night and day.

I had nearly a gigabyte of data sitting around, waiting to be categorized and stored into smaller files. Each record should have its own file. So I had a piece of code that worked like this:

def make_filename(record_type, i):
  return "{}.{}.txt".format(record_type, i)

i = 1
filename = make_filename(record_type, i)
while file_exists(filename):
  i++
  filename = make_filename(record_type, i)

I tested the script on a fragment of the data set, and it worked all right. Then I let it loose on a larger chunk of data, and waited. And waited some more. At some point I grew to impatient and killed the process.

What had happened? For each new file of a type, we had to loop through all existing files of that type. This means I had accidentally created a quadratic algorithms. This doesn't sound too bad for a dozen files but I had one type with nearly 20000 records. Doing a stat() on each of these is terribly slow.

So of course the trivial solution would be to keep a counter in my script for each type so that I only have to go through the existing files once, but this is terribly fragile. Whenever I have the same information represented in two places, one of those will invariably come out of sync. In this case, the file system should be the single source of truth regarding the existence of files.

So how can I find the next free number in this list of files without knowing the length beforehand? I was reminded of the binary search algorithm. Given two bounds, I can find a specific elements via \log n comparisons. It does so by iteratively halving the search space and continuing the search within the new upper and lower bound.

Now I only need an upper bound of a file name that doesn't exist. Instead of looking at each file, I could have used a larger increment like only checking each 100th file. However, this does not fundamentally change my existing algorithm. Constant factors are irrelevant for this analysis, so I'd still have to deal with a slow, unfavourably scaling O(n^2) algorithm.

Instead, I have to skip with a size corresponding to the size of my data set. In particular, I can skip over as many elements as I have already seen while searching for the end, meaning that the skipped amount always doubles at each step. With this, I can find an upper bound in \log n.

With a lower bound of the last file I know to exist and an upper bound of some file I know doesn't exist, I can use an ordinary binary search to find the exact length. Again, this binary search only takes O(\log n). The combination of two O(\log n) phases is still O(\log n), so the total complexity for all elements is O(n \log n), “quasi-linear”. This should be noticeably faster than the quadratic algorithm at a scale.

def expsearch(start, is_inside):
  """
  Find the length of a random-access collection.
  Runs in Θ(log n) where n is the actual length of the collection.

  start: int
    a natural number (including zero) from where the search is started.

  is_inside: int → bool
    a callback to return true while the given index is inside the
    collection.
  """

  # find the upper bound on the size
  hi = lo = start
  while is_inside(hi):
    lo = hi
    hi = 2*hi if hi > 0 else 1

  # Find the exact size through binary search in [lo, hi].
  # In the end, hi is the next free index.
  while lo < hi:
    mid = (lo + hi) // 2
    if is_inside(mid):
      # too low
      lo = mid + 1
    else:
      # too high
      hi = mid

  return hi

i = expsearch(1, lambda i: file_exists(make_filename(record_type, i)))
filename = make_filename(record_type, i)
write_data_to(filename)

The only problem with this approach is that it has a race condition. After we test that the filename doesn't exist but before we actually create the file, another process could have created a file. This could be solved by using a lockfile, but I opted for the simpler solution of simply not running the program in parallel.

Also, this assumes that testing for file existence is a constant-time operation, which doesn't hold for some file systems especially when dealing with many directory entries as in this case.

As it turns out, it was actually fast enough now. After a moment of waiting (but not a hour of waiting), all the files had been processed and I could start with my real analysis. Taking a step back and thinking about your algorithms can help in cases like this.

While algorithms are not terribly common, I think this ability to see an avoidable problem, remember solutions to similar problems, and come up with an algorithm of your own is an important quality of a good programmer. Solid computer science education still matters. But it's not necessarily essential. A developer gets paid for solving problems, but the fewest problems are solved by algorithms (as compared to mindful debugging, or meditating the documentation, or just knowing who to ask). For these reasons, it seems to be a bad idea when algorithms dominate interview questions.

In an interview setting, algorithm questions are sometimes used as a proxy for job performance. But since most programming doesn't touch deeply on algorithms, it seems to be a fairly bad proxy. In “We Hire the Best, Just Like Everyone Else”, Jeff Atwood mentions audition projects as a better interview mechanism. A candidate is invited to contract for a small project. The interviewee and the company can then look at each other through real work. Perhaps that is a better approach (though quite flawed in other ways).