The Performance Cost of Integer Overflow Checking

How much overhead should we expect from enabling integer overflow checks? Using a compiler flag or built-in intrinsics, we should be able to do the check with a conditional branch that branches based on the overflow flag that add and sub set. Code that looks like

add     %esi, %edi

should turn into something like

add     %esi, %edi
jo      <handle_overflow>

Assuming that branch is always correctly predicted (which should be the case for most code), the costs of the branch are the cost of executing that correctly predicted not-taken branch, the pollution the branch causes in the branch history table, and the cost of decoding the branch (on x86, jo and jno don’t fuse with add or sub, which means that on the fast path, the branch will take up one of the 4 opcodes that can be come from the decoded instruction cache per cycle). That’s probably less than a 2x penalty per add or sub on front-end limited in the worst case (which might happen in a tightly optmized loop, but should be rare in general), plus some nebulous penalty from branch history pollution which is really difficult to measure in microbenchmarks. Overall, we can use 2x as a pessimistic guess for the total penalty.

2x sounds like a lot, but how much time do applications spend adding and subtracting? If we look at the most commonly used benchmark of “workstation” integer workloads, specint, the composition is maybe 40% load/store ops, 10% branches, and 50% other operations. Of the 50% “other” operations, maybe 30% of those are integer add/sub ops. If we guesstimate that load/store ops are 10x as expensive as add/sub ops, and other ops are as expensive as add/sub, a 2x penalty on add/sub should result in about a (40*10+10+50 + 12) / (40*10+10+50) = 3% penalty.

A Quick Tutorial on Implementing and Debugging Malloc, Free, Calloc, and Realloc

Let’s write a malloc and see how it works with existing programs!

This tutorial is going to assume that you know what pointers are, and that you know enough C to know that *ptr dereferences a pointer, ptr->foo means (*ptr).foo, that malloc is used to dynamically allocate space, and that you’re familiar with the concept of a linked list. If you decide to work through this tutorial without really knowing C, please let me know what parts could use more exposition. If you want to look at all of this code at once, it’s available here.

Preliminaries aside, malloc’s function signature is

void *malloc(size_t size);

It takes as input a number of bytes and returns a pointer to a block of memory of that size.

There are a number of ways we can implement this. We’re going to arbitrarily choose to use the sbrk syscall. The OS reserves stack and heap space for processes and sbrk lets us manipulate the heap. sbrk(0) returns a pointer to the current top of the heap. sbrk(foo) increments the heap size by foo and returns a pointer to the previous top of the heap.

Markets Don’t Eliminate Discrimination

I literally can’t remember the last time I saw a public discussion of discrimination in tech where someone didn’t assert that discrimination is impossible because of the market. Here’s a quote from Marc Andreessen that sums up a common view1.

Let’s launch right into it. I think the critique that Silicon Valley companies are deliberately, systematically discriminatory is incorrect, and there are two reasons to believe that that’s the case. … No. 2, our companies are desperate for talent. Desperate. Our companies are dying for talent. They’re like lying on the beach gasping because they can’t get enough talented people in for these jobs. The motivation to go find talent wherever it is unbelievably high.

Marc Andreesen’s point is that the market is too competitive for discrimination to exist. But VC funded startups aren’t the first companies in the world to face a competitive hiring market. Consider the market for PhD econonimists from, say, 1958 to 1987. Alan Greenspan had this to say about how that market looked to his firm, Townsend-Greenspan.

What Do Linux Developers Say in Commit Messages?

I was curious what different people worked on in Linux, so I tried grabbing data from the current git repository to see if I could pull that out of commit message data. This doesn’t include history from before they switched to git, so it only goes back to 2005, but that’s still a decent chunk of history.

Here’s a list of the most commonly used words (in commit messages), by the top four most frequent committers, with users ordered by number of commits.

Given That We Aren’t Going to Put Any Effort Into Testing, What’s the Best Way to Test?

If I had to guess, I’d say I probably work around hundreds of bugs in an average week, and thousands in a bad week. It’s not unusual for me to run into a hundred new bugs in a single week. But I often get skepticism when I mention that I run into multiple new (to me) bugs per day, and that this is inevitable if we don’t change how we write tests. Well, here’s a log of one week of bugs. After a brief description of the bugs, I’ll talk about what we can do to improve the situation. The obvious answer to spend more effort on testing, but everyone already knows we should do that and no one does it. That doesn’t mean it’s hopeless, though.

One week of bugs

Julia

Unicode sequence causes match/ismatch to blow up with a bounds error.

Unicode sequence causes using a string as a hash index to blow up with a bounds error.

Exception randomly not caught by catch. This sucks because putting things in a try/catch was the workaround for the two bugs above. I’ve seen other variants of this before; it’s possible this shouldn’t count as a new bug because it might be the same root cause as some bug I’ve already seen.

Speeding Up Octopress by 15x

I’ve seen all these studies that show how a 100ms improvement in page load time has a significant effect on page views, conversion rate, etc., but I’d never actually tried to optimize my site. This blog is a static Octopress site, hosted on Github Pages. Static sites are supposed to be fast, and Github Pages uses Fastly, which is supposed to be fast, so everything should be fast, right?

Not having done this before, I didn’t know what to do. But in a great talk on how the internet works, Dan Espeset suggested trying webpagetest; let’s give it a shot.

What Are the Odds the Build Actually Works?

I’ve noticed that builds are broken and tests fail a lot more often on open source projects than on “work” projects. I wasn’t sure how much of that was my perception vs. reality, so I grabbed the travis CI data for a few popular categories on github1.

The Empirical Evidence That Types Affect Productivity and Correctness

There are some pretty strong statements about types floating around out there; for the most part, those claims haven’t matched my personal experience. The claims range from the oft-repeated phrase that when you get the types to line up, everything just works, to “not relying on type safety is unethical (if you have an SLA)”1. There are probably plenty of strong claims about dynamic languages that I’d disagree with if I heard them, but I’m not in the right communities to hear the more outlandish claims about dynamically typed languages. Either way, it’s rare to see people cite actual evidence.

This is a review of the empirical evidence.

Why Intel Added the CLWB and PCOMMIT Instructions

The latest verison of the Intel manual has a couple of new instructions for non-volatile storage, like SSDs. What’s that about?

Before we look at the instructions in detail, let’s take a look at the issues that exist with super fast NVRAM. One problem is that next generation storage technologies will be fast enough that syscall and other OS overhead can be more expensive than the actual cost of the disk access1. Another is the impedence mismatch between the x86 memory hierarchy and persistent memory. In both cases, it’s basically an Amdahl’s law problem, where one component has improved so much that other components have to improve to keep up.

Cache Eviction: When Are Randomized Algorithms Better Than LRU?

Once upon a time, my computer architecture professor mentioned that using a random eviction policy for caches really isn’t so bad. That random eviction isn’t bad can be surprising – if your cache fills up and you have to get rid of something, choosing the least recently used (LRU) is an obvious choice, since you’re more likely to use something if you’ve used it recently. If you have a tight loop, LRU is going to be perfect as long as the loop fits in cache, but it’s going to cause a miss every time if the loop doesn’t fit. A random eviction policy degrades gracefully as the loop gets too big.

In practice, on real workloads, random tends to do worse than other algorithms. But what if we take two random choices and just use LRU between those two choices?

Here are the relative miss rates we get for SPEC CPU1 with a Sandy Bridge-like cache (8-way associative, 64k, 256k, and 2MB L1, L2, and L3 caches, respectively). These are ratios (algorithm miss rate : random miss rate); lower is better. Each cache uses the same policy at all levels of the cache.

Policy L1 (64k) L2 (256k) L3 (2MB)
2-random 0.91 0.93 0.95
FIFO 0.96 0.97 1.02
LRU 0.90 0.90 0.97
random 1.00 1.00 1.00