Randomize HN

You ever notice that there's this funny threshold for getting to the front page on sites like HN? The exact threshold varies depending on how much traffic there is, but, for articles that aren't wildly popular, there's this moment when the article is at N-1 votes. There is, perhaps, a 60% chance that the vote will come and the article will get pushed to the front page, where it will receive a slew of votes. There is, maybe, a 40% chance it will never get the vote that pushes it to the front page, causing it to languish in obscurity forever.

It's non-optimal that an article that will receive 50 votes in expectation has a 60% chance of getting 100+ votes, and a 40% chance of getting 2 votes. Ideally, each article would always get its expected number of votes and stay on the front page for the expected number of time, giving readers exposure to the article in proportion to its popularity. Instead, by random happenstance, plenty of interesting content never makes it the front page, and as a result, the content that does make it gets a higher than optimal level of exposure.

You also see the same problem, with the sign bit flipped, on low traffic sites that push things to the front page the moment they're posted, like lobste.rs and the smaller sub-reddits: they displace links that most people would be interested in by putting links that almost no one cares about on the front page just so that the few things people do care about get enough exposure to be upvoted. On reddit, users "fix" this problem by heavily downvoting most submissions, pushing them off the front page, resulting in a problem that's fundamentally the same as the problem HN has.

Instead of implementing some simple and easy to optimize, sites pile on ad hoc rules. Reddit implemented the rising page, but it fails to solve the problem. On low-traffic subreddits, like r/programming the threshold is so high that it's almost always empty. On high-traffic sub-reddits, anything that's upvoted enough to make it to the rising page is already wildly successful, and whether or not an article becomes successful is heavily dependent on whether or not the first couple voters happen to be people who upvote the post instead of downvoting it, i.e., the problem of getting onto the rising page is no different than the problem of getting to the top normally.

HN tries to solve the problem by manually penalizing certain domains and keywords. That doesn't solve the problem for the 95% of posts that aren't penalized. For posts that don't make it to the front page, the obvious workaround is to delete and re-submit your post if it doesn't make the front page the first time around, but that's now a ban worthy offense. Of course, people are working around that, and HN has a workaround for the workaround, and so on. It's endless. That's the problem with "simple" ad hoc solutions.

There's an easy fix, but it's counter-intuitive. By adding a small amount of random noise to the rank of an article, we can smooth out the discontinuity between making it onto the front page and languishing in obscurity. The math is simple, but the intuition is even simpler1. Imagine a vastly oversimplified model where, for each article, every reader upvotes with a fixed probability and the front page gets many more eyeballs than the new page. The result follows. If you like, you can work through the exercise with a more realistic model, but the result is the same2.

Adding noise to smooth out a discontinuity is a common trick when you can settle for an approximate result. I recently employed it to work around the classic floating point problem, where adding a tiny number to a large number results in no change, which is problem when adding many small numbers to some large numbers3. For a simple example of applying this, consider keeping a reduced precision counter that uses loglog(n) bits to store the value. Let countVal(x) = 2^x and inc(x) = if (rand(2^x) == 0) x++4. Like understanding when to apply Taylor series, this is a simple trick that people are often impressed by if they haven't seen it before. You're welcome5.

  1. Another way to look at it is that it's A/B testing for upvotes (though, to be pedantic it's actually closer to multi-armed bandit). Another is that the distribution of people reading the front page and the new page aren't the same, and randomizing the front page prevents the clique that reads the new page from having undue influence. [return]
  2. If you want to do the exercise yourself, pg once said the formula for HN is: (votes - 1) / (time + 2)^1.5. It's possible the power of the denominator has been tweaked, but as long as it's greater than 1.0, you'll a reasonable result. [return]
  3. Kahan summation wasn't sufficient, for the fundamental same reason it won't work for the simplified example I gave above. [return]
  4. Assume we use a rand function that returns a non-negative integer between 0 and n-1, inclusive. With x = 0, we start counting from 1, as God intended. inc(0) will definitely increment, so we'll increment and correctly count to countVal(1) = 2^1 = 2. Next, we'll increment with probability ½; we'll have to increment twice in expectation to increase x. That works out perfectly because countVal(2) = 2^2 = 4, so we want to increment twice before increasing x. Then we'll increment with probability ¼, and so on and so forth. [return]
  5. See Mitzenmacher for a good introduction to randomized algorithms that also has an explanation of all the math you need to know. If you already apply Chernoff bounds in your sleep, and want something more in-depth, Motwani & Raghavan is awesome. [return]
← How to discourage open source contributions Writing safe Verilog →
Archive Support this site (patreon) About Twitter RSS