hardware bugs vs. software bugs, it's not too surprising that a lot more effort goes into hardware testing. But here's a case where there's almost no extra effort. You've already got analytics measuring the conversion rate through all sorts of user funnels. The only new idea here is that clicking on an ad or making a purchase isn't the only type of conversion you should measure. Following directions at an intersection is a conversion, not deleting an image immediately after pasting it is a conversion, and using a modal dialogue box after opening it up is a conversion.
Of course, whether it's ad click conversion rates or cache hit rates, blindly optimizing a single number will get you into a local optima that will hurt you in the long run, and setting thresholds for conversion rates that should send you an alert is nontrivial. There's a combinatorially large space of user actions, so it takes judicious use of machine learning to figure out reasonable thresholds. That's going to cost time and effort. But think of all the effort you put into optimizing clicks. You probably figured out, years ago, that replacing boring text with giant pancake buttons gives you 3x the clickthrough rate; you're now down to optimizing 1% here and 2% there. That's great, and it's a sign that you've captured all the low hanging fruit. But what do you think the future clickthrough rate is when a user encounters a show-stopping bug that prevents any forward progress on a modal dialogue box?
If this sounds like an awful lot of work, find a known bug that you've fixed, and grep your logs data for users who ran into that bug. Take your time; I'll wait. Alienating those users by providing a profoundly broken product is doing a lot more to your clickthrough rate than having a hard to find checkout button, and the exact same process that led you to that gigantic checkout button can solve your other problem, too. Everyone knows that adding 200ms of load time can cause 20% of users to close the window. What do you think the effect of exposing them to a bug that takes 5,000ms of user interaction to fix is?
If that's worth fixing, pull out scalding, dremel, cascalog, or whatever your favorite data processing tool is. Start looking for user actions that don't make sense. Start looking for bugs.
Thanks to Pablo Torres for catching a typo in this post
This is worse than it sounds. In addition to producing a busy icon forever in the doc, it disconnects that session from the server, which is another thing that could be detected: it's awfully suspicious if a certain user action is always followed by a disconnection.
Moreover, both of these failure modes could have been found with fuzzing, since they should never happen. Bugs are hard enough to find that defense in depth is the only reasonable solution.
[return]