Stress duty

Mon, Apr 12, 2004

Here in Windows (and in other groups around the company) we have a system that is generally called stress.  Different groups have different implementations, but the idea is always the same.  You develop a test suite that hammers you system and then have everyone install it and run it overnight.  The idea is that this way we can catch those hard to repro issues that only our customers see.

When you get one of these errors, what do you do with it?  Generally there is a team of people who have the thankless job of coming in early, finding all the machines that have broken in to the debugger, logging the failures in a database and having the system send mail to various people to investigate.  The actual debugging is done via the NT command line debugger (cdb) redirected over the network via the remote.exe utility.  It is up to this team to figure out which issues have already been caught and fixed (but perhaps aren't in the build yet) and what issues are new and need to be investigated.  These guys cover a lot of code so the generally don't have the knowledge to figure out what developer should be looking at any particular problem.  That falls to the rotating list of people who are on "stress duty."  The shell team calls it "dev of the day."  We've learned by hard experience that if you send these types of things to a big mailing list, they either get ignored or the same sorry glutton for punishment picks up all of the issues.

I happen to be on stress duty this week.  My job is to handle, in as expedient a fashion as possible, any incoming issues.  I have to do my best to exercise my debugging skills and try to figure out what is going on.  BTW, Raymond Chen is one of the undisputed masters of the cdb kung-fu.  If I can figure out what is going on myself, I can file a bug on the issue and let the person whose machine I'm debugging reclaim their machine.  If the problem is in code that I don't know well it is my job to reassign the issue to someone who does know the code well. 

Beyond handling incoming stress breaks in the morning, I also have to be responsive to all sorts of other issues that might get dropped on the floor.  This includes other random crashes people might have during the day along with helping out to resolve any build breaks.  (We try super hard to make sure that build breaks don't happen, but with something the size of Windows it still sometimes happens.)

Do other companies have systems like this set up?  I'd be interested to hear what works for you.

Managed/Unmanaged lines

Mon, Apr 12, 2004

Over on Channel9, there was an interesting question about the history of how my team in Avalon decide to split our component between managed and unmanaged code.