75: fifth edition -- leaves Bell Labs, basis for BSD 1.x
79 -- one of the best
82 System III
84 4.2 BSD
89 SVR4 unification of Xenix, BSD, System V
NT development begins
History of NT
Team forms 11/89
Six guys from DEC
One guy from MS
Built from the ground up
Advanced PC OS
Designed for desktop & server
Secure, scalable, SMP design
All new code
Schedule: 18 months (only missed our date by 3 years)
History of NT, cont.
Initial effort targeted at Intel i860 code-named N10, hence the name NT which doubled as N-Ten and New Technology
Most dev done on i860 simulator running OS/2 1.2
Microsoft built a single board i860 computer code-named Dazzle, including the supporting chipset; ran full kernel, memory management, etc. on the machine
Compiler came from Metaware with weekly UUCP updates sent to my Sun-4/200
MS wrote a PE/Coff linker and a graphical cross debugger
Design longevity
OS code has a long lifetime
You have to base your OS on solid design principles
You have to set goals; not everything can be at the top of the list
You have to design for evolution in hardware, usage patterns, etc.
Only way to succeed is to base your design on a solid architectural foundation
Development environments never get enough attention
Goal setting
First job was to establish high level goals
Portability: ability to target more than one processor, avoid assembler, abstract away machine dependencies. Purposely started the i386 port very late to avoid falling into a typical Microsoft x86 centric design
Reliability: nothing should be able to crash the OS. Anything that crashes the OS is a bug. Very radical thinking inside MS considering Win16 was co-operative multi-tasking in a single address space, and OS/2 had similar attributes with respect to memory isolation
Extensibility: ability to extend OS over time
Compatibility: with DOS, OS/2, POSIX, or other popular runtimes; this is the foundation work that allowed us to invent windows two years into NT OS/2 development
performance: all of the above are more important than raw speed!
NS OS/2 design workbook
Design of executive captured in functional specs
Written by engineers, for engineers
Every functional interface was defined and reviewed
Small teams can do this efficiently
Making this process scale is an almost impossible challenge
Senior developers are inundated with spec reviews and the value of their feedback becomes meaningless
You have to spread review duties broadly and everyone must share the culture
Developing a culture
To scale a dev team, you need to establish a culture
Common way of evaluating designs, making tradeoffs, etc.
Common way of developing code and reacting to problems (build breaks, critical bugs, etc.)
Common way of establishing ownership of problems
Goal setting can be the foundation for the culture
Keeping culture alive as a team grows is a huge challenge
The NT culture
Portability, reliability, security, and extensibility ingrained as the teams top priority
Every decision was made in the context of these design goals
Everyone owns all the code, so whenever something is busted anyone has a right and a duty to fix it
Works in small groups (< 150 people) where people cover for each other
Fails miserably in large groups
Sloppiness is not tolerated
Great idea, but very difficult to nurture as group grows
Abuse and intimidation gets way out of control; can't keep calling people stupid and except them to listen
A successful culture has to accept that mistakes will happen
NT 3.1 vs. Windows 2000
Dev teams
Source control
Process management
Serialized development
Defects
Development team
NT 3.1
Starts small (6), slowly grows to 200 people
NT culture was commonly understood by all
Windows 2000
Mass assimilation of other teams into the NT team
NT 4.0 had 800 developers, Windows 2000 had 1400
Original NT culture practiced by the old timers in the group, but keeping the culture alive was difficult due to growth, physical separation, etc.
Diluted culture leads to conflict
Accountability: I don't "own" the code that is busted, see Mark!
reliability vs. new features
64-bit portability vs. new features
Source control system (NT 3.1)
Internally developed, maintained by a non-NT tools team
No branch capability, but not needed for small team
10-12 well isolated source "projects", 6M LOC
Informal project separation worked well
minimal obscure source level dependencies
Small hard drive could easily hold entire source tree
Developer could easily stay in sync with changes made to the system
Source control system (Windows 2000)
Windows team takes ownership of source control system, which is on life support
Branch capability sorely needed, tree copies used as substitutes, so merging is a nightmare
180 source "projects", 29M LOC
No project separation, reaching "up and over" was very common as developers tried to minimize what they had to carry on their machines to get their jobs done
Full source base required about 50Gb of disk space
To keep a machine in sync was a huge chore (1 week to set up, 2 hours per day to sync)
Process management (NT 3.1)
Safe sync period in effect for 4 hours each day; all other times, the rule is check-in when ready
Build lab syncs during morning safe sync period, which starts a complete build
Build breaks are corrected manually during the build process (1-2 breaks were normal)
Complete build time is 5 hours on 486/50
Build is boot tested with some very minimal testing before release to stress testing
Defects corrected with incremental build fixed
4pm, stress testing on ~100 machines begins
Process management (Windows 2000)
Developers not allowed to change source tree without explicit, email/written permission
Build lab manually approves each check-in using a combination of email, web, and a bug tracking database
Build lab approves about 100 changes each day and manually issues the appropriate sync and build commands
Build breaks are corrected manually; when they occur, all further build processing is halted
A developer that mistypes a build instruction can stop the build lab, which stops over 5000 people
Complete build time is 8 hours on 4-way PIII Xeon 550 with 50Gb disk and 512k cache
Build is boot tested and assuming we get a boot, extensive baseline testing begins
Testing is a mostly manual, semi-automated process
Defects occurring in the boot or test phase must be corrected before the build is "released" for stress testing
4pm, stress testing on ~1000 machines begins
Team size
Product
Devs
Testers
NT 3.1
200
140
NT 3.5
300
230
NT 3.51
450
325
NT 4.0
800
700
Win2k
1400
1700
Serialized Development
The model from NT 3.1 to 2000
All developers on team check in to a single main line branch
Master build lab syncs to main branch and builds releases from that branch
Checked in defect affects everyone waiting for results
Defect rates and serialization
Compile time or run time bugs that occur in a dev's office only affect that dev
Once a defect is checked in, the number of people affected by the defect increases
Best devs are going to check in a runtime or compile time mistake at least twice a year
Best devs will be able to code with a checked in compile time or run time break very quickly (20 minutes end-to-end)
As the code base gets larger, and as the team gets larger, these numbers typically double
Defect rates data
With serialized development
Good, small, teams operate efficiently
Even the absolute best large teams are always broken and always serialized
Product
Team #
Defects/dev-yr
Fix time / defect
Defects / day
Total fix time
NT 3.1
200
2
20m
1
20m
NT 3.5
300
2
25m
1.6
41m
NT 3.51
450
2
30m
2.5
1.2h
NT 4.0
800
3
35m
6.6
3.8h
Win2k
1400
4
40m
15.3
10.2h
Dev environment summary
NT 3.1
Fast and loose; lots of fun & energy
Few barriers to getting work done
Defects serialized as parts of the process, but didn't stop the whole machine; minimal downtime
Windows 2000
Source control system bursting at the seams
Excessive process management serialized the entire dev process; 1 defect stops 1400 devs, 5000 team members
Resource required to build a complete instance of NT were excessive, giving few developers a way to be sucessful
Focused fixes
Source control
Source code restructuring
Make the large team work like a set of small teams
Windows is already organized into reasonable sized dev teams
Goal is to allow these teams to work as a team when contributing source code changes rather than as a group of individuals that happen to work for the same VP
Parallel development, team level independence
Automated builds
Source control system
New system identified 3/99 (SourceDepot)
Native branch support
Scalable high speed client-server architecture
New machine setup 3 hours vs. 1 week
Normal sync 5 minutes vs. 2 hours
Transition to SourceDepot done on live Win2k code base
Hand built SLM -> SourceDepot migration system allowed us to keep in sync with the old system while transitioning to SourceDepot without changing the code layout.
Source code restructuring
16 depots for covering each major area of source code
Organization is focused on:
Minimizing cross project dependencies to reduce defect rate
Sizing projects to compile in a reasonable about of time
To build a project, all you need is the code for that project and that public/root project
Cross project sharing is explicit
New tree layout
The new tree layout features
Root project houses public
15 additional projects hang off the root
No nested projects
All projects build independently
Cross project dependencies resolved via public, public/internal usnig checked in interfaces
Team level independence
Each team determines its own check-in policy, enable rapid, frequent check ins
Teams are isolated from mistakes by other teams
When errors occur, only the tema causing the error is affected
A build, boot, or test break only affects a small subset of the product group
Each team has their own view of the source tree, their own mini build lab, and builds and entire installable build
Any developer with adequate resources can easily duplicate a mini build lab
Build and release a completely installable Windows system
Teams integrate their changes into the "main" trunk one at a time, so there is a high degree of accountability when something goes wrong in "main"
Build breaks will happen, but they are easily localized to the branch level, not the main product codeline
Teams are isolated from mistakes made by other teams
When errors occur, they affect smaller teams
A build, boot, or test break only affects a small subset of the Windows development team
Each team has their own view of the source tree and their own mini buikld lab
Each team's lab is enlisted in all projects and builds all projects
Each team needs resources able to build an NT system
Each team's build lab builds, tests, and mini-bvt's a complete standalone system
Automated builds
Build lab runs 100% hands off
10am and 10pm full sync and full build
Build failures are auto detected and mailed to the team
Sucessful builds are automatically released with automatic notification to the team