How Facebook Ships Code « FrameThink – Frameworks for Thinking People
I’m fascinated by the way Facebook operates. It’s a very unique environment, not easily replicated (nor would their system work for all companies, even if they tried). These are notes gathered from talking with many friends at Facebook about how the company develops and releases software.
- by default all code commits get packaged into weekly releases (tuesdays)
- with extra effort, changes can go out same day
- tuesday code releases require all engineers who committed code in that week’s release candidate to be on-site
- engineers must be present in a specific IRC channel for “roll call” before the release begins or else suffer a public “shaming”
- ops team runs code releases by gradually rolling code out
- facebook has around 60,000 servers
- there are 9 levels for rolling out new code
- [CORRECTION thx epriest] “The nine push phases are not concentric. There are three concentric phases (p1 = internal release, p2 = small external release, p3 = full external release). The other six phases are auxiliary tiers like our internal tools, video upload hosts, etc.”
- the smallest level is only 6 servers
- e.g., new tuesday release is rolled out to 6 servers (level 1), ops team then observes those 6 servers and make sure that they are behaving correctly before rolling forward to the next level.
- if a release is causing any issues (e.g., throwing errors, etc.) then push is halted. the engineer who committed the offending changeset is paged to fix the problem. and then the release starts over again at level 1.
- so a release may go thru levels repeatedly: 1-2-3-fix. back to 1. 1-2-3-4-5-fix. back to 1. 1-2-3-4-5-6-7-8-9.
- ops team is really well-trained, well-respected, and very business-aware. their server metrics go beyond the usual error logs, load & memory utilization stats — also include user behavior. E.g., if a new release changes the percentage of users who engage with Facebook features, the ops team will see that in their metrics and may stop a release for that reason so they can investigate.
- during the release process, ops team uses an IRC-based paging system that can ping individual engineers via Facebook, email, IRC, IM, and SMS if needed to get their attention. not responding to ops team results in public shaming.
- once code has rolled out to level 9 and is stable, then done with weekly push.
- if a feature doesn’t get coded in time for a particular weekly push, it’s not that big a deal (unless there are hard external dependencies) — features will just generally get shipped whenever they’re completed.
- getting svn-blamed, publicly shamed, or slipping projects too often will result in an engineer getting fired. ”it’s a very high performance culture”. people that aren’t productive or aren’t super talented really stick out. Managers will literally take poor performers aside within 6 months of hiring and say “this just isn’t working out, you’re not a good culture fit”. this actually applies at every level of the company, even C-level and VP-level hires have been quickly dismissed if they aren’t super productive.
It’ll be super interesting to see how Facebook’s development culture evolves over time — and especially to see if the culture can continue scaling as the company grows into the thousands-of-employees.
What do you think? Would “developer-driven culture” work at your company?