Talk Notes: John Allspaw and Paul Hammond: "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr": Velocity 2009
2009 Velocity Conference: 6/22-24, 2009, Santa Clara, CA
I'm re-watching John Allspaw (@allspaw) seminal 2009 presentation called "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr." This talk is widely credited for showing the world what #devops coudl achieve, showing how Etsy was routinely deploy features into production at a rate scarcely imaginable for typical IT organizations who were doing quarterly or annual updates.
This SlideShare presentation and the Blip.tv video can be found on John's page here.
Presented with Paul Hammond (@ph), who was his VP Engineering counterpart at Flickr/Yahoo.
This is an awesome talk -- it was even better than I remembered it being. John and Paul discuss the incredible Dev and Ops challenges running one of the largest Internet sites, and how they created the breakthroughs. Nice job, guys!
Talk notes:
- Funny how the Ops stereotype is oft the same as Infosec: "no, no, no"; "who wants to work w/that person?"
- "Ops job is not to keep the site up; it's to enable the business; requires ability to enable business change"
- "Problem: change is the root cause of most outages; Ops paranoia is warranted"
- "Options: discourage change for stability (crotchety) OR allow change to happen as often as needed (smart)"
"You need Ops people who think like Dev; and Dev people who think like Ops"
Automated infrastructure
- manually managing more a dozen servers makes Dev job almost impossible"
- "Enablers: OS imaging and role & config management; all this enables cloud (e.g., EC2)"
- Version control:
- "Flickr source code was in CVS, but Ops stuff was in Perforce; 1 repository critical"
- "One repository where all dev/ops changes reside, you can quickly see what change to mitigate issue"
- One-Step Build:
- showing screenshot of build/stage button: click, SVN checkout, compiles all templates...
- "...copies everything to staging server for testing, automatically. No manual running commands, undocumented steps"
- "Obviates issue of Dev/QA/Production config drift, undocumented steps, etc."
- "After that comes One-Step Deploy; showing Flickr deployment screen; deploy log is poor man change control"
- "Viewing deploy log may show other deploy in progress, so deployment can be aborted/delayed"
- "Press 'I'm feeling lucky' will deploy code; no manual steps that can go wrong; continuous deployment/integration"
- "Deploy log: we know who, when and what; deploy timestamp goes on top of all monitoring tools"
- "You can't deploy 10 times/day if you're crashing 10 times/day. That's no agile, that's retarded" (haha)
- "You can use capistrano, makefiles, RPM; we use Hudson to generate packages for ops"
- "We can now make each deploy smaller, less risky, and more frequent changes; aids in faster recovery"
- Feature flags
- "aka branching; lots of branching come from'desktop software' lifecycle artifacts"
- "For online services, there's only one version that matters: Production; we always ship trunk"
- "We don't do all dev work in trunk; but by always shipping trunk, you always know which code/env is running"
- "Instead of branching code, we enable all new features in code with configurable settings; enables private betas"
- "Allows private betas on production servers with production traffic; we have great staging environments, but..."
- "...you may not notice new diffs betw QA & Production; allows bucket testing (eg, enable for 5% of users/traffic)
- "Obviates need for taking servers in/out of rotation, different code bases in production, etc. Do it in code"
- "Allows dark launches, silently turning on new features, but not making it visible: gives ops experience w/o risk"
- "For Ops, it takes away all the fear and suspense, because Ops gains experience before it goes live"
- "Eg, new Flickr homepage had new features that created massive new db load; for weeks, db was being queried, but"
- "...data thrown away. Dark launch period gave Dev/Ops time to prepare, improve, so launch was flawless"
- (Brilliant dark launch techniques being discussed here -- I can think of so many times I would have used this!)
- "We currently have several hundred of feature enable/disable flags; we can always turn things off; if db cluster
- "...starts having problem, we can disable features to lessen database loads; we don't rollback, we fix forward"
- Shared Metrics:
- "We gather tons of operational metrics: Dev watch these metrics as obsessively as Ops"
- "Each Dev person will have some tab open to Ops metrics (e.g., monitoring for 37 cluster ganglia install)"
- "We show application level metrics, combined with CPU load, network stats; app metrics give context to it all"
- Showing graph of, for previous minute, how long each image operation took (after you uploaded kitten pic)"
- Showing graph of queue size of for some image processing step
- "John's team makes it easy for us to create graphs: just create file w/{key,val} pair & it shows up in ganglia"
- "We create adaptive feedback loops: if database is overloaded or queue size too lg, app will throttle back"
- Describing multi-month process of Yahoo! shutting down photos site and migrating to Flickr; enormous async queues
- "It takes a lot of time to take years of all your photos into Flickr; petabytes of image data, tons of metadata"
- "We know how much storage was coming online: unknown: how many people who click 'Migrate to Flickr'"
- "Predicting when we'd run out of storage space was a huge challenge."
- "We put last deploy time on every monitoring tool" (showing example of impact of 'small image optimization')
- IRC and IM robots:
- We use IRC everywhere, lots of balls in the air, Dev & Ops on it; we squirt events into IRC"
- "We put build & deploy logs, critical alerts into IRC; and then shove it into search engine"
- "Now we can ask "has this happened before?'" and "what did we do about it?"
- Respect.
- Most important culture element at Flickr is respect, avoiding Dev/Ops stereotypes."
- "Respect different people's responsibilities: John will get hauled in front of mgmt when sit goes down"
- "I'm going to get hauled in front of mgmt when we don't ship features on time or enough of them" " "Saying 'no' is another way of saying 'I don't care about your problems'"
- "Memcache is a marvelous example of what can be created when Dev/Ops work together"
- "Dev hiding things from Ops is a bad idea: there's prob a good reason why Ops is afraid"
- "Dev: ask Ops abt: what metrics will change & how? what are the risks? what are signs that something went wrong?"
- "Dev: ask what are the contingencies? how can Ops recover and help site keep running?"
- "Dev should come up with answers to all of these before going to Ops"
- Trust:
- "Imagine Dev person who says deploy this & if something goes wrong, set this to zero and blame me."
- "That's obviously a Dev guy who cares about the site, and doesn't want to wake up my team unnecessarily"
- "Dev needs to bring in Ops when it comes to features; Dev needs to bring in Ops when it comes to upgrading tools"
- "It sounds obvious, but all too often, I've seen where this working relationship doesn't exist: those are cowboys"
- "To encourage this, we create shared runbooks & escalation plans: how will new features be supported?"
- "Provide knobs/levers: provide monitoring for features, enable Ops to change things (eg, # of threads)"
- "Controversial: give Dev access to production systems: playing phone tag over shell commands is dumb"
- "Dev should have shell into production systems that are read-only: let them see the system, logs, etc."
- "Non-root accounts are low risk. Solving problems without it is too difficult"
- Healthy attitudes around failures:
- "Airline pilots days each month in simulators, training for emergencies; they develop procedures"
- "If you have heart attack, do you want treatment from EMT who deals with it once/year, or once/weekly?" Practice.
- "Fire drills: During Flickr outages, junior engrs observe & practice diagnoses. after site up, check answers"
- Showing fingerpointyness slide: showing "mean time to innocence" principles
- "Flickr culture: we figure out stuff, fix it; and often have multiple people blaming themselves!"
- Avoiding blame
- "Developers: remember that other people will be woken up when your code break" "Saying sorry next day helps"
- "Saying sorry makes people feel better about it and shows lack of respect for Ops"
- "Ask what you'd do if someone weren't there in middle of the night picking up your slack? what would you chg?"
Reader Comments (1)
Kudos to Dev and Ops! :)