Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference
2011 Velocity Conference: 6/14-16, Santa Clara, CA
I'm reviewing some of the awesome talks I missed while I was at the conference. Videos are available at the O'Reilly website.
And I'm using this opportunity to use a kickass iPad app written by @_flynn to do simultaneous notetaking and tweeting. Cool!
This is the Post-Mortem Roundtable, chaired by Mandi Walls, Admeld (@lnxchk). The other panel members are:
- Mark Imbriaco, Director Prod Ops, Heroku (@markimbriaco: ex-37 Signals, gave amazing talk on Heroku architecture and ops, and also the attempts to run Heroku infrastructure on top of Heroku (!!))
- Matt Hackett, VP Engr, Tumblr (@mhkt: we serve 21MM blogs, growing 20% per month, 7.5B page views/month, high growth)
- Teresa Dietrich, Director Tech Ops, WebMD (@teresadg: ex-AOL, "lots of time spent on outage calls")
John Allspaw, VP Ops, Etsy: (@allspaw: "I'm here for balance". Both @lnxchk and @allspaw on prog committee. Kickass panel!)
@lnxchk: "Q: what's required for crises?" "Mark: our target is 20m max betw updates. If we miss, we'll say 'more in 4h'"
- @markimbriaco: "We assign someone to update, usually in support. Also we have role of incident commander."
- @allspaw: "We have status blog and Twitter feed as well. We track blip -> degradation -> outage, which escalates"
- @allspaw: "...as outage grows, can trigger Dev or Ops; for severe issues, community team, who are good at words."
- @teresadg: "at WebMD, less structured policies, but we do notify upon service down and publish restoration times."
- @teresadg: "Important is internal notifcation: 2K employees, countless doctors affected. Notifying/transparceny critical"
- @teresadg: "Getting Ops to be transparent was challenging; but CTO demanded visibility, best info on restoration time, etc"
- @mhkt: "We're small, but Ops is always 1st stop. When any risk of large impact, we page 24x7 community support whos oncall"
- @mhkt: "During our 24h Tumblr outage, I wish we had Twitter updates. Our lack of transparency was criticized widely"
- @mhkt: "We don't believe our outage desc should be technical: 'MySQL failed" not "incorrect setting, cluster failed""
@lnxchk: "Q: @allspaw showed IRC log used during outages: instant documentation, free timestamps: do y'all use IRC?"
- @markimbriaco: "We use Campfire, but new prob: we use skype, we lose our instant record; maybe need to echo notes into Campfire"
- @teresadg: "We use Microsoft Lync. Don't laugh. It works. Auto-populates, phone/video chat, messaging window, draw whiteboard."
- @teresadg: "It really works. If you have licenses try it. Goes way behind Communicator."
- @mhkt: "We use Hipchat. It's like Campfire, lots of clients; records chats, all company Notices email: Ops/Dev/CEO/Community"
- @mhkt: "This is the highest level record of outage that I refer back to all the time"
@lnxchk: "Q: How do you put knowledge into institutional knowledge to prevent future screwups?"
- @markimbriaco and @mhkt both say "We use Wiki, and we suck at it" (haha)
- @allspaw: "Yahoo! did this very well, which I miss. We use Wiki, but for start/end/detect times goes into Google spreadsheet"
- @allspaw: "Yahoo! did this very well, which I miss. We use Wiki, but start/end/detect times goes into Google spreadsheet"
- @allspaw: "But all associated media (Skitch screenshots, IRC logs) goes into Pastebin or Gist, and goes into Wiki"
- @allspaw: "Though I hate Wikis, everyone knows how to use it, it's available."
- @teresadg: "We formed SRE team: jumps in during stuck releases, outages; SRE assigned to outage; will do data gathering"
- @teresadg: "Post-outage, they'll dive deep, pull all logs, analyze; what were builds on servers, what changed, time to.."
- @teresadg: "..detect, fix; then asks 'what do we need to change'; yields request to dev for more monitoring, config chg"
- @teresadg: "Or procedure change or documentation change: all those reccs driven by SRE, instead of ignored after crisis"
@lnxchk: "Q: how do you keep post-mortems from becoming too emotionally charged, people screaming on desks, etc.?"
- @mhkt: "Since Dev chg often causes issue, they'll drive it, pulling in Ops when necc. Will tell story, no blame, study..."
- @teresadg: "Allow time for sleep before data gathering/discussion to prev sysadmin from throttling dev who caused failure"
- @markimbriaco: "Campfire/IRC enables easy data gathering; run post-mortem as chrono review, passed out ahead of time"
- @markimbriaco: "..this 'world according to mark' helps start conversation; real world example during Amazon EBS failure:"
- @markimbriaco: "...after 67h outage, no point in further discussion; would have just caused PTSD. sometimes not necc"
- @allspaw: "+1 for 'no rigid process'; I know I won't get true story and details necc to improve until people feel safe"
- @allspaw: "State it: I'm not going to fire you, dock pay, or get benched; My boss is CTO and supports my policy"
- @allspaw: "If you have standing room only for your post mortems, you know you're doing something right; self-imrovement"
- @allspaw: "When Dev pushes their own code, chgs happening all time; we want this to reflect Etsy, we value authorship"
- @allspaw: "Dev have pride of authorship and confidence and name attached to commits; fingerpointing
- @allspaw: "Ideal: here's what i did, here's what I thought would happen, here's what went wrong; then people offer it up" (Nice!)
- @teresadg: "I've been doing this for 15 yrs; too many people fear getting fired; I've seen really stupid stuff..."
- @teresadg: "...like tools that if you type in wrong window, sends to all routers; firing only happens when there's malice"
- @teresadg: "How did we know malice? We could see in logs him testing it, knowing full well the effects of script"
- @teresadg: "Fear of making mistake, coming into work and thinking 'it'll be anit-Joe day' is real. Safety is needed"
- @markimbriaco: "I'm constantly worried about issues between Dev and Ops; want Dev to be able to say here's what happened"
- @mhkt: "Desire to create institutional knowledge and learning; but also has catharsis needs: lv room feeling better"
- @mhkt: "How to decide in-person or email? Know I'm doing it right when ppl say 'can we do in-person post-mortem?"
@lnxchk: "Frameworks from nuke, chem industry, like Five Why: Which methodologies do you use? Too cold? Useful?"
- @allspaw: (after no one else says "yes"): "After using various methods, some from high risk industries, like for nuc power
- @allspaw: "..during early days: structured, mathematical. eg. Fault/event tree analysis vs. risk mgmt; 5 Whys came about..
- @allspaw: "...because of (Taichi Ohno I think, John) method of asking why on plant floor. Opp of rigorous fishbone diagram
- @allspaw: "For web, growth has been so fast: we choose efficiency vs. rigor; not worth 40h mtgs for couple slow web pages
- @allspaw: "I think this decision makes sense; not like 'oh, we amputated left leg instead of right" (or reactor meltdown)
- (FWIW, @kevinbehr's fave root cause analysis is Apollo Method)
@lnxchk: "What's worst thing you've ever seen happen in root cause meeting?"
- @allspaw: "I've seen some RCA that's extremely finger-pointy; previous company: defense mechanism up before meeting!"
- @allspaw: "'...I don't know why I should be there, but I'll go, b/c educational'" <-- shields/defense up upon invite!
- @markimbriaco: "As young sysadmin at bank, tons of VPs: no one asked me question, even though it was a tech issue!"
- @teresadg: "During malicious event, lots of staff got computers/disks confiscated due to data hiding; mgmt didn't say why"
- @teresadg: "Maybe sysadmin didn't intend to cause as much damage, but he hid tracks; caused cascading problems"
- @mhkt: "Bad post-mortems focus on how we fixed instead of process: "eg: saw this fault, and we fixed it"
- @teresadg: "we like time to detect, notification, respond, troubleshoot, repair. Run those #s everytime; Find outliers"
- @teresadg: "Things are always going to go wrong; that's why Ops people will always have jobs" (Nice!)
End of talk! Great job @lnxchk, @teresadg, @mhkt, @markimbriaco, @allspaw! Will publish link when I find it tomorrow!
BTW, I love that O'Reilly makes videos avail to everyone. Awesome conference. Will go again next yr!
Reader Comments (1)
Are these their official Twitter accounts?