Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference

Friday

Dec302011

Friday, December 30, 2011 at 12:29PM

2011 Velocity Conference: 6/14-16, Santa Clara, CA

I'm reviewing some of the awesome talks I missed while I was at the conference. Videos are available at the O'Reilly website.

And I'm using this opportunity to use a kickass iPad app written by @_flynn to do simultaneous notetaking and tweeting. Cool!

This is the Post-Mortem Roundtable, chaired by Mandi Walls, Admeld (@lnxchk). The other panel members are:

Mark Imbriaco, Director Prod Ops, Heroku (@markimbriaco: ex-37 Signals, gave amazing talk on Heroku architecture and ops, and also the attempts to run Heroku infrastructure on top of Heroku (!!))
Matt Hackett, VP Engr, Tumblr (@mhkt: we serve 21MM blogs, growing 20% per month, 7.5B page views/month, high growth)
Teresa Dietrich, Director Tech Ops, WebMD (@teresadg: ex-AOL, "lots of time spent on outage calls")
John Allspaw, VP Ops, Etsy: (@allspaw: "I'm here for balance". Both @lnxchk and @allspaw on prog committee. Kickass panel!)
@lnxchk: "Q: what's required for crises?" "Mark: our target is 20m max betw updates. If we miss, we'll say 'more in 4h'"
- @markimbriaco: "We assign someone to update, usually in support. Also we have role of incident commander."
- @allspaw: "We have status blog and Twitter feed as well. We track blip -> degradation -> outage, which escalates"
- @allspaw: "...as outage grows, can trigger Dev or Ops; for severe issues, community team, who are good at words."
- @teresadg: "at WebMD, less structured policies, but we do notify upon service down and publish restoration times."
- @teresadg: "Important is internal notifcation: 2K employees, countless doctors affected. Notifying/transparceny critical"
- @teresadg: "Getting Ops to be transparent was challenging; but CTO demanded visibility, best info on restoration time, etc"
- @mhkt: "We're small, but Ops is always 1st stop. When any risk of large impact, we page 24x7 community support whos oncall"
- @mhkt: "During our 24h Tumblr outage, I wish we had Twitter updates. Our lack of transparency was criticized widely"
- @mhkt: "We don't believe our outage desc should be technical: 'MySQL failed" not "incorrect setting, cluster failed""
@lnxchk: "Q: @allspaw showed IRC log used during outages: instant documentation, free timestamps: do y'all use IRC?"
- @markimbriaco: "We use Campfire, but new prob: we use skype, we lose our instant record; maybe need to echo notes into Campfire"
- @teresadg: "We use Microsoft Lync. Don't laugh. It works. Auto-populates, phone/video chat, messaging window, draw whiteboard."
- @teresadg: "It really works. If you have licenses try it. Goes way behind Communicator."
- @mhkt: "We use Hipchat. It's like Campfire, lots of clients; records chats, all company Notices email: Ops/Dev/CEO/Community"
- @mhkt: "This is the highest level record of outage that I refer back to all the time"
@lnxchk: "Q: How do you put knowledge into institutional knowledge to prevent future screwups?"
- @markimbriaco and @mhkt both say "We use Wiki, and we suck at it" (haha)
- @allspaw: "Yahoo! did this very well, which I miss. We use Wiki, but for start/end/detect times goes into Google spreadsheet"
- @allspaw: "Yahoo! did this very well, which I miss. We use Wiki, but start/end/detect times goes into Google spreadsheet"
- @allspaw: "But all associated media (Skitch screenshots, IRC logs) goes into Pastebin or Gist, and goes into Wiki"
- @allspaw: "Though I hate Wikis, everyone knows how to use it, it's available."
- @teresadg: "We formed SRE team: jumps in during stuck releases, outages; SRE assigned to outage; will do data gathering"
- @teresadg: "Post-outage, they'll dive deep, pull all logs, analyze; what were builds on servers, what changed, time to.."
- @teresadg: "..detect, fix; then asks 'what do we need to change'; yields request to dev for more monitoring, config chg"
- @teresadg: "Or procedure change or documentation change: all those reccs driven by SRE, instead of ignored after crisis"
@lnxchk: "Q: how do you keep post-mortems from becoming too emotionally charged, people screaming on desks, etc.?"
- @mhkt: "Since Dev chg often causes issue, they'll drive it, pulling in Ops when necc. Will tell story, no blame, study..."
- @teresadg: "Allow time for sleep before data gathering/discussion to prev sysadmin from throttling dev who caused failure"
- @markimbriaco: "Campfire/IRC enables easy data gathering; run post-mortem as chrono review, passed out ahead of time"
- @markimbriaco: "..this 'world according to mark' helps start conversation; real world example during Amazon EBS failure:"
- @markimbriaco: "...after 67h outage, no point in further discussion; would have just caused PTSD. sometimes not necc"
- @allspaw: "+1 for 'no rigid process'; I know I won't get true story and details necc to improve until people feel safe"
- @allspaw: "State it: I'm not going to fire you, dock pay, or get benched; My boss is CTO and supports my policy"
- @allspaw: "If you have standing room only for your post mortems, you know you're doing something right; self-imrovement"
- @allspaw: "When Dev pushes their own code, chgs happening all time; we want this to reflect Etsy, we value authorship"
- @allspaw: "Dev have pride of authorship and confidence and name attached to commits; fingerpointing
- @allspaw: "Ideal: here's what i did, here's what I thought would happen, here's what went wrong; then people offer it up" (Nice!)
- @teresadg: "I've been doing this for 15 yrs; too many people fear getting fired; I've seen really stupid stuff..."
- @teresadg: "...like tools that if you type in wrong window, sends to all routers; firing only happens when there's malice"
- @teresadg: "How did we know malice? We could see in logs him testing it, knowing full well the effects of script"
- @teresadg: "Fear of making mistake, coming into work and thinking 'it'll be anit-Joe day' is real. Safety is needed"
- @markimbriaco: "I'm constantly worried about issues between Dev and Ops; want Dev to be able to say here's what happened"
- @mhkt: "Desire to create institutional knowledge and learning; but also has catharsis needs: lv room feeling better"
- @mhkt: "How to decide in-person or email? Know I'm doing it right when ppl say 'can we do in-person post-mortem?"
@lnxchk: "Frameworks from nuke, chem industry, like Five Why: Which methodologies do you use? Too cold? Useful?"
- @allspaw: (after no one else says "yes"): "After using various methods, some from high risk industries, like for nuc power
- @allspaw: "..during early days: structured, mathematical. eg. Fault/event tree analysis vs. risk mgmt; 5 Whys came about..
- @allspaw: "...because of (Taichi Ohno I think, John) method of asking why on plant floor. Opp of rigorous fishbone diagram
- @allspaw: "For web, growth has been so fast: we choose efficiency vs. rigor; not worth 40h mtgs for couple slow web pages
- @allspaw: "I think this decision makes sense; not like 'oh, we amputated left leg instead of right" (or reactor meltdown)
- (FWIW, @kevinbehr's fave root cause analysis is Apollo Method)
@lnxchk: "What's worst thing you've ever seen happen in root cause meeting?"
- @allspaw: "I've seen some RCA that's extremely finger-pointy; previous company: defense mechanism up before meeting!"
- @allspaw: "'...I don't know why I should be there, but I'll go, b/c educational'" <-- shields/defense up upon invite!
- @markimbriaco: "As young sysadmin at bank, tons of VPs: no one asked me question, even though it was a tech issue!"
- @teresadg: "During malicious event, lots of staff got computers/disks confiscated due to data hiding; mgmt didn't say why"
- @teresadg: "Maybe sysadmin didn't intend to cause as much damage, but he hid tracks; caused cascading problems"
- @mhkt: "Bad post-mortems focus on how we fixed instead of process: "eg: saw this fault, and we fixed it"
- @teresadg: "we like time to detect, notification, respond, troubleshoot, repair. Run those #s everytime; Find outliers"
- @teresadg: "Things are always going to go wrong; that's why Ops people will always have jobs" (Nice!)

End of talk! Great job @lnxchk, @teresadg, @mhkt, @markimbriaco, @allspaw! Will publish link when I find it tomorrow!

BTW, I love that O'Reilly makes videos avail to everyone. Awesome conference. Will go again next yr!

Gene Kim | Comments Off |

32 References |

DevOps,

VelocityConf,

talks

References (32)

References allow you to track sources for this article, as well as articles that were written in response to this article.

Response: Cheap virtual servers

at Cheap virtual servers on April 29, 2013

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: WrdQKtYs

at WrdQKtYs on April 30, 2013

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: free fake kaiser doctors note for school

at free fake kaiser doctors note for school on February 22, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: how to lose weight on your thighs fast

at how to lose weight on your thighs fast on March 7, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: Www.vfr-Handball.de

at Www.vfr-Handball.de on March 18, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: sell instagram followers with paypal

at sell instagram followers with paypal on March 29, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: candy crush saga hacks

at candy crush saga hacks on April 5, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: jasa backlink

at jasa backlink on April 6, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: jasa backlink

at jasa backlink on April 6, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: hcg houston

at hcg houston on April 14, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: miami seo

at miami seo on April 16, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: ドクターマーチン dr.martens

at ドクターマーチン dr.martens on April 17, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: Honda Navigation DVD

at Honda Navigation DVD on April 19, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: probiotic vitamins

at probiotic vitamins on April 19, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: lareinedesneigesstreaming.com

at lareinedesneigesstreaming.com on April 21, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: Androx

at Androx on April 21, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: video

at video on April 29, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: Auckland Lawyers

at Auckland Lawyers on May 7, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: titanfall keys

at titanfall keys on May 14, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: funeral homes in dallas tx

at funeral homes in dallas tx on May 17, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: scrapebox vps

at scrapebox vps on May 18, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: natural weight loss

at natural weight loss on May 19, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: orlando taxi service

at orlando taxi service on May 20, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: sports gamblers

at sports gamblers on May 20, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: roofing consultants

at roofing consultants on May 23, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: suggestion for child support

at suggestion for child support on May 23, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: windshield repair

at windshield repair on May 23, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: whitepages.com

at whitepages.com on May 28, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: attorney jobs

at attorney jobs on May 30, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: Calgary SEO

at Calgary SEO on May 30, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: Securities Attorney

at Securities Attorney on June 7, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...
Response: limo service in orlando fl

at limo service in orlando fl on June 12, 2014

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference - RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...

Reader Comments (1)

Are these their official Twitter accounts?

January 2, 2012 |

Cheap Flyers Printing

Comments for this entry have been disabled. Additional comments may not be added to this entry at this time.