About Gene Kim

I've been researching high-performing technology organizations since 1999. I'm the multiple award-winning CTO, Tripwire founder, co-author of The DevOps Handbook, The Phoenix Project, and Visible Ops. I'm an DevOps Researcher, Theory of Constraints Jonah, a certified IS auditor and a rabid UX fan.

I am passionate about IT operations, security and compliance, and how IT organizations successfully transform from "good to great."

SEARCH BLOG
Friday
Jan202012

My First Three Weeks With The BSides Board

I’m writing this blog post to explain briefly why I chose to accept the BSides board position, what my goals are, and provide a brief status report.

Why I joined the BSides board

Over the years, I’ve come to respect the work of everyone in the BSides community.  I’m amazed and continually reminded of how many people BSides has positively influenced and the vibrant community they’ve created.  I’ve been to three events, and during my tenure at Tripwire, we became one of the first global sponsors.  I’ve always loved the people who congregate there, and I’m grateful for how it's reconnected me with old colleagues and friends.  I proudly consider myself a part of the BSides community.

In mid-December, I was asked by Mike Dahn and Jack Daniel to join them on the BSides board.  I first admitted to them that I’ve primarily been a beneficiary of everyone’s hard work, and that there are countless people who have contributed far more than me.  But after talking with them, I told them that it was a privilege to be asked and that I would be happy to serve for a one-year term and help in any way I could.  

My goals

My goal is to help ensure that BSides succeeds in its mission: to continue to help more information security practitioners achieve their fullest potential, both now and in the future.

Clearly there have been some growing pains. To paraphrase Bill Brenner, this is really an opportunity to "make a better BSides." Our goal as a board is to help BSides grow, become more effective and accountable, as well as more transparent.

A brief status report

Mike, Jack and I started having nearly daily, now weekly, phone calls. The top issues we're working on are the following:

  • Create a timeline to complete all the filings necessary for BSides to become officially a 501c3 not for profit corporation
  • Create a timeline to retain an outside bookkeeper and release audited financials, going all the way back to the first events, to show that all account balances and values are exactly as they should be, and that all the money went where it was supposed to
  • Create a communication calendar so that we regularly release information on what we’ve promised and how we’re doing on those promises, in order to earn back any lost trust with the community

On the 501c3 front, the team continues to move towards the official filing. BSides remains a California public benefit S corporation. As such, there is no official board, but we’ve started to organize and adopt all the structures and processes required for when we have official 501c3 status.

Part of this is getting regular financial reports released that are audited by an independent third party. The team has spent two weeks interviewing firms to take over the daily bookkeeping operations, as well as a CPA firm who can attest to the accuracy of the financials.   I’m particularly pleased that as soon as BSides completes the transition to a 501c3, the CPA firm that opines on the financials of the widely-revered Electronic Frontier Foundation (EFF) will do the same for BSides.  

While I've studied the SecurityErrata post, based on my analysis, I believe that any financial reporting errors found will be small. My biggest concern is that volunteers sometimes paid event suppliers out of their own pockets, due to BSides cash flow issues -- these transactions may not have been recorded or repaid properly. Of course, we will fix any issues we find.

And finally, on the communications front, this will be the first of many communications you’ll see from the team to make you aware of what we’re focused on, and how we’re doing on the commitments we’ve made.

Some last thoughts

Mike and Jack have been terrific to work with, and I’m confident that we’ll have more positive information to share throughout January and February.  From there, the focus of the board will be to discuss the structure that will best serve the BSides mission and community.  

I want to thank the many people who took the time to give me advice, provide recommendations on trusted bookkeepers and accountants, and much more. I particularly want to acknowledge Branden Williams, Brian Costello, Matt Hixson, Todd Butson and Bob McCarthy and countless others for their help, for which I’m very grateful.

Wednesday
Jan042012

Talk Notes: John Allspaw and Paul Hammond: "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr": Velocity 2009

2009 Velocity Conference: 6/22-24, 2009, Santa Clara, CA

I'm re-watching John Allspaw (@allspaw) seminal 2009 presentation called "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr." This talk is widely credited for showing the world what #devops coudl achieve, showing how Etsy was routinely deploy features into production at a rate scarcely imaginable for typical IT organizations who were doing quarterly or annual updates.

This SlideShare presentation and the Blip.tv video can be found on John's page here.

Presented with Paul Hammond (@ph), who was his VP Engineering counterpart at Flickr/Yahoo.

This is an awesome talk -- it was even better than I remembered it being. John and Paul discuss the incredible Dev and Ops challenges running one of the largest Internet sites, and how they created the breakthroughs. Nice job, guys!

Talk notes:

  • Funny how the Ops stereotype is oft the same as Infosec: "no, no, no"; "who wants to work w/that person?"
  • "Ops job is not to keep the site up; it's to enable the business; requires ability to enable business change"
  • "Problem: change is the root cause of most outages; Ops paranoia is warranted"
  • "Options: discourage change for stability (crotchety) OR allow change to happen as often as needed (smart)"
  • "You need Ops people who think like Dev; and Dev people who think like Ops"

  • Automated infrastructure

    • manually managing more a dozen servers makes Dev job almost impossible"
    • "Enablers: OS imaging and role & config management; all this enables cloud (e.g., EC2)"
  • Version control:
    • "Flickr source code was in CVS, but Ops stuff was in Perforce; 1 repository critical"
    • "One repository where all dev/ops changes reside, you can quickly see what change to mitigate issue"
  • One-Step Build:
    • showing screenshot of build/stage button: click, SVN checkout, compiles all templates...
    • "...copies everything to staging server for testing, automatically. No manual running commands, undocumented steps"
    • "Obviates issue of Dev/QA/Production config drift, undocumented steps, etc."
    • "After that comes One-Step Deploy; showing Flickr deployment screen; deploy log is poor man change control"
    • "Viewing deploy log may show other deploy in progress, so deployment can be aborted/delayed"
    • "Press 'I'm feeling lucky' will deploy code; no manual steps that can go wrong; continuous deployment/integration"
    • "Deploy log: we know who, when and what; deploy timestamp goes on top of all monitoring tools"
    • "You can't deploy 10 times/day if you're crashing 10 times/day. That's no agile, that's retarded" (haha)
    • "You can use capistrano, makefiles, RPM; we use Hudson to generate packages for ops"
    • "We can now make each deploy smaller, less risky, and more frequent changes; aids in faster recovery"
  • Feature flags
    • "aka branching; lots of branching come from'desktop software' lifecycle artifacts"
    • "For online services, there's only one version that matters: Production; we always ship trunk"
    • "We don't do all dev work in trunk; but by always shipping trunk, you always know which code/env is running"
    • "Instead of branching code, we enable all new features in code with configurable settings; enables private betas"
    • "Allows private betas on production servers with production traffic; we have great staging environments, but..."
    • "...you may not notice new diffs betw QA & Production; allows bucket testing (eg, enable for 5% of users/traffic)
    • "Obviates need for taking servers in/out of rotation, different code bases in production, etc. Do it in code"
    • "Allows dark launches, silently turning on new features, but not making it visible: gives ops experience w/o risk"
    • "For Ops, it takes away all the fear and suspense, because Ops gains experience before it goes live"
    • "Eg, new Flickr homepage had new features that created massive new db load; for weeks, db was being queried, but"
    • "...data thrown away. Dark launch period gave Dev/Ops time to prepare, improve, so launch was flawless"
    • (Brilliant dark launch techniques being discussed here -- I can think of so many times I would have used this!)
    • "We currently have several hundred of feature enable/disable flags; we can always turn things off; if db cluster
    • "...starts having problem, we can disable features to lessen database loads; we don't rollback, we fix forward"
  • Shared Metrics:
    • "We gather tons of operational metrics: Dev watch these metrics as obsessively as Ops"
    • "Each Dev person will have some tab open to Ops metrics (e.g., monitoring for 37 cluster ganglia install)"
    • "We show application level metrics, combined with CPU load, network stats; app metrics give context to it all"
    • Showing graph of, for previous minute, how long each image operation took (after you uploaded kitten pic)"
    • Showing graph of queue size of for some image processing step
    • "John's team makes it easy for us to create graphs: just create file w/{key,val} pair & it shows up in ganglia"
    • "We create adaptive feedback loops: if database is overloaded or queue size too lg, app will throttle back"
    • Describing multi-month process of Yahoo! shutting down photos site and migrating to Flickr; enormous async queues
    • "It takes a lot of time to take years of all your photos into Flickr; petabytes of image data, tons of metadata"
    • "We know how much storage was coming online: unknown: how many people who click 'Migrate to Flickr'"
    • "Predicting when we'd run out of storage space was a huge challenge."
    • "We put last deploy time on every monitoring tool" (showing example of impact of 'small image optimization')
  • IRC and IM robots:
    • We use IRC everywhere, lots of balls in the air, Dev & Ops on it; we squirt events into IRC"
    • "We put build & deploy logs, critical alerts into IRC; and then shove it into search engine"
    • "Now we can ask "has this happened before?'" and "what did we do about it?"
  • Respect.
    • Most important culture element at Flickr is respect, avoiding Dev/Ops stereotypes."
    • "Respect different people's responsibilities: John will get hauled in front of mgmt when sit goes down"
    • "I'm going to get hauled in front of mgmt when we don't ship features on time or enough of them" " "Saying 'no' is another way of saying 'I don't care about your problems'"
    • "Memcache is a marvelous example of what can be created when Dev/Ops work together"
    • "Dev hiding things from Ops is a bad idea: there's prob a good reason why Ops is afraid"
    • "Dev: ask Ops abt: what metrics will change & how? what are the risks? what are signs that something went wrong?"
    • "Dev: ask what are the contingencies? how can Ops recover and help site keep running?"
    • "Dev should come up with answers to all of these before going to Ops"
  • Trust:
    • "Imagine Dev person who says deploy this & if something goes wrong, set this to zero and blame me."
    • "That's obviously a Dev guy who cares about the site, and doesn't want to wake up my team unnecessarily"
    • "Dev needs to bring in Ops when it comes to features; Dev needs to bring in Ops when it comes to upgrading tools"
    • "It sounds obvious, but all too often, I've seen where this working relationship doesn't exist: those are cowboys"
    • "To encourage this, we create shared runbooks & escalation plans: how will new features be supported?"
    • "Provide knobs/levers: provide monitoring for features, enable Ops to change things (eg, # of threads)"
  • "Controversial: give Dev access to production systems: playing phone tag over shell commands is dumb"
    • "Dev should have shell into production systems that are read-only: let them see the system, logs, etc."
    • "Non-root accounts are low risk. Solving problems without it is too difficult"
  • Healthy attitudes around failures:
    • "Airline pilots days each month in simulators, training for emergencies; they develop procedures"
    • "If you have heart attack, do you want treatment from EMT who deals with it once/year, or once/weekly?" Practice.
    • "Fire drills: During Flickr outages, junior engrs observe & practice diagnoses. after site up, check answers"
    • Showing fingerpointyness slide: showing "mean time to innocence" principles
    • "Flickr culture: we figure out stuff, fix it; and often have multiple people blaming themselves!"
  • Avoiding blame
  • "Developers: remember that other people will be woken up when your code break" "Saying sorry next day helps"
    • "Saying sorry makes people feel better about it and shows lack of respect for Ops"
    • "Ask what you'd do if someone weren't there in middle of the night picking up your slack? what would you chg?"
Monday
Jan022012

Talk Notes: Artur Bergman on SSDs in the Data Center: 2011 Velocity Conference

2011 Velocity Conference: 6/14-16, Santa Clara, CA

Keynote talk given by Artur Bergman, VP Engineering and Operations, Wikia/Fastly, @crucially

Video available on YouTube, courtesy of O'Reilly here

Holy cow. It was only a 4 min talk, but I had heard lots of stuff about Artur's talk. Only 4 min?!? More, @crucially! :)

  • "If you're not using SSDs in your data center, you're wasting your life. Ex: My Mac w/SSD boots in 12sec"
  • "Go buy an SSD for your laptop for $500. You'll thank me. For 2.5 yrs, I've claimed it's cheaper: $/GB/IOPS"
  • "With SSDs, joins work, screw nosql, screw sharding: write code for SSDs & use random IO"
  • "For 8MM files: 9m for fsck, 12m to backup: 4GB/sec random RW latency. Everything becomes easy"
  • "Root cause of everything hard (e.g., slow joins, can't shard, slow disk) == not using SSDs, dummy"
  • "Low power: 1 watt vs. 15 watt; easy, compared to using low power shitty Atom CPUs" (haha)
  • "Ignore PCI Express cards and SLC. Just get $1000 for 600G. It's a Ferrari compared to your bike"
Friday
Dec302011

Talk Notes: Roundtable On Post-Mortems (Mandi Walls, John Allspaw, Mark Imbriaco, Matt Hackett, Teresa Dietrich): 2011 Velocity Conference

2011 Velocity Conference: 6/14-16, Santa Clara, CA

I'm reviewing some of the awesome talks I missed while I was at the conference. Videos are available at the O'Reilly website.

And I'm using this opportunity to use a kickass iPad app written by @_flynn to do simultaneous notetaking and tweeting. Cool!

This is the Post-Mortem Roundtable, chaired by Mandi Walls, Admeld (@lnxchk). The other panel members are:

  • Mark Imbriaco, Director Prod Ops, Heroku (@markimbriaco: ex-37 Signals, gave amazing talk on Heroku architecture and ops, and also the attempts to run Heroku infrastructure on top of Heroku (!!))
  • Matt Hackett, VP Engr, Tumblr (@mhkt: we serve 21MM blogs, growing 20% per month, 7.5B page views/month, high growth)
  • Teresa Dietrich, Director Tech Ops, WebMD (@teresadg: ex-AOL, "lots of time spent on outage calls")
  • John Allspaw, VP Ops, Etsy: (@allspaw: "I'm here for balance". Both @lnxchk and @allspaw on prog committee. Kickass panel!)

  • @lnxchk: "Q: what's required for crises?" "Mark: our target is 20m max betw updates. If we miss, we'll say 'more in 4h'"

    • @markimbriaco: "We assign someone to update, usually in support. Also we have role of incident commander."
    • @allspaw: "We have status blog and Twitter feed as well. We track blip -> degradation -> outage, which escalates"
    • @allspaw: "...as outage grows, can trigger Dev or Ops; for severe issues, community team, who are good at words."
    • @teresadg: "at WebMD, less structured policies, but we do notify upon service down and publish restoration times."
    • @teresadg: "Important is internal notifcation: 2K employees, countless doctors affected. Notifying/transparceny critical"
    • @teresadg: "Getting Ops to be transparent was challenging; but CTO demanded visibility, best info on restoration time, etc"
    • @mhkt: "We're small, but Ops is always 1st stop. When any risk of large impact, we page 24x7 community support whos oncall"
    • @mhkt: "During our 24h Tumblr outage, I wish we had Twitter updates. Our lack of transparency was criticized widely"
    • @mhkt: "We don't believe our outage desc should be technical: 'MySQL failed" not "incorrect setting, cluster failed""
  • @lnxchk: "Q: @allspaw showed IRC log used during outages: instant documentation, free timestamps: do y'all use IRC?"

    • @markimbriaco: "We use Campfire, but new prob: we use skype, we lose our instant record; maybe need to echo notes into Campfire"
    • @teresadg: "We use Microsoft Lync. Don't laugh. It works. Auto-populates, phone/video chat, messaging window, draw whiteboard."
    • @teresadg: "It really works. If you have licenses try it. Goes way behind Communicator."
    • @mhkt: "We use Hipchat. It's like Campfire, lots of clients; records chats, all company Notices email: Ops/Dev/CEO/Community"
    • @mhkt: "This is the highest level record of outage that I refer back to all the time"
  • @lnxchk: "Q: How do you put knowledge into institutional knowledge to prevent future screwups?"

    • @markimbriaco and @mhkt both say "We use Wiki, and we suck at it" (haha)
    • @allspaw: "Yahoo! did this very well, which I miss. We use Wiki, but for start/end/detect times goes into Google spreadsheet"
    • @allspaw: "Yahoo! did this very well, which I miss. We use Wiki, but start/end/detect times goes into Google spreadsheet"
    • @allspaw: "But all associated media (Skitch screenshots, IRC logs) goes into Pastebin or Gist, and goes into Wiki"
    • @allspaw: "Though I hate Wikis, everyone knows how to use it, it's available."
    • @teresadg: "We formed SRE team: jumps in during stuck releases, outages; SRE assigned to outage; will do data gathering"
    • @teresadg: "Post-outage, they'll dive deep, pull all logs, analyze; what were builds on servers, what changed, time to.."
    • @teresadg: "..detect, fix; then asks 'what do we need to change'; yields request to dev for more monitoring, config chg"
    • @teresadg: "Or procedure change or documentation change: all those reccs driven by SRE, instead of ignored after crisis"
  • @lnxchk: "Q: how do you keep post-mortems from becoming too emotionally charged, people screaming on desks, etc.?"

    • @mhkt: "Since Dev chg often causes issue, they'll drive it, pulling in Ops when necc. Will tell story, no blame, study..."
    • @teresadg: "Allow time for sleep before data gathering/discussion to prev sysadmin from throttling dev who caused failure"
    • @markimbriaco: "Campfire/IRC enables easy data gathering; run post-mortem as chrono review, passed out ahead of time"
    • @markimbriaco: "..this 'world according to mark' helps start conversation; real world example during Amazon EBS failure:"
    • @markimbriaco: "...after 67h outage, no point in further discussion; would have just caused PTSD. sometimes not necc"
    • @allspaw: "+1 for 'no rigid process'; I know I won't get true story and details necc to improve until people feel safe"
    • @allspaw: "State it: I'm not going to fire you, dock pay, or get benched; My boss is CTO and supports my policy"
    • @allspaw: "If you have standing room only for your post mortems, you know you're doing something right; self-imrovement"
    • @allspaw: "When Dev pushes their own code, chgs happening all time; we want this to reflect Etsy, we value authorship"
    • @allspaw: "Dev have pride of authorship and confidence and name attached to commits; fingerpointing
    • @allspaw: "Ideal: here's what i did, here's what I thought would happen, here's what went wrong; then people offer it up" (Nice!)
    • @teresadg: "I've been doing this for 15 yrs; too many people fear getting fired; I've seen really stupid stuff..."
    • @teresadg: "...like tools that if you type in wrong window, sends to all routers; firing only happens when there's malice"
    • @teresadg: "How did we know malice? We could see in logs him testing it, knowing full well the effects of script"
    • @teresadg: "Fear of making mistake, coming into work and thinking 'it'll be anit-Joe day' is real. Safety is needed"
    • @markimbriaco: "I'm constantly worried about issues between Dev and Ops; want Dev to be able to say here's what happened"
    • @mhkt: "Desire to create institutional knowledge and learning; but also has catharsis needs: lv room feeling better"
    • @mhkt: "How to decide in-person or email? Know I'm doing it right when ppl say 'can we do in-person post-mortem?"
  • @lnxchk: "Frameworks from nuke, chem industry, like Five Why: Which methodologies do you use? Too cold? Useful?"

    • @allspaw: (after no one else says "yes"): "After using various methods, some from high risk industries, like for nuc power
    • @allspaw: "..during early days: structured, mathematical. eg. Fault/event tree analysis vs. risk mgmt; 5 Whys came about..
    • @allspaw: "...because of (Taichi Ohno I think, John) method of asking why on plant floor. Opp of rigorous fishbone diagram
    • @allspaw: "For web, growth has been so fast: we choose efficiency vs. rigor; not worth 40h mtgs for couple slow web pages
    • @allspaw: "I think this decision makes sense; not like 'oh, we amputated left leg instead of right" (or reactor meltdown)
    • (FWIW, @kevinbehr's fave root cause analysis is Apollo Method)
  • @lnxchk: "What's worst thing you've ever seen happen in root cause meeting?"

    • @allspaw: "I've seen some RCA that's extremely finger-pointy; previous company: defense mechanism up before meeting!"
    • @allspaw: "'...I don't know why I should be there, but I'll go, b/c educational'" <-- shields/defense up upon invite!
    • @markimbriaco: "As young sysadmin at bank, tons of VPs: no one asked me question, even though it was a tech issue!"
    • @teresadg: "During malicious event, lots of staff got computers/disks confiscated due to data hiding; mgmt didn't say why"
    • @teresadg: "Maybe sysadmin didn't intend to cause as much damage, but he hid tracks; caused cascading problems"
    • @mhkt: "Bad post-mortems focus on how we fixed instead of process: "eg: saw this fault, and we fixed it"
    • @teresadg: "we like time to detect, notification, respond, troubleshoot, repair. Run those #s everytime; Find outliers"
    • @teresadg: "Things are always going to go wrong; that's why Ops people will always have jobs" (Nice!)

End of talk! Great job @lnxchk, @teresadg, @mhkt, @markimbriaco, @allspaw! Will publish link when I find it tomorrow!

BTW, I love that O'Reilly makes videos avail to everyone. Awesome conference. Will go again next yr!

Thursday
Jun162011

My 2011 Velocity Presentation: "Creating the Dev/Test/PM/Ops Supertribe: From Visible Ops To DevOps"

I spent two awesome days at the amazing 2011 Velocity Conference.  The presentations were fantastic, the practitioners in attendance had amazing kung fu, andI had a chance to reconnect with some of my favorite people in the industry.  Not to mention making new friends.

I can't say enough good things about this conference.  If you want to know what the bleeding edge of Web Operations and Development execution look like, this is a conference you can't afford not to be at.  Whether you're Ops, Dev or Infosec.

I had the privilege of presenting on Wednesday.  The talk title was "Creating the Dev/Test/PM/Ops Supertribe: From Visible Ops To DevOps." It summarized some of my key learnings since co-authoring the Visible Ops Handbook and studying high performing IT operations organization, and why I'm so excited about the DevOps movement.

Jesse Robbins (@jesserobbins), John Willis (@botchagalupe) and Patrick DeBois (@patrickdebois) convinced me to talk about some of the projects I'm working now, which I describe in the presentation, as well.

Slideshare is below, followed by the talk abstract.

 

I’m going to share my top lessons of how great IT organizations simultaneously deliver stellar service levels and fast flow of new features into production. It requires creating a “super-tribe”, where development, test, IT operations and information security genuinely work together to solve business objectives as opposed to throwing each under the bus.

I will describe what successful transformations look like, and how they were achieved from a Dev and Ops perspective. It will draw upon my 11 year study of high performing IT organizations, as well as work I¹ve done since 2008 to help some of that largest Internet companies increase feature flow and production stability.

Lastly, I will share materials from two book projects I am currently working on: “When IT Fails: The Novel” and a prescriptive DevOps guide. I am seeking fellow travelers who want to capture and codify the best known methods, recipes and case studies of how to implement successful DevOps-style transformations.