Talk notes: DevOps Requires Visbility (#DevOps Day)
DevOps Day, Santa Clara, CA
June 25, 2010
I'm here to participate on a panel called "DevOps Outside of Web Operations."
DevOps Requires Visibility
Javier Bowles (Appscio)
Gareth Soltero (SpringSource/VMware)
Jyoti Bansal (AppDynamics)
Matt Ray (Zenoss)
Eishay Smith (KaChing)
Moderator: Damon Edwards (DTO Solutions)
Holy cow. All these companies are hiring. That's a very interesting anecdotal evidence that successful companies embrace the values embraced by DevOps.
- Question: "DevOps requires situational awareness, very much like military."
- Eishay: "We do 92 deployments per day. We can stop the line at any time, rollback to a known good state. When we see a breach, we add a new layer of controls."
- Matt: "Situational awareness requires instrumenting bare metal, etc. Monitor everything."
- Jyoti: "Three layers: business metrics, ..."
- Javier: "You can have all the visibility in the world, but you have the keep the audience/consumer in mind. You can end up with a lot of misaligned interests. You can have complete visibility, but if you can't prove that a certain area ISN'T THE PROBLEM, then what good have you done?" (haha)
- Gareth: "challenge: how do you represent all this data?"
- Question: "Data everywhere, but not lots of knowledge. We sit up here and talk about tying up to the business metrics, but as an industry, we seem to not do it. New vendors are coming, but why is it different this time, and what causes the disconnect?"
- Javier: "Web Ops really is the business, so they have a better chance at making good metrics than the traditional IT ops organizations. Culture of instrumenting and manageability was shared equally between Dev and Ops. Dev would be reluctant to deploy code without certain amount of telemetry. That's Dev's job, not the product manager's job: he/she doesn't know how. Only Dev does."
- Jyoti: "late binding of business metrics is sometimes useful."
- Matt: "Ops will often challenge Dev on answering what metrics needs to be exposed: so we can tell when we're slow, both at hardware and app level. These people know that they can't deploy with Ops buying in."
- Eishay: "Easy. Hire best engineers. Then give them root passwords. Ops then becomes the team who automates things for Dev. Engineers know best what data needs to go where. Enable engineers to do what they want, and don't get in the way. If you can't trust the engineers, then fire them." (Holy cow. )
- Question: "what are some of the best practices for creating a culture of visibility?"
- Eisay: "Culture of quality. When test fails, there are blinking lights everywhere. Penguin shouts 'the build is broken.' Everything revolves around this, knowing we can't release something if it's broken. That's when it's ingrained into the culture."
- Matt: "As open source project, you want to foster sharing. 'Here's how I did it, here's how you can do it.' That's the culture we need to build."
- Jyoti: "Info sharing is a key cultural attribute, as opposed to info hoarding. That inhibits visibility. At Netflix, what impresses me is visibility at top-level. In lobby, you see user-satisfaction scores, where the bottlenecks are, and shared throughout the organization. Avoids the blame game."
- Javier: "Money. Pay people well. When hiring for ops, QA and engineering, and engineering is hoarding all the money, because dev is expensive. The truth is, that's a recipe for disaster, because you have dev rockstars that rule the roost. Ideally, incentives and compensation are structured so people link site outages, release performance hit pocketbooks. One exec should own both dev and ops goals."
- Question: "okay panel, you guys make it sound easy. But at my shop, we're going through a KPI measure exercise that is futile. Bonus exercise is just a racket. We'll massage the data in the end to show we got five-nines. We have one guy who was at Netflix who sees where we need to go, but they had a full-time person dedicated to it. Monitoring is a racket, but without customizing event handlers, it's a joke."
- Question: "metrics are often misguided. who cares about memory usage. focus on business metrics like transactions."
- Question; "Hey, Eisay, how do you get away with developers with root when you're dealing with financial transactions platform."
- Eisay: "Lots of reporting. We made the reporting system easy to use." (But how do you assert on the integrity of the system and data?)
- Javier: "there are creative ways to achieve compliance objectives."
- Question: "What do you think of one exec owning dev, QA, and ops?"
- Jyoti: "look at effectiveness mesasures like MTTR"
- Javier: "look for execs who have had this type of responsibility before. As a community, we need to create that skillset and experience set."
- Tip: "1) Best time to intro metrics is at feature launch time, so you know whether just launched widget is wanted, and how much CPU is required to ease CapEx budgeting process. 2) We got a sitedown issue last week. Why didn't we see that coming? Sounds like we need a graph for that."