Talk notes: Infrastructure As Code (#DevOps Day)
DevOps Day, Santa Clara, CA
June 25, 2010
I'm here to participate on a panel called "DevOps Outside of Web Operations."
Infrastructure As Code
Adam Jacob (Opscode) (@adamhjk)
Luke Kanies (Puppet Labs) (@puppetmasterd)
Erik Troan (rPath) (@OpineIT)
Theo Schlossnagle (OmniTI) (@postwait)
Moderator: Patrick Debois (Jedi.be)
Introductions:
- Introductions
- Theo: manages thousands of systems
- Adam: CTO of Opscode: and chef.
- Luke: founder of Puppet Labs, talking about infrastructure as code for years.
- Erik: founder of rPath. millions of sysadmins out there who don't use version control, unforch. Sponsored work on RPM, RHN, etc.
- Question: "why DevOps now? We did scripts 10 years ago. What's new now?"
- Erik: "before, you had hundreds of systems. Now you have thousands."
- Luke: "Now you can get 1000s machines in a day. 10 years ago, they'd laugh at you."
- Adam: "We're better than 10 years ago, like anything you do with practice. 'doit 5' "
- Theo: "Can't do peering provisions and complex task with 'doit 5'. The cloud took so much away from me that all that's left is complex tasks. Sort of awesome. What's exciting is that so many tools are standardized across orgs."
- Luke: "Stealing tools from developer side. Sysadmins didn't until recently have culture of replacing tools all the time. Just thinking about switching form CVS to subversion. That's slow. People who know how to iterate need to be copied. We should do what we can to make our own lives better and develop better tools."
"Dev has 6 decades of experience. Ops has one decade. We can't take that long. Puppet is declarative. Allows unit tests." - Adam: "Unit testing is hard, because you need to test observable state and functionality."
- Theo: "When nodes boot, you cannot rely on unit testing. It's not as useful as you think it is."
- Adam: "link failed testing into Nagios"
- Luke: "If everything is staticly declared, unit testing is not useful. If you have lots of complex code, you need to rely on unit testing, to enable some type of validation of functionality. Someday you will fat-finger it, and replaces your kernel with the word 'drop.'"
- Poll:
- Many people who write production code are also the same people who deploy code and keep it running
- Panel continues...
- "DevOps is all about allowing engineering discipline into IT ops. What is missing is moving IT ops into Dev paradigm. By the time you talk to Dev, they've been up for too long and you can't understand them. Why can you achieve operability into IT ops, and move IT ops values like accountability into Dev, and ability to put out fires immediately."
- Question: "Two different conclusions from DevOps mission: Puppet/Chef vs. reuse of complex infrastructure patterns"
- Adam: "sometimes it's infrastructure, sometimes it's code and scripts. Orchestration is brought up a lot, but often it's easy just to write 100 lines of code."
- Theo: "Orchestration is puppet/chef. I wish I had chef recipe to run clustered 100 MySQL. Largely the configuration and recipes are proprietary. Deployment is reusable. Configuration of '5 hour recovery time' is not encodable in scripts. Sharing that everyone wants is not what people expect."
- Luke: "If that's true, then I quit. That would be a failure of our movement. We're all writing infrastructure code, but in equivalent of assembly code. After 5 years at Puppet, we think we see a path there, but it's still a ways away."
- Adam: "Monocultures die. There are things that you do that are special and unique."
- Luke: "You are not a unique and beautiful snowflake."
- Adam: "Yes, you are!"
- Theo: "At Velocity conference, Amazon, Yahoo, etc. all did complex things differently. But they're not going to use the same recipe for peering. Like, I don't care about your Apache config."
- Luke: "Apache has configs, and yes you have to separate. Bad models include IP addresses, hostnames, and things with assumptions. You can't pull out all assumptions, but you can pull out a bunch. So you should be left with something that other people can use. And when you make mistakes, and when you found a late assumption, you can share a patch and re-use. And there is a growing body of people generating reusable code. At that point, we'll be over the hump."
- Adam: "But there's a layer above us. There's not a world where the software is so smart that we don't need an engineering brain."
- Erik: "I agree with Luke. It's all a rising tide. We used to write shell scripts. Now chef/puppet. Next, class sharing."
- Theo: "How many people just downloaded your business. You can't have a recipe for everything."
- Luke: "Everything is a unique and beautiful snowman. But the snowflakes are standardized. 80% of all infrastructure is the same as the person down the street."
- Question: "sudo for everyone?"
- Theo: "PCI DSS loves that. No way." (Haha, this is great.) "75% of people can read credit card numbers. There are two classes of apps: those that matter, and those that don't.. First class has rules and regulations. Using chef/puppet is a nightmare. can't use sudo, must use pfexec on Solaris. Is that a reasonable design pattern."
- Adam: "I don't want continuous deployment to heart/lung machines, You need audits."
- Question:
- Luke: "Monitoring is like integration test of infrastructure."
- "If it's not monitored, it's not in production."
- "Even in audit world, you want to be able to test the system."
- Theo: "Deployment tools only tackle a small part of the time required. You save 15 min in a 15 hour deployment process, including change management, production waivers."
- Adam: "How many people have to log in, because NetApp guy can't get there fast enough? People lie to you!"
- Theo: "Sun dtrace is there because of really complex problem, like CPU issues that cause stock market crashes. I want to run it on Linux, F5, because no matter how careful I am, something always go wrong. When things go wrong, there is always a software engineer looking at problem. Shit goes wrong all the time."
- Adam: "I love tools. I like them better because it makes me better as an engineer. But I don't tools that just say, 'I will just run things.' That never works."
- Question: "I want to go back to 16 hours to roll out something new. We tend to forget our own responsibility of why we have to have these rituals. It's because we suck." (people applaud) "I am a lover of puppet and agile systems administration, but what we miss, we'll win back a big chunk of that 15 hours. My lab in our university is exempt from the 16 hour ritual, because they never screwed up. If you do it right, you don't have to jump through all those hoops."
- Theo: "Social networking client: fail fast, fail often. But other clients, 'Oops. We screwed up. Don't come in tomorrow, because we lost data and now we are in violation with regulations.' When $500K/minute outage costs, you have to have different policies. Puppet vs Chef issue: you are informed by the community that uses you. Ruby on Rails is a great example."