RealGeneKim Blog - Home page of RealGeneKim (Gene Kim): Tripwire founder and CTO, Visible Ops co-author, and more...

Friday

Jun252010

Talk notes: Infrastructure As Code (#DevOps Day)

Friday, June 25, 2010 at 10:42AM

DevOps Day, Santa Clara, CA
June 25, 2010

I'm here to participate on a panel called "DevOps Outside of Web Operations."

Infrastructure As Code

Adam Jacob (Opscode) (@adamhjk)
Luke Kanies (Puppet Labs) (@puppetmasterd)
Erik Troan (rPath) (@OpineIT)
Theo Schlossnagle (OmniTI) (@postwait)

Moderator: Patrick Debois (Jedi.be)

Introductions:

Introductions
- Theo: manages thousands of systems
- Adam: CTO of Opscode: and chef.
- Luke: founder of Puppet Labs, talking about infrastructure as code for years.
- Erik: founder of rPath. millions of sysadmins out there who don't use version control, unforch. Sponsored work on RPM, RHN, etc.
Question: "why DevOps now? We did scripts 10 years ago. What's new now?"
- Erik: "before, you had hundreds of systems. Now you have thousands."
- Luke: "Now you can get 1000s machines in a day. 10 years ago, they'd laugh at you."
- Adam: "We're better than 10 years ago, like anything you do with practice. 'doit 5' "
- Theo: "Can't do peering provisions and complex task with 'doit 5'. The cloud took so much away from me that all that's left is complex tasks. Sort of awesome. What's exciting is that so many tools are standardized across orgs."
- Luke: "Stealing tools from developer side. Sysadmins didn't until recently have culture of replacing tools all the time. Just thinking about switching form CVS to subversion. That's slow. People who know how to iterate need to be copied. We should do what we can to make our own lives better and develop better tools."
  "Dev has 6 decades of experience. Ops has one decade. We can't take that long. Puppet is declarative. Allows unit tests."
- Adam: "Unit testing is hard, because you need to test observable state and functionality."
- Theo: "When nodes boot, you cannot rely on unit testing. It's not as useful as you think it is."
- Adam: "link failed testing into Nagios"
- Luke: "If everything is staticly declared, unit testing is not useful. If you have lots of complex code, you need to rely on unit testing, to enable some type of validation of functionality. Someday you will fat-finger it, and replaces your kernel with the word 'drop.'"
Poll:
- Many people who write production code are also the same people who deploy code and keep it running
Panel continues...
- "DevOps is all about allowing engineering discipline into IT ops. What is missing is moving IT ops into Dev paradigm. By the time you talk to Dev, they've been up for too long and you can't understand them. Why can you achieve operability into IT ops, and move IT ops values like accountability into Dev, and ability to put out fires immediately."
Question: "Two different conclusions from DevOps mission: Puppet/Chef vs. reuse of complex infrastructure patterns"
- Adam: "sometimes it's infrastructure, sometimes it's code and scripts. Orchestration is brought up a lot, but often it's easy just to write 100 lines of code."
- Theo: "Orchestration is puppet/chef. I wish I had chef recipe to run clustered 100 MySQL. Largely the configuration and recipes are proprietary. Deployment is reusable. Configuration of '5 hour recovery time' is not encodable in scripts. Sharing that everyone wants is not what people expect."
- Luke: "If that's true, then I quit. That would be a failure of our movement. We're all writing infrastructure code, but in equivalent of assembly code. After 5 years at Puppet, we think we see a path there, but it's still a ways away."
- Adam: "Monocultures die. There are things that you do that are special and unique."
- Luke: "You are not a unique and beautiful snowflake."
- Adam: "Yes, you are!"
- Theo: "At Velocity conference, Amazon, Yahoo, etc. all did complex things differently. But they're not going to use the same recipe for peering. Like, I don't care about your Apache config."
- Luke: "Apache has configs, and yes you have to separate. Bad models include IP addresses, hostnames, and things with assumptions. You can't pull out all assumptions, but you can pull out a bunch. So you should be left with something that other people can use. And when you make mistakes, and when you found a late assumption, you can share a patch and re-use. And there is a growing body of people generating reusable code. At that point, we'll be over the hump."
- Adam: "But there's a layer above us. There's not a world where the software is so smart that we don't need an engineering brain."
- Erik: "I agree with Luke. It's all a rising tide. We used to write shell scripts. Now chef/puppet. Next, class sharing."
- Theo: "How many people just downloaded your business. You can't have a recipe for everything."
- Luke: "Everything is a unique and beautiful snowman. But the snowflakes are standardized. 80% of all infrastructure is the same as the person down the street."
Question: "sudo for everyone?"
- Theo: "PCI DSS loves that. No way." (Haha, this is great.) "75% of people can read credit card numbers. There are two classes of apps: those that matter, and those that don't.. First class has rules and regulations. Using chef/puppet is a nightmare. can't use sudo, must use pfexec on Solaris. Is that a reasonable design pattern."
- Adam: "I don't want continuous deployment to heart/lung machines, You need audits."
Question:
- Luke: "Monitoring is like integration test of infrastructure."
- "If it's not monitored, it's not in production."
- "Even in audit world, you want to be able to test the system."
- Theo: "Deployment tools only tackle a small part of the time required. You save 15 min in a 15 hour deployment process, including change management, production waivers."
- Adam: "How many people have to log in, because NetApp guy can't get there fast enough? People lie to you!"
- Theo: "Sun dtrace is there because of really complex problem, like CPU issues that cause stock market crashes. I want to run it on Linux, F5, because no matter how careful I am, something always go wrong. When things go wrong, there is always a software engineer looking at problem. Shit goes wrong all the time."
- Adam: "I love tools. I like them better because it makes me better as an engineer. But I don't tools that just say, 'I will just run things.' That never works."
Question: "I want to go back to 16 hours to roll out something new. We tend to forget our own responsibility of why we have to have these rituals. It's because we suck." (people applaud) "I am a lover of puppet and agile systems administration, but what we miss, we'll win back a big chunk of that 15 hours. My lab in our university is exempt from the 16 hour ritual, because they never screwed up. If you do it right, you don't have to jump through all those hoops."

Theo: "Social networking client: fail fast, fail often. But other clients, 'Oops. We screwed up. Don't come in tomorrow, because we lost data and now we are in violation with regulations.' When $500K/minute outage costs, you have to have different policies. Puppet vs Chef issue: you are informed by the community that uses you. Ruby on Rails is a great example."

Gene Kim |

5 Comments |

71 References |

DevOps,

talks

Friday

Jun252010

Talk notes: Changing Culture To Enable DevOps (DevOps Day)

Friday, June 25, 2010 at 10:04AM

DevOps DaySanta Clara, CA
June 25, 2010

I'm here to participate on a panel called "DevOps Outside of Web Operations."

Keynote:

Excellent video of DevOps background. Showed DevOps problem statements and outlines some of the principles through Charlie Chaplain movie clips by Patrick DuBois. Well done.

Changing Culture To Enable DevOps

John Allspaw, Etsy
Israel Gat (Agile Executive)
Lee Thompson (DTO Solutions)
Lloyd Taylor (NetElder Associates)

Moderator: Andrew Shafer (Cloudscaling)

Introductions:

Introductions
- John Allspaw: Etsy. Formerly at Flickr. Has seen what is possible.
- Lloyd Taylor: now at investor firm. VP Tech Ops: Formerly Google and LinkedIn.
- Israel Gat (Agile Executive): now consultant. formerly with MOM. Now has Agile bug.
- Lee Thompson (NetElder Associates): has experience in financial vertical.
Question: lots of separation of between Dev and Ops tribe, with different goals and measurements. Please comment. (everything below are quotes)
- Israel: "governance is so complex. blah, blah..."
- Lee: "in beginning, no one represents non-functional requirements, so things grow into a big mess. Agile Startup now brings this up. Not doing this early will lead to problems."
- John: "the story often begins, "look at how bad things are." Shipping features late cost money. But so do outages. In successful organizations, developers will often set Nagios thresholds, and configurations settings are in the same place as the source code."
- Lloyd: "grounding everything in control theory. The goal of a business is to make money, so everything should subordinate and support that goal. When personalities cause problems especially trust issues between engineering and development, things tend to get worse. Things will reach a level of stability, but at a very low level of performance. That's why I think the tribe issue is so important. If Engineering and Ops hang out each other and have lots of cross-socialization. Hard to demonize people you hang out with. When you spend time together, you defeat the stove-piping. Both ship products, but one is much more efficient, and probably makes more money."
  - A great book on this is "Great Boss, Dead Boss" which is all about tribes. The author is a VP at a solar panel manufacturing, and is a Theory of Constraints Jonah (like me).
Question: Conway law: quality of product will be a reflection of the culture: Please reflect.
- Lee: "Look at cross-incentives. Ops is measured on siteups. Dev is on new features. This is the core conflict. What's the best way to keep site up? No changes. Just having faces to each organization makes it more difficult to say 'that moron.' Google has a very good process, which has a chokepoint called 'machine allocation,' which is where they predict which services has the most economic potential."
- Israel: "Is culture in IT similar than in revenue recognition. Try to assess what culture exists in Development and in IT operations. If they're different, that's interesting.
Question from audience: "Many people moving QA out of Dev. Thoughts?"
- John: "QA exists, even if it doesn't exist organizationally."
- Lloyd: "Facebook doesn't have QA group."
  - My new buddy Chris Westin tells me that in the Velocity Conference talk yesterday, "Developers do their own pushes. So obviously they're doing testing."
- Lee: "1) where is the pain: If developers suffer when deployments fail, you must ensure that those who cause the issue also bear the pain. Pain driven development. 2) you have to look at how difficult it is to make a mistake: if APIs are poorly documented, then it leads to fragility. These are part of non-functional requirements. 3) Facebook does this well: how fast do you detect screwups?"
Question: "Global community of practice: many people are working on this problem, so how can we leverage each other?"
- John: "30% of you write Internet facing code? That's the development side. Then the other community is IT ops.
Question: "Why DevOps now?"
- Lloyd: "More simple apps like Rails. And with cloud, you can't do things the old way. That's why I'm trying to create DevOps Tool Chain."
Question: "What happens when you just don't have time to spend with your own kids, let alone learn about Mary's kids (your peers in the other silo?)"
- John: "Ask where is the pain?"
- Israel: "Scrum attendance is a good indicator to tell when the other group is too busy to even show up to the daily scrum meetings." (Yes, this is very effective.)
Question: "How about in global distributed orgs, when you never see pictures of the kids of Toronto staff?"
- John: "That's hard."
- Lloyd: "Lesson after 30 years in DoD is that you have to be physically present. Maybe Facebook generation will change that."
Question: "As developer, how do I get IT ops to trust us? They're busy."

Gene Kim |

Mobilizing The PCI Resistance, Part V: The GAIT Vision For Solving The SOX-404 IT Scoping Problem

Friday, June 18, 2010 at 11:32AM

This is Part 5 of the "Mobilizing PCI Resistance" series. Briefly, we've covered:

Okay, enough on the problem. Let's talk about the solution....

What We Wanted GAIT To Achieve

So, what was our vision in January 2006?

GAIT vision.jpg

Enable auditors and management to appropriately identify and link assertions to IT activities and processes, and then appropriately scope relevant IT controls work

What we wanted to achieve provide was a way for auditors and management to precisely scope what in IT mattered for the achievement of SOX-404 objectives. Or put more precisely, to link internal control objectives for financial reporting to specific IT functionality.

And then only audit those things. Instead of carpet-bombing/auditing everything in IT.
Provide a common context for management and auditors to support and test management’s assessment that the necessary IT controls exist and are effective

Initial target is internal control objectives for financial reporting, but should extend to operating effectiveness and complying with laws and regulations (as defined by COSO)

What we were suggesting here is that "SOX-404 is only the beginning. The same principles could be applied to the other COSO objectives: security, compliance with laws/regulations/contractual obligations."

(Look, it's the PCI DSS!!!)

And Stopping The Madness Of "See, This Audit Deficiency Didn't Really Matter!"

GAIT 9 firm chart 3.jpg

Lastly, shown above is what is known as "Chart 3" of the "A Framework For Evaluating IT Control Deficiencies" document, authored by the nine CPA firms that did SOX-404 audits or advisory work, as well as Dr. William F. Messier, Jr.

Basically you would have to dig out this chart for every IT deficiency to try to wiggle out of a material weakness. You would go through this decision tree to decide whether the deficiency would result in a material weakness, a significant deficiency or just a deficiency.

Just so at the end you could say, "See? I told you so! That audit finding isn't really important."

Trouble is, to arrive at that decision took man-weeks of work. Why was the test performed in the first place?

Our observation is that if you were spending lots of time going through Chart 3 for all your IT findings, only to find that they wouldn't result in a material weakness, it was a scoping error. So, GAIT would enable you to do this thinking up front, during scoping, so that we would only perform those tests that would result in an undetected material weakness.

In my next post, I intend to write about the constituencies and politics of getting GAIT approved by all the parties:

internal auditors
IT management
security/compliance management
professional organizations: IIA, ISACA, FASB
enforcement organizations: PCAOB

I'll talk about how we assembled the constituencies, what was in it for them, and how I learned to use one of the most valuable tools in my career.

And then I'll start talking about the GAIT Principles, and how we're extending it towards application towards PCI DSS.

(Many were fellow committee members with me at the Institute of Internal Auditors. In the next post, I'll describe why we had assembled the specific players in the room: SEC publicly held companies, their audit engagement partners from the Big Four, as well as their respective national practice leaders, the Institute of Internal Auditors, and the PCAOB.)

Gene Kim |

Mobilizing The PCI Resistance, Part IV: When Bottom-Up SOX-404 Audits Go Bad. Really Bad.

Friday, June 18, 2010 at 11:13AM

This is Part 4 of the "Mobilizing PCI Resistance" series. Briefly, we've covered:

PCI Shock and dismay.jpg

In this article, I will share a cautionary tale of how the problems discussed in the previous articles result in such horrible outcomes. More specifically, how the inability to scope correctly the IT portions of SOX-404 led to tons of firefighting by information security and auditors and wasted effort, all at the expense of more important things that they should have been focusing on.

Like keeping the organization secure. As opposed to trying to achieve compliance of a misguided and mis-scoped audit. Which is at the heart of the PCI problem.

How Bottom-Up SOX-404 Auditing Happens: A Cautionary Tale

visops security.jpg

How do these audit horror stories happen? How do intelligent people in management and audit end up auditing things that don't matter?

We studied this problem when we wrote the "Visible Ops Security: Achieving Common Security and IT Operations Objectives in 4 Practical Steps" book.

This is an excerpt from the book, in a chapter called "A SOX-404 Cautionary Tale":

When external auditors started testing against SOX-404 in the first year, IT findings represented the largest category of findings, totaling more than the combined findings in the revenue, procure-to-pay, and tax categories. It’s estimated that as much as $3 billion was spent in the first year of SOX-404 to fix IT controls to remediate these findings. Ultimately, most of these findings were found not to be direct risks to accurate financial reports and did not result in a material weakness. This is because they followed a bottom up versus a top-down, risk-based approach.

Consider the following scenario: The SOX-404 team asks for an information security review of a WebSphere server that runs the materials management systems. The review shows that it’s a custom WebSphere application running on a cluster of servers that is connected to a clustered Oracle database. We then locate the firewall and determine the segment it’s on.

An information security review of the materials management system uncovers:

Numerous ghost accounts

A lack of password aging policies

Critical vulnerabilities in the Java code, including cross-site scripting issues in the HTML

Vulnerabilities in the Oracle database configuration

Firewall rules that are suspect and need further investigation

Our task list keeps expanding and the internal auditors are showing up next week. We decide to focus on the operating system level, and our suspicions prove to be correct: The operating system is not running at the latest patch levels. We add this to our list of corrective actions that need to be taken right away, and start talking with the owners of the operating system, database, and application, and even the firewall team.

When the internal audit team comes in, we are candid and transparent about all the issues. Management is informed about the risks, and soon 50+ people are working on all these issues, dropping other high-priority projects to get these issues fixed in time. After all, the argument is made, these issues should be fixed eventually because they do represent risk.

But there just isn’t enough time. The external auditors come in and find all of these issues. They start preparing a management letter stating that the integrity of the IT general control process (ITGC) environment cannot be substantiated.

As a result, more high-level meetings take place, and the financial people start to argue that the ITGC issues really can’t lead to an undetected financial reporting error. They pull out the “nine firm document” and use something called “Chart Three” to make the case. Then management and the CPA firms argue back and forth about the linkage, and management starts bringing in all the business experts to show that a failure in the ITGCs for this system could not result in inaccurate financial reports.

Finally, the owner of the materials management business process determines that even if the application, database, operating system, and firewall were compromised by a person trying to perpetrate fraud, the attempt would be caught by a daily financial reconciliation between the materials management inventory report and another report from the ERP system.

Given this new evidence, everyone agrees that reliance is actually placed on the daily financial reconciliation, which would catch both fraud and errors. Furthermore, they agree that reliance is not placed on the IT system and the supporting ITGCs. So, the IT systems are out of scope, and no further IT testing is required.

Everyone is relieved. As the information security practitioners, however, we struggle with this unsettling question about why we went through all this trouble if our efforts were not required to substantiate the accuracy of financial statements. Furthermore, we wonder if all the “good hygiene habits” are actually important and can be justified.

To be clear, it’s not that the downstream manual financial reconciliation control is the best control possible. The point is that if the scoping of IT controls were done correctly in the first place, the only control weaknesses that we would have tested and found would be those that truly jeopardized accurate financial reporting. Instead, we found control weaknesses on systems that were out of scope, and then kept digging needlessly.

Next up, I'll discuss the GAIT vision that was realized in February 2007, when the GAIT guidance was finally published.

Gene Kim |

4 Comments |

70 References |

GAIT,

PCI,

financial reporting,

security/compliance

Wednesday

Jun162010

Mobilizing The PCI Resistance, Part III: Quantifying The Huge SOX-404 Problem...

Wednesday, June 16, 2010 at 11:33AM

Previously, I wrote in Part I about "Upset about the subjectivity and ambiguity in the PCI DSS compliance standards? My #BSides submission on the answer...", and in Part II, I wrote about the problems that management and auditors faced in 2005 and 2006 for the IT portions of SOX-404.

In Part III of this series, I will continue walking through the January 2006 GAIT summit slides, and show you the objective evidence that there was a real problem that needed to be solved, and our vision of what the solution was.

Jan 2006 GAIT discussion.jpg

The Damage Of Bottom-Up Auditing

Actually, let me rewind a bit. I didn't realize it at the time, but in 2005, I heard a great presentation by Patrick Gunderman that hinted at the magnitude and scale of the SOX-404 IT audit problem. Back then a Senior Manager in the KPMG audit practice. He showed a slide that blew me away.

KPMG Gunderman.jpg

gunderman IT findings 1.jpg

In the slide above, KPMG found that "The estimated percentage of deficiencies identified show IT controls accounting for the most (34 percent), followed distantly by revenue (13 percent), procure to pay (10 percent), and fixed assets (10 percent)."

What this means is that auditors were spending time digging around IT infrastructure, and finding lots of deficiencies. Then for each one, management would either have to remediate, or argue with the auditors that it wasn't worth fixing, because an IT control failure would not result in an undetected material error. Now, if the Enron and Worldcom failures were caused by rogue DBAs, then maybe this level of scrutiny was warranted. But, something definitely doesn't seem right...

It’s estimated that as much as $3 billion was spent in the first year of SOX-404 to fix IT controls to remediate these findings. Ultimately, most of these findings were found not to be direct risks to accurate financial reports and did not result in a material weakness. This is because they followed a bottom up versus a top-down, risk-based approach.

At the January 2006 GAIT Summit, we had publicly traded companies present how this problem was affecting them and their need for a better way. Universally, they talked about the huge IT audit effort and fees associated with SOX-404 that was totally disproportionate to the risk.

These companies included (in no particular order), Goldman Sachs, Marathon Oil, Microsoft, Hewlett Packard, Chevron Phillips Chemical, Business Objects and so forth.

One of the most compelling data points was presented by Fawn Weaver at Intel.

fawn weaver intel IT audit effort.jpg

This slide shows how 50% of the SOX-404 compliance effort was IT-related, which was generating almost 80% of the findings. Yet, none of those findings represented a real risk to an undetected material error. (So again, why was all that work performed? It shouldn't have been.)

In my next post, I will write about how bottom-up auditing happens and our vision behind GAIT. Next, I will write about the politics of GAIT, and how we assembled the constituencies, what was in it for them, and how I learned to use one of the most valuable tools in my career.

All of this helps (at least, in my mind) inform the PCI problem statement, as well as the strategy of how we can solve it.

Gene Kim |