Quantcast
Channel: The VBOX » SpeakerBOX
Viewing all articles
Browse latest Browse all 8

Live from RedHat Summit 2014: How Netflix Uses Devops for Reliability and Developer Velocity

0
0

Session Summary: Netflix designs their systems and deployment processes to help the service survive both catastrophic events like zone and regional outages and less catastrophic events like network latency and random instance death. This system has previously been described as “dream devops”. In our data centers we had monolithic systems and centralized operations. When we moved to the cloud we fully embraced the distributed services and the devops model. Now, with experience, we’ve uncovered real-world challenges with the devops model and, as a result, have embraced more effective hybrid approaches. More specifically, how do we reconcile local agility and ownership with the achievement of system-wide objectives, such as the overall quality and reliability of large scale distributed environment? Topics will include our software lifecycle from code check-in to automated machine image baking to deployment, monitoring and alerting, and how Netflix uses self-service tools to enable our developers to maintain maximum code velocity.

This session was given by Jeremy Edberg the self-proclaimed Information Cowboy from Netflix. J He was very informative and humorous, which I really enjoy from a presenter.

Netflix hires responsible adults and give them a major amount of freedom and responsibility. Developers deploy when they want and also manage their own capacity and autoscaling. The developers own their product from beginning to end. The mantra is, “If the customer isn’t happy the developer shouldn’t be happy” which should be everyone’s thought process. These were my notes.

Automation: Essentially everything is automated, even code deployment is automated. They have tools to manage all of the systems.

Build for Three: This is done for reliability. Architectures are built around the assumption of loss of any VM will not affect service. This is me is the Pets vs. Cattle analogy.

We used to live in a world where nothing breaks, and when you design for mass scale, it is just not true anymore. Speed at scale breaks everything.

Cloud Native (called Microservices): 10s of thousands of instances created and removed daily.

Service-Oriented Architecture: Highly Aligned and Loosely coupled. Each Dev team builds a service all working together to figure out what each service will provide. They all work together using API that service owners leverage to collaborate. This enables the following; Easier auto-scaling, Easier capacity planning, Identify problematic code, Narrow the effects of a change, more efficient local caching. The API has its own little runtime engine in it.

Browse, Play, Watch: Each driven by a different set of APIs. Still deploy fully baked images out to the imaging servers. Consistently balances the Reliability vs. $$ factor.

Describes the Monkey Theory which I thought was pretty interesting. This was in regards to people’s way of thinking in IT communities. I think this is very relevant today for so many things, so let me share:

Start with a cage containing five monkeys. Inside the cage, hang a banana on a string and place a set of stairs under it. Before long, a monkey will go to the stairs and start to climb towards the banana.

As soon as he touches the stairs, spray all of the other monkeys with cold water. After a while, another monkey makes an attempt with the same result – all the other monkeys are sprayed with cold water. Pretty soon, when another monkey tries to climb the stairs, the other monkeys will try to prevent it.

Now, put away the cold water. Remove one monkey from the cage and replace it with a new one. The new monkey sees the banana and wants to climb the stairs. To his surprise and horror, all of the other monkeys attack him. After another attempt and attack, he knows that if he tries to climb the stairs, he will be assaulted.

Next, remove another of the original five monkeys and replace it with a new one. The newcomer goes to the stairs and is attacked. The previous newcomer takes part in the punishment with enthusiasm!

Likewise, replace a third original monkey with a new one, then a fourth, and then the fifth. Every time the newest monkey takes to the stairs, he is attacked. Most of the monkeys that are beating him have no idea why they were not permitted to climb the stairs or why they are participating in the beating of the newest monkey.

After replacing all the original monkeys, none of the remaining monkeys have ever been sprayed with cold water. Nevertheless, no monkey ever again approaches the stairs to try for the banana.

Why not?

Because as far as they know that’s the way it’s always been done around here.

  • Netflix built a Global PaaS that supports all regions and zones built on top of AWS. Using AWS because they do the heavy lifting.
  • Shared state should be stored in a shared service. Using technology and tools like EVCache, Cassandra (Availability over consistency, Writes over reads, We know Java, Open source + support, easy scalability, fast negative lookups, Distributed – No SPoF), Asgard, JBoss, Python,
  • Netflix has moved the granularity up from the instance to the cluster. (see Pets vs. Cattle)
  • Described their Deployment, App Image Baking, Linux Base AMI using CentOS or Ubuntu,
  • Continuous Integration: Each check-in results in a new machine image. Runs automatically with a new check-in, includes running tests and canaries.
  • These deployment practices allow a lead time of minutes.
  • Finding Things: Discovery (Eureka), Entrypoints (Edda), NIWS (Ribbon).
  • Keeping it All Straight: Configuration (Archaius), Base (Karyon), NIWS (Ribbon), logging (Blitz4j)
  • Storing It: Cassandra (Priam, astyanax), EVCache (Eccentric Volatile Cache)
  • https://github.com/Netflix/recipes-rss – Open Source Platform
  • Netflix has now built a ‘Simian Army’ that can think for itself and bring new ideas and new ways of doing things into the fold.
  • Best Practices – Decision Making (Time of Day/Week, Risk to Netflix), Incident Reviews (Preventative behavior), Policies (Descriptive, Flexible, Evolve Quickly, Actions), Circuit Breakers (Hystrix) [Be liberal in what you accept, strict in what you send], For Data (multiple copies, snapshots, & Replication)

Overall I found this session quite interesting, as I also believe this to hold true in my professional career. If people are pushing wheelbarrows using square wheels and you come up to them saying try this round wheel. For the most part people state that they are too busy, or things are fine just the way that they are now. I certainly hope any of these notes from this session helped someone.

 

 


Viewing all articles
Browse latest Browse all 8

Latest Images

Trending Articles





Latest Images