Notes on ‘team responsibilities in cloud-native operations’ (Pete Mounce)

Summary: Pete Mounce (@petemounce) from Just Eat gave a compelling talk at the London Continuous Delivery meetup group on ‘team responsibilities in cloud-native operations’. I found the talk hugely engaging, with loads of detail applicable to many organisations. Here are my notes from the meetup.

I captured my notes as slides:

Update: the video of Pete’s talk is here on Vimeo:

There were several specific points made by Pete that were interesting for me:

We have 12 small teams that are isolated from each other

Each team develops, delivers and supports subsystems that are independent of those of other teams. Behind this setup is an awareness of the negative effects of allowing too many people to make modifications to a codebase – ‘rot’ and disengagement often set in if people see that another team or person has changed some code and made it worse, especially if the code change was a ‘bit of a bodge’ to meet a deadline.

We spent significant time & effort on the delivery mechanism and monitoring

In my experience, most organisations currently underestimate the amount of time and effort they need to spend on evolving and operating the delivery mechanism and auxiliary systems such as logging and monitoring. If we want our software changes to be rapid, repeatable, and reliable (safe), then it’s essential to spend a non-trivial part of the product or programme budget on our delivery and operations infrastructure and tooling. Given that Just Eat are doing really well, it’s not surprising that they have invested properly in deployment and monitoring systems.

Now we make decisions using metrics from Production. We waste less time in meetings – it’s easier to make decisions based on data, and there are fewer arguments.

Making decisions based on metrics from Production is worlds away from where many organisations are today. The extra angle I liked from Pete was that metrics make decisions easier, with fewer arguments and less time wasted in meetings discussing possibilities. The organisation becomes more nimble in decision-making.

We experienced a step change when we introduced centralised logging. Centralised logging (ELK) helps us to *explore* data and metrics retrospectively and learn. We created more specific dashboards that help to reduce ‘mean time to react’ and ‘mean time to repair’.

I am a massive fan of log aggregation, especially when used to bridge the Dev / Ops divide. Pete described how introducing centralised logging (using NxLog and ELK) actually helped to change the culture within the organisation, as developers became more aware of and interested in high quality logging, and operations people began to trust developers more due to better logging. That ‘centralised logging helps to reduce time-to-repair’ is a key justification for spending money and effort.

Ops think of Alerts, Devs think of Tests. We changed our terminology so that alerts are called ‘continuously running tests in Production’ so Devs are happier to write alerts

Terminology is important; finding a shared vocabulary that drives positive behaviour was a real win for Just Eat. And just getting buy-in to always-running tests in Production is a major enabler.

Don’t use the debugger – use logging instead.

Yes, yes, yes! I gave a talk on exactly this subject with my colleague Rob Thatcher at the excellent /dev/winter conference in January 2015: Ditch the debugger and use logging instead:

It was great to hear Pete enthuse about the transforming power of log aggregation :)

Each team is aligned to a business KPI – ‘Goal Team’ e.g. ‘Make a leading mobile experience’, ‘Encourage new customer conversion’, ‘Increase engagement with restaurant partners’

Aligning teams to a high level, business-relevant metric and allowing the team the autonomy and responsibility to meet that goal is a bold step. Team members are also members of a cross-cutting team that acts a quality gatekeeper for each subsystem, so the drive from the ‘Goal Team’ is tempered by the checks from the code quality team. It sounds like a really interesting approach, and I am looking forward to hearing more from Pete in say 12 months’ time.

We establish trust relationships rather than relying on permission enforcement.

Also known as ‘treat people as decent human beings rather than army recruits’. It makes Just Eat sound like a great place to work.

We’re insulated against spikes by running fake load. We simply switch off fake load to gain an extra ~~10-20%~~ 50% capacity.

This is a superb idea: deliberately inject extra load on the Production system continuously (making sure that our system copes), so that if we notice a sudden surge in real traffic, we can just switch off the fake load to gain an extra ~~15-20%~~ 50% capacity without provisioning any more tin. It feels analogous to marathon & ultra runners who train with weights strapped to their arms and legs so that when they run the actual race, they feel they have extra strength.

Encourage people to think of logs as event streams rather than lines in a text file

Again, yes! I’ll be speaking on this subject at Operability.io Conference in London on 24-25 September 2015: ‘Un-broken logging, the foundation of operability’

We do not use industry terminology like Agile, Kanban, Scrum, but instead just describe what we do. “We try and ship software and try to continuously improve” – that’s it.

Rather than getting hung up on whether they are ‘doing Agile’, the Just Eat teams focus on getting things done well. Exemplary stuff.

A big thanks to Pete Mounce for sharing his experience, to GameSys for hosting the meetup, and to the attendees for a brilliant 45-minute long Q&A session.

Edited, 2015-07-19: correct the ‘fake load’ volumes from erroneous 15-20% to the actual 50% load used.