Answered: Your platform engineering FAQs
Your platform engineering FAQs answered by Ten10’s Head of Cloud & DevOps Practice
As experts in DevOps, Cloud, and Platform Engineering, we get a lot of questions from businesses that want to either adopt platform engineering for the first time or improve their approach and workflows.
Ten10’s Head of Cloud & DevOps Practice, Matt Smith, joins today’s instalment of Ten Minutes with Ten10 to answer your platform engineering FAQs, including:
- What role does automation play in platform engineering?
- What are Internal Developer Platforms?
- How do you approach security concerns in platform engineering?
- How can IT managers effectively communicate the value of platform engineering?
Click to listen below or read the episode’s transcript. We hope you find Matt’s insights and expertise from his extensive career useful.
Can you explain the role of automation in platform engineering and its impact on business operations?
With automation in business, the whole point of it is to get consistency, because then you can do things predictably. When we started with things like DevOps, and that eventually evolved into Platform Engineering, the goal was to make the release of software more scientific, so that we can see what’s happening and what changes impact those things. So when it gets to platform engineering, we’re not just talking about software release anymore, we’re talking about releasing products or services and managing the release of those products and services.
A lot of things with platform engineering also come down to: how do you make sure you’re compliant to your security standards? How do you make sure you’re compliant to your organisation’s financial standards? And that automation gives you the reassurance that you’re following the right task.
It’s always a blend because you don’t want to have so many restrictions on the choice of tooling technology or how you solve a problem. Because at the end of the day, your engineers are capable of playing with those things, but you need to have some governance on it to make sure that they’re able to do it in a way that’s not going to be too disruptive. And if they do, you need to push the boat out, it’s kind of a trigger for them to come and have a conversation. So you have enough automation and compliance around it to give you the reassurance that nothing crazy is going to happen. Like, you’re not going to spend a million pounds on Amazon one month because someone spun up 2000 instances. But then if someone legitimately has a need for that, then you know you have a way of dealing with it. So I think it’s just putting those loosely-defined boundaries, that help give you the reassurance that they’re roughly doing the right things in the right way and they’re not going to be doing anything too crazy. But at the same time, like I said, you don’t want to take away their ability to innovate, because then you just don’t get the right kind of solutions.
How important is scalability to platform engineering? And what strategies can be used to ensure it is successful?
A lot of businesses think they have a scalability issue and the reality is they just don’t, if we look at a major website (let’s say bbc.co.uk) there are millions and millions of users hitting that website on a daily basis and the backend services will be constantly churning out and doing different things. So yeah, that sort of website will have some scalability issues.
Most organisations, let’s take a bank, for example. They assume they have a scalability issue because maybe they have 50 million customers, they all use this core banking platform and therefore it’s really big and important. But if you do the math, the number of transactions that a core banking platform in the largest bank in the UK does is somewhere sub-1000 requests a second. So for most computers, that’s fine. Python will happily deal with tens of thousands of requests per second if it’s architected correctly. They don’t actually have a performance scaling issue. What they have is a consistency issue.
We’ve talked about it before: consistency is king. You want to make sure that as you’re scaling, you’re scaling in a consistent way that is repeatable. So you’re not just upping the size of the instances, for example, in cloud. That’s one way to scale, sure, but you want to scale horizontally because then that way you can meet demand more easily.
Having the right processes around how you scale and making sure you scale in the right way, and making sure you re-architect your application so it does scale in those ways is far more important. And I think that’s just good practice. If you’re following these good practices, yes, it’s a little bit harder work to get your product launch because you’re not just throwing it out there. But it pays dividends later. If you are releasing a service, then it’s a good idea to invest quite a lot of time in building up those practices and improving it.
What I’ve done in the past is have an engineer, for example, just ‘getting the job done’ – getting the product released – and then having multiple engineers come up behind that person to do the long-term good work. That’s resulted in you know, very highly-scalable, highly-mature platforms because you’re you’re doing the hard work but you’re also releasing value to the customers quickly.
What are Internal Developer Platforms and what role do they play in platform engineering?
Here’s a bit of an anecdote: I used to work for a company called Alfresco Software who make enterprise content management platforms. Traditional Java-type applications, you know, gigabytes and gigabytes worth of code base. Thousands and thousands of tests. They used to release once a quarter. What they wanted to do was be able to release features quicker and sooner. By the time you go through your build pipeline, you find your test run for hours, and hours and hours. And if you’ve got 100 developers releasing 30 features every two weeks, the integrations on that is just hell. And if you only have one environment, you’ve got 30 features, plus all the integrations for those 30 features all at the same time.
One of the things we did there was build an IDP. We ended up building a platform that allowed each developer to spin up a like-copy of production to build and write their own coding. That meant that as they pushed into Git, they built a like-production environment on which they could run all the tests. So before they end up, merging it into the codebase, they, as a developer, have good confidence that they’ve done the right thing. The next bit after that is that they then get to testing. What it means for the testing team is rather than having one environment with 30 changes, they have 30 environments that are exactly the same as production with individual changes. They can then test every single feature in isolation to make sure that feature is good and then they can cherry-pick the individual features they want to integrate. They may choose all 30 and integrate them at once and see how it goes. They may choose to not do that. But the important thing here is they have the flexibility to choose.
What an independent development platform does for you is it gives you the ability to choose how you want to do this work and not just throw it over and build it later into a big bundle of mess. You can test it individually.
How do you approach security concerns in platform engineering? What are some best practices to adopt?
I think a lot of this comes down to just good practice. One of the advantages to platform engineering is that you almost have this compliance as code mentality, which means you’re, you’re adding in the security as you’re going.
For example, let’s say you are doing PCI DSS or you’re doing some security standard or ISO. And one of those requirements is that you need to know who has access to the operating systems when they log in, and you need to make sure things are logged correctly. One of the great things about platform engineering, cloud, and DevOps-type workloads is that you can just automate that you can take all the users out. When someone says “Who has access to your operating system?” you can go to your infrastructure as code and you can go to your configuration management and say “No one has access” or “Only these people have access.”
From a compliance point of view, you have this immutability around who can and who can’t do things. This massively helps from an audit point of view. In terms of managing that on a daily basis, you’re setting up your logins so you see that coming all the time. You also have, which is what I like doing, is if you rebuild your environments consistently (and I mean from scratch, not just running config over the top) you also have this guarantee that if anyone did break into your systems, you’ve just deleted it and removed it. So your attack horizon, the amount of risk you’re carrying, is as long as it takes you to do a release of code. It’s just one of the reasons why I always prefer to completely rebuild the infrastructure because I guarantee that no security concerns are being missed. Every time I release production, production is a new production (with the exception of data, because I don’t think anyone wants to lose their customer data on every release). For the rest of the platform, I think building in that compliance as the configuration management fails is quite important.
One of the other elements I think is important as well is: let’s say you’re in a live incident or a security-type issue. With a lot of platform engineering, there’s a lot more emphasis around observability and metrics and information. Being able to have a platform where you can easily see “here’s a red alert” because someone logged in and you know that no one should be logging in. You can filter it, you can do stuff to say “Okay, well what’s going on here?” Or maybe there’s a high volume of traffic hitting one of your endpoints. That observability naturally comes into that platform engineering.
I think the main challenge a lot of people have is they don’t quite appreciate the scale and breadth of platform engineering. And it is a journey. It’s not a six-month ‘one and done’ kind of situation, you kind of build up in layers. There are certainly tools out there that take away a lot of those steps but they also have limitations. It’s one of those things where you have to be building all those tools in – being able to access the logs of the application in multiple different ways is quite important as well, because we have things like ELK what you do is you take a log, you chunk it up into little bits, you send it into a database, and it indexes it. What that means is: you can search on that and then you can use other tools to say “Okay, well if I see these patterns, trigger these alerts.” So you can do some interesting stuff but that’s not always good enough. Sometimes you still need to access the raw logs and see what’s going on. So your platform engineering approach becomes quite encompassing, it basically includes everything that you would do to run an operating system as much as possible.
How can IT managers effectively communicate the value of platform engineering to business leaders who may not be as technically savvy?
I always like to talk about disaster recovery (DR) and business continuity (BC) in this sense. One of the cool things about platform engineering, cloud, and DevOps-type workloads is that as long as you’re doing the automation, is you’ve got this immutability. You either are something or you aren’t. There’s no ambiguity around it.
With platform engineering, if you implement it fully, you have the ability to say that your release process of pushing code out is the same process you would use for your disaster recovery and your business continuity. And the reason is because when you go into a DR or BC-type situation you are building a new version of your environment and your syncing data, essentially. A lot of times, organisations get into challenges because they can’t remember how they built it or someone jumped on and made a change to an operating system and that change didn’t quite make it across. But again, if you’re consistently deleting your environments and rebuilding them from scratch, you have very high confidence that the version of the infrastructure that was deployed is exactly the same version you could deploy somewhere else. By going down that platform engineering route and ensuring that is always going to come out in the same way, you completely de-risk all of those compliance issues.
Obviously, you have to deal with data and make sure the data is transferred correctly. But you know, your worst case scenarios; let’s say you’re on Amazon and US East 1 goes down, [you can] just build it in one of the EU ones. They have the same features and the same functionality. It’s a data centre that’s driven by code. As long as the API calls work, you’re gonna get the exact same environment at a different location. So I would always go over that compliance route because I think it helps hammer home that’s a lot of effort (to do it correctly) to a non-IT person. They say ‘You have to check things’ but actually you don’t, because every time you do a release, you’re checking your DR on your BC. You don’t need to do that anymore.
Similar things like security patching – traditionally, people would log into systems, update patches, and you’d have this whole archaic way of going “Did I deploy the patches? Did I make sure that I applied them? How am I guaranteeing I’m applying the patches?” If you only apply patches when you build an environment, and you delete your environment last week or with every release, then it’s only been a week since you applied patches. So the challenge then is: if you have the capability to release multiple times a week, and then you don’t for three months. That’s when it becomes a challenge. But if you’ve gone to the effort of being able to release your platform every single day, you’d hope you’re releasing at least once every couple of weeks. In which case, again, you no longer have to prove you’re compliant with these security practices because in your startup environment, you’re applying patches. You’re always applying patches every release and you did a release a week ago. So that’s the last time you applied patches. So I think there’s a lot of ways and means around it.
I remember when things like Heartbleed came out years ago and we had a few hundred servers that needed to be patched. Okay, so we just push a change into configuration management and that’s it. Rebuild the environments – job done. The environments that couldn’t be rebuilt, that’s fine. You just push the change out anyway and they automatically patch themselves. There are a lot of ways of taking it away. I think we picked it up at eight o’clock in the morning and by nine o’clock all our systems were patched because it was a one-line change and you could push it in. So that utility around repeatability is quite important.
How do you balance the need for innovation with maintaining stability and reliability in your platforms?
This is a fun one. What it comes down to typically is resourcing. A lot of organisations don’t put priorities around actually doing the innovation. They have a lot of priorities around the maintenance because that’s what they’re currently doing. So one of the things I’ve done in the past is split my team. The same skill set of each individual and split into three buckets:
- Business as usual: Their job is just to keep your platform running, to deal with the ad hoc requests that come in, the unplanned things. That becomes a bucket or capacity bucket so if I have two people there, we’re only ever going to do two people’s worth of work.
- Mid-term projects: These are the things the company or the business wants to do over the next quarter or two quarters. It might be releasing a new product or something like that. We’d embed engineers into the teams and they would focus on that and that that is helping the business choose what it wants to do over a certain period of time.
- What we’re doing in a year: If we’re looking at maybe building out a data centre, how are we going to build that? Are we going to build our own private cloud or we’re going to do something else? They do a lot of innovation.
And one of the key things here is everyone has the same level of experience and in the same skill set. So as we hit outages we would just say “are we responding quick enough from a BAU point of view? If the platform goes down [or] it’s down for five minutes, is that okay? Is it not okay? Do we need to do more work there or not?” And it very much confines where you have that conversation.
If the business says “We’ve got another 10 projects we need to do” and I only have five people: “Great, we can absolutely do that. Would you like to sacrifice your day-to-day support of your platform by taking people from your business-as-usual team and put them into the project team?” Typically, most people are sensible enough to say no. And then you go “Okay, that’s great. Would you like to sacrifice your strategic stuff for the whole business to do this short-term thing?” And again, if you’re talking to CTO, typically, the answer is no because the strategy is still quite important. So then you force this other conversation: “Well, what do you want to do then? Because I can’t magic up more resources. You don’t want to sacrifice your long-term vision, you don’t sacrifice your customers, we just have this pool of (let’s say 10) engineers. Should we hire some more? Should we get some temporary resource?” And you’re bucketing the conversation into terms that make sense.
Nine times out of then, I’d have that conversation with the CTO and it’d be along the lines of: “Actually this project, you can just start working on [this project] because they’re going to be delayed by two weeks. You can do this project instead.”
One of the great ones was unplanned work, you know, when someone just comes up and they haven’t followed the process and says “We need to do this, it’s really important.” You say: “Great. We will do our best efforts in our BAU team. When they’re not fixing the platform, they can pick stuff up, it’s just gonna take forever. If that’s not great enough for you, then go and escalate. Go and talk to someone because I’m guaranteeing the CTO is not going to sacrifice our ability to support our platform for whatever the current business priorities are.”
I liked having those buckets because it made it really easy to have an actual conversation. So many organisations just say yes to work. You want to say yes because you want to be helpful, but you want to say “Yes, but this is the cost of doing that work.” So you either need more resourcing or you’re going to stop doing something.
I remember one conversation along the lines of: “You know, you’re one of the people to do this.” And I said “If I had no projects to do I wouldn’t have any people. I don’t actually have a job, I just have a job supporting you. If you don’t exist, my team shrinks down. So if you need more work then I need more people.” And I think it’s something people need to get into their heads that IT isn’t just this bottomless pit of delivering work with no real resourcing. We’re people trying to do a job but our job is to support the business. If the business has no requirements, we don’t have jobs. So we can happily scale our stuff down to maybe one or two people, that the BAU team has all you need to maintain what you’ve got. The other 50%, 60% or 70% of the team is doing the innovative stuff that the business wants like the upcoming projects in the future.
I think it’s important that if you’re starting a new project, you account for the new resourcing and don’t just throw it over. Like I said, so many organisations won’t do that prioritisation. They’ll just keep throwing work. I can think of a very large UK manufacturer that does that and they wonder why they’re not hitting their deliverables. It’s because they just won’t prioritise work. They won’t say ‘no’ as a leadership team. That’s a big thing: just say ‘no’.