In the late summer of 2023 our Director of Technology, Rob Lazzurs, flew out to San Francisco to speak at HashiCorp’s flagship cloud conference.
The following is the transcript export from the video.
Good afternoon, everyone. Hope you enjoyed lunch. In the beginning, there was this little airline, a couple of guys in a field in Dublin, and they got a plane. This was really early on; in fact, I’m pretty sure these guys were selling tickets before the Wright Brothers had launched a plane. They were innovators, doing this before the second world war, and dreamers. They wanted to get people off the island. The Irish love to travel and see the world. They tried boats, but that didn’t work out, so they thought planes would be the thing. As they started selling tickets, they realized selling tickets is hard, scheduling planes is hard, so they thought they’d get a computer. That means they’ve got two problems now: a computer and a ticketing problem. They got one of the very first computers on the island of Ireland, an IBM Mainframe, in 1964. That computer still runs the airline today, which is impressive, a lot of value delivered from one computer.
Now, about a decade on, 1970s oil crisis, things aren’t going well. These innovators decided they were going to rent out some Mainframe capacity to other airlines. They were doing SAAS before any of us had thought about cloud, or before I was born, early innovators. Fast forward another decade, they go from one Mainframe to two. They have resiliency, decide to put one outside the airport, one inside. Initially, the mainframe app was just one mainframe, but before the mainframes could do failover, they wrote it themselves into their app. They did instant failover, realizing it was important that planes keep taking off, they keep selling tickets.
They then started putting in normal Intel-based computers, building business apps, doing the standard thing in data centers. Being innovative, they decided around 2012-2013 to get their website into the cloud. They were one of the first to sell tickets on their website. They took a data center model: one inside the airport, one outside. They took an option from Amazon and started building for the website, using Amazon as a data center, but without auto-scaling.
Then, typical corporate things started happening. Outsourcing seemed like a good idea, so they siloed their networking into a company. They took their AWS production account, not the dev account, into a separate company. The AWS accounts went to the incumbent Telco, who dealt with Direct Connect into their data centers, but they were a separate silo. The on-call support wasn’t run by the AWS production account people but by a separate company. Their development teams had the Mainframe team, off to the side, doing great work. But the rest of their development silos, people writing Java code, microservices, were split into two: those in the data center and those in AWS. So, when I proposed this talk, I started writing it on May the 4th, with Star Wars playing in the background, A New Hope. This is where I come in.
Hi, I’m Rob Lazzurs from a company called Amach. We specialize in getting airlines into the cloud. I’ve been playing with computers for money for 24 years. I started in ISPs and building early clouds, I did that because they had the biggest, coolest computers and the biggest, coolest internet connections, and I just wanted to play with big, cool toys. Then, I went into UK government. As a small aside to my talk, if you get the chance in your country to work for the government and do digital work, I can’t recommend it enough. Do some public service. My favorite story from working in GV UK: I was working for the prison service, and we put laptops in a couple of prisons, giving prisoners content like Khan Academy and Wikipedia for kids. I solved the security problems of getting a CI/CD pipeline into prisons. The cool thing about that government work was what came back from the user research. A user said that thanks to the ‘Digital Hub,’ he learned to read. If you get a chance, do public service.
For the last four years, I’ve been in aviation, quite an interesting time. I left the UK at the end of 2019 and started with this airline because of the mainframe. This one computer had been there since 1964 and was still running the airline. I wanted to see and touch this computer that had been doing this all this time. It wasn’t easy to touch, but I wanted to. The cool thing about them was all those silos I talked about; they hated them and wanted rid of them. They knew they made a mistake. They are really interesting people and wanted to get rid of it but didn’t know how. So, I come in. There was no infrastructure code, no Terraform code. The AWS production account was run by a separate company, so there was some Terraform code, but only in the dev account. Imagine developing a service, deploying it in your dev environment with some code, but then when going to production, you hand that code over to a company that clicks ops it for you. That’s what was happening at this airline.
It’s a sad situation, but an easy place to improve upon. We started looking at the landscape in late 2019. Palumi and AWS CDK were cool things, but we made a choice based not on technology but on people. The people we had knew Terraform. We could have gone with AWS CDK, but they didn’t know that. We tried AWS CDK for a few isolated things to see if it would work with them; it didn’t, so we went with Terraform. I think that was the right choice. These things are valuable, and the pandemic deprived us of all of this. Getting together, listening, and sharing experiences is important because usually, we’re not going to talk to each other; we’re from different companies. I listened to a talk today from someone who does medical records. When am I, in an airline, going to speak to them? I’m not. But at previous HashiDays in London, I’d heard from Eler, who had done some really cool things with a modules monorepo. I loved their pattern and deployed it in government before I got to the airline. It worked really well there. When starting in this airline, I decided to go down the same path. Now, Terrant, a wrapper around Terraform, makes it really easy to do multiple environments and deploy a module into Dev, Test, and Prod repeatedly. I found that it worked well in my previous place. So, I decided to do the same thing. Of course, when starting anything fresh, always look again at best practice. Treat it like you know nothing. Of course, I have 20-something years of experience, but if you just
Go along with that, you’re never going to learn anything. When deploying this, I looked at the best practice for Terra Grunt as it was at that time. I looked how they laid out the environment Trio and copied that. This was the first set of commits that I’d laid out in the repo, called the infrastructure environment repo, the IDER. At the top level, there’s a directory, part of the Legacy accounts we inherited from the silos in previous structures. The next level down is environments, and under that, there are environment separators like UATX and two. Then comes deployment. The first module I wrote was for REST API G Gateway for the mobile app, the initial structure of the layout. This has evolved significantly since. My colleagues here wouldn’t have seen this structure; this was about the end of 2019.
We use Atlantis now. When bootstrapping that initial version, me sitting there with my workstation running Terraform, Terra Grunt is fine. No one needs to see my plans or applies, Terra Grunt’s managing the state file. But as soon as you’ve got the second, third, fourth person, you need transparency. If I wake up at 4 in the morning and something’s down because of a deploy someone else did, I want to be able to see that. Atlantis was great for following the same Terra Grunt Ops model, very easy for users to use. We’ve onboarded hundreds of people into using Atlantis and I don’t have to write any docs, which is good because I’m terrible at writing documentation. For Atlantis, you don’t have to do this. Our front page is the first thing you open up, run Atlantis. The top of the page is step one on how to use it. You open up a pull request against your code repo, in our case, this infrastructure environment repo. When you open up a pull request, it runs Terraform, or Tarrant in our case, plan. In your pull request, next to the code change, you see what it’s going to do. You see your code change in the diff and in the comment section, you get to see exactly what that diff is going to do. There’s an example of the plan. It shows you the plan, the code, what it’s going to do, and it’s self-documenting in the comments, telling you what you need to do next. To apply this, comment the pull request with Atlantis apply. If you want to run a plan again, you can run Atlantis plan again, or if you want to delete the plan, click the link.
The reason I liked Atlantis was it was one of two main options for us at the time, Jenkins or this. I liked that I could isolate it. We started with three accounts, today we’ve got 70. This has full control over all of them, full control over Airline. If someone compromises this, planes don’t take off, holidays are cancelled, business trips, people don’t get to HashiCorp. Having a nice, isolated, separate thing that can do that, that isn’t rejections, is kind of cool.
Step three is someone approves your PR, a very important piece in our change process. This gave us control to enforce policy, allowing us to take 500 developers, an entire organization, from no I into I, and do it well. We could have a small team, not necessarily writing the code, but always approving the changes. Step four, after the approval, you run Atlantis apply. Step five, your apply works perfectly every time, never a failure. We all know that’s not true. One of the things I like about the Atlantis model is the people making the changes; this is running against your branch, in your PR. Your main branch isn’t polluted with something that didn’t work. If this plan or apply fails, which it does, you then get to plan again, maybe make changes.
More code changes and work on your PR until it works; only then merge it into your main branch. The great thing about this is from a disaster recovery perspective, your main branch, in our case of our infrastructure environment shpo, is part of our DR process. If Amazon lost all of the instances, lost the entire availability zone, we can at the top level with ter grunt on our environment repo, run ter grunt apply all, and we could have our environment, the entire company, back up within a number of hours. So, going back to the IAC code, the last thing I want to do, being super lazy in my 24 years, getting lazier every single day, is write code. I don’t want to write Go code, I don’t want to write Python code. I love co-pilot when it comes to IC. I enforced this at the airline and said you’re going to use public modules. There are great public modules out there; someone in the audience has written some of the best, I think, and we use these extensively. We use your modules extensively, we use the Terraform AWS modules collection; it’s the top of our list.
We have two recommendations: one, if you’re going to do something in Terraform and AWS, go to Terraform AWS modules, use that module. If that doesn’t fit for some reason, if there isn’t a module to do what you want, look at another company called Club Posy. They are a competing consultancy, but I still recommend our work; it’s excellent. Go to Cloud Posy, that’s our documented policy. Only then, if you can’t use a module, if you can’t use someone else’s code to already do what you need to do because it’s probably already written, then consider writing your own module, then consider writing some code. This made things a lot easier from day one, got us moving much faster, this is the PR of community.
As you saw with the repo, there was initial service. We had senior support to do what we wanted to do, to kill the silos, get rid of the companies in between, not necessarily remove the companies but certainly remove the barriers between the companies, remove the barriers in the silos. We had support to do that, but there’s a difference between having support in a large organization. An airline can, in support contracts and on-call contracts, that in theory keeps their business up and running. We have to prove competency with the internal team. We have to show that this internal team, this small team of innovators, because everyone at this airline is that, they could really run production, they could support it, it would stay up, and we wouldn’t stop the airplanes taking off. So, we did this with two initial services: one’s the API Gateway I showed you, the next one was something called the Payment Hub, and it is exactly what it sounds like. It’s what all the payments go through, a tokenization service. That was the first real service that this airline put into production with Terraform code, with Atlantis, with Tarrant, with the modules that Anon wrote, and it worked. The airline was taking payments, the service launched, it autoscaled, it was able to go into 3 A’s, which was a first.
We even used a really cool L called autop spotting to be able to use spot instances with it. It worked, we proved competency. So then, we had that legacy account set up, and as you all know, if you’re doing any, really what you should do, you shouldn’t just go and create three accounts that have no association between them and call them things. You should create an AWS Landing zone. So, one of my colleagues, Will Oldwin, fantastic chap, knows far more about AWS than I do, he created our Landing zone. So, he’s off to this item, playing around with my Terr, my Atlantis, and he’s off to the side creating the Landing Zone.
There was a problem with this though. This is four years ago, remember, it’s Terraform AWS provider isn’t quite where it is today, and we’ve got all these systems in the data centers that we don’t have control over. So, we had no authentication service we could use for AWS, there was no single sign-on for the company, just didn’t exist. So, we decided to use what was called at the time AWS SSO, it’s now IAMIC, which I hate the name of, but there was no Terraform control for this. So, it was the one thing, sadly, we had to do the thing we hated, we had to click Ops the centralized authentication model for this Landing Zone. That was painful. One of the things I really like today, where we are four years later, is that the provider is moving much faster, and all of these things are now in the provider. Day one, we’re Playing with this, we’re creating our Landing Zone, doing the first Services part of the production competency thing. We know what to take over, wrong call from some random company that wasn’t really working with us. We had to do alerting, so that’s why Ops Genie is on the screen here. Ops Genie has a Terraform provider; it instantly solidified the approach. Not only could we control AWS with Terraform and T GR in Atlantis, but we could also control Ops Genie. We decided to move away from Nexus to Artifactory, mainly because they were cheaper per user, but it also had a Terraform provider. Later, there was Azure and GitHub, all controlled by Terraform, all controlled by one mono repo, so one mono repo for modules, one more repo for environment definitions.
As I said, I started at the end of 2019. It’s hard to talk about an airline without mentioning the pandemic. I moved from a super safe government job to a new country, working with an airline, then suddenly the head of the Irish government announces no one’s going anywhere, planes are not taking people to other countries. I’m thinking, “Good move, Rob.” My friends in government were working on pandemic stuff, so they’re safe. However, one of the things I learned in government: Cris-tunity. Never let a good crisis go to waste. There’s always something to be done. During pandemic times at the airline, I said, “You’re never going to have another chance like this.” Your planes are on the ground; airlines are a 24-hour business. There’s always something up there, people working 24 hours a day, making sure things land successfully. Thankfully, computers are not involved in landing.
So, we’re in crisis unity. We’ve got this pandemic, the planes are on the ground. Why don’t we migrate? Why don’t we get all the stuff in those data centers out into the cloud? This is our golden opportunity. Remember, at the start of the talk, I mentioned that 1964 mainframe. There’s a cloud for that too. So, we were able to migrate everything to the cloud. It was accepted as a good idea, and the airline’s part of a larger group. The whole group of airlines started doing the same thing. Planes are down, let’s shut off the computers and move them.
About that mainframe, there is a cloud for it, IBM Z Cloud. When is a cloud not a cloud? My definition is simple: if you can’t use Terraform to manage it, it’s not a cloud. I emailed the Z Cloud team, asking for their API to spin up LPARs, because I wanted to do Docker containers on my mainframe. They said if I wanted a new LPAR, I needed to email a guy in South Africa. For about five seconds, I thought about writing a Terraform provider for that. Needless to say, I did not. However, the data center migration had to go on.
Skipping the mainframe for a bit. The Mainframe team, they didn’t track their bugs in Jira; Jira was new and fangled, not interesting to them. They had a Bugzilla instance, and the other team that used this was the GCC, the people that keep the planes in the air, used Bugzilla to talk to the Mainframe team and register bugs. We wanted to prove competency by finding the simplest business app that was important, would get some attention, but if we messed it up, planes would still take off. We found this Bugzilla instance on a server, version 4.4, installed in 2013, never updated. We wanted to move that, get away from the network, the VPN, because that’s a separate silo, they don’t like talking to people, they don’t want to make changes. We took this into AWS, put an authentication gateway in front of it, AWS backups, AWS monitoring. This was the shiniest Bugzilla instance, especially of 4.4. This became our golden example of how to do an app, a single app, not one of our Java microservices, not something part of the website. You’ve got a business app, follow this example. It was one of the great things we were able to do, using community modules to then host Bugzilla, and it worked. It was resilient, could run in any of the 3 AZs. It was the perfect example, the gold standard module for everything else we did.
Now, about the network. The direct connects were provided by one company, managed by another, and no one wanted to talk to us. If you’re doing a data center migration, getting people into the cloud, make sure you can control the network. If you can’t get the data out, you’re not going to be able to migrate to the cloud. If your pipes are too small and you start copying data, taking the airline down, which happened because the networking silo had no monitoring on network capacity. So, pay attention to your network.
Back to developers. The worst name I’ve ever had for any project in my life, but what this is an internal module we made for developers to use. On the side, we’ve got a team run by Callum, who did the data center migration. We’ve got a service developer needs, created a module collection for them. Simple auto-scaling, blue-green deployments of whatever you want into AWS. As a developer, you don’t have to care about Terraform, feed in some variables, and boom, you get a deployment of your app, all auto-scaled, backed up, monitored, graphs, all the things you expect. For that, they need Jenkins.
When I started at the airline, they had four separate Jenkins instances, run by different teams, no configuration as code, no monitoring, no backups. One of the things we did to get the organization into containerization was running Atlantis, Jenkins, Artifactory in containers. We wanted to dog food the container experience, teach internal teams how to use containers. Doing that with the public app first isn’t the right place; it’s going to hit your users, the real people who want to get to Disneyland. Doing that with internal tools first lets internal people scale up, internal innovators scale up on how to do all of that.
What we need to do is get Jenkins, start getting towards continuous delivery for these services. We’ve got Atlantis, you can deploy new versions, but it’s PRs, a bit manual, not happening automatically. We decided to get more GE Ops about it. Here is an example of one of our T Grant definitions It’s a cut-down example, but we decided to keep the whole Jenkins configuration GitOps. We started inserting our hints into these files, so all those variables with CI_ build Jenkins jobs for deploying updates. You go from this T htl, a cut-down version for deploying Keycloak, to ci.yaml, and then with some Python magic, it builds into this. So, from having a Terrant definition for deploying code, with a few extra lines in your file, you now have Jenkins jobs for deploying updated versions of your apps without users building jobs.
Moving on, the last topic I want to discuss is a GitHub migration. They were on an old Bitbucket server and wanted to switch to GitHub. With the power of Terraform, we managed this for 1,200 repos and 400 users with one file. It’s a bit big at a megabyte, or 30,000 lines, but it defines all the repos. This control over GitHub, similar to our control over production infrastructure, ensures branching policies are in place.
The most important tool we deployed is InfraCost. With it, every developer knows the cost of their deployments to the airline. Though it’s not their money, spending too much eventually leads to requests for refactoring to reduce costs, which is tedious. InfraCost is highly recommended.
Next, we’re looking at Vault and Consul. Currently, we use AWS Parameter Store for configuration, which has limitations, and secrets management is manual. Everything else is automatically deployed, but we manually put secrets into AWS.
Lessons learned: enforce rules. For bootstrapping an organization like an airline, you need a starting point and a training method. We used Atlantis and pull request approvals to train everyone on best practices in writing Terraform code and deployments.
Adapt to changes. The airline changed significantly after 2019, requiring us to adapt our code and systems. Despite Bugzilla being used by almost no one, its Terraform code became the golden example of deployments.
Never say no in an organization. You’re there to achieve goals. You can’t refuse tasks because of technological implications. Always answer ‘yes, here’s how we’ll do it.’ If you want to enforce a particular coding style or cloud management, don’t just refuse requests. Instead, show how it can be done within the framework.
Lastly, sometimes break the rules. Set and enforce rules, make tools like Atlantis follow them, but also have a ‘break glass’ option. Sometimes, in emergencies like a 2 AM airline shutdown, you’ll need to bypass the rules.
That’s all, thank you for listening.