I’m currently on a short sabbatical from KPMG UK, which gives me plenty of time to get back to tech blogging. Today I wrote a short opinionated essay on my three golden rules for delivering cloud-based technology from Airlie Beach, Australia (gateway to the Whitsundays).
Now, in this case, “delivery techniques” can range from ad hoc scripts that pull some metadata from a cloud environment, to new features or even completely new applications.
Regardless of the size of the solution, these three basic principles are the same and form the basis of any design thinking I have.
Simply put, maintainability can be described as how easy it is to keep a tool working properly.
Aiming for “no action” – I really like the term “no-ops” where the solution will live its life with little or no human intervention. Often, this nirvana requires some forward thinking and investment. Done right, it ensures that your team can focus on new features without worrying about keeping old solutions running. Avoid manual intervention as much as possible and automate like your life!
Complexity, the enemy of maintainability – I always try to keep the solution as simple as possible. The legendary “don’t reinvent the wheel” applies very well here. Always try to use a PAAS/SAAS service if your needs allow, and choose a well-maintained open source library instead of starting from scratch.
Low barriers to entry – Junior and new team members should be able to contribute to the product seamlessly. This can be achieved with solid documentation, contribution guidelines, and a backlog marked as “first question” and ready for new engineers to start learning. Having a solution that can only be maintained through a single point of failure is not ideal – and it will definitely affect the speed of your team.
Everything is code! ! – This The best documentation so far is clean code. Make sure all infrastructure is written using Terraform/Bicep. This means engineers can easily refer to topology in a language they understand. To name a few other preferred examples; machine images (Packer/Ansible), policies (YAML/JSON), K8s (Helm), and of course the source code itself! Ideally the solution should be immutable i.e. easy to recreate from scratch, if it can’t – identify manual steps and ask some backlog questions!
Almost every day, my LinkedIn and Medium are flooded with new companies falling victim to cloud-based data breaches, usually through social engineering or accidental misconfiguration. Regardless of the solution in the cloud, it pays to keep things secure!
Invest in guardrails early – All major cloud providers have a wealth of built-in policies to protect your organization from serious cloud misconfigurations. Many of them are available out of the box, please configure them as a baseline as soon as possible. As your organization matures, it’s worth providing a way to deploy these cloud strategies as code to ensure you can easily keep up with new standards. Check out the Azure-based example below.
Safety is everyone’s responsibility – Annual privileged access to E-Learning is not enough (if your organization even has this!). The threat landscape is always changing, and cybercriminals are getting smarter by the day. Security should be incorporated into each engineer’s goals. Cloud security certifications are encouraged, pairing your threat intelligence capabilities with engineers, and reading regular threat reports, such as those flagged by NCSC below. Learn from others’ weaknesses and address any gaps.
Watch out for highly licensed accounts – The principle of least privilege is a boon for anyone working in cybersecurity, but I’ve seen some problematic configurations in my career. Only assign permissions required by the tool/solution. If you need high-level permissions, see if you can combine this with mitigation controls such as conditional access policies. This controls when credentials can be used, such as from trusted IP ranges or devices. I’ve included a cool preview feature from Microsoft below.
Keep an eye out for secrets— If the solution relies on shared service accounts, make sure to rotate the keys regularly, especially when engineers are away, as they can easily hang on to those keys. A better option is credential-less access, using AWS IAM roles/Azure Managed Identities. Lastly, you should implement strong secret scanning in your SCM toolset, accidental access keys in a git repo can cause confusion in the wrong hands.
Safe enough – It is worth mentioning that security features may introduce additional cost and complexity. It’s worth having a quick, standardized way to risk assess your technology and apply a reasonable level of control. Don’t be overly safe, or you’ll sacrifice maintainability and in some cases reliability.
Reliability is the possibility of your solution failing, disrupting user days and on-call engineer nights.
Designed around entropy— Entropy is a scientific measure of uncertainty. Entropy is high when making changes to a system, so make sure you have a solid suite of tests to make sure you understand when a change will break the system or set it back. Combine this with a deployment pipeline so you can easily roll back any bad changes.
Monitor/React to Key Symptoms – Unfortunately Most systems have unreliable components, make sure you can monitor key failure indicators. Ideally, combine these with an automatic runbook to cure symptoms before they cause disruption.
Choose reliable components – This may be obvious. But some components and services are more reliable than others, check the documentation and ensure availability meets your business needs.
Vital Signs Alert – At the end of the day, it’s better to catch system failures than end users. Make sure you can spot and alert on critical system failures, use health checks or spot critical job failures, and more. Tie these to the alerting mechanism of your choice and make sure engineers know how to react to maximize uptime.
Business criticality and data sensitivity will naturally determine how much you invest in each area. In my experience, systems become more critical and responsive over time, so consider these areas from day one and iterate over time.
Guys, this is a wrap! Hope you enjoyed reading this article. As with everything, this isn’t an exhaustive list, but I’ve been able to make cloud design decisions well over the years, regardless of size.