Freddie Mercury, the irrepressible frontman of rock band Queen, famously asked, “Who wants to live forever?” In our digital, connected world, we have become obsessed with longevity because our systems have limited life spans. We speak in terms of availability, redundancy, and downtime because they reflect the understanding that hardware components are designed with a service life in mind and expected to fail when they exceed the specification – they are most decidedly NOT built to live forever. This paradigm has given birth to a whole new ancillary approach to building products and services around the notion of unexpected failure and the associated disruption.
Platform designers strive hard to provide redundancy options at the lowest level in the hardware components while infrastructure designers do their part planning and designing for solutions using cluster of machines and resources organized in a highly available topology. Overall the solution space continues to evolve with newer design styles and architectures that are deployed in different ways. They are all intended, however, to handle ubiquitous case of failure with near zero if not zero downtime to the application.
Cloud computing has pivoted the paradigm on its head and added middleware and software application failure to the mix, which restates the problem. The thinking now is that all components are equally likely to fail; that encompases hardware, software, containers and apps, and each provides a counterbalancing effect on the others. System architects and designers should factor that as part of the engineering process.
An example of the current changes to the failure paradigm is the omnipresent class of compute resources, which offer the flexibility to the user bidding for the resource required to complete a task. Amazon Web Services has led the way, offering a deeply discounted EC2 usage option called Spot Instance. Spot instances are like the traditional EC2 instances and are additionally characterized with a preemption model driven by cost and availability of cloud providers resources. Essentially the guiding principle is the user is responsible to identify the unit of work (task) to be done, and the provider provides a corresponding platform on which to perform the task. Additionally there are no contracts built in the offering for running the task to completion. The best thing that the infrastructure offers is the guarantee of failure (in this case instance termination notification).
The service offering is a win-win situation for both the provider and the user. The provider stands to gains since this erases the notion of excess capacity by improving the usage efficiency adding to the top line. The user enjoys the flexibility and benefits in operational cost reduction. The user follows a pre-defined process, often initiated with no human intervention, bidding for the right sized resource needed to run tasks. If done correctly, they can have significant impact on the bottom line. Anecdotally, if the task can be defined such that run to completion can be wrapped in less than the termination notification window, you suddenly find yourself freeing up your budget to do more of the same at a lower cost measurable by an order of magnitude.
This model of compute resource availability adds to the complexity in the overall design because it requires architects and designers to plan for data integrity and consistency, continuous data protection, and session management. The one thing that tips the balance is security since the scale of operation mandates complete removal of manual configuration and control. The security model adds complexity and is unusual because it tests the true limits of fault tolerance and unexpected interruptions.
The solution is relatively simple; it requires integration with the notification system of the cloud provider such that the application can drain gracefully by persisting data, state and configuration and migrate elsewhere upon receipt of imminent termination notice. This poses a viable solution to execute workloads using excess capacity available anywhere, everywhere without making a run on the bank. The model is a magnitude better than the one of unexpected failure.
Even when demonstrating business use cases of overcoming these challenges, users continue to be hesitant. This reluctance in adopting the new models is likely because of lack in visibility and issues with risk assessment stemming into lack of confidence in securing company secrets and private data. The model blurs the line of conventional perimeter defense systems like firewalls and access control. All explorations regress through a similar set of questions: Why? What if? How? It’s the desire to know more that drives innovation and eliminates the fear factor.
Visibility has always been a cornerstone of security; there is always the desire to identify what can’t be seen, and what cannot be secured. By derivation this does not imply that it doesn’t exist if it can’t be seen; this is the surest path to a false sense of security.
To address security issues for cloud environments, users need a black box that automatically turns on, recording all activities across all cloud components—accounts, users, apps, containers, processes, files, machines—in addition to the network layer. The instant replay of the recording primes the unstructured machine learning engine humming rhythmically albeit silently to monitor, detect, and alert on issues.
Let’s consider some of the commonly seen attack surfaces that require continuous monitoring:
Data exfiltration: This attack surface has been managed effectively using traditional tools of isolation and separation limiting access of company secret data from unauthorized users. However in a multi-tenant environment like the cloud, the control is weakened with the necessity of speed in adoption of microservices and the dynamicity blurring the lines between computation nodes and data access. A continuous and rapid discovery of risky actions is needed, as well as a modern response protocol for remediation.
Resource abuse: In on-premises data centers, users have built-in safeguards for the required resources such as CPU, memory and block IO. While migrating to cloud, one must consider using guardrails in all instances that continuously monitor all behavior, classifying in near real time impacting changes, while recognizing that not every change is anomalous.
Administration of secrets: Admission control, access control and permissions all require some kind of a password or a key (session or persistent), one that auditors hyperfocus on as they create a paper trail for compliance certification. The scalable models of cloud poses not only a challenge in securely distributing such secrets but also to ensure the secrecy of secrets itself. Secrets management requires consolidation of secret operations encompassing its full life cycle —creation, access, revocation, rotation and auditing under a common head as part of the continuous security model.
In the true spirit of serving the larger cause of securing applications, machines are employed to theorize and test any and all events. The notification models ought to be trained to seek user intervention only when a suspicious activity, attack or breach is identified, based on the processing, learning and continuous refinement of high quality signals at every layer in the cloud ecosystem all in near real time removing the fear of adoption making it a truly enterprise grade scalable infrastructure without the uncontrolled cost in operations.
Rakesh Sachdeva is a founding engineer at Lacework, responsible for engineering and architecting the collection of observations that are required to deliver effective security to cloud environment whether they are based on a traditional architecture or make use of VMs or containers.