PayGo is an integrated utility payment solution provider that manages the largest energy company prepay programs in the United States. They’re currently running four production environments in AWS, with another coming online soon, and SQL Server 2017 Standard Edition running on Windows Server 2012 R2.
The company plans to migrate to Windows Server 2019 after testing is completed.
The Tech Decision
As a private, non-profit organization, “Our backend SQL Servers hold terabytes of data that must be available 24×7,” says Chad Gates, senior director of infrastructure and security, PayGo.
With AWS’s lack of shared storage, PayGo was forced to use SQL Server’s transaction logging and log shipping to protect the data. Although requiring manual intervention, this approach was acceptable for disaster recovery (DR) purposes.
But it could not provide the rapid, automatic failover capability needed to ensure high availability (HA) for the company’s mission-critical applications. “We had another option, but we believed there were more cost-effective solutions,” says Gates.
“We could use the Always On Availability Groups feature in SQL Server Enterprise Edition, but that would cost us hundreds of thousands of dollars that could be spent on other mission critical initiatives. We felt there must be a better solution, so we started looking for other options.”
In its search for a capable and cost-effective HA solution, PayGo established four criteria:
- seamless integration with Windows Server Failover Clustering
- high disk throughput performance to satisfy demanding recovery point and time objectives
- ease of implementation and dependable ongoing operation
- responsive technical support from the vendor.
The SIOS DataKeeper Solution
Receiving a recommendation to look at SIOS, Gates says, “SIOS DataKeeper Cluster Edition overcame the problem caused by the lack of shared storage. Its use of a mirrored drive looks like shared storage to the WSFC.
It was exactly what we wanted.” SIOS DataKeeper also met PayGo’s other three criteria better than any other solution considered.
PayGo first installed SIOS DataKeeper SANLess Clustering software in its own private cloud, and later migrated the configuration to AWS.
“Because SIOS DataKeeper supports private, public and hybrid cloud environments, we migrated the entire configuration, including all application software and data, easily and without any issues,” Gates says.
PayGo currently has two SQL Server nodes in each of its four SANless HA clusters. To provide protection against localized failures, the servers are deployed in separate Availability Zones. And to ensure high transactional throughput performance, each server has two network interfaces with one dedicated to SIOS data replication.
The SANless clusters employ synchronous data replication through the sub-millisecond (ms) latency connectivity AWS delivers between Availability Zones.
The Impact on PayGo
SIOS DataKeeper met and exceeded PayGo’s high expectations for a high availability solution, including ease of installation and operation, and responsive support.
“We have been using SIOS DataKeeper for several years now, and it has proven to be the most rock-solid piece of software we have,” Gates says.
Given its proven operation, including during actual failures, the IT team has minimized the ongoing testing needed for its production SANless clusters. The clusters are now tested only after changes are made to any of the hardware or software, scheduled on a monthly basis, and the test itself consists of a simple failover and failback.
PayGo also upgrades only one node at a time in each cluster to simplify roll-back, if needed. With SIOS DataKeeper performing so well, the only reason PayGo now has for upgrading to SQL Server Enterprise Edition would be outgrowing the Standard Edition’s database size limitation.
The IT team at PayGo is currently considering adding DR protection to the HA clusters by deploying a third node in a separate AWS region. The distance involved in this case (between datacenters in Virginia and Ohio) experience a latency of 12-13 ms.
While that requires asynchronous replication to ensure high throughput performance in the active node, the combined HA/DR solution would recover much quicker than what is possible with log shipping.