Aleks on AWS: November 2020

BACKGROUND

Global Trading Platform is a Service utilized by firms that require their own trading planform with multi-regional market access and advanced trading capabilities.

Key features:

- Hosted deployment delivered out of series globally distributed platform hubs

- Facilities co-located with major exchanges for direct low-latency access to crossing engines and market data feeds

- Managed connectivity between the Client and the Provider datacenters

- Real time audit streams available for consumption by client risk and downstream systems

- Report and data extract suite enabling clients to meet their compliance obligations

- Front end application deployed on customer devices

OBJECTIVE

To establish the feasibility of hosting the Service in a public cloud.

Perceived benefits:

- Reduction in the overall cost of ownership - with correlated effect of the service profit margin

- Faster time to market through the use of elastic hardware pool and established DevOps tools

- Improved integration with Client systems hosted in the cloud

TECHNICAL OVERVIEW

- Global Trading Platform Service is delivered out of series of regional hubs. A hub is comprised of a pair of datacenters. Hubs are interconnected by the resilient internal network backbone (GWAN)

- Each hub features extensive cybersecurity controls (firewalls, switches, DDoS prevention, etc.) Each customer platform instance is run in a segregated and fully controlled environment

- Global Trading Platform is architectured to operate with a Primary Server and a corresponding Hot StandBy Server. All transactions processed by the Primary are synchronously replicated onto the StandBy. In a failover scenario the StandBy seamlessly assumes the role of the Primary

- Production systems are run on dedicated baremetal hardware on Linux OS. UAT and Development systems are run on guest VMs. Hardware purchasing, mounting, configuration, OS management and maintenance are responsibilities of the Provider

- The Platform consumes live market data as well as instrument static from the internal Market Data Source

- Market connectivity is the responsibility of the Provider and is achieved through the use of co-location (or proximity location when co-lo isn't available) facilities

- Physical connectivity between the Platform Service and the Client DCs is managed by a third party connectivity provider. Configuration and provision of routing devices is the responsibility of the Provider. The double-triangle network architecture is recommended for resilience

- The Platform front end application is remote-installed onto on-premise client PCs. Encrypted communication between the client application and the server is achieved over the dedicated comms lines. User authentication is performed by the dedicated platform module

- Capability exists for front end access via the public internet. All communication is encrypted and access is authenticated on the Platform side

- Platform upgrading and patching is performed by the Provider during the market downtime

Platform Interfaces

- Below is the list of interfaces included in a typical deployment. Capability exists to support additional custom interfaces

ENTER CLOUD

The three attributes of the Service that a Client perceives as the most valuable are Performance, Reliability and Security. A change to the underlying infrastructure needs to at minimum have no detrimental effect on either of the metrics. An improvement in service is most desirable.

A full architectural re-design of the application itself is kept out of the analysis scope at this time. The current approach includes replacement of hardware, DC space and surrounding networking with those provided by Amazon Web Services (AWS).

Cloud security

AWS Shared Responsibility Model defines AWS as being responsible for security of the cloud and the AWS customer as being responsible for security in the cloud.

Security of the cloud implies the security of AWS facilities, hardware, network and virtualization infrastructure. AWS offer a set of compliance reports via AWS Artifact service. For example, SSAE 18 (SOC 1 and SOC 2) documentation is available here.

Security in the cloud implies the security of the OS, applications, data in transit and at rest, credentials, etc.

AWS offer a great deal of granularity in regard to user access to services. Console, CLI and API availability is controlled via access keys and optionally MFA; X.509/SSL/TLS data encryption is supported; Active Directory services are available. Individual permissions (policies) for access to data and resources can be set on the user and the group levels via AWS Identity and Access Management (IAM). Encryption keys necessary for security of data in transit and at rest can either be imported or obtained at AWS Key Management Service (KSM) and optionally stored using AWS hardware security module (HSM). Data store services provide additional at-rest encryption options.

Network level security is controlled via AWS Virtual Private Cloud (VPC) configuration by use of Security Groups and Network Access Control Lists (NACLs). Network address translation is facilitated by NAT gateways. Numerous firewall and DDoS protection options exist - AWS WAF, AWS Shield, AWS Firewall Manager, AWS GurdDuty and more.

AWS offer a wide arrange of security features and options that appear suitable for hosting the Global Trading Platform Service.

Direct Connect and Production environment

For the purposes of the analysis an assumption is made that compliance requirements and technicalities of hosting sensitive client data in the cloud are fully satisfied.

AWS infrastructure is operated out of data centers globally - currently spanning 77 Availability Zones within 24 geographic regions. The exact DC addressing is not advertised, but AWS Direct Connect service offers an extensive list of connection locations - close to 100 and growing. AWS customers that do not have equipment at these locations can work with AWS Partner Network vendors to establish network connectivity between AWS Direct Connect locations and their data centers.

AWS provide SLAs of 99.99% target uptime on the Compute and up to 99.99% target on Direct Connect services with appropriate service credits applicable, which is is sufficient to meet the Trading Platform Service production-level requirements. Compute resources are segregated by Availability Zones (AZ) within a region. Further granularity is achieved by provision of Local Zones - a type of AWS infrastructure aimed at bringing a subset of AWS services (such as EC2 and VPC) used for latency-sensitive workflows closer to the user. A set of unofficial latency stats gathered using ping provides insight into intra-Local AZ ballpark latency. Distributing a Primary / StandBy pair across adjacent Local AZs within a region could replicate the current multi-DC hosting approach. However, the AWS definition of a Local AZ offers no guarantees that there isn't an overlap in facilities across zones - i.e. that multiple Local AZs aren't hosted in the same DC. In addition, no inter-AZ latency SLA is available.

A theoretical possibility of losing an entire AZ along with all regional Primary and StandBy pairs with no possibility to failover still exists. As an example, a monthly Compute service (ex: EC2) uptime of 95.1% would result in a 30% AWS service credit and a 24 hour service outage. A further exploratory conversation with AWS would be required to finalize the approach to production-level hosting.

Exchange co-location is key to achieving network speed required for latency sensitive trading systems. While there is sufficient overlap in Direct Connect and exchange facility locations, gaps do exist. For example, London Metal Exchange (LME) is based in Interxion DC in Greater London, UK while AWS Direct Connect is available in Digital Realty Docklands.

Intercontinental Exchange (ICE) matching engine is hosted in 350 East Cermak in Chicago, IL, USA with co-lo space available on the 2nd floor of the building. Equinix Chicago CH1, CH2, & CH4 are in the same building on floors 5, 6 and 8. A partnership exists between AWS and Equinix for direct delivery of 1, 2, 5, and 10 Gbps AWS Direct Connect Hosted Connections.

Chicago Mercantile Exchange (CME) offer co-location services from their data center in Aurora, IL, 35 miles out of Chicago. AWS do not have a Direct Connect option in Aurora. Thus a firm willing to tolerate proximity to ICE but requiring a CME co-lo would have to consider a non-AWS hosting option.

A potential single point of failure in the form of an AZ outage is a significant concern. In addition, a direct conversation with AWS and extensive stress and latency testing would be required before hosting a production system in AWS cloud is considered. A client's trading profile and latency tolerance would be the contributing factors.

A reasonable way to progress is to explore cloud hosing of development, QA and continuous integration systems.

Compute Services and Non-latency sensitive systems

With the Trading Platform front end application installed on a PC locally, it is the server side infrastructure that requires migration.

The machine resources most utilized by the Trading Platform are primarily CPU and RAM and secondarily Disk. Large amounts of static data are loaded into memory on start up and updated in real time. The results of transaction processing are written into memory as well. Data from memory is saved to disk when appropriate.

To estimate the optimal machine spec tests should be performed to determine the footprint size of a non-production instance of the Platform. A development instance should include instrument static for a limited number of markets most often traded by the customer base. A UAT systems should include all markets traded by the customer whose Platform instance is being tested. The disk-based database size should be included with anticipated transactional average and data retention period being size-defining factors.

AWS Elastic Compute Cloud (EC2) service provides server processor and memory resources as well as allows for configuration of security and network access. An EC2 instances are essentially virtual machines with pre-configured computing power and OS. EC2 uses XEN, KVM and AWS-own Nitro hypervisors for different instance types.

A number of EC2 instance types with different compute, memory, and storage capabilities is available. The type determines the hardware of the host and the cost of an instance:

- General purpose - provides a balance of compute, memory, and networking resources

- Storage optimized - designed for workloads with high sequential reads and writes on large data sets

- Accelerated computing - provides high processing capability

- Memory optimized instances - designed to deliver fast performance on large in-memory data sets

General purpose instances offer the combination of resources and price most suitable for non-production workflows. A number of General purpose instance families exist (A, T, M) - each with an accent on different capabilities such as the ability to burst CPU usage or scale out horizontally. M family is the most suitable for workloads that have consistent behavior.

M5 (fifth generation of M) contains four instance sub-families with different combination of the processor and instance storage: M5, M5d, M5a, M5ad. M5 and M5d run on Intel Xeon Platinum 8000 processor; M5a and M5ad run on AMD EPYC 7000. The options are comparable in terms of speed, yet AMD comes at a lower price.

EC2 offers two storage options - Instance Store and Elastic Block Storage (EBS). Instance Store stands for disk physically attached to the host and EBS stands for a network-attached SAN or NAS. Instance Store-based M5d and M5ad offer lower-latency at higher cost compared to EBS-based M5 and M5a. Amazon Linux and Red Hat Linux are among the OS options supported on M5 which makes it suitable for Trading Platform hosting.

AWS are offering M6 type that features their own ARM-based microprocessor advertised to deliver up to 40% better price/performance over the current generation M5. This is an option worth exploring - especially in development environments. M5 can be used for initial proof-of-concept.

AWS offer the flowing EC2 pricing models:

- On-Demand – requires no commitment, can start/stop an instance at any time and pay an hourly rate

- Reserved – requires commitment to usage for a given time period in return for a discount

- Spot – bid for instances in the AWS market place and keep the instance until the spot exceeds the submitted bid

Reserved pricing would be most suitable for continuous development work environments. QA / UAT systems that get utilized over a limited period of time such as a customer pre-go-live or release testing would be run on On-Demand.

Depending on the Platform preference for Intel or AMD, m5.xlarge and m5a.xlarge would be the optimal choices. Comparative pricing (USD) for On-Demand instances is below.

The relationship between the resources and pricing is linear as evident from above. This indicates that it make practical sense to tailor a machine to host exactly one instance of the Platform application. On-Demand EC2s do not incur charges when not powered up. Co-hosting multiple platform instances on a single host would require the entire host to be up and chargeable when only a single planform instance is required to be running.

Further fine-tuning decisions can be made by monitoring CPU, RAM and network utilization past the initial stages of the project.

Connectivity revisited

AWS support VPN and Direct Connect as the options for access to the services.

VPN enables encrypted communication between AWS and on-premise locations over the public internet. VPN connections incur hourly as well as data transfer charges. The advantage is the ease and speed of set up. If a dedicated line install involves a collaboration with a third party provider, a VPN is set up directly over AWS Console, Command Line Interface or API. The obvious disadvantages are the variant connection speed and non-private nature of the public internet.

AWS Direct Connect implies private connectivity between AWS and on-premise. Two flavors of Direct Connect are available:

- Hosted - sized 50Mbps up to 10Gbps and shared by multiple AWS customers; available from specific AWS Direct Connect Partners only

- Dedicated - 1Gbps or 10Gbps physical Ethernet ports dedicated to a single customer

The charges are for port hours and outbound data transfer. No minimum charge exists. Sample port charges are $0.30/hour for up to 1G transfers and $2.25/hour for up to 10G. Data transfer fees vary per connection point. For example, Equinix CH2 in Chicago is at $0.0200/GB.

If sensitive customer data is present on the system, it is likely that only a Dedicated option would be appropriate. Otherwise, a connection to can be established over a Hosted solution.

The Trading Platform data transferred to and from the server (and thus affecting the overall bandwidth required) includes:

- front end application messaging - transactions and query responses; this is the primary bandwidth consumer - factor of the total user count

- market test environment or simulator connectivity - unlikely to be bandwidth-intensive

- real-time market data - consumed from the Provider-hosted on-premise source; can be quite large in size depending on market volatility and the number of exchanges traded

- real-time data streams - FIX-protocol communication, fraction of the above

- API - can vary in size depending on the nature of API interaction - transactional vs. query or stream subscription; unlikely to be high on constant basis in UAT or Dev outside of periods of stress testing

Direct Connect remains the optimal option for connectivity between the Platform Services Provider and AWS. VPN is suitable for a proof-of-concept exercise.

Virtual private could

The VPC configuration recommendations include:

- Multiple Availability Zone (AZ) deployment - Primary server co-located or proximity-located to in-scope trading venues; StandBy hosted in the same AWS Region yet in a different Availability Zone

- Private subnets with no Internet access to be used to host EC2 instances

- Security Groups and NACLs configured to allow encrypted bidirectional traffic from specific Provider DC IPs and ports only (IPv4 and IPv6 supported). This can be extended to include IP ranges post-proof-of-concept stage. Primary and StandBy can be placed in distinct Security Groups to further enhance access level granularity. The Security Groups should be aware of each other to allow for Primary-to-StandBy transaction replication. Rules to allow interface communication should similarly be limited to specific protocols and ports (ex: sFTP/22 for transfer of report files)

- AWS NAT Gateway service employed to enable remote user internet access to cloud-hosted Platform service. At minimum a single gateway per Availability Zone should exist. Each gateway should reside in a public subnet. Security Group rules should be configured to allow EC2/NAT communication

- Elastic IP addresses should be allocated and attached to the NAT Gateways. These addresses should be shared with the on-premise IT as the fixed source/destination public IPs for cloud-hosted Platform traffic

- Dedicated IAM users (non-account-root) to be created for accessing the could. IAM User Roles should be configured to control user access to AWS resources - EC2, S3, EBS for data and application image storage. Care needs to be taken not to over-assign permissions

- Amazon CloudWatch should be used to monitor the VPC components and VPN connections. Aside from monitoring the health of AWS resources, CloudWatch can be used to collect and parse the Platform logs and alert application support if an action is required

- Flow logs can be used to capture information about IP traffic going to and from VPC network interfaces. Logs can be sent to CloudWatch for monitoring and alerting

- AWS Premium Support plans include access to the Trusted Advisor tool that can be used to provision resources following AWS best practices. A free option with limited features is available

VPC charges are hourly per access endpoint and per GB of data transferred. For example, data transferred over AWS PrivateLink (between VPCs, AWS services, and on-premises) is $0.01 per GB. The endpoint charge is $0.01 per hour. NAT Gateway charges are similarly per hour and per GB of bandwidth - $0.045 each. CloudWatch charges are per metric and for associated log storage. The service offers free-tier that should be sufficient for proof-of-concept purposes. Fees vary per region.

Provider DC and Office-side firewall rule changes should be made to expose only specific IPs/ports (or ranges) pertaining to Platform front end instances and shared on-premise services.

VM import and storage

AWS support the concept of Amazon Machine Image (AMI) - a set of steps and detail required to launch an EC2 instance. AWS offers an extensive selection of ready AMI's - both free and chargeable. A custom AMI can be created off an imported VM image. VM Import/Export tool available via API and CLI can be used to migrate in off VMWare, MS Hyper-V, Citrix XEN VM and MS Azure. Export is supported only for EC2s launched from images originally imported, not for EC2's launched from AWS AMI's.

A custom VM image is first uploaded onto Amazon Simple Storage Service (S3) - AWS internet storage service. The Import tool is then used to migrate it into an AMI and launch an EC2 instance.

Alternatively, a VM can be imported as a disk snapshot. Similarly, a snapshot is uploaded onto S3. It is then used to create an Amazon Elastic Block Store (EBS) snapshot and subsequently an EBS volume that is attached to an EC2 where the software is launched. An AMI is created and used to launch subsequent volumes.

AMI images are categorized as Instance Store-backed and EBS-backed. As mentioned previously, Instance Store stands for disk physically attached to the host and EBS stands for a network-attached SAN or NAS. The AMI category defines the root volume created from an image and subsequent instance launch mechanism. Instance Store-backed AMIs are loaded from S3 onto Instance Store backed EC2s. EBS-backed AMIs are extracted as snapshots onto EBS volumes. The primary difference is data persistence and speed: data stored on Instance Store volumes is not persistent through instance stops, terminations, or hardware failures; EBS volumes do preserve their data through instance stops and terminations. However, use of EBS over network implies additional latency.

In scenarios when a run requires no durative data, a Platform instance can be loaded and initiated from an image stored in S3. This would require the image to contain at minimum the data set requited for the services to come up. The start time is likely to be longer compared to launching from an EBS root volume, but the cost of S3 storage is significantly lower than that of EBS.

Using EBS as a root allows for data persistence and flexibility of being able to stop an EC2 and subsequently re-launch it to operate on data left off intact. This implies savings over a continuously run EC2 as the root EBS volume is changed separately from the EC2 uptime. In addition, a Platform launch time is likely to be significantly shortened when using readily available runtime.

General Purpose SSD (gp2) EBS Volumes are charged by AWS at $0.10 per GB-month of provisioned storage. EBS Snapshots, a point in time copy of block data, are at $0.05 per GB-month of data stored. S3 offers options that vary in retrieval speed and data availability. Extraction to EC2 is free of charge. S3 Standard used for general-purpose storage of frequently accessed data is charged at $0.023 per GB for the first 50 TB / Month.

Both S3 and EBS can be pursued as storage and launch options depending on the Platform use case.

Containers

Docker, an open source container management platform, should be considered as a possible tool option for containerization and deployment of the Platform server software onto EC2 instances. Docker is more often used in microservice environments, but a monolithic application can nevertheless benefit from the technology.

The preferred way to model this would be to create distinct layers for platform components:

- OS with all dependencies replicating the on-promise environment

- platform core

- regional or client-specific configuration

The layering capability of Docker would prove useful when rolling out platform core patches or client customization updates helping to avoid rebuilding the entire stack. Primary and StandBy can be run as a Swarm - as a pair of VMs running the Docker application configured to join together in a cluster. Since networking devices are virtualized within a Docker container, port conflicts with Swarm nodes competing for resource access should not be an issue. By default Docker does not persist data between container runs. This can be worked around by creating volumes - Docker-managed file system mounts used to preserve data generated by running containers. A private Docker registry can be set up to manage software deployment.

The much discussed Docker vulnerabilities would require detailed assessment and security approval: reliance on the Linux kernel and the possibility of a single container taking monopoly over resources; potential container breakout that could give the attacker root access to the targeted host (runC); poisoned images and compromised secrets, etc. For any serious production-level use Docker Enterprise should be considered in favor of the Community Edition.

Amazon Elastic Container Service (ECS) is a scalable container management offering by AWS. It supports two container launch types:

- Fargate - serverlessly running containerized applications without the need to provision and manage the backend infrastructure

- EC2 launch type – allows for EC2 resource flexibility, elasticity, etc. The control and cluster management is achieved via the Container Agent that comes as part of ECS-optimized AMI or can be installed manually. ECS launch cost is defined by the EC2 type used - On-Demand, Spot, Reserved, Dedicated.

The overall benefit of using Docker or ECS vs. an install directly onto EC2 Linux would need to be determined based on:

- container technology security review conclusions

- overhead required for making the Platform container-friendly

The approach of tailoring EC2 resources to provide the computing power required by a single Trading Platform instance might prove to be viable without the use of containers.

Performance analysis

A Trading Platform instance launch on AWS should be followed by a thorough performance review. Accent should be made on the following:

- CPU utilization and memory foot print fluctuations in response to market peak simulations

- database IOPS

- partial and full application failover on the network, hardware and application levels; data persistence across Primary/StandBy; front end user experience effect of reconnecting to the StandBy once it assumes the role of the Primary, etc.

- overall electronic order latency measured from the electronic receipt of a message into the Trading Platform to onward transmission out to market from the market-access component

CONCLUSION

Amazon Web Services do offer the tools necessary for hosting the Trading Platform in the cloud environment. To reach a definitive conclusion a proof-of-concept exercise should be undertaken. The in-cloud Platform performance statistics should be compared to the equivalent metric indicators gathered in on-premise environment. In addition, an all-in cost analysis of the on-premise hosting should be compared to the aggregate cost of running in the cloud over a similar period of time. Additional InfoSec compliance evaluation, market access latency analysis and a Platform Client sign off would be required to host a production instance in the cloud.

Additional corporate security recommendations

An implementation of a dedicated security framework is necessary to ensure control of access to cloud environments, definition of standards of use and adherence to these standards. From making sure that IAM user permissions are not over-assigned to ensuring the appropriate client or department is properly billed for AWS resource use, clear guidelines need to made available.

The highlights should include:

- mandatory training for staff requiring cloud access

- services access control through dedicated IAM user creation

- development of AWS compatible well-documented tools, scripts and libraries

- definition of minimum infrastructure security standards for use of specific resources such as VPC, databases, DDoS prevention mechanisms, etc.

- provision of ready-to-use locked-down infrastructure-as-code templates

- regulations around storage, distribution, revocation, rotation for custom access keys

- automated tools for security mentoring of CI/CD environments - a centralized on-premise or AWS-provided code-compliance checking mechanism

- over-billing prevention for resources no longer in use - such as automated removal of an unused UAT EC2 environments

- incident playbooks

- billing alarms

- definition of responsibilities shared between the Trading Platform Provider and a Client - data provision, in-cloud connectivity, shared resource use and billing

Availability of clearly defined, easily available and centrally enforced set of rules is imperative to the success of cloud adoption. Strong adherence to these rules is mandatory for continuous compliance required by Trading Platform environments.

Aleks on AWS

Wednesday, November 11, 2020

Case Study - Global Trading Platform