London Stock Exchange Group (LSEG) has 30 PB of Tick History-PCAP data, which is ultra-high-quality global market data that is based on raw exchange data, timestamped to the nanosecond. An additional 60 TB is generated every day. LSEG sought to migrate their data from Wasabi cloud storage, LSEG was looking for a new solution to optimize their storage costs by archiving older datasets and provide their customers easier access to market data. LSEG required an automated migration strategy and were seeking best practices for large multicloud data migrations.
LSEG engaged AWS and DataArt, an AWS Partner, who orchestrated this large-scale data migration with AWS DataSync, which enables you to transfer data between on-premises storage and AWS, edge locations, other cloud providers, and AWS Storage services. The key objective of the migration solution was to accelerate the data transfer, while emphasizing efficiency and cost in the solution design, development, and process.
In this post, we review how LSEG automated the migration of their 30 PB ultra-high-quality global market data to Amazon Simple Storage Service (S3) using AWS DataSync in less than three months. LSEG transferred active datasets to Amazon S3 Intelligent-Tiering to achieve cost savings based on the future data access patterns by their customers. Source data that was older than 180 days and not accessed by customers was transferred directly to Amazon S3 Glacier Deep Archive. Moving data to Amazon S3 has resulted in an 80% reduction in data storage costs for LSEG. In addition, with data on AWS, LSEG’s customers directly access and consume the data for exploratory analytics via AWS Data Exchange, providing optionality in how LSEG customers access and consume the data
Solution overview
In this section, we review the automated migration solution and services that constitute the overall design outlined in Figure 1. The migration process included the collection of detailed information from the source Wasabi buckets, generation of DataSync tasks, data transfer, and data validation. DataSync was the core component for the data transfer that was orchestrated with a combination of AWS Step Functions, AWS Lambda, and Amazon DynamoDB. Monitoring of the data migration was performed using integrated logging with Amazon CloudWatch.
Figure 1: LSEG data migration architecture with DataSync and AWS services
The new AWS environment configuration began with networking within an Amazon Virtual Private Cloud (VPC) configured with five public subnets across five Availability Zones (AZs). DataSync agents were deployed across AZs as m5.2xlarge Amazon Elastic Compute Cloud (EC2) instances and activated with a VPC private endpoint. The instance type was determined based on the number of objects being planned for each DataSync task. The optimal number of agents was defined based on data transfer performance obtained from an initial proof of concept and planned infrastructure costs. The scale out of the Amazon EC2 fleet per AZ was configured using two factors:
Supported data and networking throughput – A theoretical limit
Throughput measurements obtained through a proof of concept (POC) – A practical limit
Once the POC was complete, we developed and implemented an automated scale out approach that included deploying DataSync agents and configuring tasks for the data transfer process. This automated approach enabled us to scale data transfer throughput as required based on business requirements and cutover deadlines.
Automating the migration workflow
To isolate the migration workload, the migration infrastructure was deployed in a separate account from the environment hosting the destination production S3 buckets. We developed the solution so that the infrastructure could be deployed with the AWS Cloud Development Kit (CDK) or HashiCorp Terraform. Multiple DataSync tasks were pre-generated, created for each migration session, and used to specify a storage class of either S3 Intelligent-Tiering or S3 Glacier Deep Archive depending on the source data access patterns based on last modified time. LSEG understood the data access patterns based on metrics collected from historical customer usage.
The migration process was divided into smaller phases outlined as a migration session. Each migration session consisted of multiple DataSync tasks run in parallel. This phased approach allowed for focused monitoring while rerunning DataSync tasks as required. Monitoring was implemented for Amazon EC2 DataSync agents, data throughput rates, and DataSync task status.
The migration session consisted of four components:
Collect metadata from source buckets.
Generate DataSync tasks.
Migrate data.
Validate data post-migration.
Orchestrating the migration
Metadata was collected for the source Wasabi buckets. The metadata included object counts, sizes, and checksums. The generation mechanism for the Wasabi bucket ETags is different than the generation of Amazon S3 ETags, thus it could not be used for application data validation. To meet the application’s requirement for matching ETags after the data migration, we created custom Python scripts that generated metadata on the source system, using the same mechanism as the target system. This process was accelerated by implementing parallelism in the Python scripts for the objects. This helped the post application validation process after the migration.
Based on the collected metadata and migration session definitions, we generated a list of DataSync task configurations. The task configurations were stored in DynamoDB. Task configurations were generated with the following quotas:
Number of objects and prefixes defined in the task: less than 20 million.
Task filter length: Less than 102400 characters.
Data to be transferred: 1 TB.
Calculations based on measured performance indicated that 1 TB would be transferred in 60 minutes with a single DataSync task and agent.
Step Functions orchestrated running DataSync tasks and available agents were utilized from an SQS queue. The top-level step function validated execution run time conditions, read parameters affecting the migration session, and started the task workflow. The top-level step function also implemented a global error trap to invalidate the session and stop the infrastructure in case of an unexpected job failure. A separate step function implemented session orchestration, including a retry mechanism, to process unpredicted errors such as unexpected API responses.
The migration orchestration backend was implemented using AWS Lambda functions to reduce management overhead and use the serverless approach. To simplify and unify the code, common code was separated into a Lambda layer, which was used across the Lambdas. The Lambda layer contained functionality such as unified logging, JSON manipulations, and DynamoDB data manipulations. Amazon EventBridge was used to process completed tasks. The data layer of the solution consisted of three components:
AWS Secrets Manager to store sensitive information such as access and secret keys for the source environment.
AWS Systems Manager Parameter Store to store parameters used by the migration sessions.
DynamoDB to store the migration plan, which consisted of migration session logs, source and target bucket configurations, and migration tasks for DataSync.
During a transfer, DataSync always verifies the integrity of your data. However, the uniqueness of the LSEG dataset along with automated deletion of data from the source required a final comprehensive validation to make sure of data integrity at the application level. Therefore, an additional data validation process was developed to make sure the entire data migration process was complete. Each object was checked for accuracy before marking the migration of that object as a success. To optimize time and cost-efficiency of the post-migration validation, we implemented a combined approach using a custom Python script along with Amazon Athena for additional data processing.
Monitoring and alerting
During the migration process, it was important to track the overall status for each component of the migration. The AWS services used in this architecture were configured to log events to CloudWatch. CloudWatch logs were used for both active and reactive analysis, making sure of the level of monitoring required to assess and optimize use across the multiple services. CloudWatch dashboards were implemented to monitor a migration session using selected key metrics.
One main goal of monitoring was to capture the maximum amount of network throughput generated by DataSync agents running on EC2 instances. This information helped us calculate the number of EC2 instances deployed in each AZ while balancing costs and data transfer performance relative to the completion date. DataSync agents were deployed as EC2 instances using Terraform automation. Based on the real-time monitoring metrics, we were able to adjust the number of DataSync agents within each subnet to make sure of the efficient migration of the data. Additionally, we implemented alerts to monitor specific conditions – such as Amazon EC2 and DataSync agent availability – to keep an active pool of available agents for a given migration session.
Key findings and considerations
If you are planning a large scale migration similar to this use case, there are a few considerations worth noting before starting the migration process.
Migration strategy
It is important to plan and decide on a migration strategy. Migration strategy differs based on what the customer wants to achieve in terms of balancing a faster migration timeline or cost optimization. For LSEG’s migration, the goal was to migrate the data as fast as possible. Due to this, we accelerated the migration by using more DataSync agents across multiple AZs.
Another key consideration for an effective migration is how to implement failure handling for the automated migration infrastructure. The automation developed for the LSEG migration divided the effort into smaller phases and was implemented with a repeatable approach, simplifying retries when necessary. For AWS DataSync tasks that required rerun, the task was automatically restarted based on feedback from the implemented observability dashboard. It is very important to establish the migration Key Performance Indicators (KPIs) ahead of time, so that the implemented observability of the migration can track progress against these KPIs, and adjust the scale of the migration tasks as needed.
Scaling and AWS DataSync considerations
With a highly automated workflow we were mindful of how many API operations the scripting would call throughout the stack. API throttling is the process of limiting the number of API requests that a user can make in a certain period. It is recommended that applications implement retry handling in the event of API throttling. For example, an application might need to retry the request, using an appropriate sleep interval. Ideally the sleep interval should be variable based on an exponential backoff. We used the retry mechanism to avoid API throttling throughout the orchestration. Furthermore, migration sessions were scheduled outside of peak usage times.
Given DataSync was a key component of the overall architecture, consider the following criteria in relation to recommendations for DataSync agents:
Number of vCPUs per account
Amazon EC2 type network throughput
Number of DataSync tasks
DataSync quotas
Number of Availability Zones
Availability of EC2 instance types in your AWS Region
Some Amazon S3 storage classes have a minimum billable object size or metadata considerations. For example, S3 Glacier Deep Archive needs 40 KB of additional metadata for each archived object. When a DataSync S3 location is configured to write to S3 Glacier Deep Archive, objects less than 40 KB objects are stored in the S3 Standard storage class. However, the source dataset contained objects smaller than 40 KB and there was a requirement to store the objects in S3 Glacier Deep Archive. For this scenario, an S3 Lifecycle policy was configured to move data to S3 Glacier Deep Archive storage after 0 days across the destination S3 bucket.
Lastly, depending on the type of data you are migrating, you may need a specific data validation strategy. As in this use case when working with custom applications, there may be additional requirements such as ETag validation. In this solution, we generated metadata for both source and target systems. The LSEG data center solution required matching metadata that had different formats from the source and destination.
Conclusion
With DataArt consulting, LSEG leveraged AWS DataSync for the data migration process taking advantage of its built-in features and automation with other AWS services, such as AWS Step Functions, AWS Lambda, and Amazon DynamoDB. The planning of this 30 PB migration started with the customer requirements and was implemented with a scale out approach using multiple DataSync agents and tasks. The strategy took into account project timelines and associated costs, and the migration was completed in less than three months. The strategy used a phased approach while orchestrating and validating DataSync tasks throughout the process until final cutover.
LSEG’s migration of the the Tick History – PCAP global market data to Amazon S3 has resulted in an 80% reduction in storage costs and transformed the way LSEG customers access the data. With data stored in Amazon S3, customers are now able to run exploratory analysis without having to copy data to their own environment. As a result, customers’ storage costs have gone down as well. LSEG Tick History – PCAP has achieved higher availability and resiliency of their data as a result of migrating to AWS. The migrated data has been published through the AWS Data Exchange. As a result, entitlements management is simplified without custom solutions. This has freed up time on LSEG’s teams to focus on more business-generating applications.
“The partnership with AWS continues to help our customers access lossless, high-quality historical data.With over 30PB of data now migrated, thanks to both teams at AWS and LSEG, leveraging sophisticated tools and processes, we can continue to add new global multi-assets venues as customer demand continues to grow for LSEG PCAP data.”
Rob Lane, Global Head of Business Execution, Low Latency, LSEG