|7 min read

Tier-1 Commerce Payment Systems Migration

Migrating payment processing systems to the cloud while maintaining PCI compliance and zero downtime

There is a particular kind of stress that comes with migrating payment systems. When a commerce platform processes millions of transactions per day and a single minute of downtime translates to six figures in lost revenue, the margin for error is effectively zero. I have been leading one of these migrations for the past several months, and the experience has shaped how I think about infrastructure work in ways that no other project has.

The system in question was a legacy payment processing pipeline running on bare-metal servers in a co-located data center. It handled authorization, capture, settlement, and refund workflows for multiple lines of business at a major entertainment company. The technology stack was a mix of Java services, a commercial payment gateway, and a database layer that had accumulated years of customization.

The mandate was clear: move this to AWS without losing a single transaction.

PCI Compliance in the Cloud

The first challenge was not technical. It was regulatory. PCI DSS (Payment Card Industry Data Security Standard) compliance is non-negotiable for any system that touches cardholder data. Our existing PCI environment was audited and certified on physical infrastructure with known network boundaries, controlled physical access, and dedicated hardware security modules.

Moving to the cloud meant re-establishing all of those controls in a virtual environment. AWS provides PCI-compliant infrastructure, but that only covers the physical and hypervisor layers. Everything from the operating system up is your responsibility. This is the shared responsibility model in its most consequential form.

We designed the cloud environment with PCI scope minimization as the primary architectural goal. The fewer systems that touch cardholder data, the smaller the audit surface. We achieved this through several patterns:

Network segmentation. The payment processing services lived in isolated VPCs with no internet gateway. All traffic flowed through AWS PrivateLink or VPC peering connections, and every network path was explicitly defined in security groups and NACLs. No default-allow rules existed anywhere in the payment VPC.

Tokenization at the edge. We pushed tokenization as close to the customer as possible. By the time a payment request reached our backend services, the actual card number had been replaced with a token. This meant that the vast majority of our microservices never saw cardholder data and were out of PCI scope entirely.

Encryption everywhere. TLS 1.2 for data in transit, AES-256 for data at rest. AWS KMS managed the encryption keys with automatic rotation. The HSM (Hardware Security Module) requirement was met with AWS CloudHSM, which provides dedicated hardware in AWS's data centers that you control exclusively.

Logging and monitoring. PCI requires comprehensive audit logging. Every API call, every database query, every network connection in the payment VPC was logged to a dedicated CloudWatch log group with a 13-month retention policy. CloudTrail captured all AWS API activity. We built alerting on anomalous patterns: unusual transaction volumes, unexpected source IPs, failed authentication attempts.

The Migration Strategy

We rejected the big-bang approach immediately. Cutting over a payment system all at once is a recipe for catastrophic failure. Instead, we designed a phased migration using a traffic-splitting pattern.

The architecture looked like this:

                      [Load Balancer]
                      /              \
              [Router Service]
              /              \
    [Legacy Pipeline]    [Cloud Pipeline]
              \              /
           [Reconciliation Service]

The router service was a lightweight proxy that could direct individual transactions to either the legacy or cloud pipeline based on configurable rules. We started by sending 1% of traffic to the cloud pipeline, with every transaction also processed through the legacy system for comparison.

The reconciliation service compared the results. If the cloud pipeline produced a different authorization response than the legacy system for the same transaction, we flagged it for investigation. This dual-processing approach let us validate correctness with real production traffic without risking customer impact.

The Cutover Phases

Phase 1: Shadow mode (4 weeks). 100% of traffic went to the legacy system. 100% was also copied to the cloud pipeline. Results were compared but the cloud responses were discarded. This phase uncovered dozens of edge cases: currency conversion rounding differences, timezone handling in settlement windows, and retry logic that behaved differently under cloud network latency.

Phase 2: Canary (3 weeks). We routed 1% of live traffic to the cloud pipeline as the primary processor, with automatic fallback to legacy if the cloud response exceeded latency thresholds or returned an error. The fallback was transparent to the customer; they never knew a retry happened.

Phase 3: Gradual ramp (6 weeks). 1% became 5%, then 10%, then 25%, then 50%. Each increase came after a stabilization period of at least one week with clean metrics. We held at 50% for two weeks to observe settlement and refund workflows that have longer time horizons.

Phase 4: Full cutover. We routed 100% to the cloud pipeline with the legacy system on hot standby. After 30 days of clean operation, we decommissioned the legacy pipeline.

The entire process took roughly four months.

Zero-Downtime Deployment Patterns

Once the payment system lived in AWS, we needed deployment practices that matched the zero-downtime requirement. We adopted several patterns.

Blue-green deployments. Two identical environments existed at all times. Deployments went to the inactive environment, were validated with synthetic transactions, and then traffic was shifted via weighted target groups on the ALB. If anything went wrong, shifting traffic back took seconds.

Database migrations as separate events. Schema changes were deployed independently from application code, always backward-compatible, and always additive. We never dropped a column or renamed a table in the same release that changed the application code referencing it. This meant any application version could run against any adjacent schema version.

Circuit breakers on external dependencies. The payment gateway, fraud detection service, and tax calculation service were all external. Each integration point had a circuit breaker (implemented with Hystrix, which was still the standard at the time) that would trip after a configurable number of failures, preventing cascade failures from propagating through the system.

Monitoring a Payment System

Standard application monitoring is insufficient for payment systems. We built a real-time dashboard that tracked metrics specific to the commerce domain:

  • Authorization success rate by payment method, card brand, and issuing bank
  • Average authorization latency with p50, p95, and p99 breakdowns
  • Settlement batch completeness, comparing expected versus actual settlement counts
  • Refund processing time from initiation to confirmation
  • Fraud detection false positive rate, because blocking legitimate transactions is as costly as allowing fraudulent ones

We set alerting thresholds tight. A 0.5% drop in authorization success rate triggered a page. In payment processing, small percentage changes represent large numbers of affected customers.

Lessons Learned

PCI compliance is an architecture decision, not an afterthought. If you design your system without PCI scope minimization from the start, retrofitting it is enormously expensive. Tokenization, network segmentation, and encryption should be foundational, not bolted on.

Dual-processing is worth the engineering cost. Running both systems in parallel for months was expensive in compute and engineering time. It was worth every dollar. The edge cases we caught in shadow mode would have been production incidents during a direct cutover.

Payment systems expose every assumption in your infrastructure. Eventual consistency, clock skew, network partitions, DNS propagation delays: payment processing surfaces all of these because the tolerance for error is zero. If your infrastructure has latent issues, a payment migration will find them.

Communicate constantly. We held daily standups with stakeholders from engineering, finance, compliance, and the business. Payment system changes affect revenue directly, and everyone from the CFO to the fraud analysts needs to understand what is happening and when.

This migration was the most technically demanding project I have worked on. It was also the most rewarding. Moving a tier-1 payment system to the cloud without dropping a single transaction is the kind of result that builds organizational confidence in cloud infrastructure. After this, the next migration conversation was not "should we move to the cloud?" but "when can we start?"

Share: