|8 min read

Lambda, RDS, Oracle on EC2: Hybrid Data Tier Patterns

Designing data tier architectures that bridge serverless Lambda functions with traditional Oracle databases on AWS

At a major entertainment company, our data tier is not a clean, modern architecture. It is a hybrid landscape where AWS Lambda functions coexist with Oracle databases running on EC2, RDS instances serve read-heavy workloads, and DynamoDB tables handle session state. This is not a failure of architecture. It is the reality of operating at enterprise scale, where migration is incremental and legacy systems carry business logic that took years to develop.

I have spent the past several months designing patterns that let these components work together reliably. The challenges are more nuanced than "just move everything to DynamoDB," and the solutions are worth documenting.

The Landscape

Our data tier has four primary components:

  • Oracle 12c on EC2: The system of record for core business data. Runs on r4.4xlarge instances in a Data Guard configuration for high availability. Contains roughly 2TB of data with complex PL/SQL business logic embedded in stored procedures.
  • Amazon RDS (PostgreSQL): Serves read-heavy workloads through read replicas. Used by newer microservices that do not need Oracle-specific features.
  • DynamoDB: Handles session state, feature flags, and high-throughput key-value access patterns.
  • AWS Lambda: Processes events from SQS queues, API Gateway requests, and S3 triggers. Currently about 40 Lambda functions in production.

The challenge is making Lambda functions interact with Oracle on EC2 without introducing reliability issues, performance bottlenecks, or security gaps.

Lambda to Oracle: The Connection Problem

Lambda functions connecting directly to Oracle (or any relational database) face a fundamental architectural tension. Relational databases maintain stateful connections. Each connection consumes memory on the database server, and Oracle licenses often impose connection limits. Lambda functions are ephemeral, stateless, and can scale to hundreds or thousands of concurrent executions in seconds.

If you have 500 concurrent Lambda invocations, each opening a database connection, you will exhaust your Oracle connection pool and bring down every application that depends on that database. I have seen this happen. It is not hypothetical.

Pattern 1: RDS Proxy and Connection Pooling

For RDS PostgreSQL workloads, Amazon RDS Proxy (which AWS previewed earlier this year) addresses this problem directly. The proxy maintains a persistent connection pool to the database and multiplexes Lambda connections through it, reducing the connection overhead by an order of magnitude.

For Oracle on EC2, no managed proxy exists. We built our own connection pooling layer using a small ECS service running PgBouncer's conceptual equivalent for Oracle: a custom connection broker built on Oracle's Universal Connection Pool (UCP) exposed through a thin REST API.

Lambda Function
    |
    v
API Gateway (internal)
    |
    v
ECS Service (Oracle Connection Broker)
    |
    v [persistent connection pool]
Oracle on EC2 (Data Guard)

The connection broker maintains a pool of 50 persistent connections to Oracle and serves requests from Lambda functions. This caps the number of Oracle connections regardless of Lambda concurrency, and the connection broker handles queueing when demand exceeds pool capacity.

Pattern 2: Asynchronous Data Access via SQS

For workloads that do not require synchronous responses, we use SQS as a buffer between Lambda and Oracle. The Lambda function writes a message to an SQS queue, and a consumer service (running on ECS) processes the queue and writes to Oracle at a controlled rate.

This pattern is particularly effective for write operations where the caller does not need immediate confirmation of database persistence. Event logging, audit trails, and analytics writes all fit this pattern well.

Lambda Function
    |
    v
SQS Queue (with DLQ)
    |
    v
ECS Consumer Service [rate-limited]
    |
    v
Oracle on EC2

The SQS dead letter queue catches messages that fail after multiple retry attempts, giving us a recovery path without losing data.

Pattern 3: DynamoDB as a Read Cache

For read-heavy access patterns, we cache frequently accessed Oracle data in DynamoDB. A change data capture process reads the Oracle redo logs and propagates changes to DynamoDB tables with a typical lag of two to five seconds.

Lambda functions read from DynamoDB for most requests, falling back to Oracle only when the data is not cached or when strong consistency is required. This reduces Oracle load by roughly 70% for the workloads where we have implemented it.

The trade-off is eventual consistency. For our use cases (content metadata, configuration data, user preferences), a few seconds of staleness is acceptable. For financial transactions or inventory management, it would not be.

Oracle High Availability on EC2

Running Oracle on EC2 rather than RDS means we own the high availability configuration. We use Oracle Data Guard with the following topology:

  • Primary instance: r4.4xlarge in us-east-1a
  • Standby instance: r4.4xlarge in us-east-1b
  • Observer: t3.medium in us-east-1c (for Fast-Start Failover)

Data Guard replicates redo logs from primary to standby synchronously (for zero data loss) within the same region. The observer monitors both instances and triggers automatic failover if the primary becomes unreachable.

Key operational considerations:

Storage. We use EBS io1 volumes (provisioned IOPS) for the Oracle data files. The IOPS are provisioned at a 50:1 ratio (storage GB to IOPS), which provides consistent performance for our mixed read/write workload. GP2 volumes were insufficient; we hit the IOPS ceiling during peak batch processing windows.

Backup. RMAN backups to S3 via the Oracle S3 integration module. Full backups nightly, incremental every four hours, archived redo logs shipped to S3 continuously. We maintain a 30-day recovery window.

Patching. Oracle Critical Patch Updates (CPUs) are applied quarterly during maintenance windows. We apply to the standby first, fail over, then patch the old primary. This gives us a zero-downtime patching process at the cost of a brief (under 30 seconds) failover interruption.

IAM and Network Security

The data tier security model uses multiple layers:

VPC segmentation. Oracle instances run in private subnets with no internet access. Lambda functions that need Oracle access are configured with VPC networking, placing them in the same private subnets. This adds cold start latency to Lambda (VPC-attached Lambdas take 5 to 10 seconds for cold starts), but the security requirement is non-negotiable.

Security groups. Oracle security groups allow inbound connections only from the connection broker ECS service security group, on port 1521. No direct access from Lambda functions to Oracle; all traffic routes through the connection broker.

IAM roles. Lambda functions assume execution roles with least-privilege policies. A function that reads from DynamoDB gets dynamodb:GetItem and dynamodb:Query on the specific table ARN. Nothing more.

Secrets management. Oracle credentials are stored in AWS Secrets Manager with automatic rotation every 30 days. Lambda functions and ECS services retrieve credentials at runtime; no credentials are embedded in code or environment variables.

Monitoring the Hybrid Tier

Monitoring a hybrid data tier requires correlating metrics across multiple services. We built a unified dashboard that combines:

  • Oracle AWR metrics (wait events, SQL execution plans, tablespace utilization) published to CloudWatch via a custom metric agent
  • RDS Performance Insights data for PostgreSQL query analysis
  • DynamoDB consumed capacity metrics and throttling events
  • Lambda duration, error rate, and concurrent execution metrics
  • SQS queue depth and message age

The most useful alert we implemented: if the SQS queue depth exceeds 1000 messages or the oldest message exceeds 5 minutes, the on-call engineer is paged. This catches Oracle connection issues before they cascade into user-visible failures.

Lessons Learned

Do not fight the hybrid. The instinct is to migrate everything to a single modern database. In practice, Oracle databases with decades of embedded business logic cannot be migrated quickly or safely. Designing clean integration patterns between old and new is more pragmatic than attempting a wholesale migration.

Connection management is the critical path. Every reliability incident we had in the first three months traced back to connection pool exhaustion. The connection broker pattern eliminated this class of failure.

Eventual consistency is usually fine. Engineers tend to default to strong consistency because it is simpler to reason about. But for most read workloads, data that is two seconds stale is indistinguishable from data that is current. DynamoDB caching reduced Oracle load dramatically with no user-visible impact.

Monitor the boundaries. The failure points in a hybrid architecture are at the integration boundaries: where Lambda meets Oracle, where DynamoDB cache diverges from Oracle source, where SQS queue depth indicates a downstream bottleneck. Instrument these boundaries aggressively.

This is not a clean architecture. It is a pragmatic one, shaped by real constraints: existing Oracle investments, licensing costs, compliance requirements, and the simple fact that you cannot rewrite two decades of PL/SQL over a weekend. The patterns we have developed are not elegant in the abstract, but they work reliably in production, and that is what matters.

Share: