Storage Gateways, EFS, and AWS Batch: Solving Enterprise Storage Challenges
How we designed storage architectures for media workflows using EFS, AWS Batch, and Storage Gateway
Cloud compute is relatively straightforward. You need a server, you launch an instance, you scale horizontally when demand increases. Storage is a different beast entirely. Every application has different durability requirements, access patterns, throughput needs, and consistency models. At a major entertainment company, where the data includes massive media files, rendering pipelines, and content delivery workflows, storage architecture is where the real engineering lives.
Over the past few months, I have been deep in three AWS storage services that are critical to our media and batch processing workloads: EFS, AWS Batch, and Storage Gateway. Here is what I have learned about using them together in production.
The Media Processing Problem
Our content teams work with enormous files. A single uncompressed video asset can be tens of gigabytes. The processing pipeline involves ingestion, transcoding, quality checks, metadata extraction, and delivery to downstream systems. Historically, this pipeline ran on on-premises hardware with shared NFS mounts providing the common filesystem that every stage of the pipeline could read from and write to.
Moving this to AWS presented a fundamental question: how do you provide shared filesystem access to a fleet of processing nodes that scale dynamically?
EFS as the Shared Filesystem
Amazon Elastic File System was the natural answer. EFS provides a managed NFS filesystem that can be mounted by multiple EC2 instances simultaneously, scales automatically, and supports the POSIX semantics that our processing tools expect.
The architecture:
Ingestion Transcoding QC Delivery
Nodes Nodes Nodes Nodes
\ | | /
\ | | /
-------- EFS File System ---------
/content-pipeline/
/ingest/
/transcode/
/output/
Each processing stage reads from and writes to well-defined directories on the EFS filesystem. A transcoding node picks up a file from /ingest/, processes it, and writes the result to /transcode/. The QC nodes watch /transcode/ for new files, validate them, and move completed files to /output/.
Performance Tuning
EFS performance required careful tuning. The default general purpose performance mode was insufficient for our video processing workloads. We switched to Max I/O performance mode, which provides higher aggregate throughput at the cost of slightly higher latency per operation. For large sequential reads and writes, which is exactly what video processing involves, this trade-off is favorable.
We also learned the hard way about EFS burst credits. EFS provides a baseline throughput proportional to the filesystem size, with the ability to burst to higher throughput for limited periods. A small filesystem that handles periodic large workloads will exhaust its burst credits and drop to baseline throughput, which can be painfully slow.
The solution was provisioned throughput mode, where we specify the throughput we need regardless of filesystem size. This costs more but provides predictable performance, which matters when processing pipelines have SLAs.
resource "aws_efs_file_system" "content_pipeline" {
performance_mode = "maxIO"
throughput_mode = "provisioned"
provisioned_throughput_in_mibps = 256
lifecycle_policy {
transition_to_ia = "AFTER_30_DAYS"
}
tags = {
Name = "content-pipeline"
}
}
The lifecycle policy is worth noting. EFS Infrequent Access storage class costs significantly less than standard storage. Files that have not been accessed in 30 days automatically transition to IA, which saves money on the large archive of processed media that rarely gets re-accessed but needs to remain available.
AWS Batch for Processing Orchestration
The processing nodes themselves run as AWS Batch jobs. Batch handles the compute provisioning, job scheduling, and retry logic that we would otherwise have to build ourselves.
Our Batch setup consists of three compute environments:
- On-demand: For time-sensitive, high-priority jobs that need guaranteed capacity.
- Spot: For bulk processing where cost optimization matters more than completion time.
- GPU: For transcoding workloads that benefit from hardware acceleration.
{
"computeEnvironmentName": "media-processing-spot",
"type": "MANAGED",
"computeResources": {
"type": "SPOT",
"bidPercentage": 60,
"minvCpus": 0,
"maxvCpus": 1024,
"desiredvCpus": 0,
"instanceTypes": ["c5.2xlarge", "c5.4xlarge", "m5.2xlarge", "m5.4xlarge"],
"subnets": ["subnet-abc123", "subnet-def456"],
"securityGroupIds": ["sg-batch123"]
}
}
Setting minvCpus to 0 means the environment scales to zero when idle, eliminating costs during off-hours. Setting maxvCpus to 1024 allows substantial burst capacity for large batch submissions.
The Batch job definitions mount the EFS filesystem, so every job container has access to the shared content pipeline:
{
"containerProperties": {
"image": "123456789.dkr.ecr.us-east-1.amazonaws.com/media-processor:latest",
"vcpus": 4,
"memory": 16384,
"mountPoints": [
{
"containerPath": "/content",
"sourceVolume": "efs-content"
}
],
"volumes": [
{
"name": "efs-content",
"efsVolumeConfiguration": {
"fileSystemId": "fs-abc123"
}
}
]
}
}
Job Dependencies and Workflows
AWS Batch supports job dependencies, which allows us to model the processing pipeline as a directed acyclic graph. The transcoding job depends on the ingestion job. The QC job depends on the transcoding job. The delivery job depends on the QC job.
ingest_job = batch.submit_job(
jobName='ingest-asset-12345',
jobQueue='high-priority',
jobDefinition='media-ingest:3'
)
transcode_job = batch.submit_job(
jobName='transcode-asset-12345',
jobQueue='spot-processing',
jobDefinition='media-transcode:5',
dependsOn=[{'jobId': ingest_job['jobId']}]
)
If any stage fails, downstream jobs do not run, and the retry policy at each stage handles transient failures automatically.
Storage Gateway for Hybrid Workflows
Not all of our content originates in the cloud. Some teams still work with on-premises editing suites and local storage. AWS Storage Gateway bridges this gap by exposing S3 storage as an NFS or SMB file share accessible from on-premises systems.
We deployed Storage Gateway in file gateway mode. On-premises systems write to what looks like a standard NFS mount, but the data is automatically synced to an S3 bucket. A Lambda function triggers on S3 object creation events to initiate the cloud-based processing pipeline.
On-Premises AWS
Edit Suite S3 Bucket Lambda AWS Batch
| | | |
| NFS write | | |
| .................. | | |
| | S3 Event | |
| | ................ | |
| | | Submit Job |
| | | ............. |
The cache on the Storage Gateway appliance provides low-latency access to recently written files, while the full dataset lives durably in S3. This gives on-premises users the performance they expect while ensuring that all data is available for cloud processing.
Cache Sizing
Getting the cache size right is important. Too small, and frequently accessed files evict from cache, causing reads to pull from S3 over the WAN. Too large, and you are paying for local storage you do not need. We sized the cache to hold roughly two days of active working data, based on access pattern analysis. CloudWatch metrics on cache hit rates validated our sizing over time.
Lambda as the Glue
Throughout these workflows, Lambda functions serve as the orchestration glue. They respond to events, trigger Batch jobs, update metadata in DynamoDB, send notifications to SNS topics, and handle error conditions. Each function is small, focused, and stateless.
A typical pattern:
- S3 event triggers Lambda on file upload.
- Lambda validates the file, extracts metadata, records it in DynamoDB.
- Lambda submits an AWS Batch job for processing.
- Batch job completion triggers a CloudWatch Event.
- Another Lambda function picks up the completion event, updates DynamoDB, and notifies downstream systems.
This event-driven approach eliminates polling, reduces coupling between components, and makes the system observable at every stage.
Key Takeaways
EFS performance is not free. You need to understand performance modes, throughput modes, and burst credits before committing to EFS for performance-sensitive workloads. Provisioned throughput is expensive but necessary for predictable pipeline performance.
AWS Batch simplifies operations significantly. Building our own job scheduling and compute scaling system would have been months of engineering. Batch provides it out of the box with Spot integration for cost optimization.
Storage Gateway is genuinely useful for hybrid architectures. It is not the most exciting service in the AWS catalog, but for organizations with significant on-premises workflows, it provides a practical bridge that does not require rearchitecting everything at once.
Event-driven architectures shine for media workflows. The combination of S3 events, Lambda, and Batch creates a processing pipeline that scales from zero to massive throughput without any infrastructure to manage during idle periods.
Storage architecture at enterprise scale is a discipline unto itself. These services are tools, and like any tools, their value depends on understanding their strengths, limitations, and the specific problem you are applying them to.