Skip to main content

Command Palette

Search for a command to run...

VPC Lattice in Production

How we exposed AWS services to on-premises networks with overlapping CIDRs

Updated
9 min read
P

👋 Hi, I’m @hpfpv ☁️ I’m a Cloud Infrastructure Architect | 8x AWS Certified 🚀 I build secure, scalable, and automated solutions on AWS using Terraform, CloudFormation, and CI/CD 📚 Always exploring hybrid cloud, serverless, and AI-driven architectures

Hey guys,

I just renewed my AWS Advanced Networking Specialty, and let me tell you… this one really humbled me.

Between all the new services, the subtle networking tricks, and the fact that I hadn’t done real deep network hands-on in a little while, I definitely had a few cold sweats during the exam.

And to make it worse, I waited 10 full days for the results; easily the most stressful part.

But hey, it’s done. I passed. 🙌🏾

And now that the pressure is over, I wanted to share something interesting.

During a recent project, and also while preparing for the exam, I rediscovered a service that I think doesn’t get the love it deserves: Amazon VPC Lattice.

It ended up helping us solve a very tricky service-exposure problem in a complex hybrid, multi-hop network architecture with overlapping CIDRs.

If you’ve ever dealt with that kind of setup, you already know: this is where traditional patterns start showing their limits.

So let’s talk about it.

Context and challenge

Let me describe what we were working with.

The client operates a multi-site enterprise network. Multiple physical locations, each with its own network equipment, all interconnected through an internal enterprise backbone.

Here's the complexity: every site uses the same internal IP ranges, with overlapping CIDRs across the board. To enable cross-site communication, each site exposes services using a pool of routable IPs combined with NAT.

One site, Site A, has a hybrid connection to AWS via Direct Connect. In AWS, we have multiple VPCs (Dev, Test, Prod, and a shared endpoints VPC) all connected through a Transit Gateway.

Site A has full connectivity to all its services - both on-premises and in AWS. Other sites can consume Site A's on-prem services through the routable IP + NAT setup. Everything works as expected.

The new objective:

Enable other sites on the enterprise network to consume services running in AWS, via Site A's Direct Connect connection, while minimizing impact on the existing landing zone infrastructure.

The challenges we needed to address:

  1. AWS private IP ranges are not routable from the enterprise network

  2. Risk of IP overlap between AWS and the enterprise network

  3. Need for bidirectional connectivity with integrated DNS resolution

The traditional approaches - chaining NAT devices, allocating more routable IPs (not even an option in our case), or setting up multiple ALBs with custom DNS - all had significant drawbacks in terms of complexity or operational overhead.

We needed a solution that could abstract the service exposure, handle routing elegantly, and integrate smoothly with the existing architecture.

That's where VPC Lattice came in.

What is VPC Lattice?

Amazon VPC Lattice is a managed application networking service that fundamentally changes how we think about service-to-service communication.

At its core, VPC Lattice abstracts away the traditional networking complexity. Instead of dealing with IP addresses, routing tables, and load balancers, you work with service networks - logical application layer constructs that handle connectivity, security, and observability for you.

Here's what makes it powerful:

Protocol flexibility: It supports HTTP, HTTPS, gRPC, TLS, and TCP. Whether you're running REST APIs, microservices, or legacy TCP applications, Lattice has you covered.

Compute agnostic: Works with ECS, EKS, EC2, Lambda - basically any AWS compute service. Also works with on-prem resources via IP address. You're not locked into a specific deployment pattern.

Security built-in: Native integration with AWS IAM for authentication and authorization. You can implement zero-trust principles without building custom proxy layers.

Hybrid-ready: And this is the key part for our use case - VPC Lattice can connect services across VPCs, accounts, and even between AWS and on-premises environments.

Read more about VPC Lattice here.

VPC Lattice handles four main connectivity patterns:

  • Connecting applications within a single AWS Region

  • Enabling hybrid connectivity between AWS and on-premises systems

  • Managing internet-based application access to AWS services

  • Facilitating cross-region application communication

For our scenario, that second pattern - hybrid connectivity - was exactly what we needed.

Solution overview

Here's how we used VPC Lattice to solve the problem.

Instead of exposing individual AWS services through traditional NAT or load balancers, we created a VPC Lattice service network that acts as a controlled gateway layer.

The architecture looks like this:

The flow

  1. Services in AWS (ECS tasks, Lambda functions, EKS workloads, RDS databases) are registered as targets in VPC Lattice

  2. Each service gets a Lattice service endpoint with its own DNS name

  3. VPCs are associated with the Lattice service network - this creates network connectivity without requiring complex routing

  4. The Lattice service network is exposed to Site A through a single, stable endpoint

  5. Site A publishes this endpoint to the enterprise network using one routable IP (remember - we had limited routable IPs available)

  6. Other sites (Site B, Site C) can now consume AWS services through standard DNS resolution and HTTP/HTTPS calls

DNS resolution for on-premises consumers required a custom DNS configuration since remote sites can't directly resolve Lattice endpoints (this would resolve the non-routable IP of the Lattice endpoint):

  1. A remote site wants to call an AWS service: service1.sitea.aws.cloud

  2. The enterprise DNS returns the routable IP of Site A's NAT device - not an AWS IP

  3. Traffic hits the NAT at Site A, which rewrites the source IP (from routable to non-routable internal IP)

  4. The NAT forwards the request to the Lattice endpoint located in the internal perimeter VPC (reachable via Direct Connect and Transit Gateway)

  5. Lattice inspects the host header and path, then routes the request to the appropriate target service in the correct VPC

  6. The response flows back through the same path: Lattice → TGW → DX → NAT → Enterprise Network → Remote Site

What this gives us

Single point of exposure: Only one routable IP needed at the Site A edge. All AWS services are accessed through the Lattice service network.

No IP overlap issues: VPC Lattice operates at Layer 7 (application layer). Services are identified by DNS names, not IP addresses. This completely bypasses the overlapping CIDR problem.

Integrated service discovery: Each service registered in Lattice gets a custom DNS entry in a private hosted zone shared across VPCs (check out this post for more). For on-premises access, we used custom DNS that points to Site A's NAT, which then forwards to Lattice. This gave us consistent service names across both environments.

Fine-grained access control: We can use IAM policies and Lattice auth policies to control exactly which services are accessible from on-premises, down to specific API paths if needed.

Simplified routing: The Transit Gateway only needs to know how to reach the Lattice service network association. All the service-level routing is handled by Lattice itself.

Scalability: Adding new services is trivial - register them in Lattice, update DNS, done. No firewall rules, no NAT configuration, no routing table updates.

Why this worked

The key insight here is that VPC Lattice abstracts the network layer complexity into application-layer service connectivity. Traditional approaches would have required:

  • Multiple routable IPs (one per exposed service) - we didn't have them

  • Complex NAT chains to handle IP translation - operational nightmare

  • Custom DNS infrastructure to route requests - more moving parts

  • Manual routing table management across VPCs - doesn't scale

VPC Lattice eliminated all of that.

We went from "how do we expose these 100+ services without running out of routable IPs and creating a routing mess" to "register services in Lattice, associate services, publish one endpoint."

The on-premises teams at Site B and Site C just got a list of DNS names exposing the services. From their perspective, they're making standard HTTPS calls. They don't know (or care) about the underlying AWS networking complexity.

And from our perspective, we have centralized visibility, IAM-based security, and the ability to add or remove services without touching the underlying network infrastructure.

Alternative approaches

Before we settled on VPC Lattice, we evaluated the standard approach: combining Network Load Balancers (NLB) and Application Load Balancers (ALB).

The traditional flow would be:

DNS resolution → NAT at Site A → NLB in perimeter VPC → ALB → Target service

This requires configuring NAT per environment, creating target groups and listener rules per application at the ALB level, and managing routing across multiple layers.

Why we didn't go this route:

While the NLB+ALB approach offers very good performance (under 1ms latency) and cost, it came with deal-breakers.

Assuming the NLB and ALB live in the perimeter VPC, here's where the architecture breaks down:

  • The NLB gives you a static IP for NAT mapping (maybe one per environment if traffic segregation is required)

  • The ALB provides listener rules based on host headers or paths - great for routing

  • But ALB target groups only support EC2 instances, Lambda functions (in the same VPC/account), or IP addresses

For services running in different VPCs or accounts - which is our exact scenario - you can't just point the ALB at an ECS service or an EKS pod directly. You need static IPs.

This forces application teams in other accounts/VPCs to deploy additional NLBs in their own environments just to get a static IP that the central ALB can target. But most modern services run on containers (EKS/ECS), which typically use ALBs with dynamic, non-static IPs.

So you end up with this mess:

  • Central perimeter: NLB → ALB

  • Each application account/VPC: Another NLB just for IP stability → ALB (for the actual service) → ECS/EKS targets

That's two layers of load balancers per service, duplicated infrastructure across accounts, and significant operational overhead just to work around ALB's targeting limitations.

Other challenges:

  • Fragmented security: Policies scattered across NAT, multiple NLBs, multiple ALBs, and target security groups

  • Manual observability: No unified view. You're correlating logs across multiple load balancers, VPCs and accounts

  • High operational complexity: Every new service requires coordinating across network teams (NAT rules), central platform teams (perimeter ALB), and application teams (their own NLB)

VPC Lattice eliminated the operational overhead, centralized security and observability management, and gave us the flexibility to scale.

Key recommendations

If you're considering VPC Lattice for hybrid connectivity:

  • Evaluate your architecture complexity. If you have services across multiple VPCs and accounts, Lattice eliminates the need for application teams to deploy extra infrastructure just to get static IPs for routing.

  • Check protocol requirements. Lattice supports HTTP/HTTPS/gRPC and TCP. FTP and other legacy protocols need alternative paths.

  • Enable observability from day one. CloudWatch Logs and access logging are Lattice's biggest advantages - don't waste them.

  • Document DNS patterns clearly. The custom DNS + NAT flow for on-premises isn't intuitive. Clear docs save support tickets.

  • Weigh cost vs. operational complexity. Lattice costs more but simplifies operations dramatically. Calculate both direct costs and operational overhead.

  • Use IAM policies for service-level security. Don't rely solely on security groups. Lattice's auth policies enable fine-grained, identity-based access control.