An Introduction to Envoy AI Gateway

Organizations deploying AI applications face a fundamental challenge: no single model serves all needs. Developers may choose Claude for long-context analysis, OpenAI for reasoning tasks, and Llama for cost-sensitive workloads. The problem is that each model provider uses different APIs. Without centralized control, teams can't easily switch providers, get visibility into utilization, or enforce quotas.

Envoy AI Gateway (EAIG) is an open source project that solves this challenge by providing a single, scalable OpenAI-compatible endpoint that routes to any provider. It gives Platform teams cost controls and observability, while developers never touch provider-specific SDKs.

Built on top of Envoy Gateway, EAIG is specifically designed for handling LLM traffic. It acts as a centralized access point for managing and controlling access to various AI models within an organization.

When using EAIG, your applications call a single OpenAI-compatible endpoint. It acts as proxy between the developer using the model and the model provider. It is an abstraction that enables you to switch from Bedrock Claude to Bedrock Llama to self-hosted models to OpenAI, all without touching application code.

Envoy provides a unified gateway for consuming LLMs from different providers

Besides using model providers, you can also self-host LLMs in your Kubernetes cluster. Self-hosting gives you more control over model deployment, data privacy, and infrastructure costs.

EAIG key features:

Rate limiting based on tokens
Automatic route to fallback secondary model providers
AI-specific observability ^[1]

Recap of Envoy Gateway Fundamentals

If you're already familiar with Envoy Gateway, you can skip this section.

As EAIG builds on top of the standard Kubernetes Gateway API and Envoy Gateway extensions, it's necessary to familiarize yourself with the underlying Envoy Gateway primitives:

GatewayClass - Defines which controller manages the Gateway. EAIG uses the same GatewayClass as Envoy Gateway.
Gateway - The entry point for traffic. A Gateway resource defines listeners (HTTP/HTTPS ports). When you create a Gateway, Envoy Gateway deploys the actual Envoy proxy pods and a corresponding Kubernetes Service (typically a LoadBalancer). This is like a Network Load Balancer (although technically you'd still need to attach an NLB to an Envoy Gateway to accept traffic that's external the Kubernetes cluster.)
HTTPRoute - The instruction for routing traffic HTTP based on hostnames, paths, or headers. Conceptually, this is similar to ingress rules or listener rules in ALB.
Backend - A Kubernetes Service or an external endpoint.
BackendTrafficPolicy - Configures connection behavior like timeouts, retries, and rate limiting of an HTTPRoute.
ClientTrafficPolicy - Configures how the Envoy proxy server behaves with downstream clients.
EnvoyExtensionPolicy - A way to extend Envoy's traffic processing capabilities.

Concepts

EAIG introduces the following CRDs:

AIGatewayRoute - Defines unified API and routing rules for AI traffic
AIServiceBackend - Represents individual AI service backends like Bedrock
BackendSecurityPolicy - Configures authentication for backend access
BackendTLSPolicy - Defines TLS parameters for backend connections

Here's the high level request flow ^[2]:

The request comes into Envoy AI Gateway.
Authorization filter is applied for checking if the user or account is authorized to access the model.
An AI service backend is identified by matching request headers such as model name.
The request is translated in to the API schema of the AI service backend.
AI service authentication policy is applied based on the AI service backend:
- AWS requests are signed and credentials are injected for AWS Bedrock authentication.
Rate limiting filter is applied for request based usage tracking.
Envoy routes the request to the specified AI service backend.
Upon receiving the response from the AI service, the token usage limit is reduced by extracting the usage fields of the chat completion response.
- the rate limit is enforced on the subsequent request.
The response is sent back to the client.

The routing flow from CRD perspective: Client Request → Gateway (Listener) → AIGatewayRoute (header match) → AIServiceBackend (schema translation) → Backend (an LLM running in Kubernetes or Amazon Bedrock

AIGatewayRoute

The AIGatewayRoute routes LLM traffic to one or more supported AI providers (including self-hosted models). It tells the gateway which models are available and how to reach them.

When you create an AIGatewayRoute, EAIG generates an HTTPRoute and HTTPRouteFilter (for URL rewriting).

Besides routing traffic, it is also used to

Specify the input API schema for client requests
Manage request/response transformations between different API schemas
Track LLM token usage

EAIG can tracks input and output tokens, allowing administrators to define quotas and implement rate limiting. Here's an example of an AIGatewayRoute that exposes two models (Claude and GPT-OSS from Bedrock) while tracking token usage:

apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: eaig-bedrock
  namespace: default
spec:
  parentRefs:
    - name: eaig-bedrock
      kind: Gateway
      group: gateway.networking.k8s.io
  rules:
    - matches:
        - headers:
            - type: Exact
              name: x-ai-eg-model
              value: anthropic.claude-3-sonnet-20240229-v1:0
        - headers:
            - type: Exact
              name: x-ai-eg-model
              value: openai.gpt-oss-120b-1:0
      backendRefs:
        - name: eaig-bedrock
    # The following metadata keys are used to store the costs from the LLM request.
  llmRequestCosts:
    - metadataKey: llm_input_token
      type: InputToken
    - metadataKey: llm_output_token
      type: OutputToken
    - metadataKey: llm_total_token
      type: TotalToken
    # This configures the token limit based on the CEL expression.
    # For a demonstration purpose, the CEL expression returns 100000000 only when the input token is 3,
    # otherwise it returns 0 (no token usage).
    - metadataKey: llm_cel_calculated_token
      type: CEL
      cel: "input_tokens == uint(3) ? 100000000 : 0"

The llmRequestCosts section tells the gateway to extract token counts from each response and store them as metadata. Later, your BackendTrafficPolicy uses these values to enforce rate limits. For example: don't allow more than 10,000 input tokens per hour per user.

External Processor

ExtProc (External Processing server) runs as a sidecar container alongside the Envoy proxy Pod. When a request hits the gateway, the gateway forwards ito the ExtProc. ExtProc manipulates headers and sends them back to the gateway. The gateway then sends the modified request to the upstream service (like Bedrock).

ExtProc performs three functions:

Schema translation - Converts backend APIs to Open AI APIs.
Credential injection - Retrieves backend credentials from BackendSecurityPolicy and injects them into outgoing requests. For AWS Bedrock, this means adding AWS SigV4 signature headers.
Token counting - After receiving the response from Bedrock, ExtProc extracts token usage from the response.

External processing for request headers

BackendTrafficPolicy

A BackendTrafficPolicy is used to rate-limit users for a particular AIGatewayRoute. Unlike traditional rate limiting where each request costs "1", token-based limits track the number of input tokens, output tokens, or a weighted combination of both.^[3]

When Envoy AI Gateway receives a request, the ExtProc server extracts token counts from the LLM response and stores them in Envoy's dynamic metadata. The BackendTrafficPolicy then deducts these token counts from the budget.

For example, you can limit any user to 1000 input tokens per hour. The policy uses header values to determine rate limit budgets, so you can also limit by team, application, or any custom identifier your clients send.

AIServiceBackend

AIServiceBackend represents a single AI service backend for AIGatewayRoute.

It represents a single AI service backend (like Bedrock) that handles traffic with a specific API schema. It:

defines the output API schema the backend expects
references a Kubernetes Service or Envoy Gateway Backend
reference a BackendSecurityPolicy for authentication

apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIServiceBackend
metadata:
  name: eaig-bedrock
  namespace: default
spec:
  schema:
    name: AWSBedrock
  backendRef:
    name: eaig-bedrock
    kind: Backend
    group: gateway.envoyproxy.io

An AIServiceBackend references a Backend that's hosting the LLM. It can be a service like Bedrock or a Kubernetes Service.

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: Backend
metadata:
  name: eaig-bedrock
  namespace: default
spec:
  endpoints:
    - fqdn:
        hostname: bedrock-runtime.us-west-2.amazonaws.com
        port: 443

BackendSecurityPolicy

BackendSecurityPolicy configures authentication methods for backend access. In case of Amazon Bedrock, it handles AWS SigV4 signing.

Currently, you must reference a Kubernetes Secret containing AWS API credentials. For production environments, using IAM roles (Pod Identity on EKS) is more secure than static credentials. However, Envoy AI Gateway currently requires a Secret reference even when using IAM roles. You can create a Secret with placeholder values that will be ignored in favor of the Pod's IAM role.

apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: BackendSecurityPolicy
metadata:
  name: eaig-bedrock
  namespace: default
spec:
  targetRefs:
    - group: aigateway.envoyproxy.io
      kind: AIServiceBackend
      name: eaig-bedrock
  type: AWSCredentials
  awsCredentials:
    region: us-west-2
    credentialsFile:
      secretRef:
        name: eaig-bedrock

Deploying EAIG on an Amazon EKS Cluster

This walkthrough deploys Envoy AI Gateway on Amazon EKS to expose two LLMs from Amazon Bedrock. It is configured with token-based rate limiting enforced per user. Each user is identified by the x-user-id header and receives independent token budgets. The gateway handles AWS authentication via IAM roles. It also tracks token usage through Prometheus metrics.

Before proceeding ensure you've deployed Envoy Gateway in your cluster. Deploy Envoy Gateway if you don't have it installed in your cluster:

helm install eg oci://docker.io/envoyproxy/gateway-helm \
  --version v0.0.0-latest \
  --set config.envoyGateway.provider.kubernetes.deploy.type=GatewayNamespace \
  -n envoy-gateway-system \
  --create-namespace

Deploy Envoy AI Gateway using Helm:

helm upgrade -i aieg oci://docker.io/envoyproxy/ai-gateway-helm \
  --version v0.0.0-latest \
  --namespace envoy-ai-gateway-system \
  --set "controller.image.tag=536e1f3bf8e674669825fc41c8b2f06f3c803235" \
  --create-namespace

Set Gateway Controller image to: docker.io/envoyproxy/ai-gateway-controller:536e1f3bf8e674669825fc41c8b2f06f3c803235 because the latest is broken.

After installing Envoy AI Gateway, apply the AI Gateway-specific configuration to Envoy Gateway and restart the deployment:

kubectl apply -f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/manifests/envoy-gateway-config/redis.yaml

kubectl apply -f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/manifests/envoy-gateway-config/config.yaml

kubectl apply -f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/manifests/envoy-gateway-config/rbac.yaml

kubectl rollout restart -n envoy-gateway-system deployment/envoy-gateway

Create an IAM Policy that allows access to Amazon Bedrock InvokeModel and ListFoundationModels API:

aws iam create-policy \
  --policy-name EnvoyAIGatewayBedrockAccessPolicy \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "bedrock:InvokeModel",
          "bedrock:ListFoundationModels"
        ],
        "Resource": "*"
      }
    ]
  }'

Create an IAM Role:

AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
cat >trust-relationship.json <<EOF
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ]
        }
    ]
}
EOF

aws iam create-role \
  --role-name EnvoyAIGatewayBedrockAccessRole \
  --assume-role-policy-document file://trust-relationship.json 
  
aws iam attach-role-policy --role-name EnvoyAIGatewayBedrockAccessRole \
  --policy-arn=arn:aws:iam::${AWS_ACCOUNT_ID}:policy/EnvoyAIGatewayBedrockAccessPolicy

Create a Pod Identity Mapping:

CLUSTER_NAME=Socrates
BEDROCK_ROLE_ARN=arn:aws:iam::${AWS_ACCOUNT_ID}:role/EnvoyAIGatewayBedrockAccessRole

aws eks create-pod-identity-association \
  --cluster-name $CLUSTER_NAME \
  --namespace envoy-gateway-system \
  --service-account ai-gateway \
  --role-arn $BEDROCK_ROLE_ARN

Before deploying the Gateway, ensure that you've enabled the following models in Bedrock:

openai.gpt-oss-120b-1:0
anthropic.claude-3-sonnet-20240229-v1:0

Create Envoy AI Gateway resources:

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: eaig-bedrock
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
  parametersRef: 
    group: gateway.envoyproxy.io
    kind: EnvoyProxy
    name: envoy-ai-gateway
    namespace: envoy-gateway-system
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: envoy-ai-gateway
  namespace: envoy-gateway-system
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyDeployment:
        container:
          resources: {}
      envoyServiceAccount:
        name: ai-gateway
      envoyService:
        type: ClusterIP
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: eaig-bedrock
  namespace: default
spec:
  gatewayClassName: eaig-bedrock
  listeners:
    - name: http
      protocol: HTTP
      port: 80
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: ClientTrafficPolicy
metadata:
  name: client-buffer-limit
  namespace: default
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: eaig-bedrock
  connection:
    bufferLimit: 50Mi
---
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: eaig-bedrock
  namespace: default
spec:
  parentRefs:
    - name: eaig-bedrock
      kind: Gateway
      group: gateway.networking.k8s.io
  rules:
    - matches:
        - headers:
            - type: Exact
              name: x-ai-eg-model
              value: anthropic.claude-3-sonnet-20240229-v1:0
        - headers:
            - type: Exact
              name: x-ai-eg-model
              value: openai.gpt-oss-120b-1:0
      backendRefs:
        - name: eaig-bedrock
    # The following metadata keys are used to store the costs from the LLM request.
  llmRequestCosts:
    - metadataKey: llm_input_token
      type: InputToken
    - metadataKey: llm_output_token
      type: OutputToken
    - metadataKey: llm_total_token
      type: TotalToken
    # This configures the token limit based on the CEL expression.
    # For a demonstration purpose, the CEL expression returns 100000000 only when the input token is 3,
    # otherwise it returns 0 (no token usage).
    - metadataKey: llm_cel_calculated_token
      type: CEL
      cel: "input_tokens == uint(3) ? 100000000 : 0"
---
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: BackendSecurityPolicy
metadata:
  name: eaig-bedrock
  namespace: default
spec:
  targetRefs:
    - group: aigateway.envoyproxy.io
      kind: AIServiceBackend
      name: eaig-bedrock
  type: AWSCredentials
  awsCredentials:
    region: us-west-2
    credentialsFile:
      secretRef:
        name: eaig-bedrock
---
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIServiceBackend
metadata:
  name: eaig-bedrock
  namespace: default
spec:
  schema:
    name: AWSBedrock
  backendRef:
    name: eaig-bedrock
    kind: Backend
    group: gateway.envoyproxy.io
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: Backend
metadata:
  name: eaig-bedrock
  namespace: default
spec:
  endpoints:
    - fqdn:
        hostname: bedrock-runtime.us-west-2.amazonaws.com
        port: 443
---
apiVersion: gateway.networking.k8s.io/v1alpha3
kind: BackendTLSPolicy
metadata:
  name: eaig-bedrock
  namespace: default
spec:
  targetRefs:
    - group: 'gateway.envoyproxy.io'
      kind: Backend
      name: eaig-bedrock
  validation:
    wellKnownCACertificates: "System"
    hostname: bedrock-runtime.us-west-2.amazonaws.com
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: eaig-bedrock-ratelimit-policy
  namespace: default
spec:
  # Applies the rate limit policy to the gateway.
  targetRefs:
    - name: eaig-bedrock
      kind: Gateway
      group: gateway.networking.k8s.io
  rateLimit:
    type: Global
    global:
      rules:
        # This configures the input token limit, and it has a different budget than others,
        # so it will be rate limited separately.
        - clientSelectors:
            - headers:
                # Have the rate limit budget be per unique "x-user-id" header value.
                - name: x-user-id
                  type: Distinct
          limit:
            # Configures the number of "tokens" allowed per hour, per user.
            requests: 10
            unit: Hour
          cost:
            request:
              from: Number
              # Setting the request cost to zero allows to only check the rate limit budget,
              # and not consume the budget on the request path.
              number: 0
            response:
              from: Metadata
              metadata:
                # This is the fixed namespace for the metadata used by AI Gateway.
                namespace: io.envoy.ai_gateway
                # Limit on the input token.
                key: llm_input_token

        # Repeat the same configuration for a different token type.
        # This configures the output token limit, and it has a different budget than others,
        # so it will be rate limited separately.
        - clientSelectors:
            - headers:
                - name: x-user-id
                  type: Distinct
          limit:
            requests: 10
            unit: Hour
          cost:
            request:
              from: Number
              number: 0
            response:
              from: Metadata
              metadata:
                namespace: io.envoy.ai_gateway
                key: llm_output_token

        # Repeat the same configuration for a different token type.
        # This configures the total token limit, and it has a different budget than others,
        # so it will be rate limited separately.
        - clientSelectors:
            - headers:
                - name: x-user-id
                  type: Distinct
          limit:
            requests: 10
            unit: Hour
          cost:
            request:
              from: Number
              number: 0
            response:
              from: Metadata
              metadata:
                namespace: io.envoy.ai_gateway
                key: llm_total_token

        # Repeat the same configuration for a different token type.
        # This configures the token limit based on the CEL expression.
        - clientSelectors:
            - headers:
                - name: x-user-id
                  type: Distinct
          limit:
            requests: 10
            unit: Hour
          cost:
            request:
              from: Number
              number: 0
            response:
              from: Metadata
              metadata:
                namespace: io.envoy.ai_gateway
                key: llm_cel_calculated_token
---
apiVersion: v1
kind: Secret
metadata:
  name: eaig-bedrock
  namespace: default
type: Opaque
stringData:
  # Replace this with your AWS credentials.
  credentials: |
    dummy-secret
---

Testing

After deploying Envoy AI Gateway, we can call LLMs using Open AI APIs. For simplicity, we'll use curl to invoke the model.

Get the Gateway's URL:

# Get the IP of the Gateway Service
kubectl get gateway/eaig-bedrock \
  -o jsonpath='{.status.addresses.value}'

Exec into a Pod running in your cluster. Then, use curl to send requests. Remove the x-user-id header if you don't want to be rate limited:

export GATEWAY_URL=<IP of Gateway Service>

curl -H "Content-Type: application/json" \
  -H "x-user-id: my-user-123" \
   -d '{
    "model": "anthropic.claude-3-sonnet-20240229-v1:0",
    "messages": [
      {
        "role": "user",
        "content": "Who was Aristotle? Tell it to me in less than 160 characters"
      }
    ]
  }'   $GATEWAY_URL/v1/chat/completions

Script to generate requests of varying sizes:

#!/bin/bash

export GATEWAY_URL=k8s-envoygat-envoydef-149225385c-0cf2a2e68ca526f5.elb.us-west-2.amazonaws.com

while true; do
  CHAR_COUNT=$((160 + RANDOM % 1068))
  
  echo "=== Request at $(date) - Requesting $CHAR_COUNT characters ==="
  curl -H "Content-Type: application/json" \
    -d "{
      \"model\": \"anthropic.claude-3-sonnet-20240229-v1:0\",
      \"messages\": [
        {
          \"role\": \"user\",
          \"content\": \"Who was Aristotle? Tell it to me in less than $CHAR_COUNT characters\"
        }
      ]
    }" \
    "$GATEWAY_URL/v1/chat/completions" \
    -v
  
  echo -e "\n--- Waiting 20 seconds ---\n"
  sleep 20
done

Observability

Envoy AI Gateway exposes GenAI-specific metrics following the OpenTelemetry GenAI semantic conventions.

These metrics can be put into three main buckets:

Token usage - Number of tokens processed
Request duration - Generative AI server request duration such as time-to-last byte or last output token
Target info - Metadata about the telemetry SDK

Here's a Grafana dashboard visualizing EAIG metrics.

Autoscaling Envoy AI Gateway

As your AI usage grows, a single Envoy proxy Pod may become a bottleneck, leading to increased latency, slower response times, or even request failures under high load. To handle this, you can use HPA to automatically scale the number of Envoy proxy pods based on resource utilization or custom metrics.

Under the hood, Envoy AI Gateway (EAIG) builds on Envoy Gateway, which manages the Envoy proxy deployment. The EnvoyProxy CRD supports HPA configuration directly through the envoyHpa field in its spec. This allows you to define scaling rules without manually creating an HPA resource. When configured, Envoy Gateway will generate and manage the HPA.

Below is a configuration that scales Envoy proxy Pods based on a custom metric. Note that you'd need metrics-server and Prometheus adapter installed in your cluster. You'll also need to update Prometheus Adapter to expose the metric.

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: envoy-ai-gateway
  namespace: envoy-gateway-system
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyDeployment:
        container:
          resources: {}
      envoyServiceAccount:
        name: ai-gateway
      envoyService:
        type: ClusterIP
      envoyHpa:
        minReplicas: 1
        maxReplicas: 5
        metrics:
           - type: Pods
             metric:
               name: server_request_duration_seconds_avg
             target:
               type: AverageValue
               averageValue: "10"

Here's a sample Prometheus adapter query that calculates and exposes average server request latency:

 - seriesQuery: 'gen_ai_server_request_duration_seconds_sum{kubernetes_namespace!="",kubernetes_pod_name!=""}'
      resources:
        overrides:
          kubernetes_namespace: {resource: "namespace"}
          kubernetes_pod_name: {resource: "pod"}
      name:
        matches: "gen_ai_server_request_duration_seconds_sum"
        as: "server_request_duration_seconds_avg"
      metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>) / sum(gen_ai_server_request_duration_seconds_count{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

It is best to scale Envoy on either CPU metrics or listener metrics. Check out Envoy documentation to find out the appropriate metric based on your usage type.

Performance Impact

Any proxy adds latency, the time spent translating and routing requests between client and backend. For AI workloads where model inference already takes seconds, understanding this overhead helps you evaluate whether the gateway's benefits (unified API, rate limiting, observability) justify the additional latency.

I ran an unscientific benchmark comparing direct Bedrock calls against requests through Envoy AI Gateway. Both used the same prompt: "Explain Nietzsche's philosophy in about 1024 characters." The test ran 10 iterations of each approach using the AWS CLI for Bedrock and curl for the gateway.

Here was my test setup:

Model: Claude 3 Sonnet (anthropic.claude-3-sonnet-20240229-v1:0)
Region: us-west-2
Gateway deployment: Single Envoy proxy Pod on EKS. The Pod ran on a dedicated c6a.large EC2 instance. There were no CPU/memory limits.
Client was a Pod running in the same EKS cluster

Results

Run	Bedrock Direct (ms)	Through EAIG (ms)	Overhead (ms)
1	5,969	7,150	1,181
2	6,327	7,498	1,171
3	4,778	9,167	2,389
4	5,487	7,000	1,513
5	6,315	8,211	1,896
6	5,306	7,498	2,192
7	4,475	7,245	2,770
8	5,790	7,704	1,914
9	4,947	6,388	1,441
10	5,772	6,570	798
Avg	5,517	7,443	1,727

The gateway added an average of 1.7 seconds of overhead per request, representing a 31% increase over direct Bedrock calls. Overhead varied between 798 milliseconds and 2.7 seconds across the 10 test runs, with most requests falling in the 1-3 second range.

Envoy AI Gateway is slower than directly calling Bedrock

The 1.7-second overhead becomes negligible for long-running requests exceeding 10 seconds, where it represents less than 20% of total latency. Batch processing workloads and human-in-the-loop applications, where users already expect multi-second response times, absorb this delay without impact. However, the overhead matters for real-time applications requiring sub-second responses, high-throughput scenarios where milliseconds compound across thousands of requests, and cost-sensitive workloads where every additional second translates to more tokens consumed.

So that's the tradeoff of using Envoy AI Gateway. You have to tolerate a couple seconds of additional latency per request. Maybe this will improve in the future versions.

Another test that I'd like to perform is understanding how gateway latency scales under concurrent load. In theory, Envoy's latency should get progressively worse as the number of clients increase. While autoscaling mitigates this by distributing load across multiple Envoy replicas, the gateway will never match the performance of direct Bedrock calls under high concurrency. There's inherent overhead that compounds with scale.

Unfortunately, I wasn't able to perform this test. My account hit Bedrock's account-level rate limits before reaching meaningful gateway saturation. A proper load test would require either higher Bedrock quotas or synthetic backends that don't impose rate limits, allowing measurement of pure gateway throughput independent of upstream capacity.

Closing Remarks

Envoy AI Gateway addresses a real challenge in enterprise AI deployments: managing multiple model providers through a unified interface. While its overhead may limit its use in real-time applications, the gateway provides valuable benefits for most AI workloads where response times already span multiple seconds.

The token-based rate limiting, provider abstraction, and observability features make it particularly useful for platform teams managing AI infrastructure at scale.