October 11, 2024

How and Why to Build an AI Gateway

By David Schuler

As more Generative AI applications are released into production, organizations face several challenges, including complying with AI safety policies, understanding user behavior, and ensuring applications are reliable and performant.

AI Gateways aims to solve these challenges and more. AI Gateways act as a central access point for AI in your organization, seamlessly integrating multiple model providers through a single interface. 

In this blog, we will discuss the benefits and challenges of implementing an AI Gateway and how your organization can benefit from this architecture.

Benefits of AI Gateways

Safety

Centralized Guardrails

Safety should be at the top of anyone’s mind when developing an AI application. There are several ways to implement safety checks, but the onus is typically on the business team developing the application.

Integrating guardrails into the AI Gateway ensures that all applications are adhering to your organization’s safety policies. These can be implemented in a number of ways:

  • Sensitive data detection – PII or PCI detection can be baked into the gateway to ensure that no sensitive data is leaked to the model

  • Preventing Hallucinations—Many organizations are using Retrieval Augmented Generation (RAG) with vector embeddings to help control hallucinations; while retrieval mechanisms and vector databases would normally sit in the application layer, an AI gateway would give developers access to a wider selection of vector embedding models.   

  • LLM-based safety – LLMs can be used to evaluate prompts and reject any that are potentially unsafe

Guardrails can be applied as both pre-processing and post-processing steps to ensure the safety of the model’s inputs and outputs.

Moderation

Moderation is the active process of reviewing and managing AI-generated content to ensure it meets certain standards and guidelines.  Moderation can include:

  • Rule-based filtering – rules can be configured to filter out any requests containing specific keywords or topics; a gateway should provide simple interfaces for writing and deploying rules for moderation

  • Feedback and evaluation – an AI gateway should provide interfaces for capturing and recording feedback for review and continuous improvement

  • Audit – running all AI prompts and responses through a centralized gateway makes it easy to track and audit AI usage across an organization

A well-implemented AI gateway should give moderators and stewards all the tools needed to keep an organization safe. 

Approval Checkpoint

Many organizations have implemented review processes for AI projects to ensure they are being built responsibly. AI Gateways fit nicely into this workflow since teams must request access before using the gateway. If the gateway serves as the primary access point for AI in your organization, this ensures that all applications using AI have been reviewed properly.

An example of this workflow could look like this:

It’s worth noting that granting access to the gateway can also be done earlier to enable experimentation for users who don’t have a fully-fledged product plan yet.

Simplification

Unified API Interface

AI Gateways will provide a single interface for any model provider behind the gateway. This means developers can switch between calling an Anthropic Claude model and an OpenAI GPT model by changing a single parameter.

This flexibility enables developers to quickly understand the tradeoffs of accuracy, latency, and cost by configuring their application to use the gateway’s different models. 

Infrastructure Management

Teams and individuals using an AI Gateway don’t need to worry about provisioning their models. This can be especially beneficial for individuals wanting quick access to an LLM for experimentation.

Flexibility

Models

Foundation models are still rapidly improving, and cloud providers frequently change which versions of models are supported via their APIs. Surfacing these models through an AI Gateway enables you to deploy them once and make them available to all consuming applications.

Regulation

The regulatory environment for AI is uncertain. An AI Gateway enables you to be nimble and have one primary system to update as regulation evolves. 

The gateway’s flexibility also allows different verticals to implement rules based on their regulatory requirements. For example, healthcare companies can configure rules to ensure they comply with the HIPAA Minimum Necessary rule. 

Observability

Logging

All requests and responses going through the gateway should be captured and logged. Once the data is logged, pipelines can be built to push it into a data warehouse for further analysis. This dataset can then be used for a number of different applications, such as:

  • Improving the accuracy and performance of your AI products

  • Auditing all requests and responses

  • Reporting on AI adoption

  • Monitoring cost of your AI products

The gateway simplifies the collection of this data due to serving as the central access point for AI.

Monitoring & Alerting

In addition to traditional API monitoring metrics, AI Gateways should also monitor metrics around the consuming application’s usage of the gateway and the gateway’s consumption of the models. One example of this would be monitoring token consumption by model. Suppose the gateway regularly nears its max Tokens per Minute (TPM) quota for a given model. In that case, it’s important to be alerted so you can evaluate if additional deployments need to be added. Or if you see specific models are not getting much traffic, it may be an opportunity to remove the model and simplify your deployment.

Performance

Semantic Caching

Semantic caching can be configured to quickly return results for frequently submitted prompts to the AI Gateway. Semantic caching computes a similarity score between the provided prompt and previous prompts, then simply returns the cached response if the similarity score is above a configured threshold. Here’s an example:

User 1 – “What were the sales for 2024 Q1?”

Later, User 2 prompts – “Give me the sales for Q1 in 2024”

Even though these are not identical prompts, they ask the same question. Semantic caching can quickly return an answer from the cache for User 2, improving performance and reducing cost.

Load Balancing

AI Gateways have the ability to load balance across multiple deployments of a model. Benefits of load balancing include:

  • Improved resilience. If one model deployment is unavailable, the gateway can route to an active deployment.

  • Increased token capacity. Most cloud providers limit the (TPM) available to you. By creating multiple deployments of the same model and then load balancing across those deployments, you increase your system’s TPM.

While basic load balancing has clear benefits, AI Gateways also include more intelligent load balancing, which opens up some interesting possibilities.

  • Latency-based routing. If a model is taking too long to respond, the gateway can route the request to a different model deployment or a different model entirely to fulfill the request.

  • Intent-based routing. You may find that certain models perform better for specific tasks than others. Gateways can aid in intent routing, where you first derive the intent of the user’s prompt using an LLM and then route it to the most appropriate model for that intent.

Access Control

AI Gateways enable you to set access policies to grant certain users or applications access to specific models. Most applications won’t need access to all models that are surfaced via the gateway, so it’s good practice to restrict access to only necessary models.

In addition to access, you can also set rate limits for tokens and requests by application to ensure that no application uses more than its share of model capacity. These limits can also be set for cost to prevent applications from racking up a large cloud bill.

Challenges of AI Gateways

Shared Model Capacity

When building an AI Gateway to serve as the central access point for LLMs, one challenge is having many applications working against the same shared pool of models. Understanding each application’s expected token and request usage is critical to ensuring that the gateway has enough capacity to serve its needs. 

As stated above, most gateways will allow you to implement rate limiting for applications, which can be one tool to solve this problem. However, it can be restrictive if any applications have unpredictable traffic patterns.

Another option to mitigate this problem is to provision dedicated model pools for business-critical applications. Since you can grant specific applications access to specific model pools, you can ensure that business-critical applications can operate.

Cost Management & Governance

As stated above, AI Gateways tracks the cost for each consuming application. While this data is beneficial, the cost is still typically fronted by the AI Gateway team. This brings up a common challenge of managing cloud costs for a shared service. If you need to charge back the gateway’s consumers for their portion of LLM spend, then some integration work will be required to get the data back into your FinOps system.

Guardrail Alignment & Implementation

While AI Gateways will support applying safety guardrails to all incoming and outgoing data, it can be challenging to identify which guardrails should be implemented. This needs to be a collaborative exercise between the gateway’s implementation team, security, and business domain experts to gain alignment on what to check for and what to filter out.

Implementing checks such as PII detection can also be challenging. Most gateways will support integration with external tools that can aid in PII / PCI detection to offload that complex logic.

Best Practices of AI Gateway Implementation

Create a Model Strategy

Once you have a gateway built, it can be tempting to place many different models behind it. However, this creates operational overhead and decision complexity for consumers. Instead, it is best to create a strategy for which models you want to surface. One simple approach is to place one small, fast model, one medium model, and one large, intelligent model so teams can choose between speed, cost, and accuracy when building their application.

Set Rate Limits

When consumers start to on-board to the gateway, it’s important to set token and request rate limits for their application to ensure that no one application has the potential to consume all of the available model capacity. If an application frequently hits its rate limit, that should trigger a discussion about why. Has demand for the application increased? Have users started regularly submitting longer prompts? From there, you can quickly and easily alter the rate limits for the application as needed.

As with consuming all rate-limited APIs, it’s best practice for applications to implement retries when interacting with the gateway. Applications should assume they will hit their rate limit at some point and account for that in their code.

When Should You Consider an AI Gateway?

While AI Gateways can provide a lot of value to organizations, they are part of a mature AI Platform and should not be implemented by everyone. Here are some considerations when thinking about implementing AI Gateways:

  • Are AI applications or features being built across multiple, disparate areas of your business?

  • Are you required to monitor and audit AI usage across your company?

  • Do you need to ensure adherence to specific safety practices?

If you responded yes to any of those questions, then it is probably time to consider the implementation of an AI Gateway. They should not be an immediate priority if you’re just starting on your GenAI journey.

Closing

AI Gateways are an important component for organizations building a robust AI Platform. By enabling access to AI through a central AI Gateway, you can expect application performance improvements, increased developer productivity, and stricter adherence to AI safety requirements.

If you have any questions or need further guidance on building or optimizing your AI Gateway, please contact phData. Our team of experts is ready to assist with tailored advice and solutions to help you make the most of your AI initiatives.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit