The release of ChatGPT in late 2022 introduced generative artificial intelligence to the general public and triggered a new wave of AI-oriented companies, products, and open-source projects that provide tools and frameworks to enable enterprise AI.
In a survey conducted in 2023, over three-quarters of the executives surveyed believed that artificial intelligence would disrupt their business strategy. At the time of writing this blog, the year is 2024, and companies that have not yet adopted Gen AI may be feeling the pressure of being left behind.
However, even though new foundational large language models, AI products, and AI startups are popping up daily, the enterprise generative AI field is very much still in its infancy stages, and LLMOps best practices are being ironed out as more generative AI applications are being developed and reaching production.
The enterprise AI train is far from leaving the station, which is why now is the perfect time to begin exploring generative AI use cases and upskilling team members.
This blog provides an overview of the technology choices to consider when architecting and developing a generative AI product. It is the first in a four-part series discussing how to bring generative AI into production.
Generative AI Application Frameworks
Deciding which technology choices to use is dependent on the intended use case of the AI application. An AI application designed to only answer simple queries without domain-specific knowledge can be built with a simple API call to a large language model like ChatGPT; however, the value-add of such an application is questionable.
For enterprises, the value-add of applications built on top of large language models is realized when domain knowledge from internal databases and documents is incorporated to enhance a model’s ability to answer questions, generate content, and any other intended use cases. This differs from the more basic benefits of incorporating commoditized AI tools like ChatGPT and Copilots into workflows – copilots can help organizations keep up with competitors, but incorporating domain knowledge is how to gain a competitive advantage and get ahead.
Common methods for incorporating knowledge are retrieval augmented generation (RAG), fine-tuning, or a combination of the two. Fine-tuning a large language model is less widespread, partly due to the rapid pace at which new language models are released.
A company can invest time in cleaning and preparing a dataset and fine-tuning, but after having invested their time, the release of a new language model a month later may outperform its fine-tuned model. That said, a smaller model fine-tuned on a specific task can perform equivalent to the latest and greatest foundational model at a much cheaper cost.
In today’s enterprise environment, most Gen AI applications are built on variations of the RAG-based framework. This approach incorporates relevant data from a data store into prompts, providing large language models with additional context to help answer queries.
Technology Choices for Generative AI Applications
Data Store
Vector databases have emerged as the go-to data store solution in demos and quickstarts for generative AI applications built with RAG. Unlike traditional relational databases that store data in rows and columns, vector databases store high-dimensional vector embeddings that represent non-structured data such as documents, images, and audio files.
In theory, if two vector embeddings are close to one another in vector space, then the underlying data the vectors represent are semantically similar. When a query, in the form of a vector, is sent to the vector database, the database returns the most similar vectors to the query.
Pinecone and Weaviate are popular managed vector database platforms that can efficiently scale to handle billions of documents and return relevant embeddings using an approximate nearest neighbor (ANN) algorithm. ANN algorithms are more efficient because they do not need to perform an exhaustive search over an entire index to find the most relevant vectors.
These vector database offerings also support hybrid search, a methodology that narrows the searchable embedding pool by keywords before searching semantically. Chroma is a popular open-source vector database with an ANN algorithm; however, it currently does not support hybrid search.
Vector databases may not always be the best data store and context retrieval solution for an RAG application. Depending on the scale and type of data used to provide context, graph databases, search engines, and relational databases all have the potential to be better alternatives.
Graph databases would be an ideal data store solution if your data is natively non-tabular with interconnected relationships. Graph databases store and manage the relationships between data using nodes, edges, and properties. They can efficiently use the managed relationships to query relevant information for additional prompt context.
A traditional search engine would be an ideal data store solution for a use case with very few pieces of information. Take, for example, an AI application that answers questions about a product line that only has a few thousand chunks of documentation in total. A traditional search engine, like ElasticSearch, which returns relevant documents using a syntactic search algorithm, or a vector database, like pgvector, which uses an exact nearest neighbor algorithm, has the potential to return more relevant documents than a vector database that uses an ANN search algorithm. The performance of an exact nearest neighbor algorithm is unlikely to significantly impact the latency of a Gen AI application when dealing with indexes of only a few hundred or thousand documents.
A feature store would be the ideal data store for maintaining and retrieving relevant structured data. While LLMs have unlocked the potential of unstructured data for enterprises, the value of structured data that enterprises have refined and maintained in relational databases for decades has not diminished; structured data can also provide valuable context to a Gen AI application. Enterprises can leverage both structured and unstructured data for AI applications. To provide an example, traditional structured data such as a user’s demographic information can be provided to an AI application to create a more personable experience.
Our data engineering blog in this series explores the concept of data engineering and data stores for Gen AI applications in more detail.
Foundational Model
In the context of Gen AI, a foundational model is a machine learning model with billions of parameters that have been trained on large and extensive language datasets to understand the complexity of natural language and perform general tasks. Foundational models, such as ChatGPT-4, serve as a base (or foundation) that can be further fine-tuned toward specific domains and tasks for an AI application.
When prototyping an initial proof of concept for a Gen AI application, the simplest approach to accessing a foundational large language model is to use a third-party API from companies such as OpenAI and Anthropic. This option is appealing because it involves minimal upfront infrastructure costs, with the main expense being the usage fees from the LLM API.
Granted, many enterprises have security concerns about sending proprietary data to third parties over the internet. These security concerns are why cloud providers have begun offering Model as a Service (MaaS) solutions as a safer alternative, providing companies with access to LLMs inside their cloud environment. The generative AI solutions from GCP Vertex AI, AWS Bedrock, Azure AI, and Snowflake Cortex all provide access to a variety of industry-leading foundational models. This option also has minimal upfront infrastructure cost and operates on a pay-as-you-go model when using models.
If a MaaS offering raises security or long-term cost concerns within an organization, an alternative is to self-host an open-source model. Self-hosting ensures that the organization has complete control over data that is fed into the model, but it does come with greater upfront costs and challenges. Depending on the size of the model, self-hosting will require scalable infrastructure and powerful instances with enough GPUs and memory for efficient model inference. The model weights for open-source models can be downloaded from HuggingFace. It’s important to note that a base model from HuggingFace can only serve one user at a time, and frameworks such as vLLM or OpenLLM are required for deployment and high-throughput requirements in production. Self-hosting an LLM may also lead to more friction down the road when teams want to test out different models, as each model may have different hardware and software requirements when hosting.
Deciding which foundational model(s) to use in a generative AI application will depend on the type and complexity of tasks the model will be given. Larger models have better overall performance but are costlier to run. Smaller models, though cheaper to run, have the potential to achieve comparable performance on specific tasks. For a production-ready generative AI application, testing and identifying the most optimal model for each task is essential.
Agent
An LLM agent is the underlying logic that chains together the LLM-related steps of a generative AI application. For example, in a RAG solution, the agent would be the code that creates a vector embedding of the user query, queries the data store solution to retrieve relevant documents, combines the original user query and the relevant documents into a prompt that is passed to the LLM, and performs any post-processing on the model’s response.
Beyond code-based patterns, agentic reasoning is often extended to more complex workflows that require models to decide which ‘tools’ to use to help answer a query. An example of a tool is an API that can search the internet. To help manage the complexity of Gen AI applications, frameworks that abstract portions of the agent logic from developers have emerged.
LangChain is often the first framework people hear about when exploring the generative AI space. Its framework caters to developing generative AI applications that are context-aware or require reasoning through multiple steps. However, due to its code complexity and lack of debugging observability, community sentiment has deemed LangChain a prototyping framework that is not ready for production. LangSmith, a paid offering from LangChain, does offer features meant for production, including observability.
LlamaIndex and EmbedChain are frameworks that are more suited to RAG solutions. These two offerings have a more streamlined development process than LangChain. LlamaIndex has a famous “5 lines of code” starter tutorial, and EmbedChain’s quickstart can be completed in just under a minute.
Haystack is a framework that provides a flexible, composable architecture for developing AI applications. Unlike LangChain, which is designed to fully encompass the logic of an agent, Haystack is designed to provide discrete components that an agent’s logic can be built on top of.
It is definitely an exciting time as the open-source community enhances and builds out these frameworks, but they are still being refined with best practices and new features. Given the industry’s early stages, depending on the complexity of a generative AI application, development teams may find that building custom in-house agent logic is a better option than adapting to an agent framework.
Getting Started with Generative AI
Hopefully, this overview of technologies has helped shed light on the building blocks behind generative AI applications. Assuming your company has established its AI strategy and business use case for generative AI, the easiest way to begin prototyping is to build on top of your existing cloud platform.
If your company uses the Snowflake AI Data Cloud, developers can utilize Snowflake Cortex, which has LLM and data store offerings, to build an entire generative AI application on the data cloud. To give another example, if your company is using AWS, developers can use AWS Bedrock’s MaaS offering, AWS Lambda to contain the agent logic, and a datastore such as AWS Kendra to build an end-to-end RAG solution.
Conclusion
This blog has only covered the minimum technologies required to build the bare bones of a generative AI application. The next blogs in this series will cover how to build data pipelines, implement a prompt engineering workflow, and build a testing and validation suite, which will evolve a proof-of-concept application into a production-ready application that enterprises can deploy with confidence.
Next up in our Gen AI 101 series is Part 2, which covers Data Engineering.
If your business is interested in an in-depth look into how it can best leverage Gen AI, we strongly recommend signing up for one of our free generative AI workshops. In this 90-minute session, your team can virtually meet with one of our Principal Data and AI Architects to start turning your AI ideas into a plan of action.