Gen AI 101: Data Engineering (Part 2)

This article was co-written by Lawrence Liu & Safwan Islam

While the title ‘Machine Learning Engineer’ may sound more prestigious than ‘Data Engineer’ to some, the reality is that these roles share a significant overlap. Both roles are responsible for implementing data pipelines with best practices; machine learning engineers have the added task of serving the final data product to deployed machine learning models.

Generative AI has unlocked the value of unstructured text-based data. This does not mean we’ve left the world of data engineering behind; unstructured data still requires robust data pipelining. Just as machine learning applications rely on data engineering, so do generative AI applications. Without data engineering, the development of generative AI would not be possible.

In the second part of our AI 101 series, we’ll explore why having a robust data engineering program is vital for Gen AI success.

Where Data Engineering Fits in a Gen AI Application

Out of the box, large language models possess good foundational knowledge. However, little value is added if a generative AI application relies solely on a model’s foundational knowledge without additional context or training/fine-tuning. The retrieval augmented generation framework was introduced to address this issue.

RAG is a framework that retrieves additional information from a knowledge base to augment the prompt input provided to the LLM with relevant context. This additional context will enable the LLM to produce the best possible result given the information in the knowledge base. Data engineering comes into play when processing and storing structured and unstructured data into formats that are ready to be used as context for LLM prompts.

PDFs, internal wiki pages, and slack threads are examples of unstructured data that contain valuable textual information that can be provided as context for an LLM. That being said, these sources in their native form are neither in a location nor format that would be easily accessible to a RAG-based generative AI application.

Data engineering is required to process and store these unstructured data sources in a vector database or search engine. With the unstructured data processed and in a data store, a generative AI agent can more easily retrieve the most relevant information as context for the LLM.

Even though LLMs have unlocked the value of unstructured data, structured data remains valuable for generative AI applications, offering relevant contextual information.

For example, let’s say you are building a chatbot for your customers. Creating a central structured data store of customer information, such as customer name, age, and preferences, can enable an LLM to generate more personable responses.

Data Engineering for Structured Data

The concept of processing structured data and storing it in a centralized data store is traditional data engineering; it is not a new concept introduced by generative AI applications. Data may initially reside in its raw form across multiple databases or SaaS applications. A data integration tool like Fivetran can efficiently centralize all relevant data into a single data store. Once the data is in a centralized data store, such as the Snowflake AI Data Cloud, an orchestrator and transformation tool like dbt can transform it into features that provide valuable context to an LLM.

What we’ve described as “a centralized data store with pipelines that transform data into features” is, in essence, a feature store.

Data Engineering for Unstructured Data

In the context of using a vector database, before unstructured data can be used to provide additional context in an LLM prompt, it must first be pre-processed, chunked, and tokenized.

Pre-Processing

The purpose of pre-processing is to transform our raw unstructured data into a textual format that an embedding model has been designed and trained to tokenize.

The approach to pre-processing varies depending on the original format of the unstructured data. While embedding models typically excel at tokenizing text, they struggle to understand document structures such as headings, lists, sublists, or tables. Therefore, when pre-processing data from web pages, a common step is to remove HTML tags.

Through experience, we’ve also learned that embedding models often struggle with specialized industry terms and acronyms, leading to poor mapping of similar critical concepts. One way to address this is to replace acronyms with their fully spelled-out forms.

Many agent frameworks and data store solutions come with built-in tools for pre-processing. For example, LangChain provides a plethora of loaders for different document types that help abstract the loading and parsing of raw, unstructured data.

Chunking

The concept of chunking is separating the processed data into logical chunks of information. Chunks can be defined as an entire document, individual paragraphs, or individual sentences. When a query is submitted, the most relevant chunks will be retrieved and included as additional context in the prompt to help an LLM formulate an answer.

Common chunking techniques include:

Chunking by Delimiters: This method uses delimiters like \n\n to separate paragraphs or . to separate sentences.
Fixed Character Length Window: Text is divided into chunks based on a fixed number of characters. Common window lengths to experiment with are 128, 256, 512, and 1024 characters.
Chunking with Overlap: To prevent losing valuable context that may be relevant to multiple chunks, the text is chunked using either a delimiter or a fixed character length window with some overlap between chunks.

A more advanced chunking technique is semantic chunking. This technique involves calculating the semantic difference between consecutive sentences and grouping similar sentences into chunks. Preceding sentences with semantic scores below a set threshold would begin the next chunk.

There is no one-size-fits-all chunking strategy. Determining the best chunking strategy for your application requires iterative testing.

Smaller chunks will make prompting more cost-effective and may even be necessary if an LLM has a smaller context window. However, a drawback of using smaller chunks is that each chunk may not contain enough information to help answer a user’s question.

Larger chunks help increase the possibility that enough necessary information is provided as context to a prompt. One example of a large chunking strategy is to not chunk documents at all. The context window of LLMs and LLMs’ ability to accurately remember lengthy contexts have increased. Providing entire documents as context to a prompt ensures that enough necessary information is provided to help the LLM formulate an answer. While this strategy is the most costly and will increase inference latency, it will likely produce the best accuracy.

Tokenizing

Once the unstructured data has been pre-processed and chunked, the last step is to tokenize the data using an embedding model. Tokenizing is a mathematical operation that creates a vector representation of each chunk. These vectors are stored in a vector database, which can understand the semantic similarity between chunks.

From the perspective of an RAG application, when a user sends a query, that query is tokenized, and its vector is used to query the vector database to retrieve the most relevant chunks based on vector similarity.

Popular embedding models include text-embedding-3-small and text-embedding-3-large from OpenAI. In 2023, the go-to embedding model was text-embedding-ada-002, also from OpenAI. The “best” embedding model is constantly changing, with newer models being released as frequently, if not more so, than the release pace of new foundational large language models.

The MTEB (Massive Text Embedding Benchmark) leaderboard hosted by Hugging Face ranks embedding models by their performance over a set of embedding tasks. For RAG applications, the leaderboard has a retrieval tab that ranks embedding models on retrieval performance. It’s important to note that this leaderboard contains self-reported results, and many of the models have been fine-tuned to benchmark tasks whose performance may not translate to your use case.

Choosing a well-performing embedding model from the MTEB leaderboard requires additional considerations beyond benchmark task scores. First, model licenses can legally impact your ability to use the model. Second, some embedding models have smaller context sizes, which may lead to truncated embedding, which embeds only a portion of the provided document without producing any errors. Additionally, some models may favor text that appears earlier in the document over later in the document or vice versa, which can affect the vector representation of your documents.

Data Security

Data security is a crucial concern in generative AI applications. Developers must be security-aware when populating data stores. Security measures such as row-level security or role-based access control for data stored in a structured feature store would ensure users only have their own information provided as context to the LLM.

Ideally, data stored in vector databases is strictly limited to documents that end-users have permission to access. For more granular control, managed vector database offerings like Pinecone have document-level security features in the pipeline, though current workarounds include metadata filtering or storing sensitive information in separate indexes.

Conclusion

The success of any good data product is backed by robust data engineering. A generative AI application at its core is still a data application; only this data application requires data engineering of both structured and unstructured data formats to refine a variety of data sources into a usable form ready to provide relevant context.

Next in our Gen AI 101 series is Part 3, which covers Prompt Engineering.

Gen AI 101: Data Engineering (Part 2)

Where Data Engineering Fits in a Gen AI Application

Data Engineering for Structured Data

Data Engineering for Unstructured Data

Pre-Processing

Chunking

Tokenizing

Data Security

Conclusion

More to explore

Why You Should Use Copilot in Microsoft Fabric: A Short Introduction

How to Create a Gauge Chart in Sigma Computing

Snowflake Git Integration – Snowpark CICD Lifecycle

Join our team

Partners

Resources

Software

Accelerate and automate your data projects with the phData Toolkit

Industries

Solutions

Company

Technology Partners

Other Technology Partners

Check out our latest insights

Why You Should Use Copilot in Microsoft Fabric: A Short Introduction

How to Create a Gauge Chart in Sigma Computing

Data Engineering

Consulting, Migrations, Data Pipelines, DataOps

Change Management, Enablement & Learning

COE, Coaching, PMO

Data Science and Machine Learning Services

MLOps Enablement, Prototyping, Model Development and Deployment

Strategy Services

Data, Analytics, and AI Strategy, Architecture and Assessments

Reporting, Analytics, and Visualization Services

Self-Service, Integrated Analytics, Dashboards, Automation

Elastic Operations

Data Platforms, Data Pipelines, and Machine Learning