“Vector Databases are completely different from your cloud data warehouse.” – You might have heard that statement if you are involved in creating vector embeddings for your RAG-based Gen AI applications. The Snowflake AI Data Cloud has added the VECTOR
datatype, Vector Embeddings, and Vector Similarity functions, allowing us to use Snowflake as a vector database. However, this also calls for the need to build processes to split large text data from various documents, which is required for RAG applications, into smaller chunks so that they can be embedded and retrieved efficiently as vectors.
In this blog, we will discuss:
What is Text Splitting, and what is its importance in Vector Embedding?
Different Methods of Text Splitting
How do we implement those methods in Snowflake?
What is Text Splitting, and Why is it Important for Vector Embeddings?
Text splitting is breaking down a long document or text into smaller, manageable segments or “chunks” for processing. This is widely used in Natural Language Processing (NLP), where it plays a pivotal role in pre-processing unstructured textual data. Snowflake can efficiently store these unstructured and/or semi-structured data as raw files or in relational tables as VARIANT
data types. However, how will we efficiently retrieve or correlate information even if we split the data chunks? This is where Vector Embeddings come into the picture.
Vector embedding is a way to transform complex, high-dimensional data—like raw text or images—into a simplified, lower-dimensional form called a vector, which is a structured numerical representation. For example, a vector embedding of the word cat
can be = [0.5, -0.4]
in a 2D space based on the machine learning algorithm used. Below is an example of how textual data of similar categories are grouped in a hypothetical 2D vector space –
As we can see in the above diagram, semantically matching text yields vectors that point in the same general direction. However, in the real world, the embedding algorithms will generate a vector of hundreds of dimensions (as opposed to 2 dimensions in the above diagram) for any given input text.
Once vectorized, this data can be used in RAG-based applications, for vector similarity check, and many other NLP applications, such as text summarization, named entity recognition, sentiment analysis, etc.
Vector Embeddings in Snowflake
In Snowflake, we have a VECTOR
datatype to store, encode, and retrieve vector embedding efficiently. Following is the syntax to declare a Vector Datatype:
VECTOR( , )
Where:
Type is the Snowflake data type of the elements, which can be
INT
orFLOAT
.Dimension is the dimension (length) of the vector. This must be a positive integer value with a maximum value of 4096.
You can look at this guide to explore the datatype and loading vector data directly.
For embedding the textual data as a vector, Snowflake Cortex provides functions like EMBED_TEXT_768 and EMBED_TEXT_1024 to generate embeddings, creating vector representations of text in either 768 or 1024 dimensions, respectively. Following are the various embedding models supported by these functions.
Function | Supported Embedding Models |
---|---|
EMBED_TEXT_768 | snowflake-arctic-embed-m-v1.5 |
snowflake-arctic-embed-m | |
e5-base-v2 | |
EMBED_TEXT_1024 | nv-embed-qa-4 (English only) |
multilingual-e5-large | |
voyage-multilingual-2 |
Note: Costs are different for each model.
For semantic comparisons, measuring the similarity between vectors is essential. Snowflake Cortex offers three key vector similarity functions for this purpose:
VECTOR_INNER_PRODUCT – calculates the inner product of two vectors, which is useful for understanding directional alignment and intensity.
VECTOR_L2_DISTANCE – measures the Euclidean distance, highlighting the absolute difference between vectors in space.
VECTOR_COSINE_SIMILARITY – evaluates the cosine of the angle between vectors, focusing on how closely aligned they are in the direction.
Each function provides a unique perspective on similarity, supporting diverse applications in data analysis and machine learning. For more details, refer to Vector similarity functions.
When and Why You Should Split Text for Vector Embeddings
Below are a few of the reasons why it’s essential and how it impacts vector embedding performance:
Memory and Processing Limitations
Many NLP models have token size limits, including those used to generate vector embeddings. For example, common models like BERT have a maximum token input size of 512 tokens. Splitting text into smaller segments allows each chunk to fit within these constraints, enabling processing without data loss or truncation.Improved Embedding Accuracy
When a text is too long, the model may struggle to capture all necessary details, causing some information to be lost or diluted. Smaller, thematically homogeneous segments will lead to more accurate embeddings with contextually relevant vectors that better represent the content and context of each segment.Enhanced Search and Retrieval Augmented Generation:
Vector search systems work by matching queries with embeddings in a database. When documents are split into smaller chunks, search systems can find relevant sections more precisely and quickly. This benefits applications like document retrieval, question-answering, and Retrieval-Augmented generation use cases.
What are the Methods for Splitting Text in Snowflake?
We can split a large document or text into smaller chunks. We have listed below some of the frequently used methods to split text for vector embedding in Snowflake –
Context Overlap Splitting
This is one type of Recursive Text Splitter, which might also be called “Sliding Window Splitting.”
If we think about text spitting, the basic way to split a text is based on some chunk size. However, this process has the drawback of resulting in a complete loss of information in the last part of the chunk, and the embedded vector for this chunk will also be semantically incorrect.
To overcome this, we can implement a recursive splitting mechanism whereby any chunk always overlaps N number of characters with the previous chunk. In this way, we will not lose the meaning of the last words in any chunk and have an overlapping context between each of the chunks. This is illustrated in the below snippet where using a chunk size of 100 and overlap size of 10, the first 3 chunks are highlighted in green, orange, and red –
This process is called “Context Overlap” splitting due to the overlapping information between the chunks.
Token-Based Splitting
Token-based splitting involves dividing a large text into smaller chunks based on tokens rather than characters or fixed-length blocks. Tokens represent semantic units like words or subwords. Since LLMs use tokens to count, token-based splitting makes it more language-aware and contextually relevant than character-based splitting.
This type of spitting is particularly useful when working with models with specific token limits, such as GPT-3 or any other language models, as it provides precise control over input size for language models.
The below diagram illustrates a generic view of how token-based splitting works, which is generally handled by tokenizers like BERT, GPT, etc.
Semantic Splitting
Semantic splitting involves breaking a body of text into smaller, meaningful segments based on its content, structure, or semantics rather than relying on arbitrary measures like word count or line breaks. The process focuses on dividing the text into semantically similar chunks.
This is achieved by splitting the text into sentences, converting these into vector embeddings, and calculating the vector distance or cosine similarity between consecutive segments. If we use the vector distance (how close the vectors are, represented as a floating point number between 0 – 1), a predefined threshold, such as 0.5, is then used to determine where splits occur. If the distance between two consecutive segments exceeds this threshold, a split creates separate chunks before and after the split. This process is repeated until the entire text is divided into coherent segments. The below flow diagram illustrates this process.
How to Implement Text Splitting in Snowflake Using SQL and Python UDFs
We will now demonstrate how to implement the types of Text Splitting we explained in the above section in Snowflake.
Initial Setup
Load Input Data:
We use a mixture of 3 different excerpts from 3 publicly available datasets – 1) Relational Database, 2) Mathematical Theory of Communication, and 3) Neural Networks as input data for the demo implementation.
We have loaded this data into an input table named – TEXT_INPUT
as a VARCHAR
column named TEXTDATA
in Snowflake. Below is what the input text data looks like.
Create Table to Store Vector Embedding:
Below is the final table where the vectorized data will be stored. We will store the text, its chunks and their vectors, the type of splitting method, and an audit timestamp.
create TABLE IF NOT EXISTS TEXT_CHUNKS (
TEXTDATA VARCHAR(16777216), -- Large Text To be Split
SPLIT_METHOD VARCHAR(16777216), -- Type of Splitting Method Used
CHUNK VARCHAR(16777216), -- Piece of text
CHUNK_VECTORIZED VECTOR(FLOAT, 768),-- Embedding using the VECTOR data type
LOAD_TIMESTAMP TIMESTAMP_NTZ -- audit timestamp
);
Once the setup is done, we must create the UDFs for text splitting and then run the DML to split the data, vectorize it, and load it into the final table created above.
Context Overlap Splitting
For Context Overlap Splitting, we will create a Snowpark Python UDF named fn_context_overlap_text_chunker
where we will –
Use the
RecursiveCharacterTextSplitter
class method from thelangchain
Python package to split the text data into chunks with a specific chunk size and overlap size. These values are passed as parameters to the UDF during execution.Return the chunks as an
ARRAY
.
Below is what the full DDL of the Snowpark Python UDF looks like –
create or replace function fn_context_overlap_text_chunker(textdata string, chunk_size number , chunk_overlap number)
returns ARRAY
language python
runtime_version = '3.9'
handler = 'context_overlap_text_chunker'
packages = ( 'langchain')
as
$$
from langchain.text_splitter import RecursiveCharacterTextSplitter
def context_overlap_text_chunker(textdata: str,chunk_size=100, chunk_overlap=10 ):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = chunk_size, #Adjust this as per your use case
chunk_overlap = chunk_overlap,
#Number of overlap chars between each chunk. Hence called contextual overlap. Leaving this 0 will make this chunk based splitting.
length_function = len
)
chunks = text_splitter.split_text(textdata)
return chunks
$$;
Once this is created, we will run the below SQL query to load the vectorized data into the final table. In this query, we will –
Source the text data, call the UDF with a chunk size of 200 and an overlap of 20, and flatten the output of the UDF from
ARRAY
toVARCHAR
, which is the desired chunk of the input.Use the Snowflake Cortex function
SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', chunk)
to vectorize the chunk generated from the UDF. We are using the embedding model provided by Snowflake Arctic – snowflake-arctic-embed-m-v1.5
in this example.
Add the type of splitting method along with an audit timestamp.
Below is what the final query looks like:
INSERT INTO TEXT_CHUNKS
select
TEXTDATA,
'Context Overlap' AS SPLIT_METHOD,
UDF.VALUE::VARCHAR as chunk,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5',chunk) as CHUNK_VECTORIZED,
CURRENT_TIMESTAMP::TIMESTAMP_NTZ as LOAD_TIMESTAMP
from
TEXT_INPUT,
LATERAL FLATTEN ( input => fn_context_overlap_text_chunker(TEXTDATA,200,20)) as UDF;
After running the above command, the chunks of data will get vector-embedded and stored in the table. Below is what the vectorized data looks like –
We have now successfully split some input data into 16 chunks using Context Overlap Splitting and vector embedding, transforming the chunks into a Snowflake table with a Vector datatype.
Token Based Splitting
We will use BERT Tokenizer from Hugging Face as an open-source tokenizer for token-based splitting. To do this, we first need to –
Download the BERT tokenizer in our local system from the HuggingFace repository using the below Python Code in our local –
from transformers import AutoTokenizer
model_name= 'google-bert/bert-base-uncased'
# Download the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(model_name)
Zip the directory into a file called – google-bert.zip
Upload the zip into a Snowflake-managed stage called TOKENIZER –
Once the upload is completed, we will create another Snowpark Python UDF named fn_token_based_text_chunker,
where we will-
Import the zip file from the stage we created above.
Unzip and extract the files into the directory
/tmp/google_bert_dir
, keeping a file lock so that it does not overwrite in the parallel processing in Snowflake. This contains the necessary files for BERT tokenization.
Use the
AutoTokenizer
class from thetransformers
Python package to initialize an object with the extracted directory and tokenize the text data with a default maximum token size of 512 and an overlap size of 20. These values are parameterized as input arguments to the UDF.Return the chunks as an
ARRAY
.
Below is what the full DDL of the Snowpark Python UDF looks like –
create or replace function fn_token_based_text_chunker(textdata string, max_token NUMBER , overlap_size NUMBER)
returns ARRAY
language python
runtime_version = '3.9'
handler = 'bert_text_chunker'
IMPORTS=('@TOKENIZER/google-bert.zip')
packages = ('transformers')
as
$$
from transformers import AutoTokenizer
import fcntl
import os
import sys
import threading
import zipfile
# File lock class for synchronizing write access to /tmp
class FileLock:
def __enter__(self):
self._lock = threading.Lock()
self._lock.acquire()
self._fd = open('https://i0.wp.com/www.phdata.io/tmp/lockfile.LOCK', 'w+')
fcntl.lockf(self._fd, fcntl.LOCK_EX)
def __exit__(self, type, value, traceback):
self._fd.close()
self._lock.release()
# Get the location of the import directory. Snowflake sets the import
# directory location so code can retrieve the location via sys._xoptions.
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]
zip_file_path = import_dir + "google-bert.zip"
extracted = '/tmp/google_bert_dir'
# Extract the contents of the ZIP. This is done under the file lock
# to ensure that only one worker process unzips the contents.
with FileLock():
if not os.path.isdir(extracted + '/google_bert_dir'):
with zipfile.ZipFile(zip_file_path, 'r') as myzip:
myzip.extractall(extracted)
def bert_text_chunker(textdata, max_tokens=512, chunk_overlap=50):
# Initialize the BERT tokenizer
model_name= 'google-bert/bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(extracted +'/'+model_name)
#tokenizer=
# Tokenize the text
tokens = tokenizer.tokenize(textdata)
# Initialize list to hold chunks
chunks = []
# Loop through tokens and create chunks with overlap
for i in range(0, len(tokens), max_tokens - chunk_overlap):
# Get the current chunk with overlap
chunk = tokens[i:i + max_tokens]
chunks.append(chunk)
# Stop if we have reached the end of the tokens
if i + max_tokens >= len(tokens):
break
# Detokenize each chunk back to text
text_chunks = [tokenizer.convert_tokens_to_string(chunk) for chunk in chunks]
return text_chunks
$$;
Once this is created, we will run a similar INSERT
query to load the vectorized data into the final table, which we used earlier for Context Overlap Splitting by only changing the SPLIT_METHOD
value to 'BERT Tokenizer Based'
and using them fn_token_based_text_chunker
Snowpark UDF –
INSERT INTO TEXT_CHUNKS
SELECT
TEXTDATA,
'BERT Tokenizer Based' AS SPLIT_METHOD,
UDF.VALUE::VARCHAR AS chunk,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', chunk) AS CHUNK_VECTORIZED,
CURRENT_TIMESTAMP::TIMESTAMP_NTZ AS LOAD_TIMESTAMP
FROM
TEXT_INPUT,
LATERAL FLATTEN(input => fn_token_based_text_chunker(TEXTDATA, 128, 10)) AS UDF;
Once done, we can now see that the files have been split using the BERT tokenizer and vectorized, as shown below –
And that’s it! We have also successfully implemented token-based splitting in Snowflake!
Semantic Splitting
For Semantic Splitting, we will try a slightly different approach. To implement this, we will use Vector Similarity functions available in Snowflake and a custom function in Python via Snowpark. First, we will create a Sentence Splitter function in Python as below –
create or replace function sentence_splitter(textdata string)
returns ARRAY
language python
runtime_version = '3.9'
handler = 'process'
as
$$
import re
def process(textdata):
return re.split(r'(?<=[.?!])\s+', textdata)
$$;
Once done, we will directly create the INSERT
statement to create the chunks of the input text data by –
Splitting the text into sentences using the Python function above and flattening the array to rows
Combine the sentences with a buffer size of 2 i.e. each sentence with its preceding one. This value needs to be adjusted as per your use case.
Calculate the vectors of the combined sentences using
SNOWFLAKE.CORTEX.EMBED_TEXT_1024('multilingual-e5-large', text)
and the vector distances of each vector with its succeeding one using VECTOR_L2_DISTANCE
.
If the vector distance exceeds a threshold of 0.45 (this value needs to be adjusted according to your use case), we mark that as a breakpoint.
Aggregate the sentences per breakpoint to create the chunks, and vectorize those chunks using
SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', text)
and insert the data into TEXT_CHUNKS
using a similar query to the previous two methods.
Below is what the final INSERT
query looks like:
INSERT INTO TEXT_CHUNKS
WITH INPUT_DATA AS (
SELECT TEXTDATA AS TEXT
FROM TEXT_INPUT
),
SENTENCE_SPLIT AS (
SELECT
INPUT_DATA.TEXT,
sentence_splitter(INPUT_DATA.TEXT) AS SENTENCES
FROM INPUT_DATA
),
SENTENCE_ARRAY AS (
SELECT
TEXT,
f.VALUE::VARCHAR AS SENTENCE,
f.INDEX + 1 AS SENT_INDEX
FROM SENTENCE_SPLIT,
LATERAL FLATTEN(input => SENTENCES) AS f
),
COMBINED_SENTENCES AS (
SELECT
TEXT,
SENTENCE,
SENT_INDEX,
ARRAY_CONSTRUCT_COMPACT(
LAG(SENTENCE) OVER (PARTITION BY TEXT ORDER BY SENT_INDEX),
SENTENCE
) AS COMBINED_SENTENCES
FROM SENTENCE_ARRAY
),
VECTORIZED_SENTENCES AS (
SELECT *,
SNOWFLAKE.CORTEX.EMBED_TEXT_1024('multilingual-e5-large',
ARRAY_TO_STRING(COMBINED_SENTENCES, '.')) AS COMBINED_SENTENCES_VECTOR
FROM COMBINED_SENTENCES
),
SENTENCES_SCORED AS (
SELECT *,
VECTOR_L2_DISTANCE(
COMBINED_SENTENCES_VECTOR,
LAG(COMBINED_SENTENCES_VECTOR) OVER (PARTITION BY TEXT ORDER BY SENT_INDEX)
) AS VECTOR_DISTANCE,
CASE
WHEN VECTOR_DISTANCE IS NULL THEN 0
WHEN VECTOR_DISTANCE > 0.45 THEN 1
ELSE 0
END AS SPLIT_POINT
FROM VECTORIZED_SENTENCES
),
SENTENCES_GROUPED AS (
SELECT *,
SUM(SPLIT_POINT) OVER (
PARTITION BY TEXT
ORDER BY SENT_INDEX
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS SPLIT_GROUP
FROM SENTENCES_SCORED
),
SEMANTIC_SPLIT AS (
SELECT
TEXT,
SPLIT_GROUP,
LISTAGG(SENTENCE, '. ') WITHIN GROUP (ORDER BY SENT_INDEX) AS SEMANTIC_CHUNK
FROM SENTENCES_GROUPED
GROUP BY TEXT, SPLIT_GROUP
ORDER BY SPLIT_GROUP
)
SELECT
TEXT AS TEXTDATA,
'Semantic' AS SPLIT_METHOD,
SEMANTIC_CHUNK::VARCHAR AS chunk,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', SEMANTIC_CHUNK) AS CHUNK_VECTORIZED,
CURRENT_TIMESTAMP::TIMESTAMP_NTZ AS LOAD_TIMESTAMP
FROM SEMANTIC_SPLIT
ORDER BY SPLIT_GROUP;
Below is how the final vectorized data with Semantic Chunking looks like:
Best Practices
Preprocess Text Before Splitting
Remove unnecessary whitespace, special characters, or noisy content.
Normalize text (e.g. lowercasing) based on your application.
Test for Trade-offs
For proper embedding quality and efficiency, experiment with different values of –
Chunk Size and Overlap Size for Context Splitting.
Token Size for Token-Based Splitting.
Sentence Buffer Size and Splitting Threshold Value for Semantic Chunking.
Evaluate the Results
Assess the embeddings’ performance using downstream tasks like Document retrieval and Classification tasks.
Iterate on splitting strategy based on performance metrics.
One widely used way to evaluate the vector embedding results is through precision, recall, and F1-score metrics.
Below is an example of how the evaluation of vector embedding can be done using precision, recall, and F1-score metrics-
Imagine we have a dataset of customer reviews, and the goal is to classify each
review into one of two categories: Positive or Negative. We have created vector embeddings for these reviews, and now we want to assess the quality of the embeddings based on their performance in a sentiment classification task.
Steps:
Train a Classifier: First, we use your vector embeddings as features to train a classification model (e.g., logistic regression, SVM, or a neural network). The input to the classifier is a vector representation (embedding) of the review, and the output is the predicted sentiment label (Positive or Negative).
Evaluate the Predictions into a Confusion Matrix: After the model has been trained, we test it on a separate data set and get predictions for each review. We compare these predictions to the true labels i.e. the actual sentiments of the reviews from a human evaluation to form a confusion matrix. Below is an illustration of how it looks for a binary classification task (Positive/Negative):
Calculate Precision, Recall, and F1-Score:
Precision:
Precision answers the question: “Of all the reviews predicted as positive, how many actually are positive?”
Recall:
Recall answers the question: “Of all the reviews that are actually positive, how many did we correctly predict?”
F1-Score:
The F1-score is useful when you want to have a single metric that gives you an overall sense of the model’s performance, especially when precision and recall are both important for your use case.
Interpreting the Results:
High Precision means that the embeddings help the model avoid many false positives (predicting Negative reviews as Positive).
High Recall means the embeddings help the model correctly identify most of the Positive reviews.
F1-Score gives an overall measure of how well the embeddings are helping the model perform both Precision and Recall tasks, ensuring a balance between not missing positive reviews and not wrongly tagging negative reviews as positive.
This is how vector embedding results can be evaluated and tuned according to the need of your use case.
Closing
This blog has covered how to implement three major text-splitting strategies for Vector Embeddings in Snowflake. However, there are numerous other splitting techniques depending on the dataset type. For example, If the data has some inherent structure like HTML or Markdown pages, it is better to split by considering that structure. We can use –
from langchain_text_splitters import HTMLHeaderTextSplitter, HTMLSectionSplitter
to implement this, similar to how we did Context Overlap Splitting. Other relevant options exist in the langchain
package if the source data contains Python, Javascript, and other codebases.
So, with the introduction of vector datatype and similarity functions along with the wide array of Python packages available via Snowpark, Snowflake has jumped onto the queue of all the vector databases available in the market today. It will be interesting to see how Snowflake holds up as a vector database compared to its peers in efficiency, quality, and scalability.
Are you interested in exploring Snowflake as a vector database?
If you have questions about text-splitting strategies, vector embeddings, or how to maximize Snowflake’s capabilities as a vector database, we’re here to help!
FAQs
What are vector indices?
Vector indices are specialized data structures designed to organize, store, and efficiently retrieve high-dimensional vectors (numerical representations of data) in applications like similarity search, recommendation systems, and AI-driven content retrieval. These indices are crucial when dealing with vector embeddings produced by models like Snowflake Arctic, OpenAI, Amazon Titan, or other neural networks for tasks like semantic search, image recognition, or clustering.
What are some of the other popular Vector Databases?
Some popular Vector Databases currently available are Pinecone, Chroma, FAISS, and Milvus. There are some vector-enabled NoSQL databases, like MongoDB, Neo4j, Cassandra, Redis, etc. Among the SQL databases, PostgreSQL is one of the popular ones, as it is vector-enabled, along with Timescale, SingleStore, etc.