November 28, 2024

How to Split Text For Vector Embeddings in Snowflake

By Pratik Datta

Vector Databases are completely different from your cloud data warehouse.” – You might have heard that statement if you are involved in creating vector embeddings for your RAG-based Gen AI applications. The Snowflake AI Data Cloud has added the VECTOR datatype, Vector Embeddings, and Vector Similarity functions, allowing us to use Snowflake as a vector database. However, this also calls for the need to build processes to split large text data from various documents, which is required for RAG applications, into smaller chunks so that they can be embedded and retrieved efficiently as vectors. 

In this blog, we will discuss:

  • What is Text Splitting, and what is its importance in Vector Embedding?

  • Different Methods of Text Splitting

  • How do we implement those methods in Snowflake?

What is Text Splitting, and Why is it Important for Vector Embeddings?

Text splitting is breaking down a long document or text into smaller, manageable segments or “chunks” for processing. This is widely used in Natural Language Processing (NLP), where it plays a pivotal role in pre-processing unstructured textual data. Snowflake can efficiently store these unstructured and/or semi-structured data as raw files or in relational tables as VARIANT data types. However, how will we efficiently retrieve or correlate information even if we split the data chunks? This is where Vector Embeddings come into the picture.

Vector embedding is a way to transform complex, high-dimensional data—like raw text or images—into a simplified, lower-dimensional form called a vector, which is a structured numerical representation. For example, a vector embedding of the word cat can be = [0.5, -0.4] in a 2D space based on the machine learning algorithm used. Below is an example of how textual data of similar categories are grouped in a hypothetical 2D vector space – 

As we can see in the above diagram, semantically matching text yields vectors that point in the same general direction. However, in the real world, the embedding algorithms will generate a vector of hundreds of dimensions  (as opposed to 2 dimensions in the above diagram) for any given input text.

Once vectorized, this data can be used in RAG-based applications, for vector similarity check, and many other NLP applications, such as text summarization, named entity recognition, sentiment analysis, etc.

Vector Embeddings in Snowflake

In Snowflake, we have a VECTOR datatype to store, encode, and retrieve vector embedding efficiently. Following  is the syntax to declare a Vector Datatype:

				
					VECTOR( <Type>, <Dimension> )
				
			

Where:

  • Type is the Snowflake data type of the elements, which can be INT or FLOAT.

  • Dimension is the dimension (length) of the vector. This must be a positive integer value with a maximum value of 4096.

You can look at this guide to explore the datatype and loading vector data directly.

For embedding the textual data as a vector, Snowflake Cortex provides functions like EMBED_TEXT_768 and EMBED_TEXT_1024  to generate embeddings, creating vector representations of text in either 768 or 1024 dimensions, respectively. Following are the various embedding models supported by these functions.

 

Function

Supported Embedding Models

EMBED_TEXT_768

snowflake-arctic-embed-m-v1.5

snowflake-arctic-embed-m

e5-base-v2

EMBED_TEXT_1024

nv-embed-qa-4 (English only)

multilingual-e5-large

voyage-multilingual-2

Note: Costs are different for each model.

For semantic comparisons, measuring the similarity between vectors is essential. Snowflake Cortex offers three key vector similarity functions for this purpose:

  1. VECTOR_INNER_PRODUCT – calculates the inner product of two vectors, which is useful for understanding directional alignment and intensity.

  2. VECTOR_L2_DISTANCE – measures the Euclidean distance, highlighting the absolute difference between vectors in space.

  3. VECTOR_COSINE_SIMILARITY – evaluates the cosine of the angle between vectors, focusing on how closely aligned they are in the direction.

Each function provides a unique perspective on similarity, supporting diverse applications in data analysis and machine learning. For more details, refer to Vector similarity functions.

When and Why You Should Split Text for Vector Embeddings

Below are a few of the reasons why it’s essential and how it impacts vector embedding performance:

  1. Memory and Processing Limitations
    Many NLP models have token size limits, including those used to generate vector embeddings. For example, common models like BERT have a maximum token input size of 512 tokens. Splitting text into smaller segments allows each chunk to fit within these constraints, enabling processing without data loss or truncation.

  2. Improved Embedding Accuracy
    When a text is too long, the model may struggle to capture all necessary details, causing some information to be lost or diluted. Smaller, thematically homogeneous segments will lead to more accurate embeddings with contextually relevant vectors that better represent the content and context of each segment.

  3. Enhanced Search and Retrieval Augmented Generation:
    Vector search systems work by matching queries with embeddings in a database. When documents are split into smaller chunks, search systems can find relevant sections more precisely and quickly. This benefits applications like document retrieval, question-answering, and Retrieval-Augmented generation use cases.

What are the Methods for Splitting Text in Snowflake?

We can split a large document or text into smaller chunks. We have listed below some of the frequently used methods to split text for vector embedding in Snowflake –

Context Overlap Splitting

This is one type of Recursive Text Splitter, which might also be called “Sliding Window Splitting.”

If we think about text spitting, the basic way to split a text is based on some chunk size. However, this process has the drawback of resulting in a complete loss of information in the last part of the chunk, and the embedded vector for this chunk will also be semantically incorrect.

To overcome this, we can implement a recursive splitting mechanism whereby any chunk always overlaps N number of characters with the previous chunk. In this way, we will not lose the meaning of the last words in any chunk and have an overlapping context between each of the chunks. This is illustrated in the below snippet where using a chunk size of 100 and overlap size of 10, the first 3 chunks are highlighted in green, orange, and red –

This process is called “Context Overlap” splitting due to the overlapping information between the chunks.

Token-Based Splitting

Token-based splitting involves dividing a large text into smaller chunks based on tokens rather than characters or fixed-length blocks. Tokens represent semantic units like words or subwords. Since LLMs use tokens to count, token-based splitting makes it more language-aware and contextually relevant than character-based splitting.

This type of spitting is particularly useful when working with models with specific token limits, such as GPT-3 or any other language models, as it provides precise control over input size for language models.

The below diagram illustrates a generic view of how token-based splitting works, which is generally handled by tokenizers like BERT, GPT, etc.

Semantic Splitting

Semantic splitting involves breaking a body of text into smaller, meaningful segments based on its content, structure, or semantics rather than relying on arbitrary measures like word count or line breaks. The process focuses on dividing the text into semantically similar chunks. 

This is achieved by splitting the text into sentences, converting these into vector embeddings, and calculating the vector distance or cosine similarity between consecutive segments. If we use the vector distance (how close the vectors are,  represented as a floating point number between  0 – 1), a predefined threshold, such as 0.5, is then used to determine where splits occur. If the distance between two consecutive segments exceeds this threshold, a split creates separate chunks before and after the split. This process is repeated until the entire text is divided into coherent segments. The below flow diagram illustrates this process.

How to Implement Text Splitting in Snowflake Using SQL and Python UDFs

We will now demonstrate how to implement the types of Text Splitting we explained in the above section in Snowflake.

Initial Setup

Load Input Data:

We use a mixture of  3 different excerpts from 3 publicly available datasets – 1) Relational Database, 2) Mathematical Theory of Communication, and 3) Neural Networks as input data for the demo implementation.

We have loaded this data into an input table named – TEXT_INPUT as a VARCHAR column named TEXTDATA in Snowflake. Below is what the input text data looks like.

Create Table to Store Vector Embedding

Below is the final table where the vectorized data will be stored. We will store the text, its chunks and their vectors, the type of splitting method, and an audit timestamp.

				
					
create TABLE IF NOT EXISTS TEXT_CHUNKS ( 
    TEXTDATA VARCHAR(16777216), -- Large Text To be Split
    SPLIT_METHOD VARCHAR(16777216), -- Type of Splitting Method Used
    CHUNK VARCHAR(16777216), -- Piece of text
    CHUNK_VECTORIZED VECTOR(FLOAT, 768),-- Embedding using the VECTOR data type
    LOAD_TIMESTAMP TIMESTAMP_NTZ -- audit timestamp
    );
				
			

Once the setup is done, we must create the UDFs for text splitting and then run the DML to split the data, vectorize it, and load it into the final table created above.

Context Overlap Splitting

For Context Overlap Splitting, we will create a Snowpark Python UDF named fn_context_overlap_text_chunker  where we will – 

  • Use the RecursiveCharacterTextSplitter class method from the langchain Python package to split the text data into chunks with a specific chunk size and overlap size. These values are passed as parameters to the UDF during execution. 

  • Return the chunks as an ARRAY.

Below is what the full DDL of the Snowpark Python UDF looks like – 

				
					
create or replace function fn_context_overlap_text_chunker(textdata string, chunk_size number , chunk_overlap number)
returns ARRAY 
language python
runtime_version = '3.9'
handler = 'context_overlap_text_chunker'
packages = ( 'langchain')
as
$$
from langchain.text_splitter import RecursiveCharacterTextSplitter
def context_overlap_text_chunker(textdata: str,chunk_size=100, chunk_overlap=10 ):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = chunk_size, #Adjust this as per your use case
        chunk_overlap  = chunk_overlap, 
        #Number of overlap chars between each chunk. Hence called contextual overlap. Leaving this 0 will make this chunk based splitting.
        length_function = len
    )
    chunks = text_splitter.split_text(textdata)
    return chunks
$$;
				
			

Once this is created, we will run the below SQL query to load the vectorized data into the final table. In this query, we will –

  • Source the text data, call the UDF with a chunk size of 200 and an overlap of 20, and flatten the output of the UDF from ARRAY to VARCHAR, which is the desired chunk of the input.

  • Use the Snowflake Cortex function

				
					SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', chunk)
				
			

to vectorize the chunk generated from the UDF. We are using the embedding model provided by Snowflake Arctic – snowflake-arctic-embed-m-v1.5 in this example. 

  • Add the type of splitting method along with an audit timestamp.

 Below is what the final query looks like:

				
					INSERT INTO TEXT_CHUNKS
select 
TEXTDATA,  
'Context Overlap' AS SPLIT_METHOD,
UDF.VALUE::VARCHAR as chunk,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5',chunk) as CHUNK_VECTORIZED,
CURRENT_TIMESTAMP::TIMESTAMP_NTZ as LOAD_TIMESTAMP
from 
TEXT_INPUT,
LATERAL FLATTEN ( input => fn_context_overlap_text_chunker(TEXTDATA,200,20)) as UDF;
				
			

After running the above command, the chunks of data will get vector-embedded and stored in the table. Below is what the vectorized data looks like –

We have now successfully split some input data into 16 chunks using Context Overlap Splitting and vector embedding, transforming the chunks into a Snowflake table with a Vector datatype.

Token Based Splitting

We will use BERT Tokenizer from Hugging Face as an open-source tokenizer for token-based splitting. To do this, we first need to –

  • Download the BERT tokenizer in our local system from the HuggingFace repository using the below Python Code in our local – 

				
					from transformers import AutoTokenizer
model_name= 'google-bert/bert-base-uncased'
# Download the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(model_name)
				
			
  • Zip the directory into a file called  – google-bert.zip

  • Upload the zip into a Snowflake-managed stage called TOKENIZER –

Once the upload is completed, we will create another Snowpark Python UDF named fn_token_based_text_chunker, where we will-

  • Import the zip file from the stage we created above.

  • Unzip and extract the files into the directory /tmp/google_bert_dir, keeping a file lock so that it does not overwrite in the parallel processing in Snowflake. This contains the necessary files for BERT tokenization.

  • Use the AutoTokenizer class from the transformers Python package to initialize an object with the extracted directory and tokenize the text data with a default maximum token size of 512 and an overlap size of 20. These values are parameterized as input arguments to the UDF. 

  • Return the chunks as an ARRAY.

Below is what the full DDL of the Snowpark Python UDF looks like – 

				
					
create or replace function fn_token_based_text_chunker(textdata string, max_token NUMBER , overlap_size NUMBER)
returns ARRAY
language python
runtime_version = '3.9'
handler = 'bert_text_chunker'
IMPORTS=('@TOKENIZER/google-bert.zip')
packages = ('transformers')
as
$$
from transformers import AutoTokenizer

import fcntl
import os
import sys
import threading
import zipfile

# File lock class for synchronizing write access to /tmp
class FileLock:
   def __enter__(self):
      self._lock = threading.Lock()
      self._lock.acquire()
      self._fd = open('https://i0.wp.com/www.phdata.io/tmp/lockfile.LOCK', 'w+')
      fcntl.lockf(self._fd, fcntl.LOCK_EX)

   def __exit__(self, type, value, traceback):
      self._fd.close()
      self._lock.release()
      
# Get the location of the import directory. Snowflake sets the import
# directory location so code can retrieve the location via sys._xoptions.
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]

zip_file_path = import_dir + "google-bert.zip"
extracted = '/tmp/google_bert_dir'

# Extract the contents of the ZIP. This is done under the file lock
# to ensure that only one worker process unzips the contents.
with FileLock():
   if not os.path.isdir(extracted + '/google_bert_dir'):
      with zipfile.ZipFile(zip_file_path, 'r') as myzip:
         myzip.extractall(extracted)

        
def bert_text_chunker(textdata, max_tokens=512, chunk_overlap=50):
	# Initialize the BERT tokenizer
	model_name= 'google-bert/bert-base-uncased'
	tokenizer = AutoTokenizer.from_pretrained(extracted +'/'+model_name)
	#tokenizer=
	# Tokenize the text
	tokens = tokenizer.tokenize(textdata)
	
	# Initialize list to hold chunks
	chunks = []
	
	# Loop through tokens and create chunks with overlap
	for i in range(0, len(tokens), max_tokens - chunk_overlap):
		# Get the current chunk with overlap
		chunk = tokens[i:i + max_tokens]
		chunks.append(chunk)
		
		# Stop if we have reached the end of the tokens
		if i + max_tokens >= len(tokens):
			break
	
	# Detokenize each chunk back to text
	text_chunks = [tokenizer.convert_tokens_to_string(chunk) for chunk in chunks]
	
	return text_chunks

$$;
				
			

Once this is created, we will run a similar INSERT query to load the vectorized data into the final table, which we used earlier for Context Overlap Splitting by only changing the SPLIT_METHOD value to 'BERT Tokenizer Based' and using them fn_token_based_text_chunker  Snowpark UDF –

				
					INSERT INTO TEXT_CHUNKS
SELECT 
    TEXTDATA,  
    'BERT Tokenizer Based' AS SPLIT_METHOD,
    UDF.VALUE::VARCHAR AS chunk,
    SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', chunk) AS CHUNK_VECTORIZED,
    CURRENT_TIMESTAMP::TIMESTAMP_NTZ AS LOAD_TIMESTAMP
FROM 
    TEXT_INPUT,
    LATERAL FLATTEN(input => fn_token_based_text_chunker(TEXTDATA, 128, 10)) AS UDF;


				
			

Once done, we can now see that the files have been split using the BERT tokenizer and vectorized, as shown below –

And that’s it! We have also successfully implemented token-based splitting in Snowflake!

Semantic Splitting

For Semantic Splitting, we will try a slightly different approach. To implement this, we will use Vector Similarity functions available in Snowflake and a custom function in Python via Snowpark. First, we will create a Sentence Splitter function in Python as below –

				
					create or replace function sentence_splitter(textdata string)
returns  ARRAY
language python
runtime_version = '3.9'
handler = 'process'
as
$$
import re
def process(textdata):
    return re.split(r'(?<=[.?!])\s+', textdata)
$$;
				
			

Once done, we will directly create the INSERT statement to create the chunks of the input text data by –

  •  Splitting the text into sentences using the Python function above and flattening the array to rows

  • Combine the sentences with a buffer size of 2 i.e. each sentence with its preceding one. This value needs to be adjusted as per your use case.

  • Calculate the vectors of the combined sentences using

				
					SNOWFLAKE.CORTEX.EMBED_TEXT_1024('multilingual-e5-large', text)
				
			

and the vector distances of each vector with its succeeding one using VECTOR_L2_DISTANCE.

  • If the vector distance exceeds a threshold of 0.45 (this value needs to be adjusted according to your use case), we mark that as a breakpoint.

  • Aggregate the sentences per breakpoint to create the chunks, and vectorize those chunks using

				
					SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', text)
				
			

and insert the data into TEXT_CHUNKS using a similar query to the previous two methods.

Below is what the final INSERT query looks like:

				
					INSERT INTO TEXT_CHUNKS 
WITH INPUT_DATA AS (
    SELECT TEXTDATA AS TEXT 
    FROM TEXT_INPUT
),
SENTENCE_SPLIT AS (
    SELECT 
        INPUT_DATA.TEXT, 
        sentence_splitter(INPUT_DATA.TEXT) AS SENTENCES 
    FROM INPUT_DATA
),
SENTENCE_ARRAY AS (
    SELECT 
        TEXT, 
        f.VALUE::VARCHAR AS SENTENCE, 
        f.INDEX + 1 AS SENT_INDEX
    FROM SENTENCE_SPLIT,
    LATERAL FLATTEN(input => SENTENCES) AS f
),
COMBINED_SENTENCES AS (
    SELECT 
        TEXT, 
        SENTENCE, 
        SENT_INDEX, 
        ARRAY_CONSTRUCT_COMPACT(
            LAG(SENTENCE) OVER (PARTITION BY TEXT ORDER BY SENT_INDEX), 
            SENTENCE
        ) AS COMBINED_SENTENCES
    FROM SENTENCE_ARRAY
),
VECTORIZED_SENTENCES AS (
    SELECT *,
        SNOWFLAKE.CORTEX.EMBED_TEXT_1024('multilingual-e5-large', 
        ARRAY_TO_STRING(COMBINED_SENTENCES, '.')) AS COMBINED_SENTENCES_VECTOR
    FROM COMBINED_SENTENCES
),
SENTENCES_SCORED AS (
    SELECT *, 
        VECTOR_L2_DISTANCE(
            COMBINED_SENTENCES_VECTOR, 
            LAG(COMBINED_SENTENCES_VECTOR) OVER (PARTITION BY TEXT ORDER BY SENT_INDEX)
        ) AS VECTOR_DISTANCE,
        CASE 
            WHEN VECTOR_DISTANCE IS NULL THEN 0 
            WHEN VECTOR_DISTANCE > 0.45 THEN 1 
            ELSE 0 
        END AS SPLIT_POINT
    FROM VECTORIZED_SENTENCES
),
SENTENCES_GROUPED AS (
    SELECT *,
        SUM(SPLIT_POINT) OVER (
            PARTITION BY TEXT 
            ORDER BY SENT_INDEX  
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS SPLIT_GROUP 
    FROM SENTENCES_SCORED
),
SEMANTIC_SPLIT AS (
    SELECT 
        TEXT,
        SPLIT_GROUP, 
        LISTAGG(SENTENCE, '. ') WITHIN GROUP (ORDER BY SENT_INDEX) AS SEMANTIC_CHUNK
    FROM SENTENCES_GROUPED
    GROUP BY TEXT, SPLIT_GROUP
    ORDER BY SPLIT_GROUP
)
SELECT   
    TEXT AS TEXTDATA,  
    'Semantic' AS SPLIT_METHOD,
    SEMANTIC_CHUNK::VARCHAR AS chunk,
    SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', SEMANTIC_CHUNK) AS CHUNK_VECTORIZED,
    CURRENT_TIMESTAMP::TIMESTAMP_NTZ AS LOAD_TIMESTAMP
FROM SEMANTIC_SPLIT
ORDER BY SPLIT_GROUP;

				
			

Below is how the final vectorized data with Semantic Chunking looks like:

Best Practices

  • Preprocess Text Before Splitting

    • Remove unnecessary whitespace, special characters, or noisy content.

    • Normalize text (e.g. lowercasing) based on your application.

  • Test for Trade-offs
    For proper embedding quality and efficiency, experiment with different values of –

    • Chunk Size and Overlap Size for Context Splitting.

    • Token Size for Token-Based Splitting.

    • Sentence Buffer Size and Splitting Threshold Value for Semantic Chunking.

  • Evaluate the Results

    • Assess the embeddings’ performance using downstream tasks like Document retrieval and Classification tasks.

    • Iterate on splitting strategy based on performance metrics.

    • One widely used way to evaluate the vector embedding results is through precision, recall, and F1-score metrics.

Below is an example of how the evaluation of vector embedding can be done using precision, recall, and F1-score metrics-

Imagine we have a dataset of customer reviews, and the goal is to classify each

review into one of two categories: Positive or Negative. We have created vector embeddings for these reviews, and now we want to assess the quality of the embeddings based on their performance in a sentiment classification task.

Steps:

Train a Classifier: First, we use your vector embeddings as features to train a classification model (e.g., logistic regression, SVM, or a neural network). The input to the classifier is a vector representation (embedding) of the review, and the output is the predicted sentiment label (Positive or Negative).

Evaluate the Predictions into a Confusion Matrix: After the model has been trained, we test it on a separate data set and get predictions for each review. We compare these predictions to the true labels  i.e. the actual sentiments of the reviews from a human evaluation to form a confusion matrix. Below is an illustration of how it looks for a binary classification task (Positive/Negative):

Calculate Precision, Recall, and F1-Score:

Precision:

Precision answers the question: “Of all the reviews predicted as positive, how many actually are positive?”

Recall:

Recall answers the question: “Of all the reviews that are actually positive, how many did we correctly predict?”

F1-Score:

The F1-score is useful when you want to have a single metric that gives you an overall sense of the model’s performance, especially when precision and recall are both important for your use case.

Interpreting the Results:

  • High Precision means that the embeddings help the model avoid many false positives (predicting Negative reviews as Positive).

  • High Recall means the embeddings help the model correctly identify most of the Positive reviews.

  • F1-Score gives an overall measure of how well the embeddings are helping the model perform both Precision and Recall tasks, ensuring a balance between not missing positive reviews and not wrongly tagging negative reviews as positive.

This is how vector embedding results can be evaluated and tuned according to the need of your use case.

Closing

This blog has covered how to implement three major text-splitting strategies for Vector Embeddings in Snowflake. However, there are numerous other splitting techniques depending on the dataset type. For example, If the data has some inherent structure like HTML or Markdown pages, it is better to split by considering that structure. We can use – 

				
					from langchain_text_splitters import HTMLHeaderTextSplitter, HTMLSectionSplitter
				
			

to implement this, similar to how we did Context Overlap Splitting. Other relevant options exist in the langchain package if the source data contains Python, Javascript, and other codebases. 

So, with the introduction of vector datatype and similarity functions along with the wide array of Python packages available via Snowpark, Snowflake has jumped onto the queue of all the vector databases available in the market today. It will be interesting to see how Snowflake holds up as a vector database compared to its peers in efficiency, quality, and scalability. 

Are you interested in exploring Snowflake as a vector database?

If you have questions about text-splitting strategies, vector embeddings, or how to maximize Snowflake’s capabilities as a vector database, we’re here to help!

FAQs

Vector indices are specialized data structures designed to organize, store, and efficiently retrieve high-dimensional vectors (numerical representations of data) in applications like similarity search, recommendation systems, and AI-driven content retrieval. These indices are crucial when dealing with vector embeddings produced by models like Snowflake Arctic, OpenAI, Amazon Titan, or other neural networks for tasks like semantic search, image recognition, or clustering.

Some popular Vector Databases currently available are Pinecone, Chroma, FAISS, and Milvus. There are some vector-enabled NoSQL databases, like MongoDB, Neo4j, Cassandra, Redis, etc. Among the SQL databases, PostgreSQL is one of the popular ones, as it is vector-enabled, along with Timescale, SingleStore, etc.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit