November 13, 2024

Getting Started with Gen AI in Matillion Data Productivity Cloud

By Pratik Datta

Generative AI (Gen AI) represents a transformative advancement in artificial intelligence, enabling the creation of new content, such as text, images, music, and even code, by learning patterns from existing data. Gen AI can synthesize large and complex datasets in data analytics, enrich existing data, automate report generation, and suggest new ways to approach data-driven challenges. The Large Language Models (LLMs) have the ability to interpret and create from data, whether through natural language processing (NLP) or synthetic data generation, which can result in enriched data visualization.

Matillion Data Productivity Cloud (DPC) is a fully SaaS ELT and Analytics Engineering platform that allows us to use LLMs in an interactive GUI through a drag-drop and configure approach and unleash the full power of LLMs in any relevant data engineering and analytics workloads.

In this blog, we will explore the applications of GenAI through Matillion DPC and how to set up the environment and build a GenAI workflow in Matillion DPC.

Why Use Matillion Data Productivity Cloud for GenAI?

Below are some reasons why we prefer to use Matillion DPC to implement Gen AI.

  • Matillion DPC provides out-of-the-box components to connect to most of the LLMs in the current market and allows us to leverage them from providers like Amazon Bedrock and Snowflake Cortex to categorize data, standardize labeling, and cleanse datasets and many other use cases. 

  • It also has in-built components to vectorize data.

  • It supports  Retrieval Augmented Generation (RAG), which is one of the most widely used features of Gen AI.

  • The drag-drop and configure approach requires minimal coding to use the LLMs.

What Are the Applications of Gen AI in Matillion DPC?

Generative AI holds endless potential, but analyzing and refining applications to find the best fit is essential—especially when integrating with a powerful analytical platform like Matillion DPC. Below are some key applications of Generative AI in Matillion DPC-

  • Data Enrichment (Normalization and Standardization): By leveraging the powerful LLMs available in Matillion DPC, we can enrich our existing data by standardizing and normalizing data points according to predefined or general standards. For example, we can standardize different formats of country and state codes (e.g., ISO2, ISO3) into a single, consistent value.

  • Data Parsing and Extraction: We can use the GenAI components in Matillion DPC for Automated Document Parsing and Information Extraction for financial documents, health records, etc, Sentiment Analysis of Reviews and Customer Feedback, Text summarization for large documents, and many other data extraction tasks.

  • Synthetic Data Generation and  Data Augmentation: The LLMs available in Matillion DPC can generate synthetic data and augment data in lower software environments for software testing, A/B testing, balancing imbalanced datasets, domain-specific text, chatbot training, etc.

Setting Up Gen AI in Matillion DPC

AWS Bedrock and Snowflake Cortex—Both provide a suite of GenAI LLMs in a Model as a Service (MaaS) architecture that can be used to extract information from our data, complete and answer freeform questions, and provide intelligent assistance.

This section will focus on setting up AWS Bedrock and Snowflake Cortex in Matllion DPC.

AWS Bedrock

Prerequisite: AWS has recently applied service quotas on Invoking Models in Bedrock, and you must ensure that your AWS account has the quotas on the particular model you wish to use. If you see the below example, the particular account does not have a quota for invoking any of the models currently available via Matillion DPC. 

If you don’t, you must follow this to request quotas on models. After the quotas are assigned at an account level, we can start configuring AWS Bedrock in Matillion DPC.

Below are the steps needed to configure AWS Bedrock to be used from Matillion DPC –

  • Login to your AWS Account and navigate to the AWS Bedrock homepage

  • Click on the icon on the left-hand side, scroll down below to find the Model Access button, and click on it. The below page should appear –
  • On this page, we need to enable access to the models we want to use or may want to use in the future in our DPC pipelines. To do this, click the Available to request button next to the desired LLMs. For example, we have selected a few Llama models and Mistral models as shown below –

  • Once Selected, Click on Next and then Submit.

  • Once submitted, the updated access status may take some time to appear on the homepage. Refresh the homepage, and once it shows the Access Granted status below, we successfully enabled these LLMs in AWS Bedrock.

  • Next, we must ensure that the Matillion DPC has the authentication to use these Bedrock Models.

  • To do this, we need to add an inline policy that grants the IAM user associated with our Matillion DPC instance permission to invoke Bedrock models. Below is the JSON for the inline policy-

				
					{
  "Sid": "InvokeModel",
  "Effect": "Allow",
  "Action": [
    "bedrock:InvokeModel"
  ],
  "Resource": "arn:aws:bedrock:*::foundation-model/*"
}
				
			

**If you have not attached an AWS user to your Matillion DPC, follow this guide to do the same.

In our case, the user’s name is svc_matillion_user. Below is how we add the inline policy to the user in IAM.

  • After adding the inline policy, click Next and assign it a meaningful name, as shown below, then click Create Policy.

  • Once the policy is created, it should appear listed under the IAM user, and we’ll be ready to use AWS Bedrock in Matillion DPC

Snowflake Cortex

To enable the use of Snowflake Cortex LLM functions from Matillion DPC –

  • Grant the SNOWFLAKE.CORTEX_USER database role to the user/role used by Matillion DPC to connect to your Snowflake account. In the example below, the role used by DPC is SVC_MATILLION_TRAINING_ROLE. The required commands are provided below-

				
					USE ROLE ACCOUNTADMIN;

GRANT DATABASE ROLE SNOWFLAKE.CORTEX_USER TO ROLE SVC_MATILLION_TRAINING_ROLE;
				
			

If you want to use Snowflake Arctic as the LLM and your cloud is on a region other than AWS US West 2, please use the below command from an ACCOUNT ADMIN role to implement cross-region inference –

				
					ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'AWS_US';
				
			

Note: You may need assistance from your Snowflake admins to execute these commands. They can also adjust them to fit your organization’s Role-Based Access Control hierarchy. The main objective is to grant the SNOWFLAKE.CORTEX_USER role and usage and Cross region inference (optional) to the Matillion DPC Snowflake user/role.

Once you are granted the above access, you should be able to call the Snowflake Cortex LLM functions from your Matillion DPC pipelines.

Important Terms

Tokens

Tokens are character sets, or combinations of words and punctuation, used by large language models (LLM) to decompose text while processing the input prompt and when returning the response to that prompt. The number of tokens is not always determined by the size of the prompt text but also depends on any additional information sent along with it, such as whether it’s a user prompt or a system prompt.

Context Window

A context window in a Large Language Model (LLM) is the amount of text ( in terms of tokens) that the model can process at one time when generating or understanding the input prompt. All LLMs have a limit on the context window, which needs to be considered when feeding the input prompt to it.

Temperature

A parameter that influences the creativity of the LLM output determines whether the output is more random and creative or predictable. 0 means least random or creative, and 1 means most creative.

User Prompt

The User Prompt is the input the end user gives, which contains the actual question, command, or task you want the LLM to complete. It’s often specific and can vary in complexity, ranging from simple queries (e.g., “What is the weather today?”) to complex, multi-part instructions (e.g., “Summarize this article on climate change and provide insights for policy implications.”).

System Prompt

The System Prompt, or System Message, sets the overall context, tone, or rules for the LLM’s response. This prompt is given to the model to establish the LLM’s  “personality” and tone of response for any given user prompt.

This blog will help you understand all these and other parameters that help tune a LLM.

Building a Gen AI Workflow in Matillion DPC

The above sections show how to configure AWS Bedrock and Snowflake Cortex into Matillion DPC. We will now build a simple GenAI workflow in Matillion DPC, implementing a Data Enrichment (Normalization and Standardization) use case via Snowflake Cortex using Snowflake Arctic as the LLM.

Problem Statement

When working with data from public sources like manual surveys, we often encounter scenarios where the data for various dimensions is not standardized. For example, we are building an analytics application on top of some survey data collected from various regional sources. Below is the sample source data we have in Snowflake –

As we can see from the data, due to the diverse nature of the sources, the date formats, the rating formats, and the address formats vary extensively, causing failure in the processing pipeline or incorrect analytical results. Writing custom code (Eg. a CASE WHEN construct) to standardize the different formats is not a scalable solution because we have to continuously update the code as and when new data arrives to handle the new scenarios.

Hence, we need to build a scalable pipeline that will successfully standardize the data points to easily harness the data for analytics.

Solution

Below is the transformation job we have built-in Matillion DPC to solve this problem – 

We will explain all the components of the pipeline in brief – 

Read Raw Survey Info

First, we are reading all the fields from our raw data table – SURVEY_INFO_RAW as shown below – 

Add Prompt

We are building a prompt field named RAW_PROMPT for the LLM to process this data. We have just concatenated all the data into a string which will be passed into the LLM in the next step. Here is how it looks –

Cortex Completions for Standardization

The Gen AI magic starts from here. For each record, we are now passing the RAW_PROMPT field in the Inputs field to Snowflake Cortex and select snowflake-arctic  as the model shown below –

Then we have specified the system prompt and the user prompt  by clicking on the gear icon on the right side of the respective fields as shown below – 

** Please remember to design the best possible User and System Prompt based on your use case and data. This guide can help you select the best prompt.

Next, we will specify the other parameters required by the LLM, like Temperature (0 because we need a specific response)  and Max Tokens (we will leave it empty as we don’t want to enforce a limit on the number of tokens as of now) as shown below – 

We will include all the input columns for now, as we need the USER_ID to persist in the final output. 

This component will call the Snowflake Arctic Model and generate a field called completion_result, which will hold the LLM’s response.

Extract Output from Cortex Response

Once the Cortex completes generating the response for each record, we will extract the output in a field named OUTPUT, which only stores the actual message. This will be done with the following expression – 

Curate Columns Required

This is a simple SQL expression to flatten the JSON message to a relational data format and select only the standardized output fields from the previous step, as shown below – 

				
					SELECT USER_ID,
OUTPUT:"survey_date"::DATE AS Survey_Date,
OUTPUT:"date_of_birth"::DATE AS DATE_OF_BIRTH,
OUTPUT:"city"::VARCHAR AS City,
REPLACE(OUTPUT:"state_iso2"::VARCHAR, 'US-', '') AS State,
OUTPUT:"country_iso3"::VARCHAR AS Country,
OUTPUT:"rating"::DOUBLE AS Rating,
FROM $T{Extract Output from Cortex Response}
				
			

Update Curated Survey Info

We have all the output fields now i.e. all the input fields standardized by Snowflake Arctic, and we will write the same into our curated table – SURVEY_INFO_RAW. Below is the how the data looks now- 

This is the end of the pipeline, where we have our desired output and have successfully standardized the raw input data into a fixed standard according to our needs using Gen AI.

Comparative Study on AWS Bedrock vs Snowflake Cortex in Matillion DPC

Below are the highlights and major differences of features between AWS Bedrock and Snowflake Cortex that can be leveraged in Matillion DPC  –

Topic

AWS Bedrock

Snowflake Cortex

Configuration

Relatively Complex to Configure

Easy to Configure

Access Point

AWS Console, Boto3 API, or third-party tools like Matillion DPC

Snowsight, Snowflake Functions, or third-party tools like Matillion DPC

LLMs Supported via Matillion DPC

  • Anthropic Claude

  • Mistral AI

  • Meta Llama

  • Amazon Titan

  • Cohere Command

  • Snowflake Arctic

  • Mistral AI

  • Meta Llama

  • Reka AI

Costing Model

Matillion DPC costs +

$x / 1K Input Tokens +

$y / 1K Output Tokens

x,y varies per LLM Model

Matillion DPC costs +

$z/ 1M Overall Tokens +

Snowflake warehouse cost

z varies per LLM Model

** Matillion DPC costs use a consumption-based pricing model and charge for the model’s duration.

Best Practices

  • Use batch inference for LLM wherever possible when working with large input data sets. If that’s not feasible, try batching multiple inputs into a single input and generating one prompt for the batch to get the response from the LLM in a single call. The context window of the particular LLM must be considered when deciding on the batch size.

  • Choose the use case and the LLM to be invoked carefully. The most difficult task is to choose a proper use case for implementing Gen AI in your applications and the LLM for it.

Closing

With GenAI, Matillion DPC is enhancing data productivity and redefining what’s possible with data. As the data processing landscape evolves, embracing GenAI positions organizations at the forefront of innovation, enabling them to quickly turn data into actionable intelligence. 

Are you ready to take your data operations to the next level? Dive into the potential of GenAI with Matillion Data Productivity Cloud and see firsthand how it can drive smarter, faster, and more impactful decisions. 

If you want to learn more about how GenAI and Matillion DPC can transform your data strategy, contact phData. Our team of experts is here to provide guidance, answer any questions, and help you unlock the full potential of your data operations.

FAQs

The Hub is an all-in-one platform providing various services to all registered users. New customers begin with a Credit-Based Trial. After registering and signing in, you can access Matillion’s products and tools to manage your Hub account and payment plan.

AI is a broad field dedicated to creating systems capable of tasks that typically require human intelligence, like decision-making, pattern recognition, problem-solving, and learning, applications of which include recommendation systems, speech recognition, autonomous driving, and predictive analytics.GenAI is a specialized branch of AI that focuses on generating new content, such as text, images, audio, and even code. Popular examples of GenAI include text generation models (like ChatGPT, Claude, etc), image generation tools, and music composition systems.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit