July 2, 2024

How to Setup a Project in Snowpark Using a Python IDE

By Kaulab Basu

Snowpark, offered by the Snowflake AI Data Cloud, consists of libraries and runtimes that enable secure deployment and processing of non-SQL code, such as Python, Java, and Scala. With Snowpark, you can develop code outside of the Snowflake environment and then deploy it back into Snowflake, eliminating the need to manage infrastructure concerns.

In this blog, we’ll cover the steps to get started, including:

  1. How to set up an existing Snowpark project on your local system using a Python IDE.

  2. Add a Python UDF to the existing codebase and deploy the function directly in Snowflake.

  3. Validate the function deployment locally and test from Snowflake as well.

  4. Dive deep into the inner workings of the Snowpark Python UDF deployment.

  5. Check in the new code to Github using git workflows and validate the deployment.

Before we dive into the steps, it’s important to unpack why Snowpark is such a big deal.

What Does Snowpark Bring to the Table?

Familiar Client Side Libraries – Snowpark brings in-house and deeply integrated, DataFrame-style programming abstract and OSS-compatible APIs to the languages data practitioners like to use (Python, Scala, etc). It also includes the Snowpark ML API for more efficient machine language (ML)  modeling and ML operations.

Flexible Runtime Constructs – Snowpark provides flexible runtime constructs that allow users to bring in and run custom logic. Developers can seamlessly build data pipelines, ML models, and data applications with User-Defined Functions and Stored Procedures.

What Are Snowpark’s Differentiators?

  1. Spark/Pandas Dataframe Style Programming Within Snowflake – You can perform your standard actions/transformations just like you would in Spark.

  2. Easy to Collaborate – With multiple developers working on the programming side of things, Snowpark solves an otherwise collaboration issue efficiently. You can set up your own environment in your local system and then check in/deploy the code back to Snowflake using Snowpark (more on this later in the article).

  3. Built-In Server-Side Packages – NumPy, Pandas, Keras, etc.

  4. Easy to Deploy – Using SnowCLI, Snowpark can easily integrate your VS Code (or any editor for that matter) with Snowflake and leverage a virtual warehouse defined to run your deployment process (we’ll also cover this later in the article). 

How to Set up a Project in Snowpark

Now with the background information out of the way, let’s get started! The first thing you’ll want to ensure is that you have all of the following: 

  • Python (preferably between 3.83.10 but not greater than 3.10).

  • Miniconda (if you already have Anaconda, please skip).

  • Snowflake account with ACCOUNTADMIN access (to alter the configuration file).

  • Anaconda Terms & Conditions are accepted on Snowflake. See the Getting Started section in Third-Party Packages.

  • GitHub account. 

  • Gitbash installed.

  • VS Code (Or any similar IDE to work with).

  • SnowSQL installed on your system.

Note: These instructions are meant to work for Windows, but are very similar for people working on Mac and Linux operating systems.

Getting Started

  1. Start with forking this and creating your own repo on GitHub.

  2. Add Miniconda scripts path to your environment variables system path (similar to C:\Users\kbasu\AppData\Local\miniconda3\Scripts).

  3. Create a root folder and point the VS Code to that directory using the command prompt terminal.

  4. Clone your forked repository to the root directory. (git clone <your project repo>).

  5. Move inside sfguide-data-engineering-with-snowpark-python (cd sfguide-data-engineering-with-snowpark-python).

  6. Create conda environment (conda env create -f environment.yml). You should find environment.yml already present inside the root folder.

  7. Activate the conda environment. (conda activate snowflake-demo).

Now you should see the new environment activated!

Here is what the project directory should look like:

How to Add a New Python UDF and Deploy the Same to Snowflake

  1. Typically you’d want to create a separate folder structure for functions so that they can be deployed as is (no dependency):

    1. An app.py – This is the main function where you will code. The file name can be different, but for standardization purposes, it is advised to use app.py.

    2. An app.toml – This will help in deploying the code (we’ll share the formats for each of these files in the end).

    3. A requirement.txt – Like any other Python project, this will be looked upon to find and install any specific Python library required for this particular functionality (for this particular UDF, it is kept blank as no additional library is required).

    4. A .gitignore – This is automatically updated during git push and pull.

  1. Main function – A simple multiplication function:

  1. TOML file –  App.zip will be automatically created while deploying the code, will come to that shortly. The rest should be self-explanatory.

  1. Requirement file – Leave blank.

  1. Git ignore file

  1. Install the additional libraries (only for this new functionality) if required. We don’t have any in our example, but for reference, this is how you’d perform this:

  1. Let’s test this function locally once (we need to be confident about the functionality of the newly added piece of code).

  1. Now it’s time to deploy the code to Snowflake:

The deployment should be successful. In the next sections, we will review the deployment’s validation and the inner workings of the entire process.

Validating the Deployment in Snowflake

  1. Existence – The newly created Python UDF should be present under the Analytics schema under the HOL_DB database.

  1. Content – Let’s validate the content. SnowCLI simplifies the process and automatically converts the Python code to SQL script (more on this in the next section).

  1. Run the Function From Snowflake – Let’s test the functionality from Snowflake.

  1. Test Locally – Now that we have tested the functionality from Snowflake, let’s test this from local.

In our example, the deployment worked just fine. 

How Does the Deployment Work in the Background?

For Snowpark Python UDFs and sprocs in particular, the SnowCLI does all the heavy lifting of deploying the objects to Snowflake. Here is what it does in the background:

  • Dealing with third-party packages

    • For packages that can be accessed directly from our Anaconda environment, it will add them to the packages list in the create function SQL command. 

    • For packages that are not currently available in our Anaconda environment,  it will download the code and include them in the project zip file.

  • Creates a zip file of everything in your project, including the following: 

    • Copying that project zip file to the Snowflake stage:

    • Creating the Snowflake function object:

More on SnowCLI can be found here since it is still being developed by Snowflake. For a comprehensive guide to writing Python UDF, check out this guide. 

Next, we will deploy the changes back to git.

Deployment of the Changes to GIT Repo

  1. Configure Forked Repo – From the repository, follow this route – Settings > Secrets and variables > Actions > New repository secret near the top right and enter the names given below along with the appropriate values:

  1. Configure SnowSQL Parameters: From the VS Code editor, press CTRL-P. Search for ~/.snowsql/config. 

Configure similar to the above settings.

  1. Push Your Changes and Commit 

    1. You should already see pending changes in your VS Code git source control section.

Enter a suitable message and commit.

    1. If not added already, add your username and email ID as authorized credentials for git:

    1. You can commit and then sync, or there’s an option to commit and sync.

    2. Verify the success message and no pending changes in the source control section.

  1. Verify the Changes on git

    1. Go to the actions tab in GitHub and find the latest git workflow run.

    1. Click on the latest run and verify the details.

    1. In the latest codebase, verify that the latest changes are present.

Best Practices

Now you may ask, when is it the right time to go via the Snowpark route? Here are some of our best practices to follow:

  • If you have too few Python (or any other non-SQL programming language) UDFs to write/already written, you may want to go via the Python worksheet route instead.

  • If your data pipeline requirements are quite straightforward—i.e., they don’t have too many different conditional logics, too many different types of workloads, or changing requirements—and pipelines can be written in plain SQL without any foreseeing debugging issues, then you may not want to over-engineer and stick to traditional SQL-based Snowflake pipelines.

  • If you have a simple migration requirement (e.g., Hive tables must be migrated to Snowflake), check if you can use Snowflake-Spark connectors directly instead of going via the Snowpark route.

  • Snowpark will have the greatest impact on the following use cases:

    • You have migration requirements with data cleansing/transformation/standardization logic written/not written in Spark.

    • You have different types of workloads to handle with varying requirements.

    • You have different developers working on building data pipelines/UDFs/stored procedures in the same environment.

    • You have code written in different languages(Java/Python etc.), and you want to have a common platform without having to worry about infrastructure considerations.

    • You already have an SQL-based data pipeline, but changing it to meet new requirements would require extensive changes. 

    • You want a Spark-like programming environment.

Closing

Thank you so much for reading! Our hope is that this article helps you get the most out of Snowpark. If you need help, have questions, or want to learn more about how Snowpark & Snowflake can help your business make more informed decisions, the experts at phData can help!

As Snowflake’s 2024 Partner of the Year, phData can confidently help you get the most out of your Snowflake investment. Explore our Snowflake services today, especially our Snowpark MVP program.

FAQs

When you use the Snowflake Python connector, you are fetching the data from Snowflake and bringing it to the compute instance(the public cloud you are using- AWS/Azure/GCP) where your Python code is running for further processing. In Snowpark, your Python (Or Scala/Java) program(for data processing), which is coded as a UDF runs on the snowflake engine(using SnowCLI connecting to a virtual warehouse) itself. So, you do not need to bring the data out of Snowflake for processing, thus making it a better choice if you’re concerned about security. In addition to this, there are the following benefits of using Snowpark instead-

  • Support for interacting with data within Snowflake using libraries and patterns purpose built for different languages without compromising on performance or functionality.

  • Support for authoring Snowpark code using local tools such as Jupyter, VS Code, or IntelliJ.

  • Support for pushdown for all operations, including Snowflake UDFs. This means Snowpark pushes down all data transformation and heavy lifting to the Snowflake data cloud, enabling you to efficiently work with data of any size.

By now, you must already know that Snowpark requires the users/developers to be comfortable with at least 1 programming language. Here are some additional points-

  • Snowpark is a relatively new technology, and some bugs or performance issues may not yet have been identified.

  • Snowpark is a programming model, and it requires some level of programming expertise to use.

  • Snowpark is not currently available for all Snowflake regions.

  • There are some limitations around stored procedure in writing. 

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit