This article was originally written by Andrew Evans and updated for 2025 by Ryan Gooch.
As companies continue to adopt machine learning (ML) in their workflows, the demand for scalable and efficient tools has increased. Snowpark, an innovative technology from the Snowflake Data Cloud, meets this demand by providing users with an ecosystem of tooling that they can leverage to build applications and data pipelines with familiar programming languages, and without having to move data across tools or platforms.
Snowflake’s Snowpark helps data scientists and developers looking to streamline their data workflows and enhance the performance of their models while using a familiar programming language, such as Python, Java, and Scala. Snowpark ML is the Python toolkit that helps users preprocess, perform feature engineering, and train their models in Snowflake while using the libraries that they already know and love.
In this blog post, we will explore the performance benefits of Snowpark ML and how it can help businesses make better use of their data.
Snowpark for MLOps
Snowpark ML provides familiar interfaces for processing data and interacting with ML models in Snowflake. In this section, we’ll examine a few of the major components of the Snowpark ML library and discuss what they offer the user.Â
Many of the features are 1:1 API mappings from the widely-used scikit-learn
library and other commonly used data science tools. Here, we’ll focus on three essential parts of the SDK: Modeling, Preprocessing, and the Feature Store.
Snowpark ML Modeling
Snowpark ML Modeling enables model training directly within Snowflake’s ecosystem. Leveraging familiar Python frameworks like scikit-learn
, LightGBM
, and XGBoost
, data scientists can perform feature engineering, preprocessing, and model training without leaving the Snowflake environment. This integration offers significant benefits, including distributed execution for preprocessing functions, accelerated training with distributed hyperparameter optimization, and distributed fitting of models.Â
The API closely mirrors popular ML libraries, making it intuitive for experienced data scientists while offering enhanced scalability.Â
Streamlined Model Training with Snowpark's Distributed Power
Snowpark ML Modeling supports a wide range of estimators and transformers, allowing for the creation of sophisticated models that seamlessly integrate with Snowpark ML Operations. By executing training procedures as stored procedures within Snowflake’s virtual warehouses, Snowpark ML Modeling harnesses the full power of Snowflake’s distributed computing capabilities, enabling efficient processing of large datasets and complex models.
Data scientists will find many of their favorite and most valuable metrics in the metrics
module, including classification metrics like accuracy, F1 score, precision, recall, log loss, etc., as well as regression metrics such as mean absolute error, mean absolute percentage error, and mean squared error. There are also descriptive statistics metrics like correlation and covariance. All of these metrics support distributed execution, allowing data scientists to easily leverage performant compute clusters to calculate these values in a familiar API.
Snowpark ML Modeling streamlines ML model inference, allowing for seamless deployment and execution of trained models within Snowflake’s ecosystem. Once a model is trained, making predictions on data is as easy as calling the model’s predict
method, which creates a temporary user-defined function (UDF) in your Snowflake virtual warehouse. For long-term persistence, models can be stored in the Snowflake Model Registry, facilitating easy discovery and deployment across your organization. Snowpark ML supports various inference scenarios, including partitioned custom models for parallel execution, which can significantly boost performance when dealing with large datasets. The framework also provides a Snowpark Python version of scikit-learn
‘s Pipeline
, enabling complex transformation sequences.Â
Additionally, Snowpark ML models can be “unwrapped” to their underlying third-party formats, offering flexibility in model manipulation and local execution. This integrated approach to inference eliminates data movement, reduces latency, and leverages Snowflake’s scalable compute resources, resulting in faster and more efficient prediction workflows.
Snowpark ML Preprocessing
Snowpark ML Preprocessing offers a set of utilities that data scientists and machine learning engineers will recognize from the scikit-learn
library. These tools include scalers for standardizing and encoding columns, functions for handling numerical and categorical features, and ways to bin data into intervals. Since these operations are optimized for distributed computing in Snowflake, the user can achieve massive performance gains without changing their code or leaving their Snowflake estate.Â
For more information, have a look at the Distributed Preprocessing performance comparison from Snowflake.
Snowpark ML Feature Store
Once features have been computed and preprocessed, the user can take advantage of the Snowflake Feature Store to persist their defined features and have them calculated as data arrives. Features are built with Dynamic Tables, allowing easy lineage and freshness controls, or developers can leverage external processing sources like dbt for calculation. This will add another dimension of utility to the Snowpark ML Modeling suite of tools, giving users pre-computed historical and fresh features with which they can easily train models.
Snowflake Model Registry
The Snowflake Model Registry is a powerful centralized repository for managing and storing trained models, complete with associated metadata and version history. This comprehensive system simplifies model governance and fosters seamless collaboration among team members.Â
By treating models as first-class schema-level objects, the registry ensures easy discoverability and utilization across your organization. The Python-based Snowpark ML library provides classes for creating registries and storing models, supporting multiple versions and default designations. The registry’s versatility shines through its support of many model types, including Snowpark ML Modeling, popular frameworks like scikit-learn and PyTorch, and even custom models via the CustomModel class.Â
Once stored, models can be easily invoked to perform operations such as inference directly within Snowflake’s virtual warehouses, streamlining the entire machine learning workflow from development to deployment.
Snowpark ML Inference vs. Local Inference
In this example, we’ll use a preexisting model trained to predict the point spreads of NFL games using data from the Snowflake Marketplace, specifically from ThoughtSpot’s Fantasy Football dataset. To make things super simple, we’ll use a Hex notebook to execute each inference mode.Â
With ML inference in Snowpark, inference logic can be written natively in Python, deserializing a pre-trained model from a Snowflake stage. Predictions are generated in parallel, distributed using Snowflake’s efficient logic, and rapidly scaled out within a Warehouse. Bringing the model to the data, as opposed to the traditional approach, has two key benefits.Â
First, data never needs to be moved out of where it is natively stored, which is far faster and more secure. Second, the performance of the inference logic easily and rapidly scales. As needs change, compute resources allocated dynamically respond.
Local Inference
For local inference, we need to download the model stored in a Snowflake stage and deserialize it. Next, we’ll collect the appropriate football table for input and run predictions within our local environment. Once the predictions are finished, we must write those back to a table within Snowflake.Â
For each step, we’ll record how long it takes too.
from datetime import datetime
import pandas
t_start = datetime.now()
pandas_df = hex_snowpark_session.sql("""
SELECT
…several columns…
FROM user_db.nfl.nfl_data_large
""").to_pandas()
t_collect = datetime.now()
model_xgb = joblib.load("https://i0.wp.com/www.phdata.io/hex/model/model.joblib.gz")
t_deserialize = datetime.now()
prediction = model_xgb.predict(pandas_df)
pandas_df["PREDICTION"] = prediction[0]
t_predict = datetime.now()
hex_snowpark_session.sql("USE SCHEMA USER_DB.NFL").collect()
hex_snowpark_session.write_pandas(pandas_df,
table_name="predictions",
database="USER_DB",
schema="NFL",
auto_create_table=True)
t_write = datetime.now()
times = {
'1. Collect': (t_collect - t_start).total_seconds(),
'2. Deseriialize': (t_deserialize - t_collect).total_seconds(),
'3. Predict': (t_predict - t_deserialize).total_seconds(),
'4. Write': (t_write - t_predict).total_seconds(),
'5. Total': (t_write - t_start).total_seconds()
}
Snowpark
Train
Snowpark allows us to write Python code right where the data is—i.e., in our Snowpark-optimized warehouse—without needing to worry about creating stored procedures or UDFs. We simply run data science code on a Snowflake table, which allows us to train the model in a familiar format. In this example, we split the data into training and validation sets and fit an XGBoost regressor to the training data before storing the model and reporting results.Â
Note that there are some small differences from the scikit-learn API due to the lazy execution style of Snowpark; we must tell the XGBRegressor constructor which features are inputs to the model and which is our target column.
# install recent version of snowpark
# !pip install --upgrade snowflake-ml-python
import numpy as np
from snowflake.ml.modeling.metrics import mean_squared_error as MSE, r2_score
from snowflake.ml.modeling.xgboost import XGBRegressor
# Load data
df_in = hex_snowpark_session.table("NFL_DATA")
# Split the data into training and testing sets
X_train_df, X_test_df = df_in.random_split([0.9, 0.1], seed=42)
# Initialize the model
model = XGBRegressor(
n_estimators=2000,
max_depth=10,
eta=0.001,
subsample=0.5,
colsample_bytree=0.8,
reg_lambda=0.5,
reg_alpha=0.5,
gamma=100,
input_cols=features_to_train[:-1],
label_cols=features_to_train[-1],
)
# Fit the model
model.fit(X_train_df)
# Predict the model
pred = model.predict(X_test_df)
# RMSE Computation
rmse = np.sqrt(MSE(X_test_df[features_to_train[-1]], pred))
r2 = r2_score(X_test_df[features_to_train[-1]], pred)
# Print the results
print(f"R2_Test: {str(r2)}", f"RMSE_Test: {str(rmse)}")
Inference
Inference execution requires as little as two lines of code: getting a reference to our inference data and calling the predict method from our trained model.
df_in = hex_snowpark_session.table("INFERENCE_NFL_DATA")
inference_preds = model.predict(df_in)
In the chart below, 750,000 predictions are produced with our XGBoost model. In the local case, time is taken to copy data over, predict with statically allocated compute resources, and then write the predictions back to the warehouse.Â
In the Snowpark case, all the steps are performed within the warehouse and executed far faster. If our data volume grows even larger, we could simply scale out the warehouse it runs on.
Conclusion
Snowpark provides all the essential capabilities a data scientist needs with a classic interface backed by the power of Snowflake’s compute. From initial exploratory analysis to putting Machine Learning in production, Snowpark makes the development process incredibly straightforward. With complete pushdown of processing, training, and inferencing, data science teams have all the compute they need at their fingertips, whatever scale their model achieves.
Want to learn more?
Check out our recent Snowflake ML Objects Cheatsheet for insights into building better predictive ML capabilities within Snowflake!