At phData, we have worked quite a bit with Snowpark since it has become available from the Snowflake Data Cloud. Historically, Snowpark has been based on Scala and Java and these languages had to be used when building data transformations or data applications.
At the 2022 Snowflake Summit, Snowflake announced that Python is now supported with Snowpark. Lucky for you, phData has had early access to this feature for several months and we’re happy to share our thoughts and learnings.
In this blog, we’ll explore the new Python feature in Snowpark and why it matters to your business.
Does Snowflake Support Python in Snowpark?
Snowflake now supports Scala, Java, and Python for Snowpark. Not only is there a Python API, but there are also some additional features that are Python-specific. Like the Scala and Java API, the Python Snowpark API provides a dataframe API that allows lazy transformations of the data.
In theory, this can be much more efficient than a normal Python client and offers a much cheaper alternative to the JDBC client.
One of the other upsides of the Python Snowpark support is Python User Defined Functions (UDFs) and Python stored procedures are now supported as well. The full Python support allows for everything that was done via Scala and Java before to be done via Python. When creating simpler transformations, this will require more time and effort compared to a Scala implementation.
Managing Packages
As with any Python application, there are always questions about package management. While not all Python packages are supported (such as TensorFlow), most packages can be installed via the ORGADMIN role. Packages are all controlled via Anaconda and they can be viewed via this command: “select * from information_schema.packages where language = ‘python’;” Once the package is installed it can be used via anyone in the account.
Python Specific Changes
A more Python-specific feature released is the UDF batch API. The batch UDF functionality allows a UDF to use Pandas dataframes and operate on batches of rows instead of each individual row. This can lead to speed-ups in certain algorithms and working with code that accepts Pandas dataframes. This can also simplify the transformation code that is expecting a Pandas dataframe. One of the things to keep in mind is that a batch must complete in 60 seconds. If needed, the batch size can be manually specified but batches should not have a shared state.
Why Implement Python for Snowpark?
As with the Scala and Java API, any customers that are looking to create data transformations in a Snowflake environment can leverage Snowpark. But not every client will have the staff on hand who are versed in Scala or Java development. Since Python is very popular, there are a large number of uses in data engineering. This means the odds are quite high that there is the existing staff that can work in Python, reducing the cost of implementation due to fewer training costs.
Is Anything Else Important?
While Snowpark is more efficient than a regular JDBC client, there are still several performance gotchas that can happen. There is another article that has Best Practices for Snowpark and those same best practices apply to the Python implementation as well. But on top of that, using the batch system would provide more throughput as well.
Closing
Overall, Snowflake is working to grow the Snowpark ecosystem and adding a widely asked-for feature to Snowpark. This should provide another tool for Data Engineers to efficiently and quickly work in the Snowflake ecosystem and provide more business value.
If you’re interested in learning more about new Snowflake features, our team of experts would love to help answer your toughest questions. As the 2022 Snowflake Partner of the Year, we have a proven record of helping organizations of all sizes succeed with Snowflake and we’d love to help your organization too