One of the benefits to Aptitive’s partnership with Snowflake is access to advanced trainings and certifications. Combined with our Snowflake project work,these trainings help us become Snowflake experts and find opportunities to assist our clients in making Snowflake work better for them.The most recent training helped us identify some important tactics for solving one of clients’ biggest concerns: How do I optimize for cost? Here is a list of actions that you should take to make sure you are not overspending on your Snowflake computation or storage.
First, for context, here is an extremely simplified diagram of how Snowflake is functioning in the background:
Since the most expensive part of any Snowflake deployment is compute, we have identified some useful tactics to store data strategically for efficient reads, write supercharged SQL scripts, and balance your performance vs cost.
Although very different than storing data on traditional disk, there are many benefits to loading Snowflake data strategically.
1.Sort on ingestion: Data is automatically partitioned in SF on natural ingestion order.
– Sorting an S3 bucket (using something like syncsort) before bulk load via copy could be way faster than inserting with an order by
2.CSV (Gzipped) is the best format for loading to SF (2-3x faster than Parquet or ORC)
3.Use COPY INTO instead of INSERT because it utilizes the more efficient bulk loading processes.
Take advantage of the native cloud ability to scale, create, and optimize your compute resources.
4.Scale up or out appropriately.
– As seen above, when you run a query, Snowflake will:
+ Find required FDN files
+ Pull files down into SSD VMs (Note: if >160 GB for AWS or >400 GB for Azure, will spill over to remote IO)
+ Performs compute
+ Files will stay on VM until DW is suspended
–1 big query = increase size of Data Warehouse
–Lots of small queries = queries are queuing = increase # of DWs or # of clusters (if enterprise you can enable multi-cluster)
5.Turn your VW on and off for certain workloads
–Turn on for batch, then immediately turn off (no reason to wait for auto-suspend)
–Use Auto resume when it makes sense
6.Control query processing and concurrency with parameters
–Statement queued timeout in seconds
–Statement timeout in seconds
7.Use warehouse monitoring to size and limit cost per workload (not per database → this is a shift from the on-prem mentality)
–If your workload is queuing then add more clusters
–If your workload is slow with no queuing then size up
Often overlooked, organizing your information into a mature data model will allow for high-performance SQL scripting better caching potential. Shameless plug, this is Aptitive’s bread and butter. Please reach out for a free working session to discuss data modeling for your company.
8.Do a data model for analytics
–Star Schema, 3NF, Data vault are optimal for SF
–Snowflake is NOT ideal for OLTP workloads
9.Bake your constraints into design because SF DOES NOT enforce them
–Build queries to check for violations
10.Build a process to alert you of loading issues (use an ETL framework)
Snowflake preserves a massive amount of usage data for analysis. At the very least, it allows you to see which workflows are the most expensive.
11.Use Account Usage Views (eg warehouse_metering_history) for tracking history, performance and cost
12.Don’t use AccountAdmin or Public roles for creating objects or accessing data (only for looking at costs)…create securable objects with the “correct” role and integrate new roles into the existing hierarchy
– Create roles by business functions to track spending by line of business
13.Use Resource Monitors to cut off DWs when you hit predefined credit amount limits
– Create one resource monitor per DW
– Enable notifications
The history profiler is the primary tool to observe poorly written queries and make the appropriate changes.
14.Use history profiler to optimize queries
– Goal is to put the most expensive node in the bottom right hand corner of profiler diagram
– SYSTEM$CLUSTERING_DEPTH shows how effective the partitions are – the smaller the average depth, the better clustered the table is with regards to the specified columns
+ Hot tip: You can add an new automatic reclustering service, but I don’t think it is worth the money right now
15.Analyze Bytes Scanned: Remote VS Cache
– Make your Bytes Scanned column use “Cache” or “Local” memory most of the time, otherwise consider creating a cluster key to scan more efficiently
16.Make the ratio of partitions scanned to partition used as small as possible by pruning
The number one issue driving costs in a Snowflake deployment is poorly written code! Resist the tendency to just increase the power (and therefore the cost) and focus some time on improving your SQL scripts.
17.Drop temporary and transient tables when done using
18.Don’t use “CREATE TABLE AS”, SF hates trunc and reloads for time travel issues. Instead, use “CREATE OR REPLACE”
– Again, Use COPY INTO notINSERT INTO
– Use staging tables to manage transformation of imported data
– Validate the data BEFORE loading into SF target tables
19.Use ANSI Joins because they are better for the optimizer
– Use “JOIN ON a.id = b.id” format
– NOT the “WHERE a.id=b.id”
20.Use “WITH” clauses for windowing instead of temp tables or sub-selects
21.Don’t use ORDER BY. Sorting is very expensive!
– Use integers over strings if you must order
22.Don’t handle duplicate data using DISTINCT or GROUP BY
Finally, set up the Snowflake deployment to work well in your entire data ecosystem.
23.Locate your S3 buckets in the same geographic region
24.Set up the buckets to match how the files are coming across (eg by date or application)
25.Keep files between 60-100 MB to take advantage of parallelism
26.Don’t use materialized views except in specific use cases (eg pre-aggregating)
Snowflake is shifting the paradigm when it comes to data warehousing in the cloud. However, by fundamentally processing data differently than other solutions, Snowflake has a whole new set of challenges for implementation.
If you’re in the Chicago area, Aptitive is excited to help sponsor the new Snowflake User Group in the area. We hope to promote a community for users to come together and discuss everything for performance to new features. Come to the first meetup on February 21st at 4pm in the Vibes Office!
Greg Marsh is a Data Engineer Manager at Aptitive. In his role, Greg facilitates the discovery of business insights from data. From “Big” data like IoT streams or classic relational ERP information, Greg helps companies to unlock the power of their data.