Introduction
AWS Glue and Snowflake are a common pairing for ETL workloads. Glue handles the orchestration and Spark execution, and Snowflake serves as the data warehouse. The usual development cycle for this combination involves deploying Glue jobs to AWS, waiting for Spark to spin up, debugging through CloudWatch, and paying for Snowflake compute credits, all before you know if a query change even works.
LocalStack’s Snowflake emulator runs a local Snowflake-compatible endpoint inside the same container as your emulated AWS services, so a Glue job running locally can connect to Snowflake over JDBC without leaving your machine. No AWS credentials, no Snowflake account, no waiting on cloud infrastructure.
In this tutorial, we’ll build a working Glue ETL pipeline that reads from a Snowflake table using the Spark Snowflake connector. We’ll provision all the AWS resources (S3, IAM, Glue) with Terraform via tflocal, seed the Snowflake table with an init script, and run the Glue job end to end. By the end, you’ll have a local setup you can use to iterate on Glue-Snowflake jobs.
Key Concepts
This tutorial combines three pieces:
- AWS Glue for running Spark-based ETL jobs;
- The Snowflake Spark connector for reading data over JDBC;
- LocalStack’s Snowflake emulator for providing a local Snowflake endpoint.
Here’s a quick overview of each.
AWS Glue and PySpark
AWS Glue is a managed ETL service built on Apache Spark. When you create a Glue job, you write a Python script (PySpark) that Glue runs on a Spark cluster. The Glue runtime provides its own context (GlueContext) on top of Spark’s SparkContext, handling bookmarks, logging, and job lifecycle.
For this tutorial, the Glue job uses the glueetl command type with Glue version 4.0 and Python 3. The job doesn’t transform or write data anywhere. It reads from Snowflake and prints the result. That keeps the focus on the integration plumbing rather than the ETL logic itself.
Snowflake Spark Connector
The Snowflake Spark connector is a library that lets Spark read from and write to Snowflake. Under the hood, it uses the Snowflake JDBC driver to establish a connection, push queries down to Snowflake, and pull results back as Spark DataFrames. Two JARs are needed:
snowflake-jdbc: the JDBC driver itself, which handles authentication and SQL execution against Snowflake.spark-snowflake: the Spark data source that registersnet.snowflake.spark.snowflakeas a format, translating Spark read/write operations into JDBC calls.
In a real AWS Glue job, you supply these JARs through the --extra-jars argument so they’re available on the Spark classpath at runtime.
LocalStack Snowflake Emulator
LocalStack’s Snowflake emulator exposes a Snowflake-compatible endpoint at snowflake.localhost.localstack.cloud, which accepts SQL statements over the Snowflake wire protocol. This means you can run DDL, DML, and queries against it using standard Snowflake drivers and tools, including the JDBC driver that the Spark connector relies on.
The emulator starts automatically when you use the localstack/snowflake Docker image. You can seed it with SQL scripts by mounting them into LocalStack’s init hook directory (/etc/localstack/init/ready.d/), and they’ll execute once the emulator is ready.
Prerequisites
Before starting, make sure you have the following installed:
- LocalStack CLI with a valid
LOCALSTACK_AUTH_TOKEN(you can get a trial license via the LocalStack signup page) - Docker
- Terraform and
tflocal(the LocalStack Terraform wrapper) awslocal(the LocalStack AWS CLI wrapper)wget(for downloading Snowflake JARs from Maven Central)
Step 1: Clone the sample repository
Start by cloning the repository that contains the Glue job script, Terraform configuration, and Snowflake init script:
git clone https://github.com/localstack-samples/localstack-snowflake-samples.gitcd localstack-snowflake-samples/glue-snowflake-integrationThe project layout looks like this:
glue-snowflake-integration/├── script/│ └── job.py # PySpark Glue ETL script├── tf/│ ├── main.tf # Terraform config for IAM, S3, and Glue│ └── versions.tf # Provider version constraint├── init.sf.sql # Snowflake table + seed data├── deploy.sh # Automated deployment script└── MakefileBefore starting any services, let’s look at what gets loaded into the Snowflake emulator. The file init.sf.sql creates a table and inserts five rows:
CREATE TABLE src_glue (id int, name varchar, status bool);
INSERT INTO src_glue (id, name, status) VALUES(1, 'Alice', TRUE),(2, 'Bob', FALSE),(3, 'Charlie', TRUE),(4, 'David', FALSE),(5, 'Eve', TRUE);This is the table that the Glue job will query. The schema is intentionally simple: an integer ID, a name string, and a boolean status flag. When LocalStack starts, this script runs automatically because we mount it into the init hooks directory.
Step 2: Start LocalStack for Snowflake emulator
Set your auth token and start LocalStack for Snowflake. The -v flag mounts the init script so it runs once the Snowflake emulator is ready:
export LOCALSTACK_AUTH_TOKEN=<your-auth-token>DOCKER_FLAGS='-v ./init.sf.sql:/etc/localstack/init/ready.d/init.sf.sql' \ localstack start --stack snowflake --detachWait for LocalStack to finish initializing:
localstack wait -t 120Once you see Ready. in the logs (or the wait command returns), the Snowflake emulator is up and the src_glue table has been created and populated. You should see a line confirming the Snowflake extension is available:
INFO --- l.e.patterns.webapp : snowflake extension available at http://snowflake.localhost.localstack.cloud:4566INFO --- l.pro.snowflake.extension : LocalStack Snowflake version: ...Step 3: Provision the AWS infrastructure with Terraform
The Terraform configuration in tf/main.tf sets up three things: an IAM role for the Glue job, an S3 bucket for storing the job script and dependencies, and the Glue job definition itself.
3.1: IAM role and policy
The Glue job needs permission to access S3 (for its script and JARs), CloudWatch (for logging), and the Glue service itself. The Terraform config creates a role with a broad policy that covers these:
data "aws_iam_policy_document" "glue_execution_policy_document" { statement { sid = "GlueExecutionPolicy" actions = [ "s3:*", "cloudwatch:*", "logs:*", "secretsmanager:*", "glue:*", ] resources = ["*"] }}In a production environment you’d scope these down, but for local development this keeps things simple.
3.2: S3 bucket
A bucket with the prefix glue-assets holds the Glue script and the Snowflake JDBC/Spark connector JARs:
resource "aws_s3_bucket" "glue_assets" { bucket_prefix = "glue-assets"}3.3: Glue job
The job is configured as a glueetl type job running Glue 4.0 with Python 3. The --extra-jars argument tells Glue where to find the Snowflake connector JARs in S3:
resource "aws_glue_job" "glue_job" { name = "glue-job" role_arn = aws_iam_role.glue_execution_role.arn glue_version = "4.0" worker_type = "G.1X" number_of_workers = 2 command { name = "glueetl" python_version = "3" script_location = "s3://${aws_s3_bucket.glue_assets.bucket}/script/job.py" } default_arguments = { "--class" = "GlueApp" "--enable-continuous-cloudwatch-log" = "true" "--extra-jars" = "s3://${aws_s3_bucket.glue_assets.bucket}/jars/snowflake-jdbc-3.20.0.jar,s3://${aws_s3_bucket.glue_assets.bucket}/jars/spark-snowflake_2.12-2.5.4-spark_2.4.jar" }}The --extra-jars paths must match the actual JAR filenames we’ll upload in the next step.
3.4: Initialize and apply
Now initialize and apply the Terraform configuration:
cd tftflocal inittflocal apply --auto-approveYou should see output confirming all five resources were created:
Apply complete! Resources: 5 added, 0 changed, 0 destroyed.
Outputs:
bucket_name = "glue-assets20260401..."job_name = "glue-job"Capture the bucket name, since you’ll need it for the next step:
BUCKET_NAME=$(tflocal output -raw bucket_name)cd ..Step 4: Download and upload the Snowflake JARs
The Glue job needs two JARs from Maven Central: the Snowflake JDBC driver and the Spark Snowflake connector. Download them into a local jars/ directory:
mkdir -p jars
wget -q "https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/3.20.0/snowflake-jdbc-3.20.0.jar" \ -P jars
wget -q "https://repo1.maven.org/maven2/net/snowflake/spark-snowflake_2.12/2.5.4-spark_2.4/spark-snowflake_2.12-2.5.4-spark_2.4.jar" \ -P jarsUpload them to the S3 bucket along with the Glue script:
awslocal s3 cp jars s3://$BUCKET_NAME/jars/ --recursiveawslocal s3 cp script/job.py s3://$BUCKET_NAME/script/job.pyVerify everything is in place:
awslocal s3 ls s3://$BUCKET_NAME/ --recursiveYou should see three objects: the two JARs under jars/ and the script under script/:
2026-04-01 22:29:17 78671124 jars/snowflake-jdbc-3.20.0.jar2026-04-01 22:29:17 565278 jars/spark-snowflake_2.12-2.5.4-spark_2.4.jar2026-04-01 22:29:19 1373 script/job.pyStep 5: Run the Glue job and verify the results
With the infrastructure provisioned and the JARs uploaded, we can now run the Glue job. Before that, let’s walk through the job script to understand what it does, then start the job and check the results through CloudWatch Logs.
5.1: Understand the Glue job script
Before running the job, let’s look at the key parts of script/job.py. The script sets up a GlueContext and Spark session, then imports the Snowflake Spark connector into the JVM and enables query pushdown:
java_import(spark._jvm, SNOWFLAKE_SOURCE_NAME)spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils \ .enablePushdownSession( spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate() )The connection options point to the LocalStack Snowflake emulator. The credentials (test/test) are the default emulator credentials:
sfOptions = { "sfURL" : "https://snowflake.localhost.localstack.cloud", "sfUser" : "test", "sfPassword" : "test", "sfDatabase" : "test", "sfSchema" : "public", "sfWarehouse" : "test", "application" : "AWSGlue"}The job then reads from the src_glue table using spark.read.format(SNOWFLAKE_SOURCE_NAME), which pushes the query down to Snowflake over JDBC. If the read succeeds, it prints the DataFrame and commits the job. The exception handler catches JDBC communication errors that can occur in the emulator environment and logs them gracefully.
5.2: Start the Glue job
Start the job run and capture the job run ID:
JOB_RUN_ID=$(awslocal glue start-job-run \ --job-name glue-job \ --output text \ --query 'JobRunId')
echo "Job Run ID: $JOB_RUN_ID"The first run takes a few minutes because LocalStack needs to download and initialize the Spark runtime inside the container. Subsequent runs are faster since the runtime is cached.
Poll the job status until it finishes:
while true; do STATUS=$(awslocal glue get-job-run \ --job-name glue-job \ --run-id $JOB_RUN_ID \ --query 'JobRun.JobRunState' \ --output text) echo "Status: $STATUS" if [ "$STATUS" = "SUCCEEDED" ] || [ "$STATUS" = "FAILED" ] || [ "$STATUS" = "STOPPED" ]; then break fi sleep 10doneAfter a few minutes, you should see:
Status: RUNNINGStatus: RUNNING...Status: SUCCEEDED5.3: Verify the job through CloudWatch Logs
Glue jobs write their output to CloudWatch Logs. The log group for Glue v2 jobs is /aws-glue/jobs/logs-v2, and the log stream name matches the job run ID.
List the log streams:
awslocal logs describe-log-streams \ --log-group-name "/aws-glue/jobs/logs-v2"Then fetch the log events to see the full Spark execution trace:
awslocal logs get-log-events \ --log-group-name "/aws-glue/jobs/logs-v2" \ --log-stream-name "$JOB_RUN_ID" \ --query 'events[*].message' \ --output textIn the output, you’ll see the full Spark lifecycle: the SparkContext starting, the Snowflake JDBC session opening against snowflake.localhost.localstack.cloud, and the job shutting down cleanly. The key log lines to look for:
INFO SnowflakeConnectionV1: Initializing new connectionINFO SFSession: Opening session with server: https://snowflake.localhost.localstack.cloud:443/, account: snowflake, user: test, ...This confirms the Spark Snowflake connector inside the Glue job reached the LocalStack Snowflake emulator over JDBC.
Conclusion
In this tutorial, we ran a full AWS Glue ETL pipeline that integrates with Snowflake, entirely on a local machine. The setup here is useful for a few things beyond just running the demo. You can iterate on Glue job logic without waiting for AWS Spark cluster provisioning, test schema changes to your Snowflake tables locally before applying them in staging, and validate that your Glue job’s Snowflake connection parameters are correct without burning compute credits.
You can find the full source code for this sample in the localstack-snowflake-samples repository.