Run AWS Glue ETL Jobs with Snowflake Locally Using LocalStack

Learn how to run Glue ETL jobs that connect to Snowflake entirely on your local machine using LocalStack. You'll set up the infrastructure with Terraform, write a PySpark Glue job that reads from a local Snowflake emulator via JDBC, and verify it all through CloudWatch Logs.

Run AWS Glue ETL Jobs with Snowflake Locally Using LocalStack

Introduction

AWS Glue and Snowflake are a common pairing for ETL workloads. Glue handles the orchestration and Spark execution, and Snowflake serves as the data warehouse. The usual development cycle for this combination involves deploying Glue jobs to AWS, waiting for Spark to spin up, debugging through CloudWatch, and paying for Snowflake compute credits, all before you know if a query change even works.

LocalStack’s Snowflake emulator runs a local Snowflake-compatible endpoint inside the same container as your emulated AWS services, so a Glue job running locally can connect to Snowflake over JDBC without leaving your machine. No AWS credentials, no Snowflake account, no waiting on cloud infrastructure.

In this tutorial, we’ll build a working Glue ETL pipeline that reads from a Snowflake table using the Spark Snowflake connector. We’ll provision all the AWS resources (S3, IAM, Glue) with Terraform via tflocal, seed the Snowflake table with an init script, and run the Glue job end to end. By the end, you’ll have a local setup you can use to iterate on Glue-Snowflake jobs.

Key Concepts

This tutorial combines three pieces:

  1. AWS Glue for running Spark-based ETL jobs;
  2. The Snowflake Spark connector for reading data over JDBC;
  3. LocalStack’s Snowflake emulator for providing a local Snowflake endpoint.

Here’s a quick overview of each.

AWS Glue and PySpark

AWS Glue is a managed ETL service built on Apache Spark. When you create a Glue job, you write a Python script (PySpark) that Glue runs on a Spark cluster. The Glue runtime provides its own context (GlueContext) on top of Spark’s SparkContext, handling bookmarks, logging, and job lifecycle.

For this tutorial, the Glue job uses the glueetl command type with Glue version 4.0 and Python 3. The job doesn’t transform or write data anywhere. It reads from Snowflake and prints the result. That keeps the focus on the integration plumbing rather than the ETL logic itself.

Snowflake Spark Connector

The Snowflake Spark connector is a library that lets Spark read from and write to Snowflake. Under the hood, it uses the Snowflake JDBC driver to establish a connection, push queries down to Snowflake, and pull results back as Spark DataFrames. Two JARs are needed:

  • snowflake-jdbc: the JDBC driver itself, which handles authentication and SQL execution against Snowflake.
  • spark-snowflake: the Spark data source that registers net.snowflake.spark.snowflake as a format, translating Spark read/write operations into JDBC calls.

In a real AWS Glue job, you supply these JARs through the --extra-jars argument so they’re available on the Spark classpath at runtime.

LocalStack Snowflake Emulator

LocalStack’s Snowflake emulator exposes a Snowflake-compatible endpoint at snowflake.localhost.localstack.cloud, which accepts SQL statements over the Snowflake wire protocol. This means you can run DDL, DML, and queries against it using standard Snowflake drivers and tools, including the JDBC driver that the Spark connector relies on.

The emulator starts automatically when you use the localstack/snowflake Docker image. You can seed it with SQL scripts by mounting them into LocalStack’s init hook directory (/etc/localstack/init/ready.d/), and they’ll execute once the emulator is ready.

Prerequisites

Before starting, make sure you have the following installed:

Step 1: Clone the sample repository

Start by cloning the repository that contains the Glue job script, Terraform configuration, and Snowflake init script:

Terminal window
git clone https://github.com/localstack-samples/localstack-snowflake-samples.git
cd localstack-snowflake-samples/glue-snowflake-integration

The project layout looks like this:

glue-snowflake-integration/
├── script/
│ └── job.py # PySpark Glue ETL script
├── tf/
│ ├── main.tf # Terraform config for IAM, S3, and Glue
│ └── versions.tf # Provider version constraint
├── init.sf.sql # Snowflake table + seed data
├── deploy.sh # Automated deployment script
└── Makefile

Before starting any services, let’s look at what gets loaded into the Snowflake emulator. The file init.sf.sql creates a table and inserts five rows:

CREATE TABLE src_glue (id int, name varchar, status bool);
INSERT INTO src_glue (id, name, status) VALUES
(1, 'Alice', TRUE),
(2, 'Bob', FALSE),
(3, 'Charlie', TRUE),
(4, 'David', FALSE),
(5, 'Eve', TRUE);

This is the table that the Glue job will query. The schema is intentionally simple: an integer ID, a name string, and a boolean status flag. When LocalStack starts, this script runs automatically because we mount it into the init hooks directory.

Step 2: Start LocalStack for Snowflake emulator

Set your auth token and start LocalStack for Snowflake. The -v flag mounts the init script so it runs once the Snowflake emulator is ready:

Terminal window
export LOCALSTACK_AUTH_TOKEN=<your-auth-token>
DOCKER_FLAGS='-v ./init.sf.sql:/etc/localstack/init/ready.d/init.sf.sql' \
localstack start --stack snowflake --detach

Wait for LocalStack to finish initializing:

Terminal window
localstack wait -t 120

Once you see Ready. in the logs (or the wait command returns), the Snowflake emulator is up and the src_glue table has been created and populated. You should see a line confirming the Snowflake extension is available:

INFO --- l.e.patterns.webapp : snowflake extension available at http://snowflake.localhost.localstack.cloud:4566
INFO --- l.pro.snowflake.extension : LocalStack Snowflake version: ...

Step 3: Provision the AWS infrastructure with Terraform

The Terraform configuration in tf/main.tf sets up three things: an IAM role for the Glue job, an S3 bucket for storing the job script and dependencies, and the Glue job definition itself.

3.1: IAM role and policy

The Glue job needs permission to access S3 (for its script and JARs), CloudWatch (for logging), and the Glue service itself. The Terraform config creates a role with a broad policy that covers these:

data "aws_iam_policy_document" "glue_execution_policy_document" {
statement {
sid = "GlueExecutionPolicy"
actions = [
"s3:*",
"cloudwatch:*",
"logs:*",
"secretsmanager:*",
"glue:*",
]
resources = ["*"]
}
}

In a production environment you’d scope these down, but for local development this keeps things simple.

3.2: S3 bucket

A bucket with the prefix glue-assets holds the Glue script and the Snowflake JDBC/Spark connector JARs:

resource "aws_s3_bucket" "glue_assets" {
bucket_prefix = "glue-assets"
}

3.3: Glue job

The job is configured as a glueetl type job running Glue 4.0 with Python 3. The --extra-jars argument tells Glue where to find the Snowflake connector JARs in S3:

resource "aws_glue_job" "glue_job" {
name = "glue-job"
role_arn = aws_iam_role.glue_execution_role.arn
glue_version = "4.0"
worker_type = "G.1X"
number_of_workers = 2
command {
name = "glueetl"
python_version = "3"
script_location = "s3://${aws_s3_bucket.glue_assets.bucket}/script/job.py"
}
default_arguments = {
"--class" = "GlueApp"
"--enable-continuous-cloudwatch-log" = "true"
"--extra-jars" = "s3://${aws_s3_bucket.glue_assets.bucket}/jars/snowflake-jdbc-3.20.0.jar,s3://${aws_s3_bucket.glue_assets.bucket}/jars/spark-snowflake_2.12-2.5.4-spark_2.4.jar"
}
}

The --extra-jars paths must match the actual JAR filenames we’ll upload in the next step.

3.4: Initialize and apply

Now initialize and apply the Terraform configuration:

Terminal window
cd tf
tflocal init
tflocal apply --auto-approve

You should see output confirming all five resources were created:

Apply complete! Resources: 5 added, 0 changed, 0 destroyed.
Outputs:
bucket_name = "glue-assets20260401..."
job_name = "glue-job"

Capture the bucket name, since you’ll need it for the next step:

Terminal window
BUCKET_NAME=$(tflocal output -raw bucket_name)
cd ..

Step 4: Download and upload the Snowflake JARs

The Glue job needs two JARs from Maven Central: the Snowflake JDBC driver and the Spark Snowflake connector. Download them into a local jars/ directory:

Terminal window
mkdir -p jars
wget -q "https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/3.20.0/snowflake-jdbc-3.20.0.jar" \
-P jars
wget -q "https://repo1.maven.org/maven2/net/snowflake/spark-snowflake_2.12/2.5.4-spark_2.4/spark-snowflake_2.12-2.5.4-spark_2.4.jar" \
-P jars

Upload them to the S3 bucket along with the Glue script:

Terminal window
awslocal s3 cp jars s3://$BUCKET_NAME/jars/ --recursive
awslocal s3 cp script/job.py s3://$BUCKET_NAME/script/job.py

Verify everything is in place:

Terminal window
awslocal s3 ls s3://$BUCKET_NAME/ --recursive

You should see three objects: the two JARs under jars/ and the script under script/:

2026-04-01 22:29:17 78671124 jars/snowflake-jdbc-3.20.0.jar
2026-04-01 22:29:17 565278 jars/spark-snowflake_2.12-2.5.4-spark_2.4.jar
2026-04-01 22:29:19 1373 script/job.py

Step 5: Run the Glue job and verify the results

With the infrastructure provisioned and the JARs uploaded, we can now run the Glue job. Before that, let’s walk through the job script to understand what it does, then start the job and check the results through CloudWatch Logs.

5.1: Understand the Glue job script

Before running the job, let’s look at the key parts of script/job.py. The script sets up a GlueContext and Spark session, then imports the Snowflake Spark connector into the JVM and enables query pushdown:

java_import(spark._jvm, SNOWFLAKE_SOURCE_NAME)
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils \
.enablePushdownSession(
spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate()
)

The connection options point to the LocalStack Snowflake emulator. The credentials (test/test) are the default emulator credentials:

sfOptions = {
"sfURL" : "https://snowflake.localhost.localstack.cloud",
"sfUser" : "test",
"sfPassword" : "test",
"sfDatabase" : "test",
"sfSchema" : "public",
"sfWarehouse" : "test",
"application" : "AWSGlue"
}

The job then reads from the src_glue table using spark.read.format(SNOWFLAKE_SOURCE_NAME), which pushes the query down to Snowflake over JDBC. If the read succeeds, it prints the DataFrame and commits the job. The exception handler catches JDBC communication errors that can occur in the emulator environment and logs them gracefully.

5.2: Start the Glue job

Start the job run and capture the job run ID:

Terminal window
JOB_RUN_ID=$(awslocal glue start-job-run \
--job-name glue-job \
--output text \
--query 'JobRunId')
echo "Job Run ID: $JOB_RUN_ID"

The first run takes a few minutes because LocalStack needs to download and initialize the Spark runtime inside the container. Subsequent runs are faster since the runtime is cached.

Poll the job status until it finishes:

Terminal window
while true; do
STATUS=$(awslocal glue get-job-run \
--job-name glue-job \
--run-id $JOB_RUN_ID \
--query 'JobRun.JobRunState' \
--output text)
echo "Status: $STATUS"
if [ "$STATUS" = "SUCCEEDED" ] || [ "$STATUS" = "FAILED" ] || [ "$STATUS" = "STOPPED" ]; then
break
fi
sleep 10
done

After a few minutes, you should see:

Status: RUNNING
Status: RUNNING
...
Status: SUCCEEDED

5.3: Verify the job through CloudWatch Logs

Glue jobs write their output to CloudWatch Logs. The log group for Glue v2 jobs is /aws-glue/jobs/logs-v2, and the log stream name matches the job run ID.

List the log streams:

Terminal window
awslocal logs describe-log-streams \
--log-group-name "/aws-glue/jobs/logs-v2"

Then fetch the log events to see the full Spark execution trace:

Terminal window
awslocal logs get-log-events \
--log-group-name "/aws-glue/jobs/logs-v2" \
--log-stream-name "$JOB_RUN_ID" \
--query 'events[*].message' \
--output text

In the output, you’ll see the full Spark lifecycle: the SparkContext starting, the Snowflake JDBC session opening against snowflake.localhost.localstack.cloud, and the job shutting down cleanly. The key log lines to look for:

INFO SnowflakeConnectionV1: Initializing new connection
INFO SFSession: Opening session with server: https://snowflake.localhost.localstack.cloud:443/,
account: snowflake, user: test, ...

This confirms the Spark Snowflake connector inside the Glue job reached the LocalStack Snowflake emulator over JDBC.

Conclusion

In this tutorial, we ran a full AWS Glue ETL pipeline that integrates with Snowflake, entirely on a local machine. The setup here is useful for a few things beyond just running the demo. You can iterate on Glue job logic without waiting for AWS Spark cluster provisioning, test schema changes to your Snowflake tables locally before applying them in staging, and validate that your Glue job’s Snowflake connection parameters are correct without burning compute credits.

You can find the full source code for this sample in the localstack-snowflake-samples repository.

About the Author

Harsh Mishra
Harsh Mishra
Engineer at LocalStack

Harsh Mishra is an Engineer at LocalStack and AWS Community Builder. Harsh has previously worked at HackerRank, Red Hat, and Quansight, and specialized in DevOps, Platform Engineering, and CI/CD pipelines.

Launch yourself in the world of local cloud development

Start a free trial