What is StreamSets Transformer for Spark?

Transformer for Spark allows users to create low to no code data pipelines that natively execute on Spark. Supported environments include Databricks, EMR, and HDInsight.

What is a StreamSets Transformer for Spark pipeline?

All data pipelines for all of our engines, including Transformer for Spark, are essentially data flows. Taking data from one source to another and often including transformations along the way. Data pipelines can be leveraged to power machine learning, advanced analytics, business intelligence and other key insights.

Is StreamSets an ETL tool?

StreamSets acts as an ETL tool, though it is a complete end-to-end data integration platform. It performs ETL, ELT and data transformations such as joins, aggregates, and unions directly on Apache Spark and Snowflake platforms.

How do I create an ETL pipeline in Spark with StreamSets?

Transformer for Spark, StreamSets’ Spark Engine, acts as a Spark client that launches distributed Spark applications. Transformer passes the pipeline definitions and Spark runs it just as it would any other application, distributing the processing across nodes in the cluster. You can find more information on how to get started in the StreamSets documentation for Transformer.

How do I install StreamSets Transformer for Spark?

Installation information can be found in the Transformer Spark documentation..

The StreamSets and webMethods platforms have now been acquired by IBM

Learn more

Selecting a region changes the language and/or content of this page.

International (English)

Schedule a demo

StreamSets

Apache Spark Transformer

Configure and manage your ETL pipelines on Spark without hand coding.

Modern ETL Pipelines without the complexity

Turn unlimited data into insights in minutes with StreamSets Transformer for Spark. StreamSets Transformer runs on any Apache Spark environment (Databricks, AWS EMR, Google Cloud Dataproc, and Yarn) on premises and across clouds. StreamSets Transformer for Spark is a data pipeline engine designed for any developer or data engineer to build and manage ETL and ML pipelines that execute on Spark.

Create pipelines for performing ETL and machine learning operations using an intent-driven visual design tool
Troubleshoot with unparalleled visibility into the execution of Spark applications
Run any major Spark distribution and switch platforms without redesign.

Keep running, anywhere

Run Apache Spark anywhere now and in the future as your needs evolve.

Operationalize your data transformations

                    Build and manage ETL and ML pipelines that execute on Spark

Put powerful and native ETL at the fingertips of any data engineer. Use a simple, drag-and-drop UI to create highly instrumented pipelines for performing ETL, stream processing, and machine learning operations. StreamSets Platform helps your team accelerate your data projects. Easily operationalize code and automate critical Spark operations through a central platform

                    Run on multiple Spark platforms

Transformer Engines are designed to run on all major Spark distributions for maximum flexibility. You can natively execute on EMR, HDInsight, and Databricks platforms. Run your development and production projects on multiple Spark platforms or support different business unit needs from a single tool without rework.

                    See What Changed and Respond Easily

Full visibility and unmatched resiliency in your pipelines means you can stop hunting through log files for errors when change happens. Transformer pipelines are instrumented to provide deep visibility into Spark execution so you can troubleshoot at the pipeline level and at each stage in the pipeline. Transformer offers the enterprise features and productivity of legacy ETL tools, while revealing the full power and flexibility of Apache Spark.

The StreamSets Data Integration Platform

Build smart data pipelines in minutes and deploy across hybrid and multi-cloud platforms from a single log in.

Awards and Recognition

Data engineers gain efficiencies with StreamSets

★★★★★ | 8.01.23

"The best feature of StreamSets is its intuitive visual interface, allowing us to effortlessly design, monitor, and manage data pipelines without the need for complex coding. This has significantly reduced our development time and made the process highly accessible to both technical and non-technical team members."

- Mili M., Senior System Analyst | Mid-market (51-1000 employees)

★★★★★ | 8.03.23

"StreamSets has lot of out of box features to use for data pipelines and connect AWS Kinesis, DB or Kafka and send to HDFS & Hive."

- Sanath V. | Enterprise (> 1000 employees)

Frequently asked questions

What is StreamSets Transformer for Spark?

Transformer for Spark allows users to create low to no code data pipelines that natively execute on Spark. Supported environments include Databricks, EMR, and HDInsight.
What is a StreamSets Transformer for Spark pipeline?

All data pipelines for all of our engines, including Transformer for Spark, are essentially data flows. Taking data from one source to another and often including transformations along the way. Data pipelines can be leveraged to power machine learning, advanced analytics, business intelligence and other key insights.
Is StreamSets an ETL tool?

StreamSets acts as an ETL tool, though it is a complete end-to-end data integration platform. It performs ETL, ELT and data transformations such as joins, aggregates, and unions directly on Apache Spark and Snowflake platforms.
How do I create an ETL pipeline in Spark with StreamSets?

Transformer for Spark, StreamSets’ Spark Engine, acts as a Spark client that launches distributed Spark applications. Transformer passes the pipeline definitions and Spark runs it just as it would any other application, distributing the processing across nodes in the cluster. You can find more information on how to get started in the StreamSets documentation for Transformer.
Can I still run Python code on Spark with StreamSets?

Yes. StreamSets Transformer runs on any Apache Spark environment (Databricks, AWS EMR, Google Cloud Dataproc, and Yarn) on premises and across clouds. StreamSets Transformer for Spark is a data pipeline engine designed for any developer or data engineer to build and manage ETL and ML pipelines that execute on Spark.
How do I install StreamSets Transformer for Spark?

Installation information can be found in the Transformer Spark documentation..

Research Report

The Business Value of Data Engineering

Explore the pivotal role of data engineering in driving business value and innovation. Dive into our research on trends, challenges, and strategies for 2024.

Read white paper

White paper

The Data Integration Advantage: Building a Foundation for Scalable AI

Discover how modern data integration is key to scaling AI initiatives. Learn strategies for overcoming AI challenges and driving enterprise success.

Read white paper

eBook

Five Principles for Agile Data & Operational Analytics

Master the five data principles essential for powering effective operational analytics. Transform your data strategy for agility and insight.

Read eBook

Are you ready to unlock your data?

Resilient data pipelines help you integrate your data, without giving up control, to power your cloud analytics and digital innovation.

The StreamSets and webMethods platforms have now been acquired by IBM

Apache Spark Transformer

Configure and manage your ETL pipelines on Spark without hand coding.

Build and manage ETL and ML pipelines that execute on Spark

Run on multiple Spark platforms

See What Changed and Respond Easily

★★★★★ | 8.01.23

★★★★★ | 8.03.23

"StreamSets has lot of out of box features to use for data pipelines and connect AWS Kinesis, DB or Kafka and send to HDFS & Hive."

Research Report

The Business Value of Data Engineering

White paper

The Data Integration Advantage: Building a Foundation for Scalable AI

eBook

Five Principles for Agile Data & Operational Analytics

Welcome

Discover

Connect

Hear from our CEO: The time has come for a Super iPaaS