Structured Streaming Using Spark and Scala

Apache Spark has revolutionized large-scale data processing, and Structured Streaming is one of its most powerful features for real-time data analytics. Unlike the older DStream API, Structured Streaming offers a high-level, SQL-based approach to streaming data, making it easier and more efficient to process live data streams.

In this tutorial, we’ll explore Structured Streaming with Spark and Scala, including setup, code examples, and best practices.

Structured Streaming architecture in Apache Spark with Scala

What is Structured Streaming?

Structured Streaming is a stream processing engine built on Spark SQL. It treats real-time data as a continuously growing table and allows developers to query it using familiar DataFrame and Dataset APIs in Scala.

Key features:

Event-time processing with watermarking.
Low latency (sub-second in continuous mode).
Fault tolerance using checkpoints.
Seamless integration with Kafka, HDFS, S3, JDBC, and more.

Getting Started with Structured Streaming in Scala

Step 1: Set Up Your Spark Project

You need the following dependencies in your build.sbt file:

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-sql" % "3.5.0",
  "org.apache.spark" %% "spark-sql-kafka-0-10" % "3.5.0"
)

Step 2: Import Required Libraries

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

Step 3: Create Spark Session

val spark = SparkSession.builder()
  .appName("StructuredStreamingExample")
  .master("local[*]")
  .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

Step 4: Read Streaming Data from Kafka

val kafkaStream = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "test-topic")
  .load()

val messages = kafkaStream.selectExpr("CAST(value AS STRING)").as[String]

Step 5: Transform Data

val wordCounts = messages
  .flatMap(_.split(" "))
  .groupBy("value")
  .count()

Step 6: Write Stream to Console

val query = wordCounts.writeStream
  .outputMode("complete")
  .format("console")
  .start()

query.awaitTermination()

✅ This program connects to Kafka, reads streaming messages, counts words in real-time, and prints the output to the console.

Best Practices for Structured Streaming in Scala

Use checkpointing to ensure fault tolerance.
Choose the right output mode (append, update, or complete) based on your use case.
Optimize batch intervals for lower latency.
Handle late data with watermarking.
Integrate with sinks like Cassandra, MySQL, HDFS, or cloud storage for persistence.

Real-World Applications

Real-time dashboards for business analytics.
Fraud detection in financial transactions.
Monitoring IoT sensor data streams.
Clickstream analysis for web applications.

Conclusion

Structured Streaming in Spark with Scala offers a scalable, fault-tolerant, and easy-to-use framework for real-time data processing. By leveraging DataFrame/Dataset APIs and Kafka integration, developers can build production-grade streaming pipelines with minimal effort.

SEO Keywords:

Structured Streaming in Spark with Scala
Spark Scala streaming example
Real-time data processing in Spark
Spark Structured Streaming tutorial
Kafka with Spark and Scala

Table of content

Introduction to Scala
Scala Basics
Control Structures
Functions and Methods
Object-Oriented Programming in Scala
Advanced Scala Concepts
Functional Programming in Scala
Scala Programming Exercises
- Fibonacci Program
- Remove Duplicates in String
Scala Interview Preparation
- Top 250 Scala Questions
- Scala Interview Questions
Scala Practice & Coding Challenges
- Scala Quiz
- Coding Challenges
Additional Scala Resources