Structured Streaming Using Spark and Scala

8/17/2025

#Structured Streaming Using Spark Scala

Go Back

Structured Streaming Using Spark and Scala

Apache Spark has revolutionized large-scale data processing, and Structured Streaming is one of its most powerful features for real-time data analytics. Unlike the older DStream API, Structured Streaming offers a high-level, SQL-based approach to streaming data, making it easier and more efficient to process live data streams.

In this tutorial, we’ll explore Structured Streaming with Spark and Scala, including setup, code examples, and best practices.


#Structured Streaming Using Spark  Scala

What is Structured Streaming?

Structured Streaming is a stream processing engine built on Spark SQL. It treats real-time data as a continuously growing table and allows developers to query it using familiar DataFrame and Dataset APIs in Scala.

Key features:

  • Event-time processing with watermarking.

  • Low latency (sub-second in continuous mode).

  • Fault tolerance using checkpoints.

  • Seamless integration with Kafka, HDFS, S3, JDBC, and more.


Getting Started with Structured Streaming in Scala

Step 1: Set Up Your Spark Project

You need the following dependencies in your build.sbt file:

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-sql" % "3.5.0",
  "org.apache.spark" %% "spark-sql-kafka-0-10" % "3.5.0"
)

Step 2: Import Required Libraries

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

Step 3: Create Spark Session

val spark = SparkSession.builder()
  .appName("StructuredStreamingExample")
  .master("local[*]")
  .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

Step 4: Read Streaming Data from Kafka

val kafkaStream = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "test-topic")
  .load()

val messages = kafkaStream.selectExpr("CAST(value AS STRING)").as[String]

Step 5: Transform Data

val wordCounts = messages
  .flatMap(_.split(" "))
  .groupBy("value")
  .count()

Step 6: Write Stream to Console

val query = wordCounts.writeStream
  .outputMode("complete")
  .format("console")
  .start()

query.awaitTermination()

✅ This program connects to Kafka, reads streaming messages, counts words in real-time, and prints the output to the console.


Best Practices for Structured Streaming in Scala

  1. Use checkpointing to ensure fault tolerance.

  2. Choose the right output mode (append, update, or complete) based on your use case.

  3. Optimize batch intervals for lower latency.

  4. Handle late data with watermarking.

  5. Integrate with sinks like Cassandra, MySQL, HDFS, or cloud storage for persistence.


Real-World Applications

  • Real-time dashboards for business analytics.

  • Fraud detection in financial transactions.

  • Monitoring IoT sensor data streams.

  • Clickstream analysis for web applications.


Conclusion

Structured Streaming in Spark with Scala offers a scalable, fault-tolerant, and easy-to-use framework for real-time data processing. By leveraging DataFrame/Dataset APIs and Kafka integration, developers can build production-grade streaming pipelines with minimal effort.


SEO Keywords:

  • Structured Streaming in Spark with Scala

  • Spark Scala streaming example

  • Real-time data processing in Spark

  • Spark Structured Streaming tutorial

  • Kafka with Spark and Scala