Structured Streaming Using Spark and Scala
#Structured Streaming Using Spark Scala
Apache Spark has revolutionized large-scale data processing, and Structured Streaming is one of its most powerful features for real-time data analytics. Unlike the older DStream API, Structured Streaming offers a high-level, SQL-based approach to streaming data, making it easier and more efficient to process live data streams.
In this tutorial, we’ll explore Structured Streaming with Spark and Scala, including setup, code examples, and best practices.
Structured Streaming is a stream processing engine built on Spark SQL. It treats real-time data as a continuously growing table and allows developers to query it using familiar DataFrame and Dataset APIs in Scala.
Key features:
Event-time processing with watermarking.
Low latency (sub-second in continuous mode).
Fault tolerance using checkpoints.
Seamless integration with Kafka, HDFS, S3, JDBC, and more.
You need the following dependencies in your build.sbt
file:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % "3.5.0",
"org.apache.spark" %% "spark-sql-kafka-0-10" % "3.5.0"
)
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder()
.appName("StructuredStreamingExample")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
val kafkaStream = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test-topic")
.load()
val messages = kafkaStream.selectExpr("CAST(value AS STRING)").as[String]
val wordCounts = messages
.flatMap(_.split(" "))
.groupBy("value")
.count()
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
✅ This program connects to Kafka, reads streaming messages, counts words in real-time, and prints the output to the console.
Use checkpointing to ensure fault tolerance.
Choose the right output mode (append
, update
, or complete
) based on your use case.
Optimize batch intervals for lower latency.
Handle late data with watermarking.
Integrate with sinks like Cassandra, MySQL, HDFS, or cloud storage for persistence.
Real-time dashboards for business analytics.
Fraud detection in financial transactions.
Monitoring IoT sensor data streams.
Clickstream analysis for web applications.
Structured Streaming in Spark with Scala offers a scalable, fault-tolerant, and easy-to-use framework for real-time data processing. By leveraging DataFrame/Dataset APIs and Kafka integration, developers can build production-grade streaming pipelines with minimal effort.
Structured Streaming in Spark with Scala
Spark Scala streaming example
Real-time data processing in Spark
Spark Structured Streaming tutorial
Kafka with Spark and Scala