What is Spark Streaming? Types of Spark Streaming: DStreams vs Structured Streaming
#What is Spark Streaming? Types of Spark Streaming: DStreams vs Structured Streaming
In the world of Big Data, most real-time applications require processing continuous streams of data rather than static datasets. Apache Spark Streaming is a powerful extension of the Apache Spark ecosystem that enables scalable, high-throughput, and fault-tolerant stream processing of live data streams.
This article explains what Spark Streaming is, its key features, and provides a simple example to help beginners get started.
Spark Streaming is a component of Apache Spark designed for processing live data streams. Unlike batch processing where data is processed in chunks, Spark Streaming processes real-time data from sources like Kafka, Flume, Kinesis, or TCP sockets.
It divides incoming data into small batches (called micro-batches) and processes them using the Spark engine.
👉 This makes it suitable for applications like fraud detection, real-time dashboards, sentiment analysis, and log monitoring.
Apache Spark provides two main approaches for handling streaming data: DStreams (Discretized Streams) and Structured Streaming. Each type has its own advantages, limitations, and use cases.
DStreams were the original way to work with streaming data in Spark, introduced in Spark 1.x. They work by dividing real-time data into micro-batches, which are processed using Spark’s RDD (Resilient Distributed Dataset) operations.
Works on micro-batches.
Built on RDD API.
Simple and stable for batch-like streaming workloads.
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 2) # 2-second batches
lines = ssc.socketTextStream("localhost", 9999)
words = lines.flatMap(lambda line: line.split(" "))
word_counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
word_counts.pprint()
ssc.start()
ssc.awaitTermination()
✅ Pros:
Easy to understand.
Good for traditional batch-style streaming.
❌ Cons:
Higher latency (seconds-level).
Less flexibility for real-time use cases.
Structured Streaming, introduced in Spark 2.x, is the modern and recommended approach. It treats streaming data as a continuously growing table and uses the Spark SQL engine for processing.
Supports both micro-batch and continuous processing modes.
Works with DataFrames and Datasets.
Provides event-time processing and better fault tolerance.
SQL-like API makes it easier for data analysts.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
# Create DataFrame from socket stream
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
# Split into words
words = lines.selectExpr("explode(split(value, ' ')) as word")
# Count words
word_counts = words.groupBy("word").count()
# Output results to console
query = word_counts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()
✅ Pros:
Low latency (sub-second possible).
Easy integration with SQL, MLlib, and analytics.
Strong support for event-time and watermarks.
❌ Cons:
Slightly more complex than DStreams.
Real-time Processing – Processes data streams in near real-time with minimal latency.
Integration – Works seamlessly with Kafka, Flume, HDFS, Cassandra, and more.
Scalability – Built on Spark’s distributed computing model.
Fault Tolerance – Ensures recovery of lost data using lineage and checkpointing.
Unified Engine – You can run batch, interactive, and streaming jobs in the same Spark application.
Here’s a simple Python example using Spark Streaming that reads text data from a TCP socket and performs word count in real time.
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
# Create a local StreamingContext with 2-second batch interval
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 2)
# Connect to localhost:9999 for streaming data
lines = ssc.socketTextStream("localhost", 9999)
# Split lines into words
words = lines.flatMap(lambda line: line.split(" "))
# Count each word
word_counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
# Print the word counts
word_counts.pprint()
# Start and await termination
ssc.start()
ssc.awaitTermination()
✅ Now, if you run a netcat server using:
nc -lk 9999
and type words, Spark Streaming will count them in real time.
Fraud Detection in banking transactions.
Real-time Recommendation Engines for e-commerce.
Log Monitoring for servers and applications.
IoT Data Processing for sensor streams.
Social Media Analytics like sentiment tracking on Twitter.
Spark Streaming is one of the most powerful tools for real-time data processing. With its micro-batch model, fault tolerance, and seamless integration with other big data tools, it has become a go-to choice for developers and data engineers.
By learning Spark Streaming, you can build real-time applications that scale efficiently and handle live data with ease.
DStreams: Old approach, uses RDDs, micro-batch only, simpler but less powerful.
Structured Streaming: Modern approach, DataFrame/Dataset API, low latency, recommended for new projects.
For most modern use cases, Structured Streaming is the best choice due to its scalability, performance, and ease of integration.