What is Spark Streaming? Types of Spark Streaming: DStreams vs Structured Streaming

8/17/2025

Comparison of Spark Streaming types: DStreams vs Structured Streaming in Apache Spark with real-time processing examples

What is Spark Streaming? Types of Spark Streaming: DStreams vs Structured Streaming

In the world of Big Data, most real-time applications require processing continuous streams of data rather than static datasets. Apache Spark Streaming is a powerful extension of the Apache Spark ecosystem that enables scalable, high-throughput, and fault-tolerant stream processing of live data streams.

This article explains what Spark Streaming is, its key features, and provides a simple example to help beginners get started.

Comparison of Spark Streaming types: DStreams vs Structured Streaming in Apache Spark with real-time processing examples

What is Spark Streaming?

Spark Streaming is a component of Apache Spark designed for processing live data streams. Unlike batch processing where data is processed in chunks, Spark Streaming processes real-time data from sources like Kafka, Flume, Kinesis, or TCP sockets.

It divides incoming data into small batches (called micro-batches) and processes them using the Spark engine.

This makes it suitable for applications like fraud detection, real-time dashboards, sentiment analysis, and log monitoring.

Types of Spark Streaming

Apache Spark provides two main approaches for handling streaming data: DStreams (Discretized Streams) and Structured Streaming. Each type has its own advantages, limitations, and use cases.

1. DStream-based Spark Streaming

DStreams were the original way to work with streaming data in Spark, introduced in Spark 1.x. They work by dividing real-time data into micro-batches, which are processed using Spark’s RDD (Resilient Distributed Dataset) operations.

Key Features:

Works on micro-batches.
Built on RDD API.
Simple and stable for batch-like streaming workloads.

Example:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 2)  # 2-second batches

lines = ssc.socketTextStream("localhost", 9999)
words = lines.flatMap(lambda line: line.split(" "))
word_counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

word_counts.pprint()
ssc.start()
ssc.awaitTermination()

Pros:

Easy to understand.
Good for traditional batch-style streaming.

Cons:

Higher latency (seconds-level).
Less flexibility for real-time use cases.

2. Structured Streaming

Structured Streaming, introduced in Spark 2.x, is the modern and recommended approach. It treats streaming data as a continuously growing table and uses the Spark SQL engine for processing.

Key Features:

Supports both micro-batch and continuous processing modes.
Works with DataFrames and Datasets.
Provides event-time processing and better fault tolerance.
SQL-like API makes it easier for data analysts.

Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()

# Create DataFrame from socket stream
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

# Split into words
words = lines.selectExpr("explode(split(value, ' ')) as word")

# Count words
word_counts = words.groupBy("word").count()

# Output results to console
query = word_counts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()

Pros:

Low latency (sub-second possible).
Easy integration with SQL, MLlib, and analytics.
Strong support for event-time and watermarks.

Cons:

Slightly more complex than DStreams.

Key Features of Spark Streaming

Real-time Processing – Processes data streams in near real-time with minimal latency.
Integration – Works seamlessly with Kafka, Flume, HDFS, Cassandra, and more.
Scalability – Built on Spark’s distributed computing model.
Fault Tolerance – Ensures recovery of lost data using lineage and checkpointing.
Unified Engine – You can run batch, interactive, and streaming jobs in the same Spark application.

🔹 Example: Word Count in Spark Streaming

Here’s a simple Python example using Spark Streaming that reads text data from a TCP socket and performs word count in real time.

Step 1: Import Libraries

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

Step 2: Create Spark Context and Streaming Context

# Create a local StreamingContext with 2-second batch interval
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 2)

Step 3: Connect to Data Stream (Socket)

# Connect to localhost:9999 for streaming data
lines = ssc.socketTextStream("localhost", 9999)

Step 4: Transform and Process Data

# Split lines into words
words = lines.flatMap(lambda line: line.split(" "))

# Count each word
word_counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

Step 5: Print Output

# Print the word counts
word_counts.pprint()

Step 6: Start Streaming

# Start and await termination
ssc.start()
ssc.awaitTermination()

Now, if you run a netcat server using:

nc -lk 9999

and type words, Spark Streaming will count them in real time.

Real-World Use Cases of Spark Streaming

Fraud Detection in banking transactions.
Real-time Recommendation Engines for e-commerce.
Log Monitoring for servers and applications.
IoT Data Processing for sensor streams.
Social Media Analytics like sentiment tracking on Twitter.

Conclusion

Spark Streaming is one of the most powerful tools for real-time data processing. With its micro-batch model, fault tolerance, and seamless integration with other big data tools, it has become a go-to choice for developers and data engineers.

By learning Spark Streaming, you can build real-time applications that scale efficiently and handle live data with ease.

DStreams: Old approach, uses RDDs, micro-batch only, simpler but less powerful.
Structured Streaming: Modern approach, DataFrame/Dataset API, low latency, recommended for new projects.

For most modern use cases, Structured Streaming is the best choice due to its scalability, performance, and ease of integration.

Table of content

Introduction to Apache Spark
Spark Architecture & Components
Working with Spark Shell
Core Spark Concepts
Working with Data in Spark
Spark Streaming
Performance Optimization
Machine Learning with Spark
Job Deployment & Cluster Management
Advanced Spark Topics
Spark Interview Preparation
- Top 250 Spark Questions
- Spark Interview Questions
Additional Spark Resources
- Official Doc