What is Spark Streaming? Types of Spark Streaming: DStreams vs Structured Streaming

8/17/2025

#What is Spark Streaming? Types of Spark Streaming: DStreams vs Structured Streaming

Go Back

What is Spark Streaming? Types of Spark Streaming: DStreams vs Structured Streaming

In the world of Big Data, most real-time applications require processing continuous streams of data rather than static datasets. Apache Spark Streaming is a powerful extension of the Apache Spark ecosystem that enables scalable, high-throughput, and fault-tolerant stream processing of live data streams.

This article explains what Spark Streaming is, its key features, and provides a simple example to help beginners get started.


#What is Spark Streaming? Types of Spark Streaming: DStreams vs Structured Streaming

What is Spark Streaming?

Spark Streaming is a component of Apache Spark designed for processing live data streams. Unlike batch processing where data is processed in chunks, Spark Streaming processes real-time data from sources like Kafka, Flume, Kinesis, or TCP sockets.

It divides incoming data into small batches (called micro-batches) and processes them using the Spark engine.

👉 This makes it suitable for applications like fraud detection, real-time dashboards, sentiment analysis, and log monitoring.


Types of Spark Streaming

Apache Spark provides two main approaches for handling streaming data: DStreams (Discretized Streams) and Structured Streaming. Each type has its own advantages, limitations, and use cases.


1. DStream-based Spark Streaming

DStreams were the original way to work with streaming data in Spark, introduced in Spark 1.x. They work by dividing real-time data into micro-batches, which are processed using Spark’s RDD (Resilient Distributed Dataset) operations.

Key Features:

  • Works on micro-batches.

  • Built on RDD API.

  • Simple and stable for batch-like streaming workloads.

Example:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 2)  # 2-second batches

lines = ssc.socketTextStream("localhost", 9999)
words = lines.flatMap(lambda line: line.split(" "))
word_counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

word_counts.pprint()
ssc.start()
ssc.awaitTermination()

Pros:

  • Easy to understand.

  • Good for traditional batch-style streaming.

Cons:

  • Higher latency (seconds-level).

  • Less flexibility for real-time use cases.


2. Structured Streaming

Structured Streaming, introduced in Spark 2.x, is the modern and recommended approach. It treats streaming data as a continuously growing table and uses the Spark SQL engine for processing.

Key Features:

  • Supports both micro-batch and continuous processing modes.

  • Works with DataFrames and Datasets.

  • Provides event-time processing and better fault tolerance.

  • SQL-like API makes it easier for data analysts.

Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()

# Create DataFrame from socket stream
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

# Split into words
words = lines.selectExpr("explode(split(value, ' ')) as word")

# Count words
word_counts = words.groupBy("word").count()

# Output results to console
query = word_counts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()

Pros:

  • Low latency (sub-second possible).

  • Easy integration with SQL, MLlib, and analytics.

  • Strong support for event-time and watermarks.

Cons:

  • Slightly more complex than DStreams.

 

Key Features of Spark Streaming

  1. Real-time Processing – Processes data streams in near real-time with minimal latency.

  2. Integration – Works seamlessly with Kafka, Flume, HDFS, Cassandra, and more.

  3. Scalability – Built on Spark’s distributed computing model.

  4. Fault Tolerance – Ensures recovery of lost data using lineage and checkpointing.

  5. Unified Engine – You can run batch, interactive, and streaming jobs in the same Spark application.


🔹 Example: Word Count in Spark Streaming

Here’s a simple Python example using Spark Streaming that reads text data from a TCP socket and performs word count in real time.

Step 1: Import Libraries

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

Step 2: Create Spark Context and Streaming Context

# Create a local StreamingContext with 2-second batch interval
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 2)

Step 3: Connect to Data Stream (Socket)

# Connect to localhost:9999 for streaming data
lines = ssc.socketTextStream("localhost", 9999)

Step 4: Transform and Process Data

# Split lines into words
words = lines.flatMap(lambda line: line.split(" "))

# Count each word
word_counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

Step 5: Print Output

# Print the word counts
word_counts.pprint()

Step 6: Start Streaming

# Start and await termination
ssc.start()
ssc.awaitTermination()

✅ Now, if you run a netcat server using:

nc -lk 9999

and type words, Spark Streaming will count them in real time.


🔹 Real-World Use Cases of Spark Streaming

  • Fraud Detection in banking transactions.

  • Real-time Recommendation Engines for e-commerce.

  • Log Monitoring for servers and applications.

  • IoT Data Processing for sensor streams.

  • Social Media Analytics like sentiment tracking on Twitter.


🔹 Conclusion

Spark Streaming is one of the most powerful tools for real-time data processing. With its micro-batch model, fault tolerance, and seamless integration with other big data tools, it has become a go-to choice for developers and data engineers.

By learning Spark Streaming, you can build real-time applications that scale efficiently and handle live data with ease.

  • DStreams: Old approach, uses RDDs, micro-batch only, simpler but less powerful.

  • Structured Streaming: Modern approach, DataFrame/Dataset API, low latency, recommended for new projects.

For most modern use cases, Structured Streaming is the best choice due to its scalability, performance, and ease of integration.