spark-scala-Word-count-First-Example-for-beginners
admin
Word Count example in Spark
Updated: 01/20/2025 by Computer Hope
Apache Spark is a powerful open-source big data processing framework that allows distributed data processing across multiple nodes. When combined with Scala, Spark provides an efficient way to handle large datasets in parallel. This tutorial will guide you through the first steps of setting up Spark, running a word count program in Scala, and understanding its output.
Before you proceed, make sure you have the following installed on your system:
To install Apache Spark, follow these steps:
export SPARK_HOME=/usr/local/spark
export PATH=$SPARK_HOME/bin:$PATH
spark-shell --version
If installed correctly, you should see the version details.You can deploy your own Spark cluster in standalone mode using the following command:
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh spark://<master-node-IP>:7077
Once the cluster is up and running, you can monitor it by accessing http://localhost:8080
in your browser.
Below is the Scala source code for the Word Count program in Apache Spark:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext("local", "Word Count")
val input = sc.textFile("input.txt")
val counts = input.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("output")
println("Word count completed!")
}
}
input.txt
).(word, 1)
.Use the following command to compile and run the program:
spark-submit --class SparkWordCount --master local[2] target/scala-2.12/spark-wordcount_2.12-1.0.jar
If you're using Databricks, follow these steps:
The output of the program will be saved in the output
directory. You can view the word count results by running:
cat output/part-00000
Sample output:
spark 3
scala 2
hello 5
world 7
Ensure input.txt
exists in the correct directory or provide an absolute path.
If running in a REPL environment, stop the existing SparkContext before creating a new one:
sc.stop()
Increase executor memory:
--executor-memory 4G
Congratulations! You have successfully executed your first Spark application using Scala. This Word Count program serves as a basic example to understand how Apache Spark processes data in a distributed manner. Next, try experimenting with larger datasets and more advanced Spark transformations to expand your skills.
For more Spark tutorials, stay tuned to DeveloperIndian! 🚀