Go Back

How to select all elements greater than a given values in a dataframe in spark

12/22/2022
All Articles

#bigdata #spark #scala #filter #python

How to select all elements greater than a given values in a dataframe in spark

How to select all elements greater than a given values in a dataframe

Updated:22/12/2022 by Shubham mishra

df.filter($”age” > 21).show()

Above is piece of code filter element of row that is greater than 21 (or any number )

df.groupBy(“age”).count().show()

above code is group by age field and count number of age greate than given number of age .

We can start with basic definition of Spark and diferent type of datatype in
spark.

  1. DataFrame
  2. DataSet
  3. RDD

Dataframe in spark framework

A DataFrame is a data structure that organizes data into table of rows and columns, similar like a spreadsheet or excel that contain row and column with different pre define datatype like int ,string ,boolean .

DataSet in spark framework

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).
The Dataset API is available in Scala and Java.

Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName).

The case for R is similar.A DataFrame is a distributed collection of data, which is organized into named columns. Conceptually, it is equivalent to relational tables with good optimization techniques.

RDD in spark framework

“Resilient Distributed Datasets (RDD) is a distributed memory abstraction that helps a programmer to perform in-memory computations on large cluster.” One of the important advantages of RDD is fault tolerance, it means if any failure occurs it recovers automatically.”

We can see DataFrames of below data in CSV file with Example in scala

How to select all elements greater than a given values in a dataframe in spark

import org.apache.spark.SparkContext._

import org.apache.spark._

val df = spark.read.json(“examples/src/main/resources/developerIndian.json”)

df.printSchema()

df.select(“name”).show()

df.select($”name”, $”age” + 1).show()

df.filter($”age” > 21).show()df.groupBy(“age”).count().show()

Please check the link for reference .

  1. Quick start tutorial

 

 

Article