Hadoop and Spark by Leela Prasad: Loading JSON file

Loading Multiline JSON File as a DataFrame ready to insert as a Table Example

Our JSON File 1.json:

{
"user": "gT35Hhhre9m",
"dates": ["2016-01-29", "2016-01-28"],
"status": "OK",
"reason": "some reason",
"content": [{
"foo": 123,
"bar": "val1"
}, {
"foo": 456,
"bar": "val2"
}, {
"foo": 789,
"bar": "val3"
}, {
"foo": 124,
"bar": "val4"
}, {
"foo": 126,
"bar": "val5"
}]
}

Expected output is:

-----------+---------+-------------+-----------+-------+-------
user |status | reason | dates | foo | bar |
--------------------------------------------------------------
gT35Hhhre9m| OK | some reason |2016-01-29 | 123 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-28 | 123 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-29 | 456 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-28 | 456 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-29 | 789 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-28 | 789 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-29 | 124 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-28 | 124 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-29 | 126 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-28 | 126 | val1 |
-----------+---------+-------------+-----------+-------+-------

Code:

val df1 = saprk.read.option("multiline",true).json("file:///Leela/Filesparsing/1.json")
val df2 = df1.withColumn("contents",explode(df1("content"))).withColumn("dates",explode(df1("dates")))
df2.select("user","status","reason","dates","contents.foo","contents.bar").show

Let us consider the JSON File as below.
[{
"Year": "2013",
"First Name": "DAVID",
"County": "KINGS",
"Sex": "M",
"Count": "272"
}, {
"Year": "2013",
"First Name": "JAYDEN",
"County": "KINGS",
"Sex": "M",
"Count": "268"
}, {
"Year": "2013",
"First Name": "JAYDEN",
"County": "QUEENS",
"Sex": "M",
"Count": "219"
}, {
"Year": "2013",
"First Name": "MOSHE",
"County": "KINGS",
"Sex": "M",
"Count": "219"
}, {
"Year": "2013",
"First Name": "ETHAN",
"County": "QUEENS",
"Sex": "M",
"Count": "216"
}]

Note: sqlContext.read.json cannot be able to read the above JSON file because it is a multi line JSON file. In apache documentation it is clearly mentioned as "Each line must contain a separate, self-contained valid JSON object." only then, the below one is valid.

val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)

The file Babynames.json is a multi line JSON file so we cannot use above command. Reason for this failure is that spark does parallel processing by splitting the file into RDDs and does processing. Even in this case the JSON file is splitted which makes it to be invalid for reading.

Need to use wholeTextFiles(JSONFileName) so that a Key-Value pair is created with key as the file name and value as complete file content. This will ensure that the JSON file is not split and can be parsed successfully

scala> val jsonRDD = sc.wholeTextFiles("file:///home/cloudera/Desktop/Spark/RSJSON/Babynames.JSON",2).map(x => x._2)
jsonRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at map at <console>:27

scala> val namesJSON = sqlContext.read.json(jsonRDD)
namesJSON: org.apache.spark.sql.DataFrame = [Count: string, County: string, First Name: string, Sex: string, Year: string]

scala> namesJSON.printSchema
root
|-- Count: string (nullable = true)
|-- County: string (nullable = true)
|-- First Name: string (nullable = true)
|-- Sex: string (nullable = true)
|-- Year: string (nullable = true)

scala> namesJSON.filter(namesJSON("count") > 250).show()
+-----+------+----------+---+----+
|Count|County|First Name|Sex|Year|
+-----+------+----------+---+----+
| 272| KINGS| DAVID| M|2013|
| 268| KINGS| JAYDEN| M|2013|
+-----+------+----------+---+----+
scala> namesJSON.select("First Name").show()
+----------+
|First Name|
+----------+
| DAVID|
| JAYDEN|
| JAYDEN|
| MOSHE|
| ETHAN|
+----------+

3 comments:

RenuAugust 16, 2018 at 11:28 PM
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/
rohitJanuary 2, 2019 at 2:36 AM
Nice Post thanks
alternatives to kissanime
veeraSeptember 1, 2020 at 10:45 PM
great article,keep sharing more posts with us.

thank you...

big data online training

Hadoop and Spark by Leela Prasad

Monday, February 6, 2017

Loading JSON file

Loading Multiline JSON File as a DataFrame ready to insert as a Table Example

Our JSON File 1.json:

Expected output is:

Code:

3 comments:

Popular Posts