Monday, February 6, 2017

Loading JSON file

Loading Multiline JSON File as a DataFrame ready to insert as a  Table Example 


Our JSON File 1.json:


{
"user": "gT35Hhhre9m",
"dates": ["2016-01-29", "2016-01-28"],
"status": "OK",
"reason": "some reason",
"content": [{
"foo": 123,
"bar": "val1"
}, {
"foo": 456,
"bar": "val2"
}, {
"foo": 789,
"bar": "val3"
}, {
"foo": 124,
"bar": "val4"
}, {
"foo": 126,
"bar": "val5"
}]
}

Expected output is:


-----------+---------+-------------+-----------+-------+-------
user   |status   | reason | dates  | foo   |   bar |
--------------------------------------------------------------
gT35Hhhre9m| OK   | some reason |2016-01-29 |  123  | val1  |
gT35Hhhre9m| OK   | some reason |2016-01-28 |  123  | val1  |
gT35Hhhre9m| OK   | some reason |2016-01-29 |  456  | val1  |
gT35Hhhre9m| OK   | some reason |2016-01-28 |  456  | val1  |
gT35Hhhre9m| OK   | some reason |2016-01-29 |  789  | val1  |
gT35Hhhre9m| OK   | some reason |2016-01-28 |  789  | val1  |
gT35Hhhre9m| OK   | some reason |2016-01-29 |  124  | val1  |
gT35Hhhre9m| OK   | some reason |2016-01-28 |  124  | val1  |
gT35Hhhre9m| OK   | some reason |2016-01-29 |  126  | val1  |
gT35Hhhre9m| OK   | some reason |2016-01-28 |  126  | val1  |
-----------+---------+-------------+-----------+-------+-------

Code:


val df1 = saprk.read.option("multiline",true).json("file:///Leela/Filesparsing/1.json")
val df2 = df1.withColumn("contents",explode(df1("content"))).withColumn("dates",explode(df1("dates")))
df2.select("user","status","reason","dates","contents.foo","contents.bar").show


Let us consider the JSON File as below.
[{
 "Year": "2013",
 "First Name": "DAVID",
 "County": "KINGS",
 "Sex": "M",
 "Count": "272"
}, {
 "Year": "2013",
 "First Name": "JAYDEN",
 "County": "KINGS",
 "Sex": "M",
 "Count": "268"
}, {
 "Year": "2013",
 "First Name": "JAYDEN",
 "County": "QUEENS",
 "Sex": "M",
 "Count": "219"
}, {
 "Year": "2013",
 "First Name": "MOSHE",
 "County": "KINGS",
 "Sex": "M",
 "Count": "219"
}, {
 "Year": "2013",
 "First Name": "ETHAN",
 "County": "QUEENS",
 "Sex": "M",
 "Count": "216"
}]

Note: sqlContext.read.json cannot be able to read the above JSON file because it is a multi line JSON file. In apache documentation it is clearly mentioned as "Each line must contain a separate, self-contained valid JSON object." only then, the below one is valid.
val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)
The file Babynames.json is a multi line JSON file so we cannot use above command. Reason for this failure is that spark does parallel processing by splitting the file into RDDs and does processing. Even in this case the JSON file is splitted which makes it to be invalid for reading.

Need to use wholeTextFiles(JSONFileName) so that a Key-Value pair is created with key as the file name and value as complete file content. This will ensure that the JSON file is not split and can be parsed successfully

scala> val jsonRDD = sc.wholeTextFiles("file:///home/cloudera/Desktop/Spark/RSJSON/Babynames.JSON",2).map(x => x._2)
jsonRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at map at <console>:27

scala> val namesJSON = sqlContext.read.json(jsonRDD)
namesJSON: org.apache.spark.sql.DataFrame = [Count: string, County: string, First Name: string, Sex: string, Year: string]

scala> namesJSON.printSchema
root
 |-- Count: string (nullable = true)
 |-- County: string (nullable = true)
 |-- First Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Year: string (nullable = true)

scala> namesJSON.filter(namesJSON("count") > 250).show()
+-----+------+----------+---+----+
|Count|County|First Name|Sex|Year|
+-----+------+----------+---+----+
|  272| KINGS|     DAVID|  M|2013|
|  268| KINGS|    JAYDEN|  M|2013|
+-----+------+----------+---+----+
scala> namesJSON.select("First Name").show()
+----------+
|First Name|
+----------+
|     DAVID|
|    JAYDEN|
|    JAYDEN|
|     MOSHE|
|     ETHAN|
+----------+


3 comments:

  1. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete
  2. great article,keep sharing more posts with us.

    thank you...

    big data online training

    ReplyDelete