Loading Multiline JSON File as a DataFrame ready to insert as a Table Example
Our JSON File 1.json:
{
"user": "gT35Hhhre9m",
"dates": ["2016-01-29", "2016-01-28"],
"status": "OK",
"reason": "some reason",
"content": [{
"foo": 123,
"bar": "val1"
}, {
"foo": 456,
"bar": "val2"
}, {
"foo": 789,
"bar": "val3"
}, {
"foo": 124,
"bar": "val4"
}, {
"foo": 126,
"bar": "val5"
}]
}
Expected output is:
-----------+---------+-------------+-----------+-------+-------
user |status | reason | dates | foo | bar |
--------------------------------------------------------------
gT35Hhhre9m| OK | some reason |2016-01-29 | 123 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-28 | 123 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-29 | 456 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-28 | 456 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-29 | 789 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-28 | 789 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-29 | 124 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-28 | 124 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-29 | 126 | val1 |
gT35Hhhre9m| OK | some reason |2016-01-28 | 126 | val1 |
-----------+---------+-------------+-----------+-------+-------
Code:
val df1 = saprk.read.option("multiline",true).json("file:///Leela/Filesparsing/1.json")
val df2 = df1.withColumn("contents",explode(df1("content"))).withColumn("dates",explode(df1("dates")))
df2.select("user","status","reason","dates","contents.foo","contents.bar").show
Let us consider the JSON File as below.
[{
"Year": "2013",
"First Name": "DAVID",
"County": "KINGS",
"Sex": "M",
"Count": "272"
}, {
"Year": "2013",
"First Name": "JAYDEN",
"County": "KINGS",
"Sex": "M",
"Count": "268"
}, {
"Year": "2013",
"First Name": "JAYDEN",
"County": "QUEENS",
"Sex": "M",
"Count": "219"
}, {
"Year": "2013",
"First Name": "MOSHE",
"County": "KINGS",
"Sex": "M",
"Count": "219"
}, {
"Year": "2013",
"First Name": "ETHAN",
"County": "QUEENS",
"Sex": "M",
"Count": "216"
}]
Note: sqlContext.read.json cannot be able to read the above JSON file because it is a multi line JSON file. In apache documentation it is clearly mentioned as "Each line must contain a separate, self-contained valid JSON object." only then, the below one is valid.
val path = "examples/src/main/resources/people.json" val peopleDF = spark.read.json(path)
Need to use wholeTextFiles(JSONFileName) so that a Key-Value pair is created with key as the file name and value as complete file content. This will ensure that the JSON file is not split and can be parsed successfully
scala> val jsonRDD = sc.wholeTextFiles("file:///home/cloudera/Desktop/Spark/RSJSON/Babynames.JSON",2).map(x => x._2)
jsonRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at map at <console>:27
scala> val namesJSON = sqlContext.read.json(jsonRDD)
namesJSON: org.apache.spark.sql.DataFrame = [Count: string, County: string, First Name: string, Sex: string, Year: string]
scala> namesJSON.printSchema
root
|-- Count: string (nullable = true)
|-- County: string (nullable = true)
|-- First Name: string (nullable = true)
|-- Sex: string (nullable = true)
|-- Year: string (nullable = true)
scala> namesJSON.filter(namesJSON("count") > 250).show()
+-----+------+----------+---+----+
|Count|County|First Name|Sex|Year|
+-----+------+----------+---+----+
| 272| KINGS| DAVID| M|2013|
| 268| KINGS| JAYDEN| M|2013|
+-----+------+----------+---+----+
scala> namesJSON.select("First Name").show()
+----------+
|First Name|
+----------+
| DAVID|
| JAYDEN|
| JAYDEN|
| MOSHE|
| ETHAN|
+----------+
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
ReplyDeletehttps://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/
Nice Post thanks
ReplyDeletealternatives to kissanime
great article,keep sharing more posts with us.
ReplyDeletethank you...
big data online training