I'm trying to create hive/impala tables base on avro files in HDFS. The tool for doing the transformations is Spark.
I can't use spark.read.format("avro")
to load the data into a dataframe, as in that way the doc
part (description of the column) will be lost. I can see the doc by doing:
input = sc.textFile("/path/to/avrofile")
avro_schema = input.first() # not sure what type it is
The problem is, it's a nested schema and I'm not sure how to traverse it to map the doc
to the column description in dataframe. I'd like to have doc
to the column description of the table. For example, the input schema looks like:
"fields": [
{
"name":"productName",
"type": [
"null",
"string"
],
"doc": "Real name of the product
"default": null
},
{
"name" : "currentSellers",
"type": [
"null",
{
"type": "record",
"name": "sellers",
"fields":[
{
"name": "location",
"type":[
"null",
{
"type": "record"
"name": "sellerlocation",
"fields": [
{
"name":"locationName",
"type": [
"null",
"string"
],
"doc": "Name of the location",
"default":null
},
{
"name":"locationArea",
"type": [
"null",
"string"
],
"doc": "Area of the location",#The comment needs to be added to table comments
"default":null
.... #These are nested fields
In the final table, for example one field name would be currentSellers_locationName
, with column description "Name of the location". Could someone please help to shed some light on how to parse the schema and add the doc to description? and explain a bit about what this below bit is about outside of the fields? Many thanks. Let me know if I can explain it better.
"name" : "currentSellers",
"type": [
"null",
{
"type": "record",
"name": "sellers",
"fields":[
{
from How to extract doc from avro data and add it to dataframe
No comments:
Post a Comment