Tuesday, 2 August 2022

How to extract doc from avro data and add it to dataframe

I'm trying to create hive/impala tables base on avro files in HDFS. The tool for doing the transformations is Spark.

I can't use spark.read.format("avro") to load the data into a dataframe, as in that way the doc part (description of the column) will be lost. I can see the doc by doing:

 input = sc.textFile("/path/to/avrofile")
 avro_schema = input.first() # not sure what type it is 

The problem is, it's a nested schema and I'm not sure how to traverse it to map the doc to the column description in dataframe. I'd like to have doc to the column description of the table. For example, the input schema looks like:

"fields": [
    {
     "name":"productName",
     "type": [
       "null",
       "string"
      ],
     "doc": "Real name of the product
     "default": null
    },
    {
     "name" : "currentSellers",
     "type": [
        "null",
        {
         "type": "record",
         "name": "sellers",
         "fields":[
             {
              "name": "location",
              "type":[
                 "null",
                  {
                   "type": "record"
                   "name": "sellerlocation",
                   "fields": [
                      {
                       "name":"locationName",
                       "type": [
                           "null",
                           "string"
                         ],
                       "doc": "Name of the location",
                       "default":null
                       },
                       {
                       "name":"locationArea",
                       "type": [
                           "null",
                           "string"
                         ],
                       "doc": "Area of the location",#The comment needs to be added to table comments
                       "default":null
                         .... #These are nested fields 

In the final table, for example one field name would be currentSellers_locationName, with column description "Name of the location". Could someone please help to shed some light on how to parse the schema and add the doc to description? and explain a bit about what this below bit is about outside of the fields? Many thanks. Let me know if I can explain it better.

         "name" : "currentSellers",
     "type": [
        "null",
        {
         "type": "record",
         "name": "sellers",
         "fields":[
             {
  


from How to extract doc from avro data and add it to dataframe

No comments:

Post a Comment