Hemant Vishwakarma: parse xlsx file having merged cells using python or pyspark

Tuesday, 24 January 2023

parse xlsx file having merged cells using python or pyspark

I want to parse an xlsx file. Some of the cells in the file are merged and working as a header for the underneath values.
But do not know what approach I should select to parse the file.

Shall I parse the file from xlsx to json format and then I should perform the pivoting or transformation of dataset. OR
Shall proceed just by xlsx format and try to read the specific cell values- but I believe this approach will not make the code scalable and dynamic.

I tried to parse the file and tried to convert to json but it did not load the all the records. unfortunately, it is not throwing any exception.


from json import dumps
from xlrd import open_workbook

# load excel file
wb = open_workbook('/dbfs/FileStore/tables/filename.xlsx')

# get sheet by using sheet name
sheet = wb.sheet_by_name('Input Format')

# get total rows
total_rows = sheet.nrows

# get total columns
total_columns = sheet.ncols

# convert each row of sheet name in Dictionary and append to list
lst = []
for i in range(0, total_rows):
    row = {}
    for j in range(0, total_columns):
        if i + 1 < total_rows:
            column_name = sheet.cell(rowx=0, colx=j)
            row_data = sheet.cell_value(rowx=i+1, colx=j)

            row.update(
                {
                    column_name.value: row_data
                }
            )

    if len(row):
        lst.append(row)


# convert into json
json_data = dumps(lst)
print(json_data)

After executing the above code I received following type of output:

  {
    "Analysis": "M000000000000002001900000000000001562761",
    "KPI": "FELIX PARTY.MIX",
    "": 2.9969042460942
  },
  {
    "Analysis": "M000000000000002001900000000000001562761",
    "KPI": "FRISKIES ESTERILIZADOS",
    "": 2.0046260994622
  },

Once the data will be in good shape then spark-databricks should be used for the transformation.
I tried multiple approaches but failed :( Hence seeking help from the community.

For more clarity on the question I have added sample input/output screenshot as following. Input dataset: