Tuesday 9 March 2021

Document Layout Analysis for text extraction

I need to analyze the layout structure of different documents type like: pdf, doc, docx, odt etc.

My task is: Giving a document, group the text in blocks finding the correct boundaries of each.

I did some tests using Apache Tika, which is a good extractor, it is a very good tool but it often mess up the order of the block, let me explain a bit what i mean with ORDER.

Apache Tika just extracts the text, so if my document has two columns, Tika extracts the entire text of the first column and then the text of the second column, which is ok...but sometimes the text on the first column is related to the text on the second, like a table that has row relation.

So i must take care of the positions of each block, so the problems are:

  1. Define the box boundaries, which is hard... i should understand if a sentence is starting a new block or not.

  2. Define the orientation, for example, giving a table the "sentence" should be the row, NOT the column.

So basically here i have to deal with the layout structure to correcly understand the block boundaries.

I give you a visual example:

enter image description here

A classical extractor returns:

2019
2018
2017
2016
2015
2014
Oregon Arts Commission Individual Artist Fellowship...

Which is wrong (in my case) because the dates are related to the texts on the right.

This task is preparatory for other NLP analysis, so it is very important, because, for example doing, when i need to recognize the entities(NER) inside the text, and then identify their relations, working with the correct context is very important.

How to extract the text from the document and assembly related pieces of text (understanding the layout structure of the document) under the same block?



from Document Layout Analysis for text extraction

No comments:

Post a Comment