Hemant Vishwakarma: Parsing Index page in a PDF text book with Python

Tuesday, 25 September 2018

Parsing Index page in a PDF text book with Python

I have to extract text from PDF pages as it is with the indentation into a CSV file.

Index page from PDF text book:

I should split the text into class and subclass type hierarchy along with the page numbers. For example in the image, Application server is the class and Apache Tomcat is the subclass in the page number 275

This is the expected output of the CSV:

I have used Tika parser to parse the PDF, but the indentation is not maintained properly (not unique) in the parsed content for splitting the text into class and subclasses.

This is how the parsed text looks like: