I have a 5 pages pdf file, each page has a table that I need extract. I need to extract all the tables from each page and save them as a data-frame file all using python so i converted the file, to a csv file using tabula
tabula.convert_into('input.pdf', "output.csv", output_format="csv", pages='all')
The main issue with the file output.csv is that there are several extra commas.
Example
Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361
When i convert the csv file into row/columns i got some lines offset
Check the image below to get the problem: As you can see in the image there are some lines offset(each table in each page of file has specific lines offset) how can i fix this problem
NB: The dataframe should have 6 columns with empty fields. I guess the extra commas comes from space in the pdf file. how can i remove extra commas from csv file or removing extra space on pdf file.
The expected output in the image below:
I would really appreciate your help.
from Remove extra commas in .csv file using python
No comments:
Post a Comment