Hemant Vishwakarma: Library to infer field-delimited file information

Thursday, 4 October 2018

Library to infer field-delimited file information

I have various "unknown" field-separated files that are being uploaded by users (I have zero control or even knowledge of what they will be other then that they will end in "v"), and I would like to see if there are existing libraries (hopefully in python) that infer the following information about an unknown field-separated file:

What line number the header is on.
Whether there is a header or not.
What the separator is.
If any rows are skipped after the header

In the above example, the header would start one line 2, and the data would start on line 4 (the separator here is a tab, but that's not shown in the grid above).

Are there any open-source libraries (ML/AI?) that try to infer file heading information based on the first ~100 lines of data or so? Here's one approach from a Google search, but doesn't specify any software packages: https://www.computer.org/csdl/proceedings/hpcc/2016/4297/00/07828554.pdf.

Update: essentially, I'm looking if a library exists (in any language) where I could pass it the first ~100 rows of data and it would be able to make an educated guess on (1) what line the header is on (2) what line the data starts at; and (3) what the delimiter is.

from Library to infer field-delimited file information

Hemant Vishwakarma

Thursday, 4 October 2018

Library to infer field-delimited file information

No comments:

Post a Comment