Thursday, 4 October 2018

Library to infer field-delimited file information

I have various "unknown" field-separated files that are being uploaded by users (I have zero control or even knowledge of what they will be other then that they will end in "v"), and I would like to see if there are existing libraries (hopefully in python) that infer the following information about an unknown field-separated file:

  • What line number the header is on.
  • Whether there is a header or not.
  • What the separator is.
  • If any rows are skipped after the header

enter image description here

In the above example, the header would start one line 2, and the data would start on line 4 (the separator here is a tab, but that's not shown in the grid above).

Are there any open-source libraries (ML/AI?) that try to infer file heading information based on the first ~100 lines of data or so? Here's one approach from a Google search, but doesn't specify any software packages: https://www.computer.org/csdl/proceedings/hpcc/2016/4297/00/07828554.pdf.


Update: essentially, I'm looking if a library exists (in any language) where I could pass it the first ~100 rows of data and it would be able to make an educated guess on (1) what line the header is on (2) what line the data starts at; and (3) what the delimiter is.



from Library to infer field-delimited file information

No comments:

Post a Comment