Friday, 12 October 2018

How to incorporate metadata into NLTK corpus for efficient processing

I have a folder of txt files and also a csv file with additional data like categories a particular txt document belongs to and the original source file (pdf) path. The Txt file name is used as key into the CSV file.

I have created a basic nltk corpus but I would like to know if that's the best way of structuring my data given I want to carry out a range of NLP tasks like NER on the corpus and eventually identify the entities which occur in each category and be able to link back to the source pdf files so each entity can be seen in context.

Most NLP examples (find NERs) go from corpus to python lists of entities but doesn't that mean I will loose the association back to the txt file which contained the entities and all the other associated data?

Categorical corpus appears to help with keeping the category data but my question is

What is the best way to structure and work with my corpus that avoids having to roundtrip between - process corpus to identify interesting information outputted to list - search corpus again to get files which contains the interested element from the list - search CSV (data frame) by file id to get the rest of the metadata



from How to incorporate metadata into NLTK corpus for efficient processing

No comments:

Post a Comment