Hemant Vishwakarma: Python: How to annotate back to text the output of a transformers pipeline

I have a trained BERT model that I am willing to use in order to annotate some text.

I am using the transformers pipeline for NER task in the following way:

mode = AutoModelForTokenClassification.from_pretrained(<my_model_path>)
tokenize = BertTokenizer.from_pretrained(<my_model_path>)

nlp_ner = pipeline(
    "ner",
    model=mode,
    tokenizer=tokenize
)

Then, I am obtaining the prediction results by calling:

text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
result = nlp_ner(text)

Where the returned result is:

[{'entity': 'LABEL_1', 'score': 0.99999774, 'index': 1, 'word': '3', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999979, 'index': 2, 'word': ')', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9999897, 'index': 3, 'word': 'rewrite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 4, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999962, 'index': 5, 'word': 'last', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999975, 'index': 6, 'word': 'sentence', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998623, 'index': 7, 'word': '“', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99997735, 'index': 8, 'word': 'scanning', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9941041, 'index': 9, 'word': 'probe', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.999994, 'index': 10, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999696, 'index': 11, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 12, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998647, 'index': 13, 'word': 'per', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999939, 'index': 14, 'word': '##ovsk', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.99999154, 'index': 15, 'word': '##ite', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999942, 'index': 16, 'word': 'family', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.9997022, 'index': 17, 'word': '”', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999929, 'index': 18, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999977, 'index': 19, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999076, 'index': 20, 'word': 'current', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99996257, 'index': 21, 'word': 'one', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9169066, 'index': 22, 'word': 'is', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.6795164, 'index': 23, 'word': 'quite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.7315716, 'index': 24, 'word': 'conf', 'start': None, 'end': None}, {'entity': 'LABEL_9', 'score': 0.9067044, 'index': 25, 'word': '##using', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999925, 'index': 26, 'word': '.', 'start': None, 'end': None}]

The problem I am facing now is that I would like to annotate the predicted classes back to the text itself, but this looks complicated as the prediction results are not indexing words by expolding them with a space, but for example composed words are seen as multiple words etc.

Is there a way to annotate back the text (for example in Doccanno json format) that is not too complex?

My goal is to be able to say: For all the "LABEL_9", highlight the initial text with a specific html class. Or even easier, find the start and the end index for all words predicted as being of class "LABEL_9".

from Python: How to annotate back to text the output of a transformers pipeline

Hemant Vishwakarma

Tuesday, 14 June 2022

Python: How to annotate back to text the output of a transformers pipeline

No comments:

Post a Comment