Hemant Vishwakarma: huggingface longformer case sensitive tokenizer

Thursday, 26 May 2022

huggingface longformer case sensitive tokenizer

This page shows how to build a longformer based classification.

import pandas as pd
import datasets
from transformers import LongformerTokenizerFast, LongformerForSequenceClassification, Trainer, TrainingArguments, LongformerConfig
import torch.nn as nn
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm
import wandb
import os


# load model and tokenizer and define length of the text sequence
model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096',
                                                           gradient_checkpointing=False,
                                                           attention_window = 512)
tokenizer = LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096', max_length = 1024)

I noticed that the tokenizer is sensitive to case of the data. words do and Do get different tokens below. I dont need such behavior. I can always lowercase my data before feeding to longformer. But is there is any other better way to tell tokenizer to ignore case of the data?

encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)
{'input_ids': [0, 8275, 45, 31510, 459, 11, 5, 5185, 9, 44859, 6, 13, 51, 32, 12405, 8, 2119, 7, 6378, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

encoded_input3 = tokenizer("do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input3)
{'input_ids': [0, 5016, 45, 31510, 459, 11, 5, 5185, 9, 44859, 6, 13, 51, 32, 12405, 8, 2119, 7, 6378, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

from huggingface longformer case sensitive tokenizer

Hemant Vishwakarma

Thursday, 26 May 2022

huggingface longformer case sensitive tokenizer

No comments:

Post a Comment