Hemant Vishwakarma: Chunked tokenization in huggingface has an arrow error

Friday, 21 January 2022

Chunked tokenization in huggingface has an arrow error

I'm following the code from this video at 1m25s, which shows:

def tokenize_and_chunk(texts):
  return tokenizer(
    texts["text"], truncation=True, max_length=context_length,
    return overflowing_tokens=True
  )

tokenized_datasets = raw_datasets.map(
  tokenize_and_chunk, batched=True, remove_columns=["text"]
)

Here's the error I get when I try to run this code:

model_name = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

context_length = 1000

def tokenize_and_chunk(texts):
    return tokenizer(
      texts["text"], truncation=True, max_length=context_length,
      return_overflowing_tokens=True,
  )

dataset = Dataset.from_pandas(pd.DataFrame([{"id": "123", "text": "Here are many words! "*5000}]))

Shows a fine data set:

Dataset({
    features: ['id', 'text'],
    num_rows: 1
})

Ok,let's run the tokenizer:

toknized_datasets = dataset.map(tokenize_and_chunk, batched=True, remove_columns=["text"])

 0%
0/1 [00:00<?, ?ba/s]

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-69-d1216744e2ab> in <module>
----> 1 toknized_datasets = dataset.map(tokenize_and_chunk, batched=True, remove_columns=["text"])

ArrowInvalid: Column 1 named id expected length 5 but got length 1

from Chunked tokenization in huggingface has an arrow error

Hemant Vishwakarma

Friday, 21 January 2022

Chunked tokenization in huggingface has an arrow error

No comments:

Post a Comment