Friday, 21 January 2022

Chunked tokenization in huggingface has an arrow error

I'm following the code from this video at 1m25s, which shows:

def tokenize_and_chunk(texts):
  return tokenizer(
    texts["text"], truncation=True, max_length=context_length,
    return overflowing_tokens=True
  )

tokenized_datasets = raw_datasets.map(
  tokenize_and_chunk, batched=True, remove_columns=["text"]
)

Here's the error I get when I try to run this code:

model_name = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

context_length = 1000

def tokenize_and_chunk(texts):
    return tokenizer(
      texts["text"], truncation=True, max_length=context_length,
      return_overflowing_tokens=True,
  )

dataset = Dataset.from_pandas(pd.DataFrame([{"id": "123", "text": "Here are many words! "*5000}]))  

Shows a fine data set:

Dataset({
    features: ['id', 'text'],
    num_rows: 1
})

Ok,let's run the tokenizer:

toknized_datasets = dataset.map(tokenize_and_chunk, batched=True, remove_columns=["text"])

 0%
0/1 [00:00<?, ?ba/s]

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-69-d1216744e2ab> in <module>
----> 1 toknized_datasets = dataset.map(tokenize_and_chunk, batched=True, remove_columns=["text"])

ArrowInvalid: Column 1 named id expected length 5 but got length 1


from Chunked tokenization in huggingface has an arrow error

No comments:

Post a Comment