I'm following the code from this video at 1m25s, which shows:
def tokenize_and_chunk(texts):
return tokenizer(
texts["text"], truncation=True, max_length=context_length,
return overflowing_tokens=True
)
tokenized_datasets = raw_datasets.map(
tokenize_and_chunk, batched=True, remove_columns=["text"]
)
Here's the error I get when I try to run this code:
model_name = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
context_length = 1000
def tokenize_and_chunk(texts):
return tokenizer(
texts["text"], truncation=True, max_length=context_length,
return_overflowing_tokens=True,
)
dataset = Dataset.from_pandas(pd.DataFrame([{"id": "123", "text": "Here are many words! "*5000}]))
Shows a fine data set:
Dataset({
features: ['id', 'text'],
num_rows: 1
})
Ok,let's run the tokenizer:
toknized_datasets = dataset.map(tokenize_and_chunk, batched=True, remove_columns=["text"])
0%
0/1 [00:00<?, ?ba/s]
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<ipython-input-69-d1216744e2ab> in <module>
----> 1 toknized_datasets = dataset.map(tokenize_and_chunk, batched=True, remove_columns=["text"])
ArrowInvalid: Column 1 named id expected length 5 but got length 1
from Chunked tokenization in huggingface has an arrow error
No comments:
Post a Comment