I am attempting to tokenize a single column in a TensorFlow Dataset. The approach I've been using works well if there is only a single feature column, example:
text = ["I played it a while but it was alright. The steam was a bit of trouble."
" The more they move these game to steam the more of a hard time I have"
" activating and playing a game. But in spite of that it was fun, I "
"liked it. Now I am looking forward to anno 2205 I really want to "
"play my way to the moon.",
"This game is a bit hard to get the hang of, but when you do it's great."]
target = [0, 1]
df = pd.DataFrame({"text": text,
"target": target})
training_dataset = (
tf.data.Dataset.from_tensor_slices((
tf.cast(df.text.values, tf.string),
tf.cast(df.target, tf.int32))))
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
for text, _ in training_dataset:
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
lowercase=True,
tokenizer=tokenizer)
However when I try to do this where there are a set of feature columns, say coming out of make_csv_dataset
(where each feature column is named) the above methodology fails. (ValueError: Attempt to convert a value (OrderedDict([]) to a Tensor.
).
I attempted to reference a specific feature column within the for loop using:
text = ["I played it a while but it was alright. The steam was a bit of trouble."
" The more they move these game to steam the more of a hard time I have"
" activating and playing a game. But in spite of that it was fun, I "
"liked it. Now I am looking forward to anno 2205 I really want to "
"play my way to the moon.",
"This game is a bit hard to get the hang of, but when you do it's great."]
target = [0, 1]
gender = [1, 0]
age = [45, 35]
df = pd.DataFrame({"text": text,
"target": target,
"gender": gender,
"age": age})
df.to_csv('test.csv', index=False)
dataset = tf.data.experimental.make_csv_dataset(
'test.csv',
batch_size=2,
label_name='target')
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
for features, _ in dataset:
text = features['text']
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
lowercase=True,
tokenizer=tokenizer)
I get the error: Expected binary or unicode string, got array([])
. What is the proper way to reference a single feature column so that I can tokenize? Typically you can reference a feature column using the feature['column_name']
approach within a .map
function, example:
def new_age_func(features, target):
age = features['age']
features['age'] = age/2
return features, targets
dataset = dataset.map(new_age_func)
for features, target in dataset.take(2):
print('Features: {}, Target {}'.format(features, target))
I tried combining approaches and generating the vocabulary list via a map function.
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
def vocab_generator(features, target):
text = features['text']
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
dataset = dataset.map(vocab_generator)
but this leads to the error:
AttributeError: in user code:
<ipython-input-61-374e4c375b58>:10 vocab_generator *
tokens = tokenizer.tokenize(text.numpy())
AttributeError: 'Tensor' object has no attribute 'numpy'
and changing tokenizer.tokenize(text.numpy())
to tokenizer.tokenize(text)
throws another error TypeError: Expected binary or unicode string, got <tf.Tensor 'StringLower:0' shape=(2,) dtype=string>
from Referencing and tokenizing single feature column in multi-feature TensorFlow Dataset
No comments:
Post a Comment