Hemant Vishwakarma: Error: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte in google colab

Thursday, 2 September 2021

Error: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte in google colab

import PyPDF4
from google.colab import files
files.upload()
fileReader = PyPDF4.PdfFileReader('ITC-1.pdf')
s=""
for i in range(2, fileReader.numPages):
    s+=fileReader.getPage(i).extractText()


sentences = []
while s.find('.') != -1:
    index = s.find('.')
    sentences.append(s[:index])
    s = s[index+1:]

text_ds = tf.data.TextLineDataset('ITC-1.pdf').filter(lambda x: tf.cast(tf.strings.length(x), bool))
vectorize_layer.adapt(text_ds.batch(1024))
inverse_vocab = vectorize_layer.get_vocabulary()

The last line in the code above shows the error. I saw several posts to understand what it means, but none of the solutions seem to work for me. I cannot use my local machine because I would be needing access to GPUs. Please suggest a workaround for this. Thanks!

PS: Following the code here https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/word2vec.ipynb#scrollTo=haJUNjSB60Kh, the difference is in the way I am reading the file. If there are better ways to do it, pleasee let me know!

from Error: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte in google colab

Hemant Vishwakarma

Thursday, 2 September 2021

Error: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte in google colab

No comments:

Post a Comment