import PyPDF4
from google.colab import files
files.upload()
fileReader = PyPDF4.PdfFileReader('ITC-1.pdf')
s=""
for i in range(2, fileReader.numPages):
s+=fileReader.getPage(i).extractText()
sentences = []
while s.find('.') != -1:
index = s.find('.')
sentences.append(s[:index])
s = s[index+1:]
text_ds = tf.data.TextLineDataset('ITC-1.pdf').filter(lambda x: tf.cast(tf.strings.length(x), bool))
vectorize_layer.adapt(text_ds.batch(1024))
inverse_vocab = vectorize_layer.get_vocabulary()
The last line in the code above shows the error. I saw several posts to understand what it means, but none of the solutions seem to work for me. I cannot use my local machine because I would be needing access to GPUs. Please suggest a workaround for this. Thanks!
PS: Following the code here https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/word2vec.ipynb#scrollTo=haJUNjSB60Kh, the difference is in the way I am reading the file. If there are better ways to do it, pleasee let me know!
from Error: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte in google colab
No comments:
Post a Comment