Tuesday, 16 April 2019

Process unicode strings in python

I am using fasttext pre-trained model based on english wikipedia. It works as expected...

https://github.com/shantanuo/pandas_examples/blob/master/nlp/fasttext_english.ipynb

But when I try the same code with some other language, I get an error as shown on this page...

https://github.com/shantanuo/pandas_examples/blob/master/nlp/fasttext_marathi.ipynb

The error is related to unicode:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 15: invalid start byte

I tried to open the file using Raw Binary option. I changed the function load_words_raw in load.py file:

with open(file_path, 'rb') as f:

And now I get a different error:

ValueError: could not convert string to float: b'\x00l\x02'

I have no idea how to handle this.



from Process unicode strings in python

No comments:

Post a Comment