Monday 25 October 2021

Increase tika heap size in Python with tika-python

Can someone suggest a way to give tika a larger heap size (1 GByte or so) while using tika-python (on Windows)?

I get "status: 500" errors from tika when processing very large Microsoft Word files. If I run tika from the Windows command line as follows, the errors go away:

C:>java -Xmx1G -jar tika-app-2.1.0.jar

The -Xmx1G specifies a maximum heap size of 1 GByte (much larger than the default).

I've seen several answers for other languages, but none specific for Python with tika-python.

I've tried:

os.environ["TIKA_JAVA_ARGS"] = "-Xmx1G"
from tika import parser as tika_parser 

and:

def main():  
    global MODEL_LIST   
    os.environ["TIKA_JAVA_ARGS"] = "-Xmx1G"
    start_time = time.time()
    ... rest of code ...

and from the Windows command line:

C:\<path>\findEm>set TIKA_JAVA_ARGS="-Xmx1G"
C:\<path>\findEm>python3 findEmv1.52.py

All 3 methods result in the same error, something like

2021-10-19 14:43:55,782 [MainThread  ] [WARNI]  Tika server returned status: 500

I think the main problem is that the Java tika process is already running when I'm trying to change the maximum heap size - somehow I need to kill that, set the heap size max, and restart the Java tika server. How?



from Increase tika heap size in Python with tika-python

No comments:

Post a Comment