Can someone suggest a way to give tika a larger heap size (1 GByte or so) while using tika-python (on Windows)?
I get "status: 500" errors from tika when processing very large Microsoft Word files. If I run tika from the Windows command line as follows, the errors go away:
C:>java -Xmx1G -jar tika-app-2.1.0.jar
The -Xmx1G
specifies a maximum heap size of 1 GByte (much larger than the default).
I've seen several answers for other languages, but none specific for Python with tika-python.
I've tried:
os.environ["TIKA_JAVA_ARGS"] = "-Xmx1G"
from tika import parser as tika_parser
and:
def main():
global MODEL_LIST
os.environ["TIKA_JAVA_ARGS"] = "-Xmx1G"
start_time = time.time()
... rest of code ...
and from the Windows command line:
C:\<path>\findEm>set TIKA_JAVA_ARGS="-Xmx1G"
C:\<path>\findEm>python3 findEmv1.52.py
All 3 methods result in the same error, something like
2021-10-19 14:43:55,782 [MainThread ] [WARNI] Tika server returned status: 500
I think the main problem is that the Java tika process is already running when I'm trying to change the maximum heap size - somehow I need to kill that, set the heap size max, and restart the Java tika server. How?
from Increase tika heap size in Python with tika-python
No comments:
Post a Comment