Tuesday 26 February 2019

Apache Tika exclude some html tags

I am testing Apache Tika REST Api via python for parsing HTML files. Everything works except one thing. Interior of <noscript> tags is also parsed as text and I am having some css styling content in my text, which is undesirable. Also, body of <div style="display:none"> is extracted as well. Is there a way to blacklist some html tags in the Tika rest API?



from Apache Tika exclude some html tags

No comments:

Post a Comment