I am testing Apache Tika REST Api via python for parsing HTML files. Everything works except one thing. Interior of <noscript>
tags is also parsed as text and I am having some css styling content in my text, which is undesirable. Also, body of <div style="display:none">
is extracted as well. Is there a way to blacklist some html tags in the Tika rest API?
from Apache Tika exclude some html tags
No comments:
Post a Comment