I am testing Apache Tika REST Api via python for parsing HTML files. Everything works except one thing. Interior of <noscript> tags is also parsed as text and I am having some css styling content in my text, which is undesirable. Also, body of <div style="display:none"> is extracted as well. Is there a way to blacklist some html tags in the Tika rest API?
from Apache Tika exclude some html tags
No comments:
Post a Comment