Hemant Vishwakarma: Querying HTML Content in Common Crawl Dataset Using Amazon Athena

Monday, 9 October 2023

Querying HTML Content in Common Crawl Dataset Using Amazon Athena

I am currently exploring the massive Common Crawl dataset hosted on Amazon S3 and am attempting to use Amazon Athena to query this dataset. My objective is to search within the HTML content of the web pages to identify those that contain specific strings within their tags. Essentially, I am looking to filter out websites whose HTML content matches particular criteria.

I am aware that Athena is capable of querying large datasets on S3 using standard SQL. However, I am not entirely sure about the feasibility and the approach to directly query inside the HTML content of the web pages in the Common Crawl dataset.

Here's a simplified version of what I am looking to achieve:

sql

SELECT * 
FROM "common_crawl_dataset" 
WHERE html_content LIKE '%specific-string%';

I'm reaching out to inquire:

Is it possible to directly query the HTML content of the web pages in the Common Crawl dataset using Athena? If yes, what would be the best approach to accomplish this, considering efficiency and cost-effectiveness? Are there any limitations or challenges that I should be aware of?

Any insights, tips, or examples of similar implementations would be greatly appreciated. Thank you in advance for your assistance!

from Querying HTML Content in Common Crawl Dataset Using Amazon Athena

Hemant Vishwakarma

Monday, 9 October 2023

Querying HTML Content in Common Crawl Dataset Using Amazon Athena

No comments:

Post a Comment