Sunday 13 December 2020

Can't get rid of unwanted stuff while scraping email addresses

I'm trying to capture email addresses from some site's landing pages using requests in combination with re module. This is the pattern [\w\.-]+@[\w\.-]+ that I've used within the script to capture them.

When I run the script, I do get email addresses. However, I also get some unwanted stuff that resemble email addresses but in reality they are not which I would like to get rid of.

import re
import requests

links = (
    'http://www.acupuncturetx.com',
    'http://www.hcmed.org',
    'http://www.drmindyboxer.com',
    'http://wendyrobinweir.com',
)

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}

for link in links:
    r = requests.get(link,headers=headers)
    emails = re.findall(r"[\w\.-]+@[\w\.-]+",r.text)
    print(emails)

Current output:

['react@16.5.2', 'react-dom@16.5.2', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com']
['hh-logo@2x.png', 'hh-logo@2x.png', 'hh-logo@2x.png', 'hh-logo@2x-300x47.png']
['leaflet@1.7.1']
['8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress.com', 'requirejs-bolt@2.3.6', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wixstores-client-cart-icon@1.797.0', 'wixstores-client-gallery@1.1634.0']

Expected output:

['bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com']
[]
[]
['wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com']

How can I only capture email addresses and get rid of unwanted stuff using regex>



from Can't get rid of unwanted stuff while scraping email addresses

No comments:

Post a Comment