I'm trying to capture email addresses from some site's landing pages using requests in combination with re module. This is the pattern [\w\.-]+@[\w\.-]+
that I've used within the script to capture them.
When I run the script, I do get email addresses. However, I also get some unwanted stuff that resemble email addresses but in reality they are not which I would like to get rid of.
import re
import requests
links = (
'http://www.acupuncturetx.com',
'http://www.hcmed.org',
'http://www.drmindyboxer.com',
'http://wendyrobinweir.com',
)
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}
for link in links:
r = requests.get(link,headers=headers)
emails = re.findall(r"[\w\.-]+@[\w\.-]+",r.text)
print(emails)
Current output:
['react@16.5.2', 'react-dom@16.5.2', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com']
['hh-logo@2x.png', 'hh-logo@2x.png', 'hh-logo@2x.png', 'hh-logo@2x-300x47.png']
['leaflet@1.7.1']
['8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress.com', 'requirejs-bolt@2.3.6', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wixstores-client-cart-icon@1.797.0', 'wixstores-client-gallery@1.1634.0']
Expected output:
['bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com']
[]
[]
['wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com']
How can I only capture email addresses and get rid of unwanted stuff using regex>
from Can't get rid of unwanted stuff while scraping email addresses
No comments:
Post a Comment