I've created a script to fetch different newspaper names derived from a search engine when I initiate search using different keywords, as in CMG제약,DB하이텍 e.t.c. in that pages top right search box.
I also used some customized dates within params to get results from those dates. The script is doing fine as long as I use a single keyword in the search list.
However, when I use multiple keyword in the search list the script only keeps up with the last keyword. This is the list of keywords I would like to use:
keywords = ['CMG제약','DB하이텍','ES큐브','EV첨단소재']
The script is short in size but because of the height of the params, it looks bigger.
I've tried so far with (works as intended as I used single search keyword in the list):
import requests
import concurrent.futures
from bs4 import BeautifulSoup
from urllib.parse import urljoin
year_list_start = ['2013.01.01','2014.01.02']
year_list_upto = ['2014.01.01','2015.01.01']
base = 'https://search.naver.com/search.naver'
link = 'https://search.naver.com/search.naver'
params = {
'where': 'news',
'sm': 'tab_pge',
'query': '',
'sort': '1',
'photo': '0',
'field': '0',
'pd': '',
'ds': '',
'de': '',
'cluster_rank': '',
'mynews': '0',
'office_type': '0',
'office_section_code': '0',
'news_office_checked': '',
'nso': '',
'start': '',
}
def fetch_content(s,keyword,link,params):
for start_date,date_upto in zip(year_list_start,year_list_upto):
ds = start_date.replace(".","")
de = date_upto.replace(".","")
params['query'] = keyword
params['ds'] = ds
params['de'] = de
params['nso'] = f'so:r,p:from{ds}to{de},a:all'
params['start'] = 1
while True:
res = s.get(link,params=params)
print(res.status_code)
print(res.url)
soup = BeautifulSoup(res.text,"lxml")
if not soup.select_one("ul.list_news .news_area .info_group > a.press"): break
for item in soup.select("ul.list_news .news_area"):
newspaper_name = item.select_one(".info_group > a.press").get_text(strip=True).lstrip("=")
print(newspaper_name)
if soup.select_one("a.btn_next[aria-disabled='true']"): break
next_page = soup.select_one("a.btn_next").get("href")
link = urljoin(base,next_page)
params = None
if __name__ == '__main__':
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
keywords = ['CMG제약']
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
future_to_url = {executor.submit(fetch_content, s, keyword, link, params): keyword for keyword in keywords}
concurrent.futures.as_completed(future_to_url)
How can I make the script work when there are more than one keyword in the search list?
from Script doesn't work when I go for multiple search keywords in the list
No comments:
Post a Comment