Sunday, 23 May 2021

Script doesn't work when I go for multiple search keywords in the list

I've created a script to fetch different newspaper names derived from a search engine when I initiate search using different keywords, as in CMG제약,DB하이텍 e.t.c. in that pages top right search box.

I also used some customized dates within params to get results from those dates. The script is doing fine as long as I use a single keyword in the search list.

However, when I use multiple keyword in the search list the script only keeps up with the last keyword. This is the list of keywords I would like to use:

keywords = ['CMG제약','DB하이텍','ES큐브','EV첨단소재']

The script is short in size but because of the height of the params, it looks bigger.

I've tried so far with (works as intended as I used single search keyword in the list):

import requests
import concurrent.futures
from bs4 import BeautifulSoup
from urllib.parse import urljoin

year_list_start = ['2013.01.01','2014.01.02']
year_list_upto = ['2014.01.01','2015.01.01']

base = 'https://search.naver.com/search.naver'
link = 'https://search.naver.com/search.naver'
params = {
    'where': 'news',
    'sm': 'tab_pge',
    'query': '',
    'sort': '1',
    'photo': '0',
    'field': '0',
    'pd': '',
    'ds': '',
    'de': '',
    'cluster_rank': '',
    'mynews': '0',
    'office_type': '0',
    'office_section_code': '0',
    'news_office_checked': '',
    'nso': '',
    'start': '',
}

def fetch_content(s,keyword,link,params):
    for start_date,date_upto in zip(year_list_start,year_list_upto):
        ds = start_date.replace(".","")
        de = date_upto.replace(".","")
        params['query'] = keyword
        params['ds'] = ds
        params['de'] = de
        params['nso'] = f'so:r,p:from{ds}to{de},a:all'
        params['start'] = 1

        while True:
            res = s.get(link,params=params)
            print(res.status_code)
            print(res.url)
            soup = BeautifulSoup(res.text,"lxml")
            if not soup.select_one("ul.list_news .news_area .info_group > a.press"): break
            for item in soup.select("ul.list_news .news_area"):
                newspaper_name = item.select_one(".info_group > a.press").get_text(strip=True).lstrip("=")
                print(newspaper_name)

            if soup.select_one("a.btn_next[aria-disabled='true']"): break
            next_page = soup.select_one("a.btn_next").get("href")
            link = urljoin(base,next_page)
            params = None


if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
        
        keywords = ['CMG제약']

        with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
            future_to_url = {executor.submit(fetch_content, s, keyword, link, params): keyword for keyword in keywords}
            concurrent.futures.as_completed(future_to_url)

How can I make the script work when there are more than one keyword in the search list?



from Script doesn't work when I go for multiple search keywords in the list

No comments:

Post a Comment