Thursday, 26 August 2021

How to elimate duplicate data being read from a url that refreshes in python beautifulsoup

The snippet works already but I need help in filtering some duplicates results.

Issue #1: After running the script for few minutes, it displays duplicate result / the same results

Issue #2: Sometimes, it misses some data. Im not sure if this is normal

Goal #1: Eliminate duplicates / dont process if it has been read already

Goal #2: Possibly read all data in succession/continuity. (10205401, 10205402, 10205403, 10205404 and so on)

from bs4 import BeautifulSoup
from time import sleep
import re, requests

trim = re.compile(r'[^\d,.]+')

url = "https://bscscan.com/txs?a=0x10ed43c718714eb63d5aa57b78b54704e256024e&ps=100&p=1"
baseurl = 'https://bscscan.com/tx/'
header = {"User-Agent": "Mozilla/5.0"}
scans = 0
previous_block = 0
while True:
    scans += 1
    reqtxsInternal = requests.get(url,header, timeout=2)
    souptxsInternal = BeautifulSoup(reqtxsInternal.content, 'html.parser')
    blocktxsInternal = souptxsInternal.findAll('table')[0].findAll('tr')
    print (" -> Whole Page Scanned: ", scans)
    for row in blocktxsInternal[1:]:
        txnhash = row.find_all('td')[1].text[0:]
        txnhashdetails = txnhash.strip()
        block = row.find_all('td')[3].text[0:]
        if float(block) > float(previous_block):
            previous_block = block
        value = row.find_all('td')[9].text[0:]
        amount = trim.sub('', value).replace(",", "")
        transval = float(amount)
        
        if float(transval) >= 0 and block == previous_block:
            print ("    Processing data -> " + str(txnhashdetails)[60:] + "   " + str(block) + "   " + str(transval))
        else:
            pass
    
sleep(1)

Current Output: (After few minutes of running the script)

-> Whole Page Scanned:  14
 Processing data -> 8490f9   10205401   0.0
 Processing data -> 31f486   10205401   0.753749522929516
 Processing data -> 180ff9   10205401   0.0011
-> Whole Page Scanned:  15              <--- duplicate reads/data
 Processing data -> 8490f9   10205401   0.0                  
 Processing data -> 31f486   10205401   0.753749522929516
 Processing data -> 180ff9   10205401   0.0011
-> Whole Page Scanned:  16 >            <--- just fine
 Processing data -> 836486   10205402   0.0345
 Processing data -> d05a8a   10205402   1.37
 Processing data -> 0a035d   10205402   0.3742134
-> Whole Page Scanned:  17               <--- missed one (10205403)                   
 Processing data -> e9d7b7   10205404   10.10
 Processing data -> 9079c9   10205404   1.09
 Processing data -> f8a8a0   10205404   100.2


from How to elimate duplicate data being read from a url that refreshes in python beautifulsoup

No comments:

Post a Comment