I am parsing a large html doc. I have used groups to "group" the text and separate using \n\n. The entire text is within the <font> </font> tags of the doc.
Each group has 5 fields, Serial#.........,Cust#...........,Customer Name...,BILL TO NO NAME.,DATE......
I need to use the Cust#........... from each "group" and compare it against every other group in the list to look for a duplicate Cust#............
If a duplicate is found then I need to append BILL TO NO NAME. into each group with the duplicate Cust#...........
Sample html:
Serial#......... 12345678974566321\nCust#........... 123456\nCustomer Name... Humpfrey Bear\nBILL TO NO NAME. Bill To: 001166 - Some Company\nDATE...... 01/01/00\n\n'Serial#......... sgfdsfd546545645\nCust#........... 123456\nCustomer Name... Humpfrey Bear\nBILL TO NO NAME. Bill To: 0165487 - Some Other Company\nDATE...... 01/01/00\n\n'Serial#......... Jgfdhdgfhgfdh4545\nCust#........... 88483\nCustomer Name... John Smith\nBILL TO NO NAME. Bill To: 0146897 - Some Company\nDATE...... 01/01/00\n\n'Serial#......... JF2SJads5dsafdsaf\nCust#........... 015648\nCustomer Name... Eric Cantona\nBILL TO NO NAME. Bill To: 8888154 - Man Utd\nDATE...... 01/01/00\n\n'Serial#......... JdsfrfdsgHG091797\nCust#........... 015648\nCustomer Name... Eric Cantona\nBILL TO NO NAME. Bill To: 9876524 - Big Big Company\nDATE...... 01/01/00\n\n'
The ouput I need is: Serial#......... 12345678974566321\nCust#........... 123456\nCustomer Name... Humpfrey Bear\nBILL TO NO NAME. Bill To: 001166 - Some Company Bill To: 0165487 - Some Other Company\nDATE...... 01/01/00\n\n',Serial#......... JF2SJads5dsafdsaf\nCust#........... 015648\nCustomer Name... Eric Cantona\nBILL TO NO NAME. Bill To: 8888154 - Man Utd Bill To: 9876524 - Big Big Company\nDATE...... 01/01/00\n\n'
My output has omitted the duplciates based on Cust#......xxxxx, but I am just trying to show my expected result a little clearer. I can sort the duplicates out later.
A shortened version of what I have thus far the rest is insignificant.
from bs4 import BeautifulSoup
import re
import urllib
import os
with open(r'.html', 'r') as f:
html_string = f.read()
soup = BeautifulSoup(html_string, 'html.parser')
groups = soup.font.text.replace('Serial#', 'xxxSerial#').split('xxx')
from Append text to a group in a html document based on condition
No comments:
Post a Comment