Friday, 28 August 2020

Append text to a group in a html document based on condition

I am parsing a large html doc. I have used groups to "group" the text and separate using \n\n. The entire text is within the <font> </font> tags of the doc.

Each group has 5 fields, Serial#.........,Cust#...........,Customer Name...,BILL TO NO NAME.,DATE......

I need to use the Cust#........... from each "group" and compare it against every other group in the list to look for a duplicate Cust#............

If a duplicate is found then I need to append BILL TO NO NAME. into each group with the duplicate Cust#...........

Sample html:

Serial#......... 12345678974566321\nCust#........... 123456\nCustomer Name... Humpfrey Bear\nBILL TO NO NAME. Bill To: 001166 - Some Company\nDATE...... 01/01/00\n\n'Serial#......... sgfdsfd546545645\nCust#........... 123456\nCustomer Name... Humpfrey Bear\nBILL TO NO NAME. Bill To: 0165487 - Some Other Company\nDATE...... 01/01/00\n\n'Serial#......... Jgfdhdgfhgfdh4545\nCust#........... 88483\nCustomer Name... John Smith\nBILL TO NO NAME. Bill To: 0146897 - Some Company\nDATE...... 01/01/00\n\n'Serial#......... JF2SJads5dsafdsaf\nCust#........... 015648\nCustomer Name... Eric Cantona\nBILL TO NO NAME. Bill To: 8888154 - Man Utd\nDATE...... 01/01/00\n\n'Serial#......... JdsfrfdsgHG091797\nCust#........... 015648\nCustomer Name... Eric Cantona\nBILL TO NO NAME. Bill To: 9876524 - Big Big Company\nDATE...... 01/01/00\n\n'

The ouput I need is: Serial#......... 12345678974566321\nCust#........... 123456\nCustomer Name... Humpfrey Bear\nBILL TO NO NAME. Bill To: 001166 - Some Company Bill To: 0165487 - Some Other Company\nDATE...... 01/01/00\n\n',Serial#......... JF2SJads5dsafdsaf\nCust#........... 015648\nCustomer Name... Eric Cantona\nBILL TO NO NAME. Bill To: 8888154 - Man Utd Bill To: 9876524 - Big Big Company\nDATE...... 01/01/00\n\n'

My output has omitted the duplciates based on Cust#......xxxxx, but I am just trying to show my expected result a little clearer. I can sort the duplicates out later.

A shortened version of what I have thus far the rest is insignificant.

from bs4 import BeautifulSoup
import re
import urllib
import os



with open(r'.html', 'r') as f:
   html_string = f.read()

soup = BeautifulSoup(html_string, 'html.parser')

groups = soup.font.text.replace('Serial#', 'xxxSerial#').split('xxx')


from Append text to a group in a html document based on condition

No comments:

Post a Comment