How To List All Strings That Have A Pa/ Inside Of A Html File Using Beautiful Soup

January 31, 2024 Post a Comment

I have a program that converts pdfs into html and I needed to complement this program so after converting It would search for the tags PA/ and the character in front of it and save

Solution 1:

Check Online Demo

    import re
    from bs4 import BeautifulSoup
    html_doc = """
    <html><title>Testing</title><body><divstyle="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:59px; top:34023px; width:84px; height:32px;"><spanstyle="font-family: YFEHEP+Times-Bold; font-size:17px">JUST SOME TEXT THAT I DON'T WANT TO HAVE ON THE CSV FILE
            <br></span><spanstyle="font-family: YFEHEP+Times-Roman; font-size:16px">PA/00986/17 GTD
            <br></span></div><divstyle="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:59px; top:34066px; width:84px; height:16px;"><spanstyle="font-family: YFEHEP+Times-Roman; font-size:16px">PA/01008/17 GTD
            <br></span></div><divstyle="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:59px; top:34105px; width:84px; height:16px;"><spanstyle="font-family: YFEHEP+Times-Roman; font-size:16px">PA/01095/17 GTD
        </body></html>
    """

    soup = BeautifulSoup(html_doc, 'html.parser')
    text = soup.get_text()

    match = re.findall("PA/(\S*)\s*(\S*)", text)
    print(match)

For writting to CSV

import csv
withopen('ur file.csv','wb') asout:
    csv_out=csv.writer(out)
    csv_out.writerow(['fist_col','second_col'])
    forrowinmatch:
        csv_out.writerow(row)

Html5 Channel

How To List All Strings That Have A Pa/ Inside Of A Html File Using Beautiful Soup

Solution 1:

Post a Comment for "How To List All Strings That Have A Pa/ Inside Of A Html File Using Beautiful Soup"