By What Library And How Can I Scrape Texts On An Html By Its Heading And Paragraph Tags?

February 28, 2024 Post a Comment

My input will be any web documents that has no fixed HTML structure. What I want to do is to extract the texts in the heading (might be nested) and its following paragraph tags (mi

Solution 1:

Traversing the tree and collecting all the <p> tags that are under increasing levels of <h> tags can be done with BeautifulSoup:

html = '''
<h1>House rule</h1>
    <h2>Rule 1</h2>
        <p>A</p>
        <p>B</p>
    <h2>Rule 2</h2>
        <h3>Rule 2.1</h3>
            <p>C</p>
        <h3>Rule 2.2</h3>
            <p>D</p>'''from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")

counter = 1
all_leafs = []
whileTrue:
    htag = 'h%d'%counter
    hgroups =  soup.findAll(htag)
    print(htag,len(hgroups))
    counter += 1iflen(hgroups) == 0: 
        breakfor hgroup in hgroups:
        for c,descendant inenumerate(hgroup.find_all_next()):
            name = getattr(descendant, "name", None)
            if name == 'p':
                all_leafs.append((hgroup.getText(),descendant.getText()))
print(all_leafs)

...

h11h22h32h40[('House rule', 'A'), ('House rule', 'B'), ('House rule', 'C'), ('House rule', 'D'), ('Rule 1', 'A'), ('Rule 1', 'B'), ('Rule 1', 'C'), ('Rule 1', 'D'), ('Rule 2', 'C'), ('Rule 2', 'D'), ('Rule 2.1', 'C'), ('Rule 2.1', 'D'), ('Rule 2.2', 'D')]

Python Dictionary

By What Library And How Can I Scrape Texts On An Html By Its Heading And Paragraph Tags?

Solution 1:

Post a Comment for "By What Library And How Can I Scrape Texts On An Html By Its Heading And Paragraph Tags?"