Skip to content Skip to sidebar Skip to footer

By What Library And How Can I Scrape Texts On An Html By Its Heading And Paragraph Tags?

My input will be any web documents that has no fixed HTML structure. What I want to do is to extract the texts in the heading (might be nested) and its following paragraph tags (mi

Solution 1:

Traversing the tree and collecting all the <p> tags that are under increasing levels of <h> tags can be done with BeautifulSoup:

html = '''
<h1>House rule</h1>
    <h2>Rule 1</h2>
        <p>A</p>
        <p>B</p>
    <h2>Rule 2</h2>
        <h3>Rule 2.1</h3>
            <p>C</p>
        <h3>Rule 2.2</h3>
            <p>D</p>'''from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")

counter = 1
all_leafs = []
whileTrue:
    htag = 'h%d'%counter
    hgroups =  soup.findAll(htag)
    print(htag,len(hgroups))
    counter += 1iflen(hgroups) == 0: 
        breakfor hgroup in hgroups:
        for c,descendant inenumerate(hgroup.find_all_next()):
            name = getattr(descendant, "name", None)
            if name == 'p':
                all_leafs.append((hgroup.getText(),descendant.getText()))
print(all_leafs)

...

h11h22h32h40[('House rule', 'A'), ('House rule', 'B'), ('House rule', 'C'), ('House rule', 'D'), ('Rule 1', 'A'), ('Rule 1', 'B'), ('Rule 1', 'C'), ('Rule 1', 'D'), ('Rule 2', 'C'), ('Rule 2', 'D'), ('Rule 2.1', 'C'), ('Rule 2.1', 'D'), ('Rule 2.2', 'D')]

Post a Comment for "By What Library And How Can I Scrape Texts On An Html By Its Heading And Paragraph Tags?"