Skip to content Skip to sidebar Skip to footer

Beautifulsoup: Findall Doesn't Find The Tags

I'm sorry about the many questions I post, but I have no idea what to do about this bug: when testing this page, with simple ps ab=soup.find('article', {'itemprop':'articleBody'})

Solution 1:

The issue is the parser:

In [21]: req = requests.get("http://www.wired.com/2016/08/cape-watch-99/")

In [22]: soup = BeautifulSoup(req.content, "lxml")

In [23]: len(soup.select("article[itemprop=articleBody] p"))
Out[23]: 26

In [24]: soup = BeautifulSoup(req.content, "html.parser")

In [25]: len(soup.select("article[itemprop=articleBody] p"))
Out[25]: 1
In [26]: soup = BeautifulSoup(req.content, "html5lib")

In [27]: len(soup.select("article[itemprop=articleBody] p"))
Out[27]: 26

You can see html5lib and lxml get all the p tags but the standard html.parser does not handle the broken html as well. Running the article html through validator.w3 you get a lot of output, in particular:

enter image description here

Solution 2:

Your code is giving only one p because when your are parsing soup and trying to see what it has parsed,it is getting only one paragraph see below code

ab = soup.find("article", {"itemprop": "articleBody"})
print ab

the output is

<articleclass="content link-underline relative body-copy"data-js="content"itemprop="articleBody"><p>Not every update about a superhero movie is worthy of great attention. Take, for example, <ahref="http://www.slashfilm.com/aquaman-setting/">the revelation</a> that not all of <em>Aquaman</em> will take place underwater</p></article>

since you are finding item under article tag and soup close the search when it find the closing article tag, and therefore its returning 1 as len of p which is correct as per your current code

Post a Comment for "Beautifulsoup: Findall Doesn't Find The Tags"