Beautifulsoup: Findall Doesn't Find The Tags
I'm sorry about the many questions I post, but I have no idea what to do about this bug: when testing this page, with simple ps ab=soup.find('article', {'itemprop':'articleBody'})
Solution 1:
The issue is the parser:
In [21]: req = requests.get("http://www.wired.com/2016/08/cape-watch-99/")
In [22]: soup = BeautifulSoup(req.content, "lxml")
In [23]: len(soup.select("article[itemprop=articleBody] p"))
Out[23]: 26
In [24]: soup = BeautifulSoup(req.content, "html.parser")
In [25]: len(soup.select("article[itemprop=articleBody] p"))
Out[25]: 1
In [26]: soup = BeautifulSoup(req.content, "html5lib")
In [27]: len(soup.select("article[itemprop=articleBody] p"))
Out[27]: 26
You can see html5lib and lxml get all the p tags but the standard html.parser does not handle the broken html as well. Running the article html through validator.w3 you get a lot of output, in particular:
Solution 2:
Your code is giving only one p because when your are parsing soup and trying to see what it has parsed,it is getting only one paragraph see below code
ab = soup.find("article", {"itemprop": "articleBody"})
print ab
the output is
<articleclass="content link-underline relative body-copy"data-js="content"itemprop="articleBody"><p>Not every update about a superhero movie is worthy of great attention. Take, for example, <ahref="http://www.slashfilm.com/aquaman-setting/">the revelation</a> that not all of <em>Aquaman</em> will take place underwater</p></article>
since you are finding item under article tag and soup close the search when it find the closing article tag, and therefore its returning 1 as len of p which is correct as per your current code
Post a Comment for "Beautifulsoup: Findall Doesn't Find The Tags"