Skip to content Skip to sidebar Skip to footer

Python Selenium ().text Returns "’" Instead Of Apostrophe (')

I am trying to collect the comments on Seeking Alpha (for example: https://seekingalpha.com/article/4243835-teslas-low-2019-capex-harm-growth-story-brand-value). I list one of the

Solution 1:

Your issue is that the apostrophe that's being misinterpreted is not a normal apostrophe character ' but instead the Unicode character for a right single quote: . The reason it turns into mojibake is that you're decoding the content incorrectly. It's in UTF-8 (so is represented by the three bytes \xe2\x80\x99), but you're decoding it with Codepage 1252 (where the three bytes \xe2\x80\x99 represent three separate characters, â, , and ).

Since you haven't shown much code, I can't offer any suggestions on how to fix the decoding issue, but there is probably a way to request Selenium to use UTF-8 (I'm frankly surprised it's not the default). Alternatively, you might be able to get the raw bytes and decode the text yourself.

While it would be best to avoid the mis-decoding, if you really need to fix up your strings after they've been turned to mojibake, the best approach is probably to re-encode them the same way they were mis-decoded, then decode again, correctly this time:

badtext = 'America’s'encoded = badtext.encode('cp1252') 
goodtext = encoded.decode('utf-8') # 'America’s'

Post a Comment for "Python Selenium ().text Returns "’" Instead Of Apostrophe (')"