Python Selenium ().text Returns "â€™" Instead Of Apostrophe (')

March 31, 2024 Post a Comment

I am trying to collect the comments on Seeking Alpha (for example: https://seekingalpha.com/article/4243835-teslas-low-2019-capex-harm-growth-story-brand-value). I list one of the

Solution 1:

Your issue is that the apostrophe that's being misinterpreted is not a normal apostrophe character ' but instead the Unicode character for a right single quote: ’. The reason it turns into mojibake is that you're decoding the content incorrectly. It's in UTF-8 (so ’ is represented by the three bytes \xe2\x80\x99), but you're decoding it with Codepage 1252 (where the three bytes \xe2\x80\x99 represent three separate characters, â, €, and ™).

Since you haven't shown much code, I can't offer any suggestions on how to fix the decoding issue, but there is probably a way to request Selenium to use UTF-8 (I'm frankly surprised it's not the default). Alternatively, you might be able to get the raw bytes and decode the text yourself.

While it would be best to avoid the mis-decoding, if you really need to fix up your strings after they've been turned to mojibake, the best approach is probably to re-encode them the same way they were mis-decoded, then decode again, correctly this time:

badtext = 'Americaâ€™s'encoded = badtext.encode('cp1252') 
goodtext = encoded.decode('utf-8') # 'America’s'

Python Dictionary

Python Selenium ().text Returns "â€™" Instead Of Apostrophe (')

Solution 1:

Post a Comment for "Python Selenium ().text Returns "â€™" Instead Of Apostrophe (')"