输出到Python的字符串不正确

for i in range (30): page = requests.get('https://www.example.com'+ df.loc[i,'ga:pagePath']) tree = html.fromstring(page.content) postalcode2 = tree.xpath('//span[@itemprop="postalCode"]/text()') postalcode = tree.xpath('//span[@itemprop="addressRegion"]/text()') if not postalcode2 and not postalcode: print(postalcode,postalcode2) elif not postalcode2: postalcode4 = postalcode[0] # postalcode4 = postalcode4.replace(' ','') df.loc[i,'postcode'] = postalcode4 elif not postalcode: postalcode3 = postalcode2[0] if 'Â' not in postalcode3: postalcode3 = postalcode3.replace('\\xa0','') postalcode3 = postalcode3.replace(' ','') else: postalcode3 = postalcode3.replace('\\xa0Â','') postalcode3 = postalcode3.replace(' ','') df.loc[i,'postcode'] = postalcode3

1条回答

网友

1楼 · 发布于 2024-05-23 21:37:24

您可能没有正确处理web输出

requests.get响应的content属性是bytestring，但HTML内容是文本。如果在创建HTML之前没有对bytestring进行解码，那么很可能会发现由于文本中出现的编码而产生的无关字符。然而，正确的处理方法不是继续使用bytestring，而是在调用html.fromstring之前通过解码将传入的bytestring转换为文本

如果有Content-Encoding头的话，您真的应该使用它找到正确的编码。作为一个实验，你可以尝试

tree = html.fromstring(page.content)

与

tree = html.fromstring(page.content.decode('utf-8')`

因为许多网站将使用UTF8编码。你可能会发现这些回答似乎更有意义，而且你不需要去掉那么多“无关”的东西

相关问题更多 >

编程相关推荐

热门问题

热门文章