我对Python完全陌生,当我使用gzip.open()
来处理.gz文件时,会得到一些类似"It’s one of those great ensemble casts that’s incredibly balanced"
的代码。你知道吗
我该怎么处理?我使用的代码:
def _review_reader(file_path):
gz = gzip.open(file_path)
for l in gz:
yield eval(l)
该文件是从json文件压缩的
比如:
{"reviewerID": "A11N155CW1UV02", "asin": "B000H00VBQ", "reviewerName": "AdrianaM", "helpful": [0, 0], "reviewText": "I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn\'t appeal to me at all.", "overall": 2.0, "summary": "A little bit boring for me", "unixReviewTime": 1399075200, "reviewTime": "05 3, 2014"}\n
{"reviewerID": "A3BC8O2KCL29V2", "asin": "B000H00VBQ", "reviewerName": "Carol T", "helpful": [0, 0], "reviewText": "I highly recommend this series. It is a must for anyone who is yearning to watch \\"grown up\\" television. Complex characters and plots to keep one totally involved. Thank you Amazin Prime.", "overall": 5.0, "summary": "Excellent Grown Up TV", "unixReviewTime": 1346630400, "reviewTime": "09 3, 2012"}\n
....
我想得到评论文本,但是有一些代码像’
因为您正在查看JSON数据,所以请使用Python的JSON解析器来加载它。它将自动处理任何嵌入的转义字符,例如
\n
或\"
。你知道吗在读取gzip文件时,重要的是要意识到gzip提供了原始字节。必须通过调用
.decode()
将这些字节显式地调整为文本,为了正确地执行此操作,您需要知道JSON使用了什么文本编码。UTF-8是一个非常安全的默认假设,但是它可能是其他的,这取决于在编写.gz文件时选择了什么。你知道吗解析完JSON后,您可以通过名称访问属性:
如果} module 可以帮助:
reviewText
恰好包含HTML代码而不是纯文本,那么可能需要另一个步骤—HTML解析。^{相关问题 更多 >
编程相关推荐