在Python 2.7中，Unicode文本显示为u'xxxx而非日语

Question

我在使用Python处理很多日文文本文件时，遇到了很多关于Unicode的问题。我知道可以用.encode("utf-8")把日文文本从u'xxxx的格式转换回正常的日文显示。我没有遇到任何编码或解码的错误。但是，我从一个Unicode文件读取文本，处理后再写入新文件时，文本却变成了u'xxxx这样的字符串，而不是原来的日文。我在多个地方尝试过使用.encode()和.decode()，也试过不使用它们，但每次结果都是一样的。欢迎任何建议。

具体来说，我在使用Scrapy库写一个爬虫，它从一个文件中提取文本，构建新文件的文件名，然后把HTML文件的第一个div内容作为字符串写入那个新文件。

让我更困惑的是，我用来创建文件名的文本片段都是正常显示为日文，文件名本身也是日文。难道是因为我在div上使用了str()，所以文件内容变成了u'xxxx吗？请查看代码的最后部分。

这是我的完整代码（请忽略其中一些比较糟糕的部分）：

def parse_item(self, response):
    original = 0
    author = "noauthor"
    title = "notitle"
    year = "xxxx"
    publisher = "xxxx"
    typer = "xxxx"
    ispub = 0
    filename = response.url.split("/")[-1]
    if "_" in filename:
        filename = filename.split("_")[0]
        if filename.isdigit():
            title = response.xpath("//h1/text()").extract()[0].encode("utf-8")
            author = response.xpath("//h2/text()").extract()[0].encode("utf-8")
            ID = filename
            bibliographic_info = response.xpath("//div[2]/text()").extract()
            for subyear in bibliographic_info:
                ispub = 0
                subyear = subyear.encode("utf-8").strip()
                if "初出：" in subyear:
                    publisher = subyear.split("：")[1]
                    original = 1
                    ispub = 1
                if "入力：" in subyear:
                    typer = subyear.split("：")[1]
                if len(subyear) > 1 and (original == 1) and (ispub == 0):
                    counter = 0
                    while counter < len(subyear):
                        if subyear[counter].isdigit():
                            break
                        counter+=1
                    if counter != len(subyear):
                        year = subyear[counter:(counter+4)]
                    original = 0
    body = str(response.xpath("//div[1]/text()").extract())
    new_filename = author + "_" + title + "_" + publisher + "_" + year + "_" + typer + ".html"
    file = open(new_filename, "a")
    file.write(body.encode("utf-8")  
    file.close()

字符串处理数据处理 unicode utf-8 文本编码 scrapy 爬虫日文文本

在Python 2.7中，Unicode文本显示为u'xxxx而非日语

1 个回答

撰写回答