写入CSV时编码UTF-8

import json import sys import csv import codecs def main(): writer = csv.writer(codecs.getwriter("utf-8")(sys.stdout), delimiter="\t") for line in sys.stdin: line = line.strip() data = [] try: data.append(json.loads(line)) except ValueError as detail: continue for tweet in data: ## deletes any rate limited data if tweet.has_key('limit'): pass else: writer.writerow([ tweet['id_str'], tweet['user']['screen_name'], tweet['text'] ]) if __name__ == '__main__': main()

2条回答

网友

1楼 · 编辑于 2024-04-27 17:01:17

我也有同样的问题。我有大量来自twitter firehouse的数据，所以每一个可能的复杂情况（和已经出现的情况）！

我用try/except解决了如下问题：

如果dict值是一个字符串：if isinstance(value,basestring)我尝试直接对其进行编码。如果不是一个字符串，我把它变成一个字符串，然后对它进行编码。

如果失败了，那是因为有个小丑在推特上用奇怪的符号来搞乱我的剧本。如果是这样的话，首先我对字符串进行解码，然后重新编码value.decode('utf-8').encode('utf-8')，然后对非字符串进行解码，生成字符串并重新编码value.decode('utf-8').encode('utf-8')

试试这个：

import csv

def export_to_csv(list_of_tweet_dicts,export_name="flat_twitter_output.csv"):

    utf8_flat_tweets=[]
    keys = []

    for tweet in list_of_tweet_dicts:
        tmp_tweet = tweet
        for key,value in tweet.iteritems():
            if key not in keys: keys.append(key)

            # convert fields to utf-8 if text
            try:
                if isinstance(value,basestring): 
                    tmp_tweet[key] = value.encode('utf-8')
                else:
                    tmp_tweet[key] = str(value).encode('utf-8')
            except:
                if isinstance(value,basestring):
                    tmp_tweet[key] = value.decode('utf-8').encode('utf-8')
                else:
                    tmp_tweet[key] = str(value.decode('utf-8')).encode('utf-8')

        utf8_flat_tweets.append(tmp_tweet)
        del tmp_tweet

    list_of_tweet_dicts = utf8_flat_tweets
    del utf8_flat_tweets

    with open(export_name, 'w') as f:
        dict_writer = csv.DictWriter(f, fieldnames=keys,quoting=csv.QUOTE_ALL)
        dict_writer.writeheader()
        dict_writer.writerows(list_of_tweet_dicts)

    print "exported tweets to '"+export_name+"'"

    return list_of_tweet_dicts

希望这对你有帮助。

网友

2楼 · 编辑于 2024-04-27 17:01:17

来自文档： https://docs.python.org/2/howto/unicode.html

a = "string"

encodedstring  = a.encode('utf-8')

如果不起作用：

Python DictWriter writing UTF-8 encoded CSV files

相关问题更多 >

编程相关推荐

热门问题

热门文章