使用字典和Zip()函数的优化

0 投票
3 回答
657 浏览
提问于 2025-04-16 09:41

我有一个这样的函数:

def filetxt():
    word_freq = {}
    lvl1      = []
    lvl2      = []
    total_t   = 0
    users     = 0
    text      = []

    for l in range(0,500):
        # Open File
        if os.path.exists("C:/Twitter/json/user_" + str(l) + ".json") == True:
            with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f:
                text_f = json.load(f)
                users = users + 1
                for i in range(len(text_f)):
                    text.append(text_f[str(i)]['text'])
                    total_t = total_t + 1
        else:
            pass

    # Filter
    occ = 0
    import string
    for i in range(len(text)):
        s = text[i] # Sample string
        a = re.findall(r'(RT)',s)
        b = re.findall(r'(@)',s)
        occ = len(a) + len(b) + occ
        s = s.encode('utf-8')
        out = s.translate(string.maketrans("",""), string.punctuation)


        # Create Wordlist/Dictionary
        word_list = text[i].lower().split(None)

        for word in word_list:
            word_freq[word] = word_freq.get(word, 0) + 1

        keys = word_freq.keys()

        numbo = range(1,len(keys)+1)
        WList = ', '.join(keys)
        NList = str(numbo).strip('[]')
        WList = WList.split(", ")
        NList = NList.split(", ")
        W2N = dict(zip(WList, NList))

        for k in range (0,len(word_list)):
            word_list[k] = W2N[word_list[k]]
        for i in range (0,len(word_list)-1):
            lvl1.append(word_list[i])
            lvl2.append(word_list[i+1])

我使用了性能分析工具,发现大部分的CPU时间都花在了zip()函数,以及joinsplit这些代码部分上。我想看看有没有什么方法我可能忽略了,可以优化代码,让它运行得更快,因为最大的延迟似乎是在我处理字典和zip()函数的方式上。任何帮助都非常感谢!

另外,这个函数的基本目的是加载一些文件,这些文件里大约有20条推文,所以我最终可能会处理大约2万到5万的文件。输出的结果是我会生成一个包含所有不同单词的列表,并且显示哪些单词是相互关联的,例如:

1 "love"
2 "pasa"
3 "mirar"
4 "ants"
5 "kers"
6 "morir"
7 "dreaming"
8 "tan"
9 "rapido"
10 "one"
11 "much"
12 "la"
...
10 1
13 12
1 7
12 2
7 3
2 4
3 11
4 8
11 6
8 9
6 5
9 20
5 8
20 25
8 18
25 9
18 17
9 2
...

3 个回答

0

有几点我想说。这几行代码放在一起让我觉得有点奇怪:

WList = ', '.join(keys)
<snip>
WList = WList.split(", ")

应该是 Wlist = list(keys)

你确定要优化这个吗?我的意思是,它真的慢到值得你花时间去优化吗?最后,如果能给我们一个脚本应该做什么的描述就好了,而不是让我们从代码中去猜测 :)

1

我希望你不介意我稍微修改了一下你的代码,让它更像我自己写的那样。

from itertools import izip
def filetxt():
    # keeps track of word count for each word.
    word_freq = {}
    # list of words which we've found
    word_list = []
    # mapping from word -> index in word_list
    word_map  = {}
    lvl1      = []
    lvl2      = []
    total_t   = 0
    users     = 0
    text      = []

    ####### You should replace this with a glob (see: glob module)
    for l in range(0,500):
        # Open File
        try:
            with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f:
                text_f = json.load(f)
                users = users + 1
                # in this file there are multiple tweets so add the text
                # for each one.
                for t in text_f.itervalues():
                    text.append(t)  ## CHECK THIS
        except IOError:
            pass

    total_t = len(text)
    # Filter
    occ = 0
    import string
    for s in text:
        a = re.findall(r'(RT)',s)
        b = re.findall(r'(@)',s)
        occ += len(a) + len(b)
        s = s.encode('utf-8')
        out = s.translate(string.maketrans("",""), string.punctuation)


        # make a list of words that are in the text s
        words = s.lower().split(None)

        for word in word_list:
            # try/except is quicker when we expect not to miss
            # and it will be rare for us not to have
            # a word in our list already.
            try:
                word_freq[word] += 1
            except KeyError:
                # we've never seen this word before so add it to our list
                word_freq[word] = 1
                word_map[word] = len(word_list)
                word_list.append(word)


        # little trick to get each word and the word that follows
        for curword, nextword in zip(words, words[1:]):
            lvl1.append(word_map[curword])
            lvl2.append(word_map[nextword])

这个代码的作用是给你提供一些信息。比如,lvl1会给你一个数字列表,这些数字对应于word_list中的单词。所以,word_list[lvl1[0]]就是你处理的第一条推文中的第一个单词。lvl2[0]则是跟在lvl1[0]后面的单词的索引,这样你就可以说,word_list[lvl2[0]]word_list[lvl1[0]]后面的那个单词。这个代码基本上是在构建word_mapword_listword_freq

请注意,你之前的做法,特别是创建W2N的方式,是不正确的。字典是没有顺序的。虽然3.1版本会有有序字典,但现在先不考虑这个。基本上,当你使用word_freq.keys()时,每次添加新单词都会改变顺序,所以没有一致性。看看这个例子,

>>> x = dict()
>>> x[5] = 2
>>> x
{5: 2}
>>> x[1] = 24
>>> x
{1: 24, 5: 2}
>>> x[10] = 14
>>> x
{1: 24, 10: 14, 5: 2}
>>>

所以5原本是第二个,但现在变成第三个了。

我还把它更新成使用0作为索引,而不是1。我不知道你为什么用range(1, len(...)+1)而不是直接用range(len(...))

无论如何,你应该摆脱传统的C/C++/Java那种用数字来做循环的思维。如果你不需要索引,那么就不需要它。

经验法则:如果你需要一个索引,那么你可能需要那个索引对应的元素,这时你应该使用enumerate链接

希望这对你有帮助……

2

觉得你想要的是这样的:

import string
from collections import defaultdict
rng = xrange if xrange else range

def filetxt():
    users     = 0
    total_t   = 0
    occ       = 0

    wordcount = defaultdict(int)
    wordpairs = defaultdict(lambda: defaultdict(int))
    for filenum in rng(500):
        try:
            with open("C:/Twitter/json/user_" + str(filenum) + ".json",'r') as inf:
                users += 1
                tweets = json.load(inf)
                total_t += len(tweets)

                for txt in (r['text'] for r in tweets):
                    occ += txt.count('RT') + txt.count('@')
                    prev = None
                    for word in txt.encode('utf-8').translate(None, string.punctuation).lower().split():
                        wordcount[word] += 1
                        wordpairs[prev][word] += 1
                        prev = word
        except IOError:
            pass

撰写回答