有没有更简单的方法从字符串构建字典并向量化字符串？Python

0 投票

4 回答

1358 浏览

提问于 2025-04-17 19:01

我想问的问题是，如何从字符串中构建一个字典，这个问题有点偏向语言学或自然语言处理，和从字符串创建字典的内容不太一样。

假设你有一系列的字符串句子，有没有更简单的方法来构建一个独特的单词字典，然后把这些字符串句子转化为向量呢？我知道有一些外部库可以做到这一点，比如gensim，但我想尽量不使用它们。我现在是这样做的：

from itertools import chain

def getKey(dic, value):
  return [k for k,v in sorted(dic.items()) if v == value]

# Vectorize will return a list of tuples and each tuple is made up of 
# (<position of word in dictionar>,<number of times it occurs in sentence>)
def vectorize(sentence, dictionary): # is there simpler way to do this?
  vector = []
  for word in sentence.split():
    word_count = sentence.lower().split().count(word)
    dic_pos = getKey(dictionary, word)[0]
    vector.append((dic_pos,word_count))
  return vector

s1 = "this is is a foo"
s2 = "this is a a bar"
s3 = "that 's a foobar"

uniq = list(set(chain(" ".join([s1,s2,s3]).split()))) # is there simpler way for this?
dictionary = {}
for i in range(len(uniq)): # can this be done with dict(list_comprehension)?
  dictionary[i] = uniq[i]

v1 = vectorize(s1, dictionary)
v2 = vectorize(s2, dictionary)
v3 = vectorize(s3, dictionary)

print v1
print v2
print v3

字符串处理自然语言处理数据预处理文本分析向量化字典构建语言学词汇表

4 个回答

你的代码里有好几个问题，我们一个一个来解决。

uniq = list(set(chain(" ".join([s1,s2,s3]).split()))) # is there simpler way for this?

首先，可能独立地使用 split() 来分割字符串会更简单（虽然写的代码量差不多），而不是先把它们合在一起再分割结果。

uniq = list(set(chain(*map(str.split, (s1, s2, s3))))

另外，你似乎总是在使用单词列表，而不是实际的句子，所以你在多个地方进行分割。为什么不在一开始就把它们全部分割呢？

同时，既然你需要传递 s1、s2 和 s3，为什么不把它们放在一个集合里呢？你也可以把结果放在一个集合里。

所以：

sentences = (s1, s2, s3)
wordlists = [sentence.split() for sentence in sentences]

uniq = list(set(chain.from_iterable(wordlists)))

# ...

vectors = [vectorize(sentence, dictionary) for sentence in sentences]
for vector in vectors:
    print vector

dictionary = {}
for i in range(len(uniq)): # can this be done with dict(list_comprehension)?
  dictionary[i] = uniq[i]

你可以在列表推导中使用 dict()，但更简单的方法是用字典推导。而且，使用 enumerate 代替 for i in range(len(uniq)) 这一部分。

dictionary = {idx: word for (idx, word) in enumerate(uniq)}

这样就替代了上面代码中的整个 # ... 部分。

另外，如果你想要反向查找字典，这样做是不对的：

def getKey(dic, value):
    return [k for k,v in sorted(dic.items()) if v == value]

相反，应该创建一个反向字典，把值映射到键的列表。

def invert_dict(dic):
    d = defaultdict(list)
    for k, v in dic.items():
        d[v].append(k)
    return d

然后，不用你的 getKey 函数，直接在反向字典中进行正常查找就可以了。

如果你需要交替进行修改和查找，可能需要某种双向字典，它会自己管理反向字典。ActiveState 上有很多这样的例子，PyPI 上也可能有一些模块，但自己构建一个并不难。总之，在这里你似乎并不需要这个。

最后，我们来看看你的 vectorize 函数。

首先要做的是，像上面提到的，使用一个单词列表，而不是句子来进行分割。

而且，在 lower 之后没有必要重新分割句子，直接在单词列表上使用映射或生成器表达式就可以了。

事实上，我不太明白你为什么在这里使用 lower，因为你的字典是用原始大小写版本构建的。我猜这可能是个错误，你可能想在构建字典时也使用 lower。提前在一个简单易找的地方创建单词列表的好处之一就是：你只需要改这一行：

wordlists = [sentence.lower().split() for sentence in sentences]

现在你的代码已经简单了一些：

def vectorize(wordlist, dictionary):
    vector = []
    for word in wordlist:
        word_count = wordlist.count(word)
        dic_pos = getKey(dictionary, word)[0]
        vector.append((dic_pos,word_count))
    return vector

同时，你可能会发现 vector = []… for word in wordlist… vector.append 正是列表推导的用处。那么，如何将三行代码变成一个列表推导呢？很简单：把它重构成一个函数。这样：

def vectorize(wordlist, dictionary):
    def vectorize_word(word):
        word_count = wordlist.count(word)
        dic_pos = getKey(dictionary, word)[0]
        return (dic_pos,word_count)
    return [vectorize_word(word) for word in wordlist]

回答于 2025-04-17 由 Python大师

分享举报

如果你想在一句话中统计某个单词出现的次数，可以使用collections.Counter这个工具。

你的代码存在一些问题：

uniq = list(set(chain(" ".join([s1,s2,s3]).split()))) # is there simpler way for this?
dictionary = {}
for i in range(len(uniq)): # can this be done with dict(list_comprehension)?
  dictionary[i] = uniq[i]

上面的部分只是创建了一个字典，这个字典是用一些随意的数字作为索引（这些数字来自于遍历一个没有索引概念的set）。然后用下面的方式来访问这个字典：

def getKey(dic, value):
  return [k for k,v in sorted(dic.items()) if v == value]

这个函数完全忽略了字典的本质：你应该通过键来查找，而不是通过值。

另外，vectorize这个概念也不太清楚。你希望通过这个函数实现什么呢？你问了一个更简单的vectorize版本，但没有告诉我们它具体是做什么的。

回答于 2025-04-17 由 Python大师

分享举报

这里：

from itertools import chain, count

s1 = "this is is a foo"
s2 = "this is a a bar"
s3 = "that 's a foobar"

# convert each sentence into a list of words, because the lists
# will be used twice, to build the dictionary and to vectorize
w1, w2, w3 = all_ws = [s.split() for s in [s1, s2, s3]]

# chain the lists and turn into a set, and then a list, of unique words
index_to_word = list(set(chain(*all_ws)))

# build the inverse mapping of index_to_word, by pairing it with a counter
word_to_index = dict(zip(index_to_word, count()))

# create the vectors of word indices and of word count for each sentence
v1 = [(word_to_index[word], w1.count(word)) for word in w1]
v2 = [(word_to_index[word], w2.count(word)) for word in w2]
v3 = [(word_to_index[word], w3.count(word)) for word in w3]

print v1
print v2
print v3

需要记住的几点：

字典应该只从键到值进行遍历；如果你需要反向操作，就创建两个字典，一个是另一个的反向映射，并保持它们的更新，就像我上面做的那样；
如果你需要一个键是连续整数的字典，直接用列表就可以了（谢谢Jeff）；
不要重复计算同样的东西！如果你以后还需要，记得把它保存到一个变量里；
尽量使用列表推导式，这样可以提高性能、简洁性和可读性。

回答于 2025-04-17 由 Python大师

分享举报

有没有更简单的方法从字符串构建字典并向量化字符串？Python

4 个回答

撰写回答