<p>可以使用集合集合,这样每个单词都是唯一的。首先,我们需要得到所有句子中所有单词的列表。为此,我们将单词初始化为一个空集合,然后使用列表理解在每个句子中添加每个小写单词(在拆分句子之后)。在</p>
<p>接下来,我们使用字典理解来构建一个由单词集中的每个单词组成的字典。值是包含每个包含该单词的句子的数据帧。这些是通过对一个函数<code>groupby(df.sentences.str.contains(word, case=False))</code>进行分组,然后得到该条件为<code>True</code>的每个组。在</p>
<pre><code>words = set()
_ = [words.add(word.lower()) for sentence in df.sentences for word in sentence.split()]
word_dict = {word: df.groupby(df.sentences.str.contains(word, case=False)).get_group(True)
for word in words}
>>> word_dict['temperature']
sentences
0 two long pieces of metal fixed together, each ...
1 the temperature at which a liquid boils
2 a system for measuring temperature that is par...
3 a unit for measuring temperature. Measurements...
4 a system for measuring temperature in which wa...
>>> word_dict['freezes']
sentences
2 a system for measuring temperature that is par...
4 a system for measuring temperature in which wa...
>>> words
{'0',
'100',
'212\xc2\xba',
'32\xc2\xba',
'a',
'amount',
'and',
'are',
'as',
'at',
'bends',
...
</code></pre>
<p>要获取每个单词的索引值词典:</p>
^{pr2}$
<p>或者布尔指标矩阵。在</p>
<pre><code>>>> [df.sentences.str.contains(word, case='lower').tolist() for word in word_dict]
[[False, False, True, False, True],
[False, False, False, True, False],
[True, False, False, False, False],
[False, False, True, False, False],
...
</code></pre>