<p>我目前的解决方案使用pandas的多索引特性。我确信可以通过更有效地使用numpy来改进它,但是我相信这将比其他python-only的答案更好:</p>
<pre><code>import pandas as pd
import numpy as np
# An example data set
df = pd.DataFrame({"sentences": [
"two long pieces of metal fixed together, each of which bends a different amount when they are both heated to the same temperature",
"the temperature at which a liquid boils",
"a system for measuring temperature that is part of the metric system, in which water freezes at 0 degrees and boils at 100 degrees",
"a unit for measuring temperature. Measurements are often expressed as a number followed by the symbol °",
"a system for measuring temperature in which water freezes at 32º and boils at 212º"
]})
# Create a new series which is a list of words in each "sentences" column
df['words'] = df['sentences'].apply(lambda sentence: sentence.split(" "))
# This is all the words in the dataset. Each word will be its own index (level of the MultiIndex)
names = np.unique(df['words'].sum())
# Create an array of tuples, one tuple for each row of data
# Each tuple contains True if the row has that word in it, and False if it does not
values = df['words'].map(
lambda words: np.vectorize(
lambda word:
True if word in words else False)(names)
)
# Make a multindex
index = pd.MultiIndex.from_tuples(values, names=names)
# Add the MultiIndex without creating a new data frame
df.set_index(index, inplace=True)
# Find all the rows that have the word 'temperature'
xs = df.xs(True, level='temperature')
print(xs.to_string(index=False))
</code></pre>