
2024-05-17 17:24:50 发布

您现在位置:Python中文网/ 问答频道 /正文



  • “温度”组,包括所有行(0、1、2、3和4)
  • 包含第2行和第4行的“冻结”组
  • 包含行0、1、2和3的“the”组
  • 仅包含行0的“metal”组。在
  • 数据集中每隔一个单词分组
import pandas as pd

# An example data set
df = pd.DataFrame({"sentences": [
    "two long pieces of metal fixed together, each of which bends a different amount when they are both heated to the same temperature",
    "the temperature at which a liquid boils",
    "a system for measuring temperature that is part of the metric system, in which water freezes at 0 degrees and boils at 100 degrees",
    "a unit for measuring temperature. Measurements are often expressed as a number followed by the symbol °",
    "a system for measuring temperature in which water freezes at 32º and boils at 212º"

# Create a new series which is a list of words in each "sentences" column
df['words'] = df['sentences'].apply(lambda sentence: sentence.split(" "))

# Try to group by this new column 

# TypeError: unhashable type: 'list'

但是我的代码抛出了一个错误,如图所示。(见下文) 由于我的任务有点复杂,我知道它可能不仅仅涉及调用groupby()。有人能帮我用熊猫做单词组吗?在



Tags: oftheindfwhichforsentences单词


lambda sentence: tuple(sentence.split())



import pandas as pd
import numpy as np

# An example data set
df = pd.DataFrame({"sentences": [
    "two long pieces of metal fixed together, each of which bends a different amount when they are both heated to the same temperature",
    "the temperature at which a liquid boils",
    "a system for measuring temperature that is part of the metric system, in which water freezes at 0 degrees and boils at 100 degrees",
    "a unit for measuring temperature. Measurements are often expressed as a number followed by the symbol °",
    "a system for measuring temperature in which water freezes at 32º and boils at 212º"

# Create a new series which is a list of words in each "sentences" column
df['words'] = df['sentences'].apply(lambda sentence: sentence.split(" "))

# This is all the words in the dataset. Each word will be its own index (level of the MultiIndex)
names = np.unique(df['words'].sum())

# Create an array of tuples, one tuple for each row of data
# Each tuple contains True if the row has that word in it, and False if it does not
values = df['words'].map(
    lambda words: np.vectorize(
        lambda word:
            True if word in words else False)(names)

# Make a multindex
index = pd.MultiIndex.from_tuples(values, names=names)

# Add the MultiIndex without creating a new data frame
df.set_index(index, inplace=True)

# Find all the rows that have the word 'temperature'
xs = df.xs(True, level='temperature')



接下来,我们使用字典理解来构建一个由单词集中的每个单词组成的字典。值是包含每个包含该单词的句子的数据帧。这些是通过对一个函数groupby(df.sentences.str.contains(word, case=False))进行分组,然后得到该条件为True的每个组。在

words = set()
_ = [words.add(word.lower()) for sentence in df.sentences for word in sentence.split()]

word_dict = {word: df.groupby(df.sentences.str.contains(word, case=False)).get_group(True) 
             for word in words}

>>> word_dict['temperature']
0  two long pieces of metal fixed together, each ...
1            the temperature at which a liquid boils
2  a system for measuring temperature that is par...
3  a unit for measuring temperature. Measurements...
4  a system for measuring temperature in which wa...

>>> word_dict['freezes']
2  a system for measuring temperature that is par...
4  a system for measuring temperature in which wa...

>>> words




>>> [df.sentences.str.contains(word, case='lower').tolist() for word in word_dict]
[[False, False, True, False, True],
 [False, False, False, True, False],
 [True, False, False, False, False],
 [False, False, True, False, False],

相关问题 更多 >