单数和复数与Pandas相匹配

2024-04-26 04:46:23 发布

您现在位置:Python中文网/ 问答频道 /正文

这个问题是我上一个问题Multiple Phrases Matching Python Pandas的扩展。虽然我在一个答案之后找到了解决问题的方法,但是一些典型的单数和复数问题出现了。在

ingredients=pd.Series(["vanilla extract","walnut","oat","egg","almond","strawberry"])

df=pd.DataFrame(["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])

我只需要将配料系列中的短语与数据帧中的短语相匹配。作为伪代码

If ingredients(singular or plural) found in phrase in the DataFrame, return the ingredient. Or otherwise, return false.

答案如下:

^{pr2}$

我还应用了下面的方法用NAN填充空单元格,这样我就可以很容易地过滤掉数据。在

df.ix[df.existence=='', 'existence'] = np.nan

结果如下:

print df
                                                 val        existence
0                        1 teaspoons vanilla extract  vanilla extract
1                                             2 eggs              egg
2                             3 cups chopped walnuts           walnut
3                                 4 cups rolled oats              oat
4  1 (10.75 ounce) can Campbell's Condensed Cream...             NaN    
5    6 ounces smoke-flavored almonds, finely chopped           almond
6                                          sdfgsfgsf              NaN  
7                                        fsfgsgsfgfg              NaN
8  2 small strawberries                                           NaN

这一直是正确的,但当单数词和复数词的映射不像almond=>;almondsapple=>;apples。当出现strawberry=>;strawberries时,此代码将其识别为NaN。在

改进我的代码来检测这种情况。我想把我的原料Series改成{},如下所示。在

#ingredients

#inputwords       #outputword

vanilla extract    vanilla extract 
walnut             walnut
walnuts            walnut
oat                oat
oats               oat
egg                egg
eggs               egg
almond             almond
almonds            almond
strawberry         strawberry
strawberries       strawberry
cherry             cherry
cherries           cherry

所以我的逻辑是每当#inputwords中的一个词出现在短语中,我想返回另一个单元格中的单词。换句话说,当strawberrystrawberries出现在短语中时,代码就把它旁边的单词{}输出。所以我的最终结果是

                                                 val        existence
0                        1 teaspoons vanilla extract  vanilla extract
1                                             2 eggs              egg
2                             3 cups chopped walnuts           walnut
3                                 4 cups rolled oats              oat
4  1 (10.75 ounce) can Campbell's Condensed Cream...             NaN    
5    6 ounces smoke-flavored almonds, finely chopped           almond
6                                          sdfgsfgsf              NaN  
7                                        fsfgsgsfgfg              NaN
8  2 small strawberries                                           strawberry

我无法找到将此功能合并到现有代码或编写新代码的方法。有人能帮我吗?在


Tags: 代码dfeggextractnaneggsvanillacups
2条回答
# your data frame
df = pd.DataFrame(data = ["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])

# Here you create mapping
mapping = pd.Series(index = ['vanilla extract' , 'walnut','walnuts','oat','oats','egg','eggs','almond','almonds','strawberry','strawberries','cherry','cherries'] , 
          data = ['vanilla extract' , 'walnut','walnut','oat','oat','egg','egg','almond','almond','strawberry','strawberry','cherry','cherry'])
# create a function that checks if the value you're looking for exist in specific phrase or not
def get_match(df):
    match = np.nan
    for key , value in mapping.iterkv():
        if key in df[0]:
            match = value
    return match
# apply this function on each row
df.apply(get_match, axis = 1)

考虑使用词干分析器:) http://www.nltk.org/howto/stem.html

直接从他们的页面上取下:

    from nltk.stem.snowball import SnowballStemmer
    stemmer = SnowballStemmer("english")
    stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
    >>> print(stemmer.stem("having"))
    have
    >>> print(stemmer2.stem("having"))
    having

在与成分列表匹配之前,重构代码,使句子中的所有单词都有词干。在

nltk是一个非常棒的工具,可以满足您的要求!在

干杯

相关问题 更多 >