这个问题是我上一个问题Multiple Phrases Matching Python Pandas的扩展。虽然我在一个答案之后找到了解决问题的方法,但是一些典型的单数和复数问题出现了。在
ingredients=pd.Series(["vanilla extract","walnut","oat","egg","almond","strawberry"])
df=pd.DataFrame(["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])
我只需要将配料系列中的短语与数据帧中的短语相匹配。作为伪代码
If ingredients(singular or plural) found in phrase in the DataFrame, return the ingredient. Or otherwise, return false.
答案如下:
^{pr2}$我还应用了下面的方法用NAN填充空单元格,这样我就可以很容易地过滤掉数据。在
df.ix[df.existence=='', 'existence'] = np.nan
结果如下:
print df
val existence
0 1 teaspoons vanilla extract vanilla extract
1 2 eggs egg
2 3 cups chopped walnuts walnut
3 4 cups rolled oats oat
4 1 (10.75 ounce) can Campbell's Condensed Cream... NaN
5 6 ounces smoke-flavored almonds, finely chopped almond
6 sdfgsfgsf NaN
7 fsfgsgsfgfg NaN
8 2 small strawberries NaN
这一直是正确的,但当单数词和复数词的映射不像almond
=>;almonds
apple
=>;apples
。当出现strawberry
=>;strawberries
时,此代码将其识别为NaN
。在
改进我的代码来检测这种情况。我想把我的原料Series
改成{
#ingredients
#inputwords #outputword
vanilla extract vanilla extract
walnut walnut
walnuts walnut
oat oat
oats oat
egg egg
eggs egg
almond almond
almonds almond
strawberry strawberry
strawberries strawberry
cherry cherry
cherries cherry
所以我的逻辑是每当#inputwords
中的一个词出现在短语中,我想返回另一个单元格中的单词。换句话说,当strawberry
或strawberries
出现在短语中时,代码就把它旁边的单词{
val existence
0 1 teaspoons vanilla extract vanilla extract
1 2 eggs egg
2 3 cups chopped walnuts walnut
3 4 cups rolled oats oat
4 1 (10.75 ounce) can Campbell's Condensed Cream... NaN
5 6 ounces smoke-flavored almonds, finely chopped almond
6 sdfgsfgsf NaN
7 fsfgsgsfgfg NaN
8 2 small strawberries strawberry
我无法找到将此功能合并到现有代码或编写新代码的方法。有人能帮我吗?在
考虑使用词干分析器:) http://www.nltk.org/howto/stem.html
直接从他们的页面上取下:
在与成分列表匹配之前,重构代码,使句子中的所有单词都有词干。在
nltk是一个非常棒的工具,可以满足您的要求!在
干杯
相关问题 更多 >
编程相关推荐