用Python将文本文件中的复数形式转换为单数形式

3条回答

网友

1楼 · 编辑于 2024-04-27 04:46:15

Nodebox英语语言库包含将复数形式转换为单数形式的脚本，反之亦然。结帐教程：https://www.nodebox.net/code/index.php/Linguistics#pluralization

要将复数转换为单复数，只需导入singular模块并使用singular()函数。它处理不同词尾、不规则形式等单词的正确转换

from en import singular
print(singular('analyses'))   
print(singular('planetoids'))
print(singular('children'))
>>> analysis
>>> planetoid
>>> child

网友

2楼 · 编辑于 2024-04-27 04:46:15

如果有复杂的单词要单独使用，我建议您不要使用词干，而是使用适当的python包链接pattern：

from pattern.text.en import singularize

plurals = ['caresses', 'flies', 'dies', 'mules', 'geese', 'mice', 'bars', 'foos',
           'families', 'dogs', 'child', 'wolves']

singles = [singularize(plural) for plural in plurals]
print singles

>>> ['caress', 'fly', 'dy', 'mule', 'goose', 'mouse', 'bar', 'foo', 'foo', 'family', 'family', 'dog', 'dog', 'child', 'wolf']

它不是完美的，但它是我找到的最好的。96%基于文档：http://www.clips.ua.ac.be/pages/pattern-en#pluralization

网友

3楼 · 编辑于 2024-04-27 04:46:15

看起来您对Python很熟悉，但我仍将尝试解释一些步骤。让我们从第一个问题开始。当您使用.read（）读入多行文件（在您的例子中是单词，数字csv）时，您将把整个文件体读入一个大字符串。

def openfile(f):
    with open(f,'r') as a:
        a = a.read() # a will equal 'soc, 32\nsoc, 1\n...' in your example
        a = a.lower()
        return a

这是很好的，但是当您想将结果传递给stem（）时，它将是一个大字符串，而不是一个单词列表。这意味着，当您使用for word in a遍历输入时，您将遍历输入字符串的每个单独字符，并将词干分析器应用于这些单独的字符。

def stem(a):
    p = nltk.PorterStemmer()
    a = [p.stem(word) for word in a] # ['s', 'o', 'c', ',', ' ', '3', '2', '\n', ...]
    return a

这绝对不适合你的目的，我们可以做一些不同的事情。

我们可以更改它，以便将输入文件作为一个行列表读取
我们可以用大字符串自己把它分解成一个列表。
我们可以一次一行地检查和处理行列表中的每一行。

为了方便起见，让我们用1。这需要将openfile（f）更改为以下内容：

def openfile(f):
    with open(f,'r') as a:
        a = a.readlines() # a will equal 'soc, 32\nsoc, 1\n...' in your example
        b = [x.lower() for x in a]
        return b

这应该给我们一个b行列表，即[soc，32，'soc，1'，…]。所以下一个问题是，当我们将字符串列表传递给stem（）时，如何处理它。一种方法是：

def stem(a):
    p = nltk.PorterStemmer()
    b = []
    for line in a:
        split_line = line.split(',') #break it up so we can get access to the word
        new_line = str(p.stem(split_line[0])) + ',' + split_line[1] #put it back together 
        b.append(new_line) #add it to the new list of lines
    return b

这绝对是一个非常粗略的解决方案，但是应该充分地遍历输入中的所有行，并使它们失去吸引力。这很粗糙，因为当你放大时，分裂和重新组合字符串并不是特别快。但是，如果您对此感到满意，那么剩下的就是遍历新行列表，并将它们写入您的文件。以我的经验，写一个新文件通常比较安全，但这应该可以正常工作。

def returnfile(f, a):
    with open(f,'w') as d:
        for line in a:
            d.write(line)


print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))

当我有以下input.txt

soc, 32
socs, 1
dogs, 8

我得到以下标准：

Please enter a filename: input.txt
['soc, 32\n', 'socs, 1\n', 'dogs, 8\n']
['soc, 32\n', 'soc, 1\n', 'dog, 8\n']
None

input.txt如下所示：

soc, 32
soc, 1
dog, 8

第二个关于将数字与相同的单词合并的问题改变了我们的解决方案。根据评论中的建议，你应该看看用字典来解决这个问题。与其把这些都作为一个大列表来做，更好的方法（可能更像是pythonic）是遍历输入的每一行，并在处理它们时将它们词干化。如果你还在想办法的话，我会在一段时间内写下这方面的代码。

相关问题更多 >

编程相关推荐

热门问题

热门文章