去除文本中的相关连字符

1 投票

2 回答

534 浏览

数据工程师

提问于 2025-04-18 01:58

假设我有一段文本，内容是：

a = "我倾向于问简单的问题"

我想先提取出带有连字符的单词，也就是先判断文本中是否有连字符，这个比较简单。我可以用 re.match("\s*-\s*", a) 来检查句子里是否有连字符。

1) 接下来，我想提取出连字符前后的部分单词（在这个例子中，我想提取出 "inclin" 和 "ed"）。

2) 然后，我想把它们合并成 "inclined"，并打印出所有这样的单词。

我在第一步卡住了。请帮帮我。

正则表达式文本处理字符串操作自然语言处理文本分析单词提取词汇合并

2 个回答

试试这个正则表达式，它应该对你有帮助：

a = "I am inclin- ed to ask simple questions"

try:
    m = re.search('\S*\-(.|\s)\S*', a) #this will get the whole word, i.e "inclin- ed"
except AttributeError:
    #not found in a

print m

然后你可以把你的字符串去掉多余的部分，最后把它们作为一个数组提取出来。

回答于 2025-04-18 由 Python大师

分享举报

>>> import re
>>> a = "I am inclin- ed to ask simple questions"
>>> result = re.findall('([a-zA-Z]+-)\s+(\w+)', a)
>>> result
[('inclin-', 'ed')]

>>> [first.rstrip('-') + second for first, second in result]
['inclined']

或者，你可以让第一个组保存这个单词，但不包括后面的-：

>>> result = re.findall('([a-zA-Z]+)-\s+(\w+)', a)
>>> result
[('inclin', 'ed')]
>>> [''.join(item) for item in result]
['inclined']

这样也可以处理字符串中的多个匹配项：

>>> a = "I am inclin- ed to ask simp- le quest- ions"
>>> result = re.findall('([a-zA-Z]+)-\s+(\w+)', a)
>>> [''.join(item) for item in result]
['inclined', 'simple', 'questions']

回答于 2025-04-18 由 Python大师

分享举报

去除文本中的相关连字符

2 个回答

撰写回答