<p>我正在尝试清理文本,以便在机器学习应用程序中使用。基本上,这些都是“半结构化”的规范文档,我正在尝试删除干扰NLTK<code>sent_tokenize()</code>函数的节号。你知道吗</p>
<p>以下是我正在处理的文本示例:</p>
<pre><code>and a Contract for the work and/or material is entered into with some other person for a
greater amount, the undersigned hereby agrees to forfeit all right and title to the
aforementioned deposit, and the same is forfeited to the Crown.
2.3.3
...
(b)
until thirty-five days after the time fixed for receiving this tender,
whichever first occurs.
2.4
AGREEMENT
Should this tender be accepted, the undersigned agrees to enter into written agreement with
the Minister of Transportation of the Province of Alberta for the faithful performance of the
works covered by this tender, in accordance with the said plans and specifications and
complete the said work on or before October 15, 2019.
</code></pre>
<p>我试图删除所有的分节符(例如2.3.3,2.4,(b)),但不是日期数字。你知道吗</p>
<p>这是我到目前为止的正则表达式:<code>[0-9]*\.[0-9]|[0-9]\.</code></p>
<p>不幸的是,它与最后一段(2019年)中的部分日期相匹配。变成201),我真的不知道如何解决这个问题,因为我不是regex的专家。你知道吗</p>
<p>谢谢你的帮助!你知道吗</p>