使用Python将长文本序列分割为段落

2024-04-23 21:52:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图将一个长文本序列分成可能的段落数。我找到了这个SO question,并考虑使用“nltk.tokenize.texttiling”。但是,在尝试在笔记本中实现下面给出的代码后,我得到了以下错误

from nltk.tokenize.texttiling import TextTilingTokenizer
import nltk
nltk.download('stopwords')
tt = TextTilingTokenizer(demo_mode=False)
s, ss, d, b = tt.tokenize("Tokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns.The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. Will not accept claims of zero casualties Mihin Lanka good plan but no funds Defence policy mixed with foreign policy was hoping to work with UNP not signing MCC a mistake. Former Parliamentarian, former Chief Executive Officer of somewhere and a contestant at the August 5 Parliamentary election from the National Democratic Front, someones name, has been mired in controversy. He is being investigated for alleged money laundering committed when he was part of the some presidents Government. Someones name, who is today fighting against the some president's administration, spoke to the Newspaper company online on some of the allegations against him.Well, that depends on the perspective that you look at it. If you take Wikipedia, it is an interactive database. A lot of people can go and write anything they want. I have seen what you are referring to and neither have I gone to correct it because everyone has the right to their own view. I think controversy can be defined in many ways. And your interpretation of controversy may differ from mine. I think what happened is, when you look at the past, some of the work that I have done and some of the involvement in terms of governance, that part of governance that I was involved in, and perhaps the effectiveness and perhaps the success I would have had in those spheres obviously made people jealous. And in politics the game is all about who gets ahead of the other. Once again its perception. If I ask you to tell me one thing I have done using thuggery. I maybe a little arrogant. But that’s my personal nature. I am a little hot-headed. But I have never done any harm to anyone. Absolutely not. When you look at the history, up to 2015, it was alright. And once we lost power in 2015, I was denied my nomination to contest the General Election. I was then incarcerated for seven months and I found out that it was basically a plot from within. Certain members of the family, very close to the President, didn’t want me back. They didn’t want to give me nomination for reasons which are obvious to them and not to me. Also later on, I found that certain actions taken in terms of keeping me imprisoned, certain meddling that they did with certain aspects of the judiciary was with the involvement of the former President as well as a Minister who was then a very powerful figure. No, I must say I don’t think former President person name or former Prime Minister person name had any hand in the matter. Of course, as soon as we lost power (in 2015), everybody was remanded. I was remanded then for seven months for the purported misuse of a vehicle. Seven years have gone and no charge sheet as yet. Publicly I can’t say this because I will be sued and you will be sued, but there was a certain intervention that was done to keep me inside for a longer period of time. The purpose of why that was done was to deny my nomination, and it happened. ")

错误:

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/nltk/tokenize/texttiling.py in _create_token_table(self, token_sequences, par_breaks)
    236             try:
--> 237                 current_par_break = next(pb_iter) #skip break at 0
    238             except StopIteration:

StopIteration: 

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
2 frames
/usr/local/lib/python3.6/dist-packages/nltk/tokenize/texttiling.py in _create_token_table(self, token_sequences, par_breaks)
    238             except StopIteration:
    239                 raise ValueError(
--> 240                     "No paragraph breaks were found(text too short perhaps?)"
    241                     )
    242         for ts in token_sequences:

ValueError: No paragraph breaks were found(text too short perhaps?)

这方面的解决方案或另一个工作库的建议非常可取


Tags: andofthetoinyouforthat