在tex中替换单词列表下划线空格的最快方法

2024-06-13 19:03:28 发布

您现在位置:Python中文网/ 问答频道 /正文

给出1000000000行,每行大约20-50个单词,例如:

Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful .
However , others argue that while anti-statism is central , it is inadequate to define anarchism .
Therefore , they argue instead that anarchism entails opposing authority or hierarchical organization in the conduct of human relations , including , but not limited to , the state system .
Proponents of anarchism , known as " anarchists " , advocate stateless societies based on non - hierarchical free association s. As a subtle and anti-dogmatic philosophy , anarchism draws on many currents of thought and strategy .
Anarchism does not offer a fixed body of doctrine from a single particular world view , instead fluxing and flowing as a philosophy .
There are many types and traditions of anarchism , not all of which are mutually exclusive .
Anarchist schools of thought can differ fundamentally , supporting anything from extreme individualism to complete collectivism .
Strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications .
Anarchism is often considered a radical left-wing ideology , and much of anarchist economics and anarchist legal philosophy reflect anti-authoritarian interpretations of communism , collectivism , syndicalism , mutualism , or participatory economics .
Anarchism as a mass social movement has regularly endured fluctuations in popularity .
The central tendency of anarchism as a social movement has been represented by anarcho-communism and anarcho-syndicalism , with individualist anarchism being primarily a literary phenomenon which nevertheless did have an impact on the bigger currents and individualists have also participated in large anarchist organizations .
Many anarchists oppose all forms of aggression , supporting self-defense or non-violence ( anarcho-pacifism ) , while others have supported the use of some coercive measures , including violent revolution and propaganda of the deed , on the path to an anarchist society .
Etymology and terminology The term derives from the ancient Greek ἄναρχος , anarchos , meaning " without rulers " , from the prefix ἀν - ( an - , " without " ) + ἀρχός ( arkhos , " leader " , from ἀρχή arkhē , " authority , sovereignty , realm , magistracy " ) + - ισμός ( - ismos , from the suffix - ιζειν , - izein " - izing " ) . "
Anarchists " was the term adopted by Maximilien de Robespierre to attack those on the left whom he had used for his own ends during the French Revolution but was determined to get rid of , though among these " anarchists " there were few who exhibited the social revolt characteristics of later anarchists .
There would be many revolutionaries of the early nineteenth century who contributed to the anarchist doctrines of the next generation , such as William Godwin and Wilhelm Weitling , but they did not use the word " anarchist " or " anarchism " in describing themselves or their beliefs .
Pierre-Joseph Proudhon was the first political philosopher to call himself an anarchist , making the formal birth of anarchism the mid-nineteenth century .
Since the 1890s from France , the term " libertarianism " has often been used as a synonym for anarchism and was used almost exclusively in this sense until the 1950s in the United States ; its use as a synonym is still common outside the United States .
On the other hand , some use " libertarianism " to refer to individualistic free-market philosophy only , referring to free-market anarchism as " libertarian anarchism " .

假设我有一个由一个或多个单词组成的字典术语列表,例如:

^{pr2}$

我需要找到所有包含这些术语的句子,然后用下划线替换这些术语中单词之间的空格。在

例如,文中有这样一句话:

Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful .

文本中有一个字典术语political philosophy。所以这个句子的输出需要是:

Anarchism is often defined as a political_philosophy which holds the state to be undesirable , unnecessary , or harmful .

我可以这样做:

dictionary = sort(dictionary, key=len) # replace the longest terms first.
for line in text:
   for term in dictionary: 
       if term in line:
           line = line.replace(term, term.replace(' ', '_'))

假设我有10000个字典术语(D)和1000000000个句子(S),那么使用这个简单方法的复杂性将是O(D*S),对吗?有没有一种更快、更少暴力的方式来达到同样的效果?

有没有办法用每一行带下划线的术语来替换所有的空格?这将有助于避免内部循环。在

如果先使用whoosh之类的方法为文本编制索引,然后查询索引并替换术语?我还是需要一些像a O(1*S)这样的东西来做替换,对吗?在

这个解决方案不需要使用Python,即使它是一些Unix命令技巧,比如grep/sed/awk,也可以,只要subprocess.Popen-able。在

请纠正我的复杂假设如果我错了,请原谅我的愚蠢。在


给出一个句子:

This is a sentence that contains multiple phrases that I need to replace with phrases with underscores, e.g. social political philosophy with political philosophy under the branch of philosophy and some computational linguistics where the cognitive linguistics and psycho cognitive linguistics appears with linguistics

假设我有字典:

cognitive linguistics
psycho cognitive linguistics
socio political philosophy
political philosophy
computational linguistics
linguistics
philosophy
social political philosophy 

输出应如下所示:

This is a sentence that contains multiple phrases that I need to replace with phrases with underscores, e.g. social_political_philosophy with political_philosophy under the branch of philosophy and some computational_linguistics where the cognitive_linguistics and psycho_cognitive_linguistics appears with linguistics

我们的目标是用一个100亿行的文本文件和一个10-10万个短语的字典来实现这一点。在


Tags: orandofthetoinfromis
2条回答

我会为你的字典做一个正则表达式来匹配数据。
然后在替换端,使用回调将空格替换为_。在

我估计整个过程不到3个小时。在

幸运的是有一个三元工具(Dictionary)regex生成器。在

要生成regex和下面显示的内容,您需要试用
{a1}版本

一些链接:
Screenshot of tool
TernaryTool(Dictionary) - Text version Dictionary samples
A 175,000 word Dictionary Regex

基本上,您可以生成自己的词典
插入要查找的字符串,然后按生成按钮。在

然后,您只需读取5 MB的数据块,然后使用
regex,然后将其附加到新文件。。重复冲洗。
真的很简单。在

根据您的样本(以上),这是对所需时间的估计
完成100亿条线路。在

此分析基于使用一个基准测试,该基准测试是使用生成的正则表达式对示例输入运行的。在

19 lines  (@ 3600 chars)

Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   5
Elapsed Time:    4.03 s,   4034.28 ms,   4034278 µs

////////////////////////////
3606 chars
x 50,000
      
180,300,000  (chars)

or 

20 lines
x 50,000
      
1,000,000  (lines)
=========================
10,000,000,000 lines
/
1,000,000  (lines) per 4 seconds
                    -
40,000 seconds
/
3600 secs per hour
            -
11 hours
////////////////////////////

但是,如果您读入并处理5兆字节的数据块
(作为单个字符串)它将减少引擎开销
把时间缩短到1-3个小时。在

这是为示例字典(压缩)生成的正则表达式:

^{pr2}$

注意,空间间隔是按每个空间[ ]生成的。
如果要将其更改为量化类,只需运行
找到(?:\[ \])+并替换为所需的任何内容。
例如\s+或{}


这里的格式是:

 \b 
 (?:
      c
      (?:
           linical [ ] 
           (?: anatomy | psychology )
        |  o
           (?:
                gnitive [ ] 
                (?: neuroscience | psychology | science )
             |  mp
                (?:
                     arative [ ] 
                     (?: anatomy | psychology )
                  |  ound [ ] morphology
                  |  utational [ ] linguistics
                )
             |  rrelation
             |  sm
                (?:
                     etic [ ] dentistry
                  |  o
                     (?: graphy | logy )
                )
           )
        |  r
           (?:
                anio
                (?: logy | metry )
             |  iminology
             |  y
                (?:
                     o
                     (?: biology | genics | nics )
                  |  ptanalysis
                  |  stallography
                )
           )
        |  urvilinear [ ] correlation
        |  y
           (?:
                bernetics
             |  to
                (?: genetics | logy )
           )
      )
   |  de
      (?:
           ixis
        |  mography
        |  nt
           (?:
                al [ ] 
                (?: anatomy | surgery )
             |  istry
           )
      )
   |  p
      (?: hilosophy | olitical [ ] philosophy )
 )
 \b 

添加10000个短语非常简单,正则表达式不大于
短语中的字节量加上隔行的开销
正则表达式。在

最后一个音符。只需生成
短语上的正则表达式。。这只是用水平空格隔开的单词。

而且,一定要预先编译正则表达式。只需要做一次。在

如果需要最大的单词,最好将单词从词组开头映射到完整短语,而不是检查dict中的每一项,只需按长度对出现的短语进行排序:

from collections import defaultdict

def get_phrases(fle):
    phrase_dict = defaultdict(list)
    with open(fle) as ph:
        for line in map(str.rstrip, ph):
            k, _, phr = line.partition(" ")
            phrase_dict[k].append(line)
        return phrase_dict

from itertools import chain


def replace(fle, dct):
    with open(fle) as f:
        for line in f:
            phrases = sorted(chain.from_iterable(dct[word] for word in line.split() 
                             if word in dct) ,reverse=1, key=len)
            for phr in phrases:
                  line = line.replace(phr, phr.replace(" ", "_"))
            yield line

输出:

^{pr2}$

其他几个版本:

def repl(x):
    if x:
        return x.group().replace(" ", "_")
    return x


def replace_re(fle, dct):
    with open(fle) as f:
        for line in f:
            spl = set(line.split())
            phrases = chain.from_iterable(dct[word] for word in spl if word in dct)
            line = re.sub("|".join(phrases), repl, line)
            yield line


def replace_re2(fle, dct):
    cached = {}
    with open(fle) as f:
        for line in f:
            phrases = tuple(chain.from_iterable(dct[word] for word in set(line.split()) if word in dct))
            if phrases not in cached:
                r = re.compile("|".join(phrases))
                cached[phrases] = r
                line = r.sub(repl, line)
            else:
                line = cached[phrases].sub(repl, line)
            yield line

相关问题 更多 >