Python正则表达式不能提取值- Python 3.x

2024-06-07 07:15:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在遍历一个非常大的(~5GB)文本文档,如下所示:

<P ID=912>
bird
dog
dog
dog
</P>

<P ID=5>
aardvark
bird
bird
cat
egret
</P>

<P ID=291>
aardvark
aardvark
aardvark
aardvark
aardvark
bird
dog
fish
fish
fish
</P>

<P ID=621>
aardvark
aardvark
bird
dog
fish
fish
fish
</P>

<P ID=5>
bird
egret
egret
</P>

<P ID=1>
bird
</P>

从id没有组织的意义上说,文档非常“无序”。我需要创建一个解决方案来遍历每个段落(由<P ID = x></P>标记表示,它将一直存在),并提取ID号。在

我使用NLTK来标记段落,这很好,我的问题是我无法从标记中提取ID。在

^{pr2}$

导致:

Current paragraph Number: None
Current paragraph Number: None
Current paragraph Number: None
Current paragraph Number: None
Current paragraph Number: None
Current paragraph Number: None

但我希望它看起来像:

Current paragraph Number: 912
Current paragraph Number: 5
Current paragraph Number: 291
Current paragraph Number: 621
Current paragraph Number: 5
Current paragraph Number: 1

我需要如何更改:para_id = re.match("<P ID=\d+>", para)

编辑: 我也尝试过: para_id = [i['id'] for i in soup(para, 'html.parser').find_all('p')] 但这产生了一个空白[]我不知道为什么我不能创建一个只有单数段的汤

注意-我应该提到这是代码的一个最小的例子。真正的程序要大得多,并且需要NLTK来解析,因为我经常使用停止词和文本标记。在


Tags: 标记noneidnumbercurrent段落dogpara
3条回答

你在哪里捕捉段落的文字但是 你应该捕获包括P标记在内的整个段落, 在捕获段落的Id之后,我使用了您的simple in data.txt

from nltk.tokenize import word_tokenize, RegexpTokenizer
import re

def get_input(filepath):
    f = open(filepath, 'r')
    content = f.read()
    f.close()  # don't forget to close file
    return content

def main():
    myfile = get_input("data.txt")
    # here capture the full paragraph
    p = r'<P ID=\d+>.*?</P>'
    paras = RegexpTokenizer(p)
    para_id = 0
    for para in paras.tokenize(myfile):
        # and here just catch the ID
        para_id = re.match("<P ID=(\d+)>", para)
        print("Current paragraph Number: {}".format(para_id.group(1)))

main()

输出:

^{pr2}$

你在读整个5 GB的文件我觉得你应该用生成器, 如果只需要打印段落Id:

import re


def main():
    with open("data.txt") as f:  # Using context manager to close resource
        for line in f:
            # and here just catch the ID
            match = re.match("<P ID=(\d+)>", line)
            if match:
                print("Current paragraph Number: {}".format(match.group(1)))

main()

这将生成相同的结果,而不会将整个5 GB加载到内存中。在

一种可能的解决方案是,在使用NLTK处理后,将输入传递给BeautifulSoup

from bs4 import BeautifulSoup as soup
results = [i['id'] for i in soup(content, 'html.parser').find_all('p')]

输出:

^{pr2}$

BeautifulSoup使您能够使用soup.contents访问段落内容:

for i in soup(content, 'html.parser').find_all('p'):
   print(i.contents)

输出:

['\nbird\ndog\ndog\ndog\n']
['\naardvark\nbird\nbird\ncat\negret\n']
['\naardvark\naardvark\naardvark\naardvark\naardvark\nbird\ndog\nfish\nfish\nfish\n']
['\naardvark\naardvark\nbird\ndog\nfish\nfish\nfish\n']
['\nbird\negret\negret\n']
['\nbird\n']

r'(?s)<P\s*ID\s*=\s*(\d+)\s*>(.*?)</P\s*>'与findall()搜索一起使用。
ID在捕获组1中,Content在捕获组2中。在

示例

>>> input = """
... <P ID=912>
... bird
... dog
... dog
... dog
... </P>
...
... <P ID=5>
... aardvark
... bird
... bird
... cat
... egret
... </P>
...
... <P ID=291>
... aardvark
... aardvark
... aardvark
... aardvark
... aardvark
... bird
... dog
... fish
... fish
... fish
... </P>
...
... <P ID=621>
... aardvark
... aardvark
... bird
... dog
... fish
... fish
... fish
... </P>
...
... <P ID=5>
... bird
... egret
... egret
... </P>
...
... <P ID=1>
... bird
... </P>
... """
>>>
>>> import re
>>> p = re.compile(r'(?s)<P\s*ID\s*=\s*(\d+)\s*>(.*?)</P\s*>')
>>>
>>> ids = p.findall(input)
>>>
>>> i = 0
>>> ids_len = len(ids)
>>>
>>> while ( i < ids_len ):
...     print(ids[i])     # The ID
...     print(ids[i+1])   # The Content
...     i += 2
...
('912', '\nbird\ndog\ndog\ndog\n')
('5', '\naardvark\nbird\nbird\ncat\negret\n')
('291', '\naardvark\naardvark\naardvark\naardvark\naardvark\nbird\ndog\nfish\nfish\nfish\n')
('621', '\naardvark\naardvark\nbird\ndog\nfish\nfish\nfish\n')
('5', '\nbird\negret\negret\n')
('1', '\nbird\n')
>>>

相关问题 更多 >