如何在没有非常慢的for循环的情况下迭代xpath子集？

<html > <head>Title</head> <body> \xc2\xa7 720 ILCS 5/10-8.1. (a) text (b) text (1) text (2) text (Source) \xc2\xa7 720 ILCS 5/10-9 (a) something (Source) \xc2\xa7 720 ILCS 5/10-10. (a) more text (Source) </body> </html>

import lxml.html import cssselect import pandas as pd … tree = lxml.html.fromstring(raw) laws = tree.cssselect('p.SECMAIN span.ePub-B') xpath_str = ''' //p[@class="SECMAIN"][{i}]/ following-sibling::p[contains(@class, "INDENT")] [count(.|//p[@class="SOURCE"][{i}]/ preceding-sibling::p[contains(@class, "INDENT")]) = count(//p[@class="SOURCE"][{i}]/ preceding-sibling::p[contains(@class, "INDENT")]) ] ''' paragraphs_dict = {} paragraphs_dict['text'] = [] paragraphs_dict['n'] = [] # nested for loop: for n in range(1, len(laws)+1): law_paragraphs = tree.xpath(xpath_str.format(i = n)) # call xpath string for p in law_paragraphs: paragraphs_dict['text'].append(p.text_content()) # store paragraph paragraphs_dict['n'].append(n)

1条回答

网友

1楼 · 发布于 2024-05-15 16:30:32

考虑你的XPath表达式：对于每一个^ {CD1>}，你将^ {}迭代到那个数，然后在{{CD3}}s上迭代两次，找到匹配的一个，然后检查前面所有的^ {< CD4>}，并取其中的节点。即使有一些优化，有限状态自动机将有很多工作要做！它可能比二次型更糟糕（见注释）

我将对sax解析器使用更直接的方法

import xml.sax
import io

class MyContentHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.n = 0
        self.d = {'text': [], 'n': []}
        self.in_indent = False

    def startElement(self, name, attributes):
        if name == "p" and attributes["class"] == "SECMAIN":
            self.n += 1 # next SECMAIN
        if name == "p" and attributes["class"].startswith("INDENT"):
            self.in_indent = True # mark that we are in an INDENT par
            self.cur = [] # to store chunks of text

    def endElement(self, name):
        if name == "p" and self.in_indent:
            self.in_indent = False # mark that we leave an INDENT par
            self.d['text'].append("".join(self.cur)) # append the INDENT text
            self.d['n'].append(self.n) # and the number

    def characters(self, data):
        # https://docs.python.org/3/library/xml.sax.handler.html#xml.sax.handler.ContentHandler.characters
        # "SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks"
        if self.in_indent: # only if an INDENT par:
            self.cur.append(data) # store the chunks

parser = xml.sax.make_parser()
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
handler = MyContentHandler()
parser.setContentHandler(handler)
parser.parse(io.StringIO(raw))

print(handler.d)
# {'text': ['(a) text', '(b) text', '(1) text', '(2) text', '(a) something', '(b) more text'], 'n': [1, 1, 1, 1, 2, 3]}

这应该比XPath版本快很多

相关问题更多 >

编程相关推荐

热门问题

热门文章