在由<ref>标记分隔的<p>标记之间搜索关键字

2024-04-29 03:42:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在下面的xml文件中搜索关键字

<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:schemaLocation="http://www.tei-c.org/ns/1.0 /home/pisenberg/grobid/grobid-0.6.1/grobid-home/schemas/xsd/Grobid.xsd"
     xmlns:xlink="http://www.w3.org/1999/xlink">
        <text xml:lang="en">
            <body>
    <div xmlns="http://www.tei-c.org/ns/1.0"><p>text before ref<ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b46">47,</ref><ref type="bibr" target="#b66">67]</ref>text after ref</p></div>
            </body>
        </text>
</TEI>

我的代码:

from lxml import etree
import os
import csv
from shutil import copyfile
import pandas as pd

teins = {'tei':'http://www.tei-c.org/ns/1.0'} #info on the xml structure

searchterm = "before" #put your search term in lowercase

filepath = "./test.xml"
        
            
with open(filepath,'r', encoding='utf8') as file:  
    try:
        tree = etree.parse(file)
        root = etree.XML(etree.tostring(tree))
        textNode = root.find(".//tei:text",teins)
        for elem in textNode.iter():
            if elem.text:
                if searchterm.lower() in elem.text.lower():
                    print(elem.text)
                
    except Exception as e: # work on python 3.x
                print(str(e))

如果我搜索“before”,我可以得到结果,它会打印“before”。但是,如果我搜索“after”,它将不会打印任何内容

我觉得textNode.iter()无法在<ref>标记之后到达<p>标记内的文本。 我想知道有人知道怎么解决这个问题吗

任何帮助都将不胜感激


Tags: textorgimportrefhttpwwwtypexml