如何在正则表达式中忽略不需要的模式

2024-05-16 14:07:40 发布

男 | 程序猿一只，喜欢编程写python代码。

我有以下python代码

from io import BytesIO
import pdfplumber, requests
test_case = {
    'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0514/2020051400555.pdf': 59,
    'https://www1.hkexnews.hk/listedco/listconews/gem/2020/0529/2020052902118.pdf': 55,
    'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0618/2020061800366.pdf': 47,
    'https://www1.hkexnews.hk/listedco/listconews/gem/2020/0630/2020063002674.pdf': 30,
}

for url, page in test_case.items():
    rq = requests.get(url)
    pdf = pdfplumber.load(BytesIO(rq.content))
    txt = pdf.pages[page].extract_text()
    txt = re.sub("([^\x00-\x7F])+", "", txt)  # no chinese
    pattern = r'.*\n.*?(?P<auditor>[A-Z].+?\n?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'
    try:
        auditor = re.search(pattern, txt, flags=re.MULTILINE).group('auditor').strip()
        print(repr(auditor))
    except AttributeError:
        print(txt)
        print('============')
        print(url)

它产生以下结果

'ShineWing'
'ShineWing'
'Hong Kong Standards on Auditing (HKSAs) issued by the Hong Kong Institute of'
'Hong Kong Financial Reporting Standards issued by the Hong Kong Institute of'

预期的结果是：

'ShineWing'
'ShineWing'
'Ernst & Young'
'Elite Partners CPA Limited'

我试过：

pattern = r'.*\n.*?(?P<auditor>[A-Z].+?\n?)$(?!Institute)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants' 此模式捕获最后两种情况，但不捕获前两种情况

pattern = r'.*\n.*?(?P<auditor>^(?!Hong|Kong)[A-Z].+?\n?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants' 这会产生期望的结果，但是^(?!Hong|Kong)有潜在的风险，因为它可能会在将来忽略其他期望的结果，因此它不是一个好的候选

相反，$(?!Institute)更为普遍和合适，但我不知道为什么在前两个病例中它不能匹配。如果有一种方法可以忽略包含issued by the Hong Kong Institute of的匹配，那就太好了

如有任何建议，将不胜感激。多谢各位

Tags： https txt pdf auditor cc pattern hong print

1条回答

网友

1楼 · 发布于 2024-05-16 14:07:40

pattern = r'\n.*?(?P<auditor>(?!.*Institute)[A-Z].+?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'

这很有效

如何在正则表达式中忽略不需要的模式

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在正则表达式中忽略不需要的模式

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >