Python正则表达式将外部文本与标记之间的文本结合起来

2024-04-19 03:17:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下字符串(阶段1):

(Undergraduate level  <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level  <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level  <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level  <A HREF="blah">MATH 251</A> Minimum Grade of D)

从这里我进入(第二阶段):

(Undergraduate level PHYS 218 Minimum Grade of D) and (Undergraduate level MATH 152 Minimum Grade of D or Undergraduate level MATH 172 Minimum Grade of D or Undergraduate level MATH 251 Minimum Grade of D)

最后我想要的是(第三阶段):

(PHYS 218) and (MATH 152 or MATH 172 or MATH 251)

目前我做这件事的方式很糟糕

我使用stage 1字符串,完全删除所有a标记,并合并剩下的文本

然后,我从a标记中获取课程编号,并将其放入上述步骤的字符串中,以进入第二阶段

然后我在第二个阶段寻找课程,去掉它左右两边的所有内容,直到我找到一个()orand

有没有什么方法可以让我用正则表达式或者别的什么干净利落地做到这一点?谢谢你


Tags: orandof字符串标记mathlevel阶段
1条回答
网友
1楼 · 发布于 2024-04-19 03:17:12
x="""(Undergraduate level  <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level  <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level  <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level  <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print re.sub(r"<[^>]*>\s*|Undergraduate level\s*|Minimum Grade of [A-Z]+","",x)

如果格式总是固定的并且不会有太大变化,那么可以使用re.sub来完成

请参见演示

https://regex101.com/r/hF7zZ1/2

编辑:

如果文本正在更改,请尝试以下操作

x="""(Undergraduate level  <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level  <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level  <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level  <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print "".join(re.findall(r"(\(|\)|\s*or\s*|\s*and\s*|(?<=>)[^<]*(?=<\/A>))",x))

相关问题 更多 >