使用Python去除WordPress图像说明短代码

0 投票
1 回答
585 浏览
提问于 2025-04-18 05:37

你好,我正在把一个WordPress博客迁移到另一个内容管理系统(CMS),我需要在上传到新平台之前,从HTML中去掉开头和结尾的[caption]标签及其内容,但要保留里面的标签。其余的代码在这里,供你参考:https://github.com/thmcmahon/wp2nb

理想情况下,我想把这个功能做成一个函数,像这样:

def strip_caption_tags(content):
  no_captions = do_some_stuff_presumably_regex(content)
  return caption

这是一个数据的例子:

<![CDATA[[caption id="attachment_5582" align="alignleft" width="1024" caption="Out on Lake Burley Griffin with members of the Canberra Ice Dragons Paddle Club, January 2014"]<a href="http://www.andrewleigh.com/blog/wp-content/uploads/2014/01/ACT-Dragon-Boat-3.jpg"><img class="size-large wp-image-5582" title="ACT Dragon Boat 3" src="http://www.andrewleigh.com/blog/wp-content/uploads/2014/01/ACT-Dragon-Boat-3-1024x682.jpg" alt="" width="1024" height="682" /></a>[/caption]

<div class="mceTemp"><strong>Ca</strong><strong>l</strong><span style="font-weight: bold;">l for Local Sporting Champions to step up and apply for grants on offer</span></div>
Young people can find it difficult to meet the ongoing and significant costs associated with participation at sporting competitions.

The Local Sporting Champions program is designed to provide financial assistance for young people towards the cost of travel, accommodation, uniforms or equipment when competing, coaching or officiating at an official sports event.

For more information on the Local Sporting Champions program visit the Australian Sports Commission website: <a href="http://www.ausport.gov.au/champions">www.ausport.gov.au/champions</a>.]]>

1 个回答

3

这是对你问题的回答,但我不太确定你问的关于数据转换的问题是否正确。其实在把数据库导出为XML之前处理这些问题可能会更简单,不过如果你想在Python中用正则表达式替换内容的话,可以这样做:

import re
contents = //... get your post contents here
contents = re.sub(r'\[/?caption[^\]]*?\]', '', contents)

关于正则表达式:

  • \[ 匹配一个字面上的左方括号 [
  • /? 可选地匹配一个斜杠 /
  • caption 匹配 caption 这个词
  • [^\]]*? 懒惰匹配任何不是右方括号 ] 的字符
  • \] 匹配一个字面上的右方括号

这样可以同时匹配 [caption foo="bar"][/caption]

你可以在这个链接 Regex101 上查看这个例子,并获得更多解释。

撰写回答