使用pythondocx突出显示或加粗文本文件中的字符串？

from docx import Document doc = Document('BSA.docx') peptide_list = ['MKWVTFISLLLLFSSAYSRGV', 'SSAYSRGVFRRDTHKSEIAH', 'KPKATEEQLKTVMENFVAFVDKCCA'] def highlight_peptides(text, keywords): text = text.paragraphs[1].text replacement = "\033[91m" + "\\1" + "\033[39m" enter code here`text = re.sub("(" + "|".join(map(re.escape, keywords)) + ")", replacement, text, flags=re.I) highlight_peptides(doc, peptide_list)

1条回答

网友

1楼 · 发布于 2024-05-16 14:27:43

您可以使用third-party ^{} module进行重叠关键字搜索。然后，可能最容易在两次过程中完成匹配：（1）存储每个高亮显示段的开始和结束位置，并组合任何重叠部分：

import regex as re # important - not using the usual re module

def find_keywords(keywords, text):
    """ Return a list of positions where keywords start or end within the text. 
    Where keywords overlap, combine them. """
    pattern = "(" + "|".join(re.escape(word) for word in keywords) + ")"
    r = []
    for match in re.finditer(pattern, text, flags=re.I, overlapped=True):
        start, end = match.span()
        if not r or start > r[-1]: 
            r += [start, end]  # add new segment
        elif end > r[-1]:
            r[-1] = end        # combine with previous segment
    return r

positions = find_keywords(keywords, text)

您的“关键字覆盖率”（突出显示的百分比）可以计算为：

coverage = sum(positions[1::2]) - sum(positions[::2]) # sum of end positions - sum of start positions
percent_coverage = coverage * 100 / len(text)

然后（2）使用^{} properties in ^{}向文本添加格式：

import docx

def highlight_sections_docx(positions, text):
    """ Add characters to a text to highlight the segments indicated by
     a list of alternating start and end positions """
    document = docx.Document()
    p = document.add_paragraph()
    for i, (start, end) in enumerate(zip([None] + positions, positions + [None])):
        run = p.add_run(text[start:end])
        if i % 2:  # odd segments are highlighted
            run.bold = True   # or add other formatting - see https://python-docx.readthedocs.io/en/latest/api/text.html#run-objects
    return document

doc = highlight_sections_docx(positions, text)
doc.save("my_word_doc.docx")

或者，您可以突出显示html中的文本，然后使用^{}包将其保存到Word文档：

def highlight_sections(positions, text, start_highlight="<mark>", end_highlight="</mark>"):
    """ Add characters to a text to highlight the segments indicated by
     a list of alternating start and end positions """
    r = ""
    for i, (start, end) in enumerate(zip([None] + positions, positions + [None])):
        if i % 2:  # odd segments are highlighted
            r += start_highlight + text[start:end] + end_highlight
        else:      # even segments are not
            r += text[start:end]
    return r

from htmldocx import HtmlToDocx
s = highlight_sections(positions, text, start_highlight="<strong>", end_highlight="</strong>")
html = f"""<html><head></head><body><span style="width:100%; word-wrap:break-word; display:inline-block;">{s}</span></body></html>"""
HtmlToDocx().parse_html_string(html).save("my_word_doc.docx")

（<mark>将是比<strong>更适合使用的html标记，但不幸的是，HtmlToDocx没有保留<mark>的任何格式，并且忽略CSS样式）

highlight_sections也可用于输出到控制台：

print(highlight_sections(positions, text, start_highlight="\033[91m", end_highlight="\033[39m"))

。。。或连接到Jupyter/IPython笔记本：

from IPython.core.display import HTML
s = highlight_sections(positions, text)
display(HTML(f"""<span style="width:100%; word-wrap:break-word; display:inline-block;">{s}</span>""")

相关问题更多 >

编程相关推荐

热门问题

热门文章