使用pythondocx突出显示或加粗文本文件中的字符串?

2024-05-16 14:27:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个“短字符串”列表,例如:

['mkwvtfillfssaysrgv','SSAYSRGVFRRDTHKSEIAH','KPKATEEQLKTVMENFVAFVDKCCA']

我需要匹配word文件(BSA.docx)或.txt文件(无所谓)中包含的“长字符串”,例如:

sp|P02769|ALBU_BOVIN Albumin OS=Bos taurus OX=9913 GN=ALB PE=1 SV=4 MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCVADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPKLKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKGACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLVTDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEKDAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEATLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKVPQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTKCCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKHKPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA

我希望使用python(在终端或jupyter笔记本中)获得以下内容:

  1. 突出显示长字符串中的短字符串匹配项。突出显示样式并不重要,它可以用黄色标记或粗体或下划线突出显示,任何可以跳到眼睛上看是否匹配的东西

  2. 查找长字符串的覆盖范围为((突出显示的字符数)/(长字符串的总长度))*100。请注意,长字符串中以“>;>;”开头的第一行只是一个标识符,需要忽略

以下是第一个任务的当前代码:

from docx import Document

doc = Document('BSA.docx')

peptide_list = ['MKWVTFISLLLLFSSAYSRGV', 'SSAYSRGVFRRDTHKSEIAH', 'KPKATEEQLKTVMENFVAFVDKCCA']

def highlight_peptides(text, keywords):
    text = text.paragraphs[1].text
    replacement = "\033[91m" + "\\1" + "\033[39m"
    enter code here`text = re.sub("(" + "|".join(map(re.escape, keywords)) + ")", replacement, text, flags=re.I)
    

highlight_peptides(doc, peptide_list)

问题是列表中的前两个短字符串重叠,结果中只有前一个字符串在序列中以红色突出显示

请参阅下面的第一个链接,其中包含我获得的输出结果

current result

请参阅第二个链接以可视化我的“理想”结果

ideal result

在理想情况下,我还包括了第二个任务,即查找序列覆盖率。我不知道如何计算彩色或突出显示的字符


Tags: 文件字符串textgtre列表doc字符
1条回答
网友
1楼 · 发布于 2024-05-16 14:27:43

您可以使用third-party ^{} module进行重叠关键字搜索。然后,可能最容易在两次过程中完成匹配:(1)存储每个高亮显示段的开始和结束位置,并组合任何重叠部分:

import regex as re # important - not using the usual re module

def find_keywords(keywords, text):
    """ Return a list of positions where keywords start or end within the text. 
    Where keywords overlap, combine them. """
    pattern = "(" + "|".join(re.escape(word) for word in keywords) + ")"
    r = []
    for match in re.finditer(pattern, text, flags=re.I, overlapped=True):
        start, end = match.span()
        if not r or start > r[-1]: 
            r += [start, end]  # add new segment
        elif end > r[-1]:
            r[-1] = end        # combine with previous segment
    return r

positions = find_keywords(keywords, text)

您的“关键字覆盖率”(突出显示的百分比)可以计算为:

coverage = sum(positions[1::2]) - sum(positions[::2]) # sum of end positions - sum of start positions
percent_coverage = coverage * 100 / len(text)

然后(2)使用^{} properties in ^{}向文本添加格式:

import docx

def highlight_sections_docx(positions, text):
    """ Add characters to a text to highlight the segments indicated by
     a list of alternating start and end positions """
    document = docx.Document()
    p = document.add_paragraph()
    for i, (start, end) in enumerate(zip([None] + positions, positions + [None])):
        run = p.add_run(text[start:end])
        if i % 2:  # odd segments are highlighted
            run.bold = True   # or add other formatting - see https://python-docx.readthedocs.io/en/latest/api/text.html#run-objects
    return document

doc = highlight_sections_docx(positions, text)
doc.save("my_word_doc.docx")

或者,您可以突出显示html中的文本,然后使用^{}包将其保存到Word文档:

def highlight_sections(positions, text, start_highlight="<mark>", end_highlight="</mark>"):
    """ Add characters to a text to highlight the segments indicated by
     a list of alternating start and end positions """
    r = ""
    for i, (start, end) in enumerate(zip([None] + positions, positions + [None])):
        if i % 2:  # odd segments are highlighted
            r += start_highlight + text[start:end] + end_highlight
        else:      # even segments are not
            r += text[start:end]
    return r

from htmldocx import HtmlToDocx
s = highlight_sections(positions, text, start_highlight="<strong>", end_highlight="</strong>")
html = f"""<html><head></head><body><span style="width:100%; word-wrap:break-word; display:inline-block;">{s}</span></body></html>"""
HtmlToDocx().parse_html_string(html).save("my_word_doc.docx")

<mark>将是比<strong>更适合使用的html标记,但不幸的是,HtmlToDocx没有保留<mark>的任何格式,并且忽略CSS样式)

highlight_sections也可用于输出到控制台:

print(highlight_sections(positions, text, start_highlight="\033[91m", end_highlight="\033[39m"))

。。。或连接到Jupyter/IPython笔记本:

from IPython.core.display import HTML
s = highlight_sections(positions, text)
display(HTML(f"""<span style="width:100%; word-wrap:break-word; display:inline-block;">{s}</span>""")

相关问题 更多 >