从文本Python解析代码

2024-05-16 22:21:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在分析StackOverflow的转储文件“Posts.Small.xml“使用pySpark。我想把一行中的“代码块”和“文本”分开。典型的解析行如下所示:

            ['[u"<p>I want to use a track-bar to change a form\'s opacity.</p>&#xA;&#xA;
        <p>This is my code:</p>&#xA;&#xA;<pre><code>decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;</code></pre>&#xA;&#xA;
    <p>When I try to build it, I get this error:</p>&#xA;&#xA;<blockquote>&#xA;  <p>Cannot implicitly convert type \'decimal\' to \'double\'.
</p>&#xA;</blockquote>&#xA;&#xA;<p>I tried making <code>trans</code> a <code>double</code>, but then the control doesn\'t work.',
             '", u\'This code has worked fine for me in VB.NET in the past.',
             '\', u"</p>&#xA; When setting a form\'s opacity should I use a decimal or double?"]']

我尝试过“itertools”和一些python函数,但是没有得到结果。 我提取上述行的初始代码是:

^{pr2}$

任何想法都是感激的!在


Tags: to代码formtransusecodethispre
2条回答

您可以使用XPath提取code内容(使用lxml库将有所帮助),然后选择其他所有内容来提取文本内容,例如:

import lxml.etree


data = '''<p>I want to use a track-bar to change a form's opacity.</p>
          <p>This is my code:</p> <pre><code>decimal trans = trackBar1.Value / 5000; this.Opacity = trans;</code></pre>
          <p>When I try to build it, I get this error:</p>
          <p>Cannot implicitly convert type 'decimal' to 'double'.</p>
          <p>I tried making <code>trans</code> a <code>double</code>.</p>'''

html = lxml.etree.HTML(data)
code_blocks = html.xpath('//code/text()')
text_blocks = html.xpath('//*[not(descendant-or-self::code)]/text()') 

最简单的方法可能是对文本应用正则表达式,匹配标记“' and '”。这样你就可以找到代码块了。不过,你不会说你以后会怎么处理他们。所以。。。在

from itertools import zip_longest

sample_paras = [
    """<p>I want to use a track-bar to change a form\'s opacity.</p>&#xA;&#xA;<p>This is my code:</p>&#xA;&#xA;<pre><code>decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;</code></pre>&#xA;&#xA;<p>When I try to build it, I get this error:</p>&#xA;&#xA;<blockquote>&#xA;  <p>Cannot implicitly convert type \'decimal\' to \'double\'. </p>&#xA;</blockquote>&#xA;&#xA;<p>I tried making <code>trans</code> a <code>double</code>, but then the control doesn\'t work.""",
    """This code has worked fine for me in VB.NET in the past.""",
    """</p>&#xA; When setting a form\'s opacity should I use a decimal or double?""",
]

single_block = " ".join(sample_paras)

import re
separate_code = re.split(r"</?code>", single_block)

text_blocks, code_blocks = zip(*zip_longest(*[iter(separate_code)] * 2))

print("Text:\n")
for t in text_blocks:
    print(" ")
    print(t)

print("\n\nCode:\n")
for t in code_blocks:
    print(" ")
    print(t)

相关问题 更多 >