有没有更Pythonic的方法合并两个带有colspan的HTML表头行?

1 投票
5 回答
803 浏览
提问于 2025-04-11 20:42

我在用Python的BeautifulSoup库来解析一些HTML内容。现在遇到的问题是,有些表头行的合并单元格(colspan)不一样。简单来说,表头行就是那些需要合并来显示列标题的行。有的列可能会跨越上面或下面的多个列,所以我们需要根据这些合并情况来调整文字的显示。下面是我用来处理这个问题的一个程序。我使用BeautifulSoup来提取合并单元格的数量和每一行中每个单元格的内容。longHeader是包含最多内容的表头行,spanLong是一个列表,里面存储了这一行中每个单元格的合并数量。这个方法是可行的,但看起来不太符合Python的风格。

另外,如果合并数量的差值小于0,这个方法就不管用了。我可以用之前的方法来解决这个问题。但在我动手之前,我想知道有没有人能快速看一下,给我一些更符合Python风格的建议。我之前是个SAS程序员,所以在写代码时总是习惯用SAS的思维方式。

longHeader=['','','bananas','','','','','','','','','','trains','','planes','','','','']
shortHeader=['','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
combinedHeader=[]
sumSpanLong=0
sumSpanShort=0
spanDiff=0
longHeaderCount=0

for each in range(len(shortHeader)):
    sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
    sumSpanShort=sumSpanShort+spanShort[each]
    spanDiff=sumSpanShort-sumSpanLong
    if spanDiff==0:
        combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
        longHeaderCount=longHeaderCount+1
        continue
    for i in range(0,spanDiff):
            combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
            longHeaderCount=longHeaderCount+1
            sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
            spanDiff=sumSpanShort-sumSpanLong
            if spanDiff==0:
                combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
                longHeaderCount=longHeaderCount+1
                break

print combinedHeader

5 个回答

1

也许可以看看zip函数来解决问题的一部分:

>>> execfile('so_ques.py')
[[' '], [' '], ['bananas bunches'], [' '], [' cars'], [' cars'], [' cars'], [' '], [' trucks'], [' trucks'], [' trucks'], [' '], ['trains freight'], [' '], ['planes cargo'], [' '], [' all other'], [' '], [' ']]

>>> zip(long_header, short_header)
[('', ''), ('', ''), ('bananas', 'bunches'), ('', ''), ('', 'cars'), ('', ''), ('', 'trucks'), ('', ''), ('', 'freight'), ('', ''), ('', 'cargo'), ('', ''), ('trains', 'all other'), ('', ''), ('planes', '')]
>>> 

enumerate可以帮助你避免用计数器进行复杂的索引:

>>> diff_list = []
>>> for place, header in enumerate(short_header):
    diff_list.append(abs(span_short[place] - span_long[place]))

>>> for place, num in enumerate(diff_list):
    if num:
        new_shortlist.extend(short_header[place] for item in range(num+1))
    else:
        new_shortlist.append(short_header[place])


>>> new_shortlist
['', '', 'bunches', '', 'cars', 'cars', 'cars', '', 'trucks', 'trucks', 'trucks', '',... 
>>> z = zip(new_shortlist, long_header)
>>> z
[('', ''), ('', ''), ('bunches', 'bananas'), ('', ''), ('cars', ''), ('cars', ''), ('cars', '')...

另外,使用更符合Python风格的命名可能会让代码更清晰:

    for each in range(len(short_header)):
        sum_span_long += span_long[long_header_count]
        sum_span_short += span_short[each]
        span_diff = sum_span_short - sum_span_long
        if not span_diff:
            combined_header.append...
3

这是你算法的一个修改版本。这里用到的 zip 是用来同时处理 的长度和头部信息的,而 类对象 则用来计数和遍历 的项目,同时也负责合并头部信息。对于内部循环,使用 while 更合适。(请原谅我用的名字太短了)。

class collector(object):
    def __init__(self, header):
        self.longHeader = header
        self.combinedHeader = []
        self.longHeaderCount = 0
    def combine(self, shortValue):
        self.combinedHeader.append(
            [self.longHeader[self.longHeaderCount]+' '+shortValue] )
        self.longHeaderCount += 1
        return self.longHeaderCount

def main():
    longHeader = [ 
       '','','bananas','','','','','','','','','','trains','','planes','','','','']
    shortHeader = [
    '','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
    spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
    spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
    sumSpanLong=0
    sumSpanShort=0

    combiner = collector(longHeader)
    for sLen,sHead in zip(spanShort,shortHeader):
        sumSpanLong += spanLong[combiner.longHeaderCount]
        sumSpanShort += sLen
        while sumSpanShort - sumSpanLong > 0:
            combiner.combine(sHead)
            sumSpanLong += spanLong[combiner.longHeaderCount]
        combiner.combine(sHead)

    return combiner.combinedHeader
2

在这个例子中,其实你有很多事情要处理。

  1. 你把Beautiful Soup的标签对象处理得太复杂了,直接保留它们作为标签就可以了。

  2. 所有这类合并算法都比较难。把要合并的两个东西对称地看待会更有帮助。

这里有一个版本,可以直接使用Beautiful Soup的标签对象。而且,这个版本不假设两行的长度。

def merge3( row1, row2 ):
    i1= 0
    i2= 0
    result= []
    while i1 != len(row1) or i2 != len(row2):
        if i1 == len(row1):
            result.append( ' '.join(row1[i1].contents) )
            i2 += 1
        elif i2 == len(row2):
            result.append( ' '.join(row2[i2].contents) )
            i1 += 1
        else:
            if row1[i1]['colspan'] < row2[i2]['colspan']:
                # Fill extra cols from row1
                c1= row1[i1]['colspan']
                while c1 != row2[i2]['colspan']:
                    result.append( ' '.join(row2[i2].contents) )
                    c1 += 1
            elif row1[i1]['colspan'] > row2[i2]['colspan']:
                # Fill extra cols from row2
                c2= row2[i2]['colspan']
                while row1[i1]['colspan'] != c2:
                    result.append( ' '.join(row1[i1].contents) )
                    c2 += 1
            else:
                assert row1[i1]['colspan'] == row2[i2]['colspan']
                pass
            txt1= ' '.join(row1[i1].contents)
            txt2= ' '.join(row2[i2].contents)
            result.append( txt1 + " " + txt2 )
            i1 += 1
            i2 += 1
    return result

撰写回答