有没有更Pythonic的方法合并两个带有colspan的HTML表头行？

1 投票

5 回答

803 浏览

提问于 2025-04-11 20:42

我在用Python的BeautifulSoup库来解析一些HTML内容。现在遇到的问题是，有些表头行的合并单元格（colspan）不一样。简单来说，表头行就是那些需要合并来显示列标题的行。有的列可能会跨越上面或下面的多个列，所以我们需要根据这些合并情况来调整文字的显示。下面是我用来处理这个问题的一个程序。我使用BeautifulSoup来提取合并单元格的数量和每一行中每个单元格的内容。longHeader是包含最多内容的表头行，spanLong是一个列表，里面存储了这一行中每个单元格的合并数量。这个方法是可行的，但看起来不太符合Python的风格。

另外，如果合并数量的差值小于0，这个方法就不管用了。我可以用之前的方法来解决这个问题。但在我动手之前，我想知道有没有人能快速看一下，给我一些更符合Python风格的建议。我之前是个SAS程序员，所以在写代码时总是习惯用SAS的思维方式。

longHeader=['','','bananas','','','','','','','','','','trains','','planes','','','','']
shortHeader=['','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
combinedHeader=[]
sumSpanLong=0
sumSpanShort=0
spanDiff=0
longHeaderCount=0

for each in range(len(shortHeader)):
    sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
    sumSpanShort=sumSpanShort+spanShort[each]
    spanDiff=sumSpanShort-sumSpanLong
    if spanDiff==0:
        combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
        longHeaderCount=longHeaderCount+1
        continue
    for i in range(0,spanDiff):
            combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
            longHeaderCount=longHeaderCount+1
            sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
            spanDiff=sumSpanShort-sumSpanLong
            if spanDiff==0:
                combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
                longHeaderCount=longHeaderCount+1
                break

print combinedHeader

数据结构编程风格数据解析 HTML beautifulsoup 表格处理 colspan 合并单元格

5 个回答

也许可以看看zip函数来解决问题的一部分：

>>> execfile('so_ques.py')
[[' '], [' '], ['bananas bunches'], [' '], [' cars'], [' cars'], [' cars'], [' '], [' trucks'], [' trucks'], [' trucks'], [' '], ['trains freight'], [' '], ['planes cargo'], [' '], [' all other'], [' '], [' ']]

>>> zip(long_header, short_header)
[('', ''), ('', ''), ('bananas', 'bunches'), ('', ''), ('', 'cars'), ('', ''), ('', 'trucks'), ('', ''), ('', 'freight'), ('', ''), ('', 'cargo'), ('', ''), ('trains', 'all other'), ('', ''), ('planes', '')]
>>>

enumerate可以帮助你避免用计数器进行复杂的索引：

>>> diff_list = []
>>> for place, header in enumerate(short_header):
    diff_list.append(abs(span_short[place] - span_long[place]))

>>> for place, num in enumerate(diff_list):
    if num:
        new_shortlist.extend(short_header[place] for item in range(num+1))
    else:
        new_shortlist.append(short_header[place])


>>> new_shortlist
['', '', 'bunches', '', 'cars', 'cars', 'cars', '', 'trucks', 'trucks', 'trucks', '',... 
>>> z = zip(new_shortlist, long_header)
>>> z
[('', ''), ('', ''), ('bunches', 'bananas'), ('', ''), ('cars', ''), ('cars', ''), ('cars', '')...

另外，使用更符合Python风格的命名可能会让代码更清晰：

    for each in range(len(short_header)):
        sum_span_long += span_long[long_header_count]
        sum_span_short += span_short[each]
        span_diff = sum_span_short - sum_span_long
        if not span_diff:
            combined_header.append...

回答于 2025-04-11 由 Python大师

分享举报

这是你算法的一个修改版本。这里用到的 zip 是用来同时处理短的长度和头部信息的，而 类对象 则用来计数和遍历长的项目，同时也负责合并头部信息。对于内部循环，使用 while 更合适。（请原谅我用的名字太短了）。

class collector(object):
    def __init__(self, header):
        self.longHeader = header
        self.combinedHeader = []
        self.longHeaderCount = 0
    def combine(self, shortValue):
        self.combinedHeader.append(
            [self.longHeader[self.longHeaderCount]+' '+shortValue] )
        self.longHeaderCount += 1
        return self.longHeaderCount

def main():
    longHeader = [ 
       '','','bananas','','','','','','','','','','trains','','planes','','','','']
    shortHeader = [
    '','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
    spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
    spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
    sumSpanLong=0
    sumSpanShort=0

    combiner = collector(longHeader)
    for sLen,sHead in zip(spanShort,shortHeader):
        sumSpanLong += spanLong[combiner.longHeaderCount]
        sumSpanShort += sLen
        while sumSpanShort - sumSpanLong > 0:
            combiner.combine(sHead)
            sumSpanLong += spanLong[combiner.longHeaderCount]
        combiner.combine(sHead)

    return combiner.combinedHeader

回答于 2025-04-11 由 Python大师

分享举报

在这个例子中，其实你有很多事情要处理。

你把Beautiful Soup的标签对象处理得太复杂了，直接保留它们作为标签就可以了。
所有这类合并算法都比较难。把要合并的两个东西对称地看待会更有帮助。

这里有一个版本，可以直接使用Beautiful Soup的标签对象。而且，这个版本不假设两行的长度。

def merge3( row1, row2 ):
    i1= 0
    i2= 0
    result= []
    while i1 != len(row1) or i2 != len(row2):
        if i1 == len(row1):
            result.append( ' '.join(row1[i1].contents) )
            i2 += 1
        elif i2 == len(row2):
            result.append( ' '.join(row2[i2].contents) )
            i1 += 1
        else:
            if row1[i1]['colspan'] < row2[i2]['colspan']:
                # Fill extra cols from row1
                c1= row1[i1]['colspan']
                while c1 != row2[i2]['colspan']:
                    result.append( ' '.join(row2[i2].contents) )
                    c1 += 1
            elif row1[i1]['colspan'] > row2[i2]['colspan']:
                # Fill extra cols from row2
                c2= row2[i2]['colspan']
                while row1[i1]['colspan'] != c2:
                    result.append( ' '.join(row1[i1].contents) )
                    c2 += 1
            else:
                assert row1[i1]['colspan'] == row2[i2]['colspan']
                pass
            txt1= ' '.join(row1[i1].contents)
            txt2= ' '.join(row2[i2].contents)
            result.append( txt1 + " " + txt2 )
            i1 += 1
            i2 += 1
    return result

回答于 2025-04-11 由 Python大师

分享举报

有没有更Pythonic的方法合并两个带有colspan的HTML表头行？

5 个回答

撰写回答