有没有更Pythonic的方法合并两个带有colspan的HTML表头行?
我在用Python的BeautifulSoup库来解析一些HTML内容。现在遇到的问题是,有些表头行的合并单元格(colspan)不一样。简单来说,表头行就是那些需要合并来显示列标题的行。有的列可能会跨越上面或下面的多个列,所以我们需要根据这些合并情况来调整文字的显示。下面是我用来处理这个问题的一个程序。我使用BeautifulSoup来提取合并单元格的数量和每一行中每个单元格的内容。longHeader是包含最多内容的表头行,spanLong是一个列表,里面存储了这一行中每个单元格的合并数量。这个方法是可行的,但看起来不太符合Python的风格。
另外,如果合并数量的差值小于0,这个方法就不管用了。我可以用之前的方法来解决这个问题。但在我动手之前,我想知道有没有人能快速看一下,给我一些更符合Python风格的建议。我之前是个SAS程序员,所以在写代码时总是习惯用SAS的思维方式。
longHeader=['','','bananas','','','','','','','','','','trains','','planes','','','','']
shortHeader=['','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
combinedHeader=[]
sumSpanLong=0
sumSpanShort=0
spanDiff=0
longHeaderCount=0
for each in range(len(shortHeader)):
sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
sumSpanShort=sumSpanShort+spanShort[each]
spanDiff=sumSpanShort-sumSpanLong
if spanDiff==0:
combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
longHeaderCount=longHeaderCount+1
continue
for i in range(0,spanDiff):
combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
longHeaderCount=longHeaderCount+1
sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
spanDiff=sumSpanShort-sumSpanLong
if spanDiff==0:
combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
longHeaderCount=longHeaderCount+1
break
print combinedHeader
5 个回答
1
也许可以看看zip函数来解决问题的一部分:
>>> execfile('so_ques.py')
[[' '], [' '], ['bananas bunches'], [' '], [' cars'], [' cars'], [' cars'], [' '], [' trucks'], [' trucks'], [' trucks'], [' '], ['trains freight'], [' '], ['planes cargo'], [' '], [' all other'], [' '], [' ']]
>>> zip(long_header, short_header)
[('', ''), ('', ''), ('bananas', 'bunches'), ('', ''), ('', 'cars'), ('', ''), ('', 'trucks'), ('', ''), ('', 'freight'), ('', ''), ('', 'cargo'), ('', ''), ('trains', 'all other'), ('', ''), ('planes', '')]
>>>
enumerate
可以帮助你避免用计数器进行复杂的索引:
>>> diff_list = []
>>> for place, header in enumerate(short_header):
diff_list.append(abs(span_short[place] - span_long[place]))
>>> for place, num in enumerate(diff_list):
if num:
new_shortlist.extend(short_header[place] for item in range(num+1))
else:
new_shortlist.append(short_header[place])
>>> new_shortlist
['', '', 'bunches', '', 'cars', 'cars', 'cars', '', 'trucks', 'trucks', 'trucks', '',...
>>> z = zip(new_shortlist, long_header)
>>> z
[('', ''), ('', ''), ('bunches', 'bananas'), ('', ''), ('cars', ''), ('cars', ''), ('cars', '')...
另外,使用更符合Python风格的命名可能会让代码更清晰:
for each in range(len(short_header)):
sum_span_long += span_long[long_header_count]
sum_span_short += span_short[each]
span_diff = sum_span_short - sum_span_long
if not span_diff:
combined_header.append...
3
这是你算法的一个修改版本。这里用到的 zip 是用来同时处理 短 的长度和头部信息的,而 类对象 则用来计数和遍历 长 的项目,同时也负责合并头部信息。对于内部循环,使用 while 更合适。(请原谅我用的名字太短了)。
class collector(object):
def __init__(self, header):
self.longHeader = header
self.combinedHeader = []
self.longHeaderCount = 0
def combine(self, shortValue):
self.combinedHeader.append(
[self.longHeader[self.longHeaderCount]+' '+shortValue] )
self.longHeaderCount += 1
return self.longHeaderCount
def main():
longHeader = [
'','','bananas','','','','','','','','','','trains','','planes','','','','']
shortHeader = [
'','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
sumSpanLong=0
sumSpanShort=0
combiner = collector(longHeader)
for sLen,sHead in zip(spanShort,shortHeader):
sumSpanLong += spanLong[combiner.longHeaderCount]
sumSpanShort += sLen
while sumSpanShort - sumSpanLong > 0:
combiner.combine(sHead)
sumSpanLong += spanLong[combiner.longHeaderCount]
combiner.combine(sHead)
return combiner.combinedHeader
2
在这个例子中,其实你有很多事情要处理。
你把Beautiful Soup的标签对象处理得太复杂了,直接保留它们作为标签就可以了。
所有这类合并算法都比较难。把要合并的两个东西对称地看待会更有帮助。
这里有一个版本,可以直接使用Beautiful Soup的标签对象。而且,这个版本不假设两行的长度。
def merge3( row1, row2 ):
i1= 0
i2= 0
result= []
while i1 != len(row1) or i2 != len(row2):
if i1 == len(row1):
result.append( ' '.join(row1[i1].contents) )
i2 += 1
elif i2 == len(row2):
result.append( ' '.join(row2[i2].contents) )
i1 += 1
else:
if row1[i1]['colspan'] < row2[i2]['colspan']:
# Fill extra cols from row1
c1= row1[i1]['colspan']
while c1 != row2[i2]['colspan']:
result.append( ' '.join(row2[i2].contents) )
c1 += 1
elif row1[i1]['colspan'] > row2[i2]['colspan']:
# Fill extra cols from row2
c2= row2[i2]['colspan']
while row1[i1]['colspan'] != c2:
result.append( ' '.join(row1[i1].contents) )
c2 += 1
else:
assert row1[i1]['colspan'] == row2[i2]['colspan']
pass
txt1= ' '.join(row1[i1].contents)
txt2= ' '.join(row2[i2].contents)
result.append( txt1 + " " + txt2 )
i1 += 1
i2 += 1
return result