使用BeautifulSoup提取以":"分隔的文本

1 投票

3 回答

1939 浏览

提问于 2025-04-17 22:56

网页的源代码中包含如下部分：

<TR>
<TD width="40%">Company No. <I>(CO.)</I> : <B>056</B></TD>
<TD width="40%">Country Code. <I>(CC.)</I> : <B>3532 </B></TD></TR>
<TR>
<TD>Register <I>(Reg.)</I> : <B>FD522</B></TD>
<TD>Credit<I>(CD.) </I>: <B>YES</B></TD></TR>
<TR>
<TD>Type <I>(TP.)</I> : <B>PRIVATE</B></TD></TR>

像“CO.”、“CC.”、“Reg.”、“CD.”和“TP.”这样的标题简写是用斜体字显示的。而像“056”、“3532”、“FD522”等内容则是用粗体字显示的。它们之间用“:”分隔。

我想用BeautifulSoup分别提取这些标题和内容，但没有成功。

我使用的是：

soup.find_all("td")

但是效果不好。它返回的结果是“公司编号 (CO.) : 056”在一行里，但我想要的是分开的，比如“公司编号”、“CO.”和“056”。

我还尝试过：

all_texts = soup.find_all(":")

或者：

all_texts = soup.find_all("/b")

等等，但都不行。

结果

下面的帮助提供了两种方法，放在这里供参考：

这种方法可以获取粗体字的内容，但在某些句子中，最后一个字母会缺失：

for bb in aa:
    cc = bb.get_text()
    dd = cc[cc.find("<b>")+1 : cc.find("</b>")]
    print dd

这种方法中的“ee”和“ff”提供了“标题”和内容，也就是在“:”前后的文本。

for bb in aa:
    cc = bb.get_text()
    dd = cc.split(' :')
    ee = dd[0] #title
    ff = dd[len(dd)-1] # content

html解析数据清洗信息提取 beautifulsoup 网页解析文本提取内容提取标签分隔

3 个回答

这其实只是简单的字符串处理，并不是BS4（Beautiful Soup 4）的问题。可以像下面这样做。请注意，下面的方法可能不是最好的，但我这样做是为了让解释更清楚。

from bs4 import BeautifulSoup as bsoup

ofile = open("test.html")
soup = bsoup(ofile)
soup.prettify()

tds = soup.find_all("td")
templist = [td.get_text() for td in tds]

newlist = []
for temp in templist:
    whole = temp.split(":") # Separate by ":" first.
    half = whole[0].split("(") # Split the first half using open parens.
    first = half[0].strip() # First of three elements.
    second = half[1].replace(")","").strip() # Second of three elements.
    third = whole[1].strip() # Use the second element for the first split to get third of three elements.
    newlist.append([first, second, third])

for lst in newlist:
    print lst # Just print it out.

结果：

[u'Company No.', u'CO.', u'056']
[u'Country Code.', u'CC.', u'3532']
[u'Register', u'Reg.', u'FD522']
[u'Credit', u'CD.', u'YES']
[u'Type', u'TP.', u'PRIVATE']
[Finished in 1.1s]

如果这对你有帮助，请告诉我们。

回答于 2025-04-17 由 Python大师

分享举报

你不需要强迫自己使用BeautifulSoup的函数来分开这些数据，因为每一条数据都有不同的标记来进行分割。比如：

<TD width="40%">Company No. <I>(CO.)</I> : <B>056</B></TD>

公司编号是用“.”来分开的。
(CO.)是用“:”来分开的。
056在<B></B>标签里面。

我建议你使用子字符串的方法来从每个中提取数据：

#grab all td
all_texts = soup.findAll("td") 
for txt in all_texts
        #convert td into string
        td = str(td)
        txt1 = td[td.find(">")+1 : td.find("<i>")] #get first data from <td>...</i>
        txt2 = td[td.find("<i>")+3 : td.find("</i>")] #get 2nd data from <i>...</i>
        txt3 = td[td.find("<b>")+3 : td.find("</b>")] #get 3rd data from <b>...</b>
        print txt1
        print txt2
        print txt3

回答于 2025-04-17 由 Python大师

分享举报

使用findAll来获取整个HTML文档中正确的部分，然后使用：

text = soup.get_text()
print text

接着用'.split()'把它分成数组。

for line in soup.get_text().split('\n'):
    if line != ''
        print line.split()

回答于 2025-04-17 由 Python大师

分享举报

使用BeautifulSoup提取以":"分隔的文本

结果

3 个回答

撰写回答