使用Beautiful Soup提取两个HTML标签之间的所有内容
我正在用Python和Beautiful Soup解析内容,然后把它写入CSV文件,但在获取某些数据时遇到了麻烦。数据经过我自己制作的TidyHTML处理后,其他不需要的数据被去掉了。
问题是,我需要获取一组<h3>
标签之间的所有数据。
示例数据:
<h3><a href="Vol-1-pages-001.pdf">Pages 1-18</a></h3>
<ul><li>September 13 1880. First regular meeting of the faculty;
September 14 1880. Discussion of curricular matters. Students are
debarred from taking algebra until they have completed both mental
and fractional arithmetic; October 4 1880.</li><li>All members present.</li></ul>
<ul><li>Moved the faculty henceforth hold regular weekkly meetings in the
President's room of the University building; 11 October 1880. All
members present; 18 October 1880. Regular meeting 2. Moved that the
President wait on the property holders on 12th street and request
them to abate the nuisance on their property; 25 October 1880.
Moved that the senior and junior classes for rhetoricals be...</li></ul>
<h3><a href="Vol-1-pages-019.pdf">Pages 19-33</a></h3>`
我需要获取第一个关闭的</h3>
标签和下一个打开的<h3>
标签之间的所有内容。这本来不应该很难,但我就是想不明白。虽然我能抓到所有的<ul>
标签,但这不行,因为<h3>
标签和<ul>
标签之间并不是一一对应的关系。
我想要的输出是:
Pages 1-18|Vol-1-pages-001.pdf|标签之间的内容。
前两个部分我没有问题,但获取标签之间的内容对我来说很困难。
我现在的代码如下:
import glob, re, os, csv
from BeautifulSoup import BeautifulSoup
from tidylib import tidy_document
from collections import deque
html_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1'
csv_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1\\archiveVol1.csv'
html_cleanup = {'\r\r\n':'', '\n\n':'', '\n':'', '\r':'', '\r\r': '', '<img src="UOSymbol1.jpg" alt="" />':''}
for infile in glob.glob( os.path.join(html_path, '*.html') ):
print "current file is: " + infile
html = open(infile).read()
for i, j in html_cleanup.iteritems():
html = html.replace(i, j)
#parse cleaned up html with Beautiful Soup
soup = BeautifulSoup(html)
#print soup
html_to_csv = csv.writer(open(csv_path, 'a'), delimiter='|',
quoting=csv.QUOTE_NONE, escapechar=' ')
#retrieve the string that has the page range and file name
volume = deque()
fileName = deque()
summary = deque()
i = 0
for title in soup.findAll('a'):
if title['href'].startswith('V'):
#print title.string
volume.append(title.string)
i+=1
#print soup('a')[i]['href']
fileName.append(soup('a')[i]['href'])
#print html_to_csv
#html_to_csv.writerow([volume, fileName])
#retrieve the summary of each archive and store
#for body in soup.findAll('ul') or soup.findAll('ol'):
# summary.append(body)
for body in soup.findAll('h3'):
body.findNextSibling(text=True)
summary.append(body)
#print out each field into the csv file
for c in range(i):
pages = volume.popleft()
path = fileName.popleft()
notes = summary
if not summary:
notes = "help"
if summary:
notes = summary.popleft()
html_to_csv.writerow([pages, path, notes])
2 个回答
0
如果你想在lxml中提取<ul><li></ul></li>
标签之间的数据,它提供了一个很棒的功能,可以使用CSSSelector
。
import lxml.html
import urllib
data = urllib.urlopen('file:///C:/Users/ranveer/st.html').read() //contains your html snippet
doc = lxml.html.fromstring(data)
elements = doc.cssselect('ul li') // CSSpath[using firebug extension]
for element in elements:
print element.text_content()
执行完上面的代码后,你会得到所有在ul,li
标签之间的文本。这比使用beautiful soup要干净很多。
如果你有计划使用lxml,你可以通过以下方式来评估XPath表达式:
import lxml
from lxml import etree
content = etree.HTML(urllib.urlopen("file:///C:/Users/ranveer/st.html").read())
content_text = content.xpath("html/body/h3[1]/a/@href | //ul[1]/li/text() | //ul[2]/li/text() | //h3[2]/a/@href")
print content_text
你可以根据自己的需要来修改XPath。
2
提取在 </h3>
和 <h3>
标签之间的内容:
from itertools import takewhile
h3s = soup('h3') # find all <h3> elements
for h3, h3next in zip(h3s, h3s[1:]):
# get elements in between
between_it = takewhile(lambda el: el is not h3next, h3.nextSiblingGenerator())
# extract text
print(''.join(getattr(el, 'text', el) for el in between_it))
这段代码假设所有的 <h3>
元素都是兄弟关系。如果不是这样的话,你可以用 h3.nextGenerator()
来代替 h3.nextSiblingGenerator()
。