Python在htm标记和topi之间刮取文本

2024-04-20 01:17:30 发布

您现在位置:Python中文网/ 问答频道 /正文

全部, 这是my previous post的延续,但适用于不同的场景。你知道吗

现在有一个特定的场景,我需要在标签之间提取文本。你知道吗

    data='''<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 2 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">The </SPAN><SPAN CLASS="c3">New York Times</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c1"><SPAN CLASS="c3">March</SPAN><SPAN CLASS="c2"> 17, 2016 Thursday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>Late Edition - Final</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c7">Paid Notice: Deaths THORNTON, ROBERT</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SECTION: </SPAN><SPAN CLASS="c2">Section A; Column 0; Classified; Pg. 19</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LENGTH: </SPAN><SPAN CLASS="c2">176 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">THORNTON--Robert. Robert &quot;Bob&quot; Richard Thornton, 89, of Peoria, IL, died peacefully and surrounded by family on Friday, March 11, 2016. Bob was born October 16, 1926, in Jersey City, New Jersey. He graduated from Regis High School in New York City on June 15, 1945, and immediately thereafter served in the U.S. Navy. He received a B.A. from Georgetown University in 1950 and a J.D. from Columbia University Law School in 1953. He practiced law in New York City for 17 years with the law firms of Dorr Hand and Nixon, Mudge, Rose, Guthrie &amp; Alexander. He joined the legal department of Caterpillar Tractor Co. in 1970 and served as the company's General Counsel and Corporate Secretary from 1983 to 1991. He is survived by his wife, Dorothy (McGuire) of Peoria; and his children, Matthew, Nicholas, Jennifer, and Julia. In lieu of flowers, donations may be made in the name of Robert and Dorothy Thornton to St. Philomena's School in Peoria, IL, Regis High School in New York City, or the National Association for Rare Disorders (www.rare diseases.org). 1/3</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">URL: </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SUBJECT: </SPAN><SPAN CLASS="c2">DEATHS &amp; OBITUARIES (92%); HIGH SCHOOLS (90%); LAWYERS (87%); LAW SCHOOLS (77%); CORPORATE COUNSEL (75%); LEGAL SERVICES (70%); GRADUATE &amp; PROFESSIONAL SCHOOLS (70%); ASSOCIATIONS &amp; ORGANIZATIONS (65%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COMPANY: </SPAN><SPAN CLASS="c2">CATERPILLAR INC (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">ORGANIZATION: </SPAN><SPAN CLASS="c2">COLUMBIA UNIVERSITY (57%); GEORGETOWN UNIVERSITY (57%); US NAVY (57%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">TICKER: </SPAN><SPAN CLASS="c2">CATR (PAR) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (SWX) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (NYSE) (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">INDUSTRY: </SPAN><SPAN CLASS="c2">NAICS333131 MINING MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); NAICS333120 CONSTRUCTION MACHINERY MANUFACTURING (70%); NAICS333111 FARM MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); SIC3531 CONSTRUCTION MACHINERY &amp; EQUIPMENT (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PERSON: </SPAN><SPAN CLASS="c2">RICHARD NIXON (78%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">CITY: </SPAN><SPAN CLASS="c2">NEW YORK, NY, USA (94%); PEORIA, IL, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">STATE: </SPAN><SPAN CLASS="c2">NEW YORK, USA (94%); ILLINOIS, USA (94%); NEW JERSEY, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COUNTRY: </SPAN><SPAN CLASS="c2">UNITED STATES (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LOAD-DATE: </SPAN><SPAN CLASS="c2">March 17, 2016</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company</SPAN></P>
</DIV>
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->
<DIV CLASS="c10">&nbsp;</DIV>
<A NAME="DOC_ID_0_1"></A><!-- Hide XML section from browser
<DOC NUMBER=2>
<DOCFULL> -->
<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">2 of 2 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">The </SPAN><SPAN CLASS="c3">New York Times Company</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c1"><SPAN CLASS="c3">March</SPAN><SPAN CLASS="c2"> 16, 2016 Wednesday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>Late Edition - Final</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c7">Paid Notice: Deaths THORNTON, ROBERT</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SECTION: </SPAN><SPAN CLASS="c2">Section B; Column 0; Classified; Pg. 16</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LENGTH: </SPAN><SPAN CLASS="c2">176 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">THORNTON--Robert. Robert &quot;Bob&quot; Richard Thornton, 89, of Peoria, IL, died peacefully and surrounded by family on Friday, March 11, 2016. Bob was born October 16, 1926, in Jersey City, New Jersey. He graduated from Regis High School in New York City on June 15, 1945, and immediately thereafter served in the U.S. Navy. He received a B.A. from Georgetown University in 1950 and a J.D. from Columbia University Law School in 1953. He practiced law in New York City for 17 years with the law firms of Dorr Hand and Nixon, Mudge, Rose, Guthrie &amp; Alexander. He joined the legal department of Caterpillar Tractor Co. in 1970 and served as the company's General Counsel and Corporate Secretary from 1983 to 1991. He is survived by his wife, Dorothy (McGuire) of Peoria; and his children, Matthew, Nicholas, Jennifer, and Julia. In lieu of flowers, donations may be made in the name of Robert and Dorothy Thornton to St. Philomena's School in Peoria, IL, Regis High School in New York City, or the National Association for Rare Disorders (www.rare diseases.org). 1/3 </SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">URL: </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SUBJECT: </SPAN><SPAN CLASS="c2">DEATHS &amp; OBITUARIES (92%); HIGH SCHOOLS (90%); LAWYERS (87%); LAW SCHOOLS (77%); CORPORATE COUNSEL (75%); LEGAL SERVICES (70%); GRADUATE &amp; PROFESSIONAL SCHOOLS (70%); ASSOCIATIONS &amp; ORGANIZATIONS (65%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COMPANY: </SPAN><SPAN CLASS="c2">CATERPILLAR INC (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">ORGANIZATION: </SPAN><SPAN CLASS="c2">COLUMBIA UNIVERSITY (57%); GEORGETOWN UNIVERSITY (57%); US NAVY (57%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">TICKER: </SPAN><SPAN CLASS="c2">CATR (PAR) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (SWX) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (NYSE) (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">INDUSTRY: </SPAN><SPAN CLASS="c2">NAICS333131 MINING MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); NAICS333120 CONSTRUCTION MACHINERY MANUFACTURING (70%); NAICS333111 FARM MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); SIC3531 CONSTRUCTION MACHINERY &amp; EQUIPMENT (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PERSON: </SPAN><SPAN CLASS="c2">RICHARD NIXON (78%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">CITY: </SPAN><SPAN CLASS="c2">NEW YORK, NY, USA (94%); PEORIA, IL, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">STATE: </SPAN><SPAN CLASS="c2">NEW YORK, USA (94%); ILLINOIS, USA (94%); NEW JERSEY, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COUNTRY: </SPAN><SPAN CLASS="c2">UNITED STATES (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LOAD-DATE: </SPAN><SPAN CLASS="c2">March 16, 2016</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2015 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company</SPAN></P>
</DIV>

''

我尝试过的解决方案:

publicationnamepattern="\<DIV CLASS=\"c0\"\>\<BR>\<P CLASS=\"c1\"\><SPAN CLASS=\"c2\"\>(.*)\</SPAN>\</P>"

copyrightpattern = "\<DIV CLASS=\"c0\"\>\<BR>\<P CLASS=\"c1\"\><SPAN CLASS=\"c2\"\>([^<]*)\</SPAN>"

publicationnamepatternvalues = [a.strip("*") for a in re.findall(publicationnamepattern, data)]

copyrightpatternvalues = [a.strip("*") for a in re.findall(copyrightpattern, data)]

print(str(publicationnamepatternvalues))

print(str(copyrightpatternvalues))

结果:

['The </SPAN><SPAN CLASS="c3">New York Times', 'Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company', 'The </SPAN><SPAN CLASS="c3">New York Times', 'Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company']

我只需要“纽约时报”来获取publicationnamepatternvalues,而“版权2016纽约时报公司”来获取版权PatternValues

我无法给出更多的静态值,因为只有这些字段在数据中是常见的。例如,纽约时报

有些数据包含span class,如c2,有些包含c4等等。,

有谁能帮我解决这种情况吗。你知道吗


Tags: andofinbrdivnewclassamp
2条回答

使用BeautifulSoup

from bs4 import BeautifulSoup

data = '''... your html ...'''

soup = BeautifulSoup(data, 'html.parser')

for x in soup.select('div.c0 br p.c1'):
    print(x.text)

结果

The New York Times
Copyright 2016 The New York Times Company
from bs4 import BeautifulSoup

a="""
data='''<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 2 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">The </SPAN><SPAN CLASS="c3">New York Times</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c1"><SPAN CLASS="c3">March</SPAN><SPAN CLASS="c2"> 17, 2016 Thursday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>Late Edition - Final</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c7">Paid Notice: Deaths THORNTON, ROBERT</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SECTION: </SPAN><SPAN CLASS="c2">Section A; Column 0; Classified; Pg. 19</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LENGTH: </SPAN><SPAN CLASS="c2">176 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">THORNTON Robert. Robert &quot;Bob&quot; 1/3</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">URL: </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SUBJECT: </SPAN><SPAN CLASS="c2">DEATHS &amp; OBITUARIES (92%); </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COMPANY: </SPAN><SPAN CLASS="c2">CATERPILLAR INC (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">ORGANIZATION: </SPAN><SPAN CLASS="c2">COLUMBIA UNIVERSITY (57%); GEORGETOWN UNIVERSITY (57%); US NAVY (57%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">TICKER: </SPAN><SPAN CLASS="c2">CATR (PAR) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (SWX) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (NYSE) (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">INDUSTRY: </SPAN><SPAN CLASS="c2">NAICS333131 MINING MACHINERY &amp; </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PERSON: </SPAN><SPAN CLASS="c2">RICHARD NIXON (78%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">CITY: </SPAN><SPAN CLASS="c2">NEW YORK, NY, USA (94%); PEORIA, IL, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">STATE: </SPAN><SPAN CLASS="c2">NEW YORK, USA (94%); ILLINOIS, USA (94%); NEW JERSEY, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COUNTRY: </SPAN><SPAN CLASS="c2">UNITED STATES (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LOAD-DATE: </SPAN><SPAN CLASS="c2">March 17, 2016</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company</SPAN></P>
</DIV>'''
"""
soup=BeautifulSoup(a)
soup2 = soup.select('div.c0')
list1 = [b.text.strip().encode('utf-8') for b in soup2]
print list1
var1, var2 = list1[1], list1[2]
print var1
print var2

输出:

['1 of 2 DOCUMENTS', 'The New York Times', 'Copyright 2016 The New York Times Company']
The New York Times
Copyright 2016 The New York Times Company

相关问题 更多 >