使用lxmlxpath时,text()函数如何提取表行?

2024-05-16 22:19:53 发布

您现在位置:Python中文网/ 问答频道 /正文

下面的代码输出的正是我想要的,但是我不知道为什么。有人能解释在遍历行时doc.xpath('//tr')row.xpath('.//*/text()')之间的关系吗?或者对//tr.//*/text()的一般解释?你知道吗

import lxml.html                                                                

fi = '/home/jesc/html/oakland-raiders'                                          

doc = lxml.html.parse(fi)                                                       

rows = doc.xpath('//tr')                                                         

data = [row.xpath('.//*/text()') for row in rows]                               

for line in data:
    print(line)  

这将给出我要提取的“tr”表行:

['PASSING']                                                                     
['TEAM', 'ATT', 'COMP', 'PCT', 'YDS', 'AVG', 'YDS/G', 'LONG', 'TD', 'TD%', 'INT', 'INT%', 'SACK', 'YDSL', 'RATE']                                                                       
['Raiders', '596', '379', '63.6', '4051', '6.9', '253.2', '75', '29', '4.9', '7', '1.2', '18.0', '86', '95.3']
['Opponents', '541', '328', '60.6', '4120', '7.9', '257.5', '98', '27', '5.0', '16', '3.0', '25.0', '147', '89.8']
['RUSHING']                                                                     
['TEAM', 'ATT', 'YDS', 'AVG', 'LONG', '20+', 'TD', 'YDS/G', 'FUM', 'FUML', '1DN']
['Raiders', '434', '1922', '4.4', '75', '19', '17', '120.1', '8', '4', '98']    
['Opponents', '421', '1881', '4.5', '64', '10', '18', '117.6', '11', '4', '94'] 
['RECEIVING']                                                                   
['TEAM', 'REC', 'TAR', 'YDS', 'AVG', 'TD', 'LONG', '20+', 'YDS/G', 'FUM', 'FUML', 'YAC', '1DN']
['Raiders', '379', '596', '4137', '10.9', '29', '75', '51', '258.6', '4', '0', '1931', '198']
['Opponents', '328', '541', '4267', '13.0', '27', '98', '61', '266.7', '5', '1', '1858', '188']
['DOWNS']                                                                       
['FIRST DOWNS', 'THIRD DOWNS', 'FOURTH DOWNS', 'PENALTIES']                     
['TEAM', 'TOTAL', 'RUSH', 'PASS', 'PEN', 'MADE', 'ATT', 'PCT', 'MADE', 'ATT', 'PCT', 'TOTAL', 'YDS']
['Raiders', '334', '98', '198', '38', '83', '218', '38.1', '6', '13', '46.2', '147', '1251']
['Opponents', '318', '94', '188', '36', '78', '198', '39.4', '4', '15', '26.7', '115', '1051']
['DEFENSE']                                                                     
['TACKLES', 'SACKS', 'INTERCEPTIONS', 'FUMBLES']                                
['TEAM', 'SOLO', 'AST', 'TOT', 'SACK', 'YDSL', 'TLOSS', 'PD', 'INT', 'YDS', 'LONG', 'TD', 'FF', 'REC', 'TD', 'BK']
['Raiders', '747', '206', '953', '25.0', '147', '40', '70', '16', '147', '40', '1', '21', '14', '0', '2']['Opponents', '813', '207', '1020', '18.0', '86', '45', '60', '7', '84', '45', '0', '10', '7', '0', '1']
['RETURNING']                                                                   
['KICKOFFS', 'PUNTS']                                                           
['TEAM', 'ATT', 'YDS', 'AVG', 'LONG', 'TD', 'RET', 'RETY', 'AVG', 'LONG', 'TD', 'FC']
['Raiders', '26', '534', '20.5', '50', '0', '41', '380', '9.3', '47', '0', '13']
['Opponents', '45', '896', '19.9', '60', '0', '33', '405', '12.3', '78', '1', '15']
['KICKING']                                                                     
['FIELD GOALS', 'EXTRA POINTS']                                                 
['TEAM', 'FGM', 'FGA', 'PCT', 'LONG', '1-19', '20-29', '30-39', '40-49', '50+', 'XPM', 'XPA', 'PCT']
['Raiders', '29', '35', '82.9', '56', '1-1', '9-9', '6-6', '10-11', '3-8', '37', '39', '94.9']
['Opponents', '22', '26', '84.6', '55', '0-0', '9-9', '5-6', '7-8', '1-3', '35', '39', '89.7']
['PUNTING']                                                                     
['TEAM', 'PUNTS', 'YDS', 'LONG', 'AVG', 'NET', 'BP', 'IN20', 'TB', 'FC', 'RET', 'RETY', 'AVG']
['Raiders', '81', '3937', '72', '48.6', '43.6', '0', '34', '9', '15', '33', '405', '12.3']
['Opponents', '72', '3309', '69', '46.0', '40.7', '0', '20', '5', '13', '41', '380', '9.3']

Tags: textdocxpathtrteamattlongtd
1条回答
网友
1楼 · 发布于 2024-05-16 22:19:53

[...] or a general explanation of //tr and .//*/text()?

在普通英语中,//tr的意思是:

select element nodes called `tr` anywhere in the document, regardless of the current
context

.//*/text()表示:

.//*      select element nodes with any name, but only if they are a descendant of the
          current context node (".")
/text()   of those elements selected, select all their immediate child text nodes

两个表达式之间的一个关键区别是,第一个表达式不考虑当前上下文节点。第二个是,因为它以.开头。你知道吗

我不清楚究竟是什么使你对这些表达感到困惑。一旦你解释得更多,我很高兴扩大我的答案。你知道吗

另外,如果您希望有人评论您的代码是否有意义,您需要显示输入XML文档。你知道吗

编辑以回复评论:

The part that confuses/confused me is in the documentation from http://www.w3schools.com/xml/xpath_syntax.asp it states that //= Selects nodes in the document from the current node that match the selection no matter where they are. What is the selection that it is matching?

w3schools以其不准确的教程而闻名,如果可以的话就避免它(顺便说一句,它与W3C无关)。//本身不选择任何内容;它本身不是一个有效的表达式。你知道吗

也许,他们所说的“选择”是指//之后的内容。此“选择”的正确名称是节点测试。例如,以下表达式是有效的:

//*
  ^ <  "selection"

将在文档中的任何位置选择元素节点,或

//@*
  ^^ <  "selection"

它将在文档中的任何位置选择属性节点。你知道吗

Is it what proceeds it (*) or what precedes it (.).

//轴(所谓的descendant-or-self::轴)与前面或后面没有关系,也与*.没有关系。*只是选择元素节点,.表示当前上下文节点。你知道吗

So . would be the tr nodes and //* (with asterisk) would be ANY nodes WITHIN the tr nodes?

在这种情况下,是的,没错。.tr元素节点,//*将选择此tr元素中的所有元素节点,而不仅仅是它的直接子元素。你知道吗

相关问题 更多 >