使用Python从<TD>元素中获取数据

2024-05-29 11:47:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在为plex编写一个代理,我正在废弃下面的html表 一般来说,我对python和web报废还比较陌生

我正在尝试获取数据XXXXXXXXXX

  1. 数据
<table class="d">
    <tbody>
        <tr>
            <th class="ch">title</th>
            <th class="ch">released</th>
            <th class="ch">company</th>
            <th class="ch">type</th>
            <th class="ch">rating</th>
            <th class="ch">category</th>
        </tr>
        <tr>
            <td class="cd" valign="top">
              <a href="/V/6/58996.html">XXXXXXXXXX</a>
            </td>
            <td class="cd">2015</td>
            <td class="cd">My Films</td>
            <td class="cd">&nbsp;</td>
            <td class="cd">&nbsp;</td>
            <td class="cd">General Hardcore</td>
        </tr>
    </tbody>
</table>
  1. 代码

这是我正在使用的代码的一部分:

    myTable = HTML.ElementFromURL(searchQuery, sleep=REQUEST_DELAY).xpath('//table[contains(@class,"d")]/tr')
    self.log('SEARCH:: My Table: %s', myTable)

    # This logs the following
    #2019-12-26 00:26:49,329 (17a4) :  INFO (logkit:16) - GEVI - SEARCH:: My Table: [<Element tr at 0x5225c30>, <Element tr at 0x5225c00>]


    for myRow in myTable:
        siteTitle = title[0]
        self.log('SEARCH:: Site Title: %s', siteTitle)

        siteTitle = title[0].text_content().strip()
        self.log('SEARCH:: Site Title: %s', siteTitle)

        # This logs the following for <tr>/<th> - ROW 1
        # 2019-12-26 00:26:49,335 (17a4) :  INFO (logkit:16) - GEVI - SEARCH:: Site Title: <Element th at 0x5225180>
        # 2019-12-26 00:26:49,342 (17a4) :  INFO (logkit:16) - GEVI - SEARCH:: Site Title: title

        # This logs the following for <tr>/<th> - ROW 2
        # 2019-12-26 00:26:49,362 (17a4) :  INFO (logkit:16) - GEVI - SEARCH:: Site Title: <Element td at 0x52256f0>
        # 2019-12-26 00:26:49,369 (17a4) :  INFO (logkit:16) - GEVI - SEARCH:: Site Title:                              #### this is my issue... should be XXXXXXXXXX


        # I can get the href using the following code
        siteURL = myRow.xpath('.//td/a')[0].get('href')
  1. 问题

A.如何获取值'XXXXXXXXXX',我尝试使用xPath,但它从同一页上的另一个表中获取数据 有没有更好的方法来获取href属性?你知道吗

  1. 其他

我使用的python库是 导入datetime、linecache、platform、os、re、string、sys、urllib

我不能使用beautifulsoup,因为这是plex的代理,因此我假设任何想使用此代理的人都必须安装beautifulsoup。 所以这是不可能的


Tags: theinfosearchtitlesitecdchtr
1条回答
网友
1楼 · 发布于 2024-05-29 11:47:55

怎么样?你知道吗

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''<table class="d">
    <tbody>
        <tr>
            <th class="ch">title</th>
            <th class="ch">released</th>
            <th class="ch">company</th>
            <th class="ch">type</th>
            <th class="ch">rating</th>
            <th class="ch">category</th>
        </tr>
        <tr>
            <td class="cd" valign="top">
              <a href="/V/6/58996.html">XXXXXXXXXX</a>
            </td>
            <td class="cd">2015</td>
            <td class="cd">My Films</td>
            <td class="cd">&nbsp;</td>
            <td class="cd">&nbsp;</td>
            <td class="cd">General Hardcore</td>
        </tr>
    </tbody>
</table>'''
doc = SimplifiedDoc(html)
table = doc.getElement('table','d') # doc.getElement(tag='table',attr='class',value='d')
trs = table.trs.contains('<a ') # table.getElementsByTag('tr').contains('<a ')
for tr in trs:
  a = tr.a
  print (a) 
  print (a.text) # XXXXXXXXXX

相关问题 更多 >

    热门问题