python网页抓取html命令

2024-03-28 23:42:56 发布

您现在位置:Python中文网/ 问答频道 /正文

以下不一定是问题。我创建了一段从网页中提取数据的代码,我想知道您对代码的看法以及如何改进它。你知道吗

我需要知道博士面试的日期。他们不给我们发电子邮件。Here他们将公布日期。我意识到我感兴趣的两个博士职位都在HTML评论中。它们都是从字符串URBAN开始的。你知道吗

我创建了一个正则表达式来查找所有的注释

regex = r"<!--(.*?)-->"

并用for循环来检查那些评论里面是否存在URBAN这个词。注释中没有字符串意味着,希望他们公布了日期。你知道吗

这是我的密码:

import requests, re, time, smtplib

url = "http://dottorato.polito.it/Esami_accesso.html"

DEBUG = False

foundInComment = True

""" 
. matches anything but \n   
* 0 or more occurrences of the pattern to its left
() groups
? for non-greedy 
"""
    regex = r"<!--(.*?)-->"

while foundInComment:
    try:
        r = requests.get(url)
        html = r.text 

        result = re.findall(regex,html,re.DOTALL) # re.DOTALL makes . match also \n 

        for match in result:
            if len(re.findall("URBAN",match)) > 1: #One of the commets has to have at least two URBAN
                foundInComment = True
                print("\"URBAN AND REGIONAL DEVELOPMEN\" found more than once in a comment at " 
                                       + time.strftime("%H:%M:%S"))
                break
            foundInComment = False

        time.sleep(600)

    except KeyboardInterrupt:
        raise
    except Exception as e:
        print e
        print "Going to sleep for 1 min"
        time.sleep(60)

if not DEBUG:
    fromaddr = 'someMail@gmail.com'
    toaddrs  = ['otherMail@gmail.com', fromaddr]

    msg = 'Subject: PHD polito\n\n Go to %s' % url 

    # Credentials
    username = 'someone'
    password = 'password'

    server = smtplib.SMTP('smtp.gmail.com:587')
    server.starttls()
    server.login(username,password)
    server.sendmail(fromaddr, toaddrs, msg)
    server.quit()

    print "End of program"

你觉得呢?你知道吗

提前谢谢!你知道吗

PS:这是HTML注释的一部分,包含单词URBAN:

<li><a href="colloqui/Architettura_Storia_Progetto2.pdf">URBAN AND REGIONAL DEVELOPMEN</a></li>
<li><a href="colloqui/Architettura_Storia_Progetto2.pdf">URBAN AND REGIONAL DEVELOPMEN - Cluster Tecnologie per le Smart Communities - Progetto Edifici a Zero Consumo Energetico in Distretti Urbani Intelligenti</a></li>
-->

我几乎可以肯定,他们会复制这一点,并粘贴在网页内的评论出来。你知道吗


Tags: oftoreurlforservertimehtml
1条回答
网友
1楼 · 发布于 2024-03-28 23:42:56

另一种方法(我认为更可靠)是使用一个专门的工具——HTML解析器。例如,使用^{},打印出包含URBAN字的所有comments

import requests
from bs4 import BeautifulSoup, Comment

url = "http://dottorato.polito.it/Esami_accesso.html"
response = requests.get(url)

soup = BeautifulSoup(response.content)
print soup.find_all(text=lambda text:isinstance(text, Comment) and 'URBAN' in text)

相关问题 更多 >