以下不一定是问题。我创建了一段从网页中提取数据的代码,我想知道您对代码的看法以及如何改进它。你知道吗
我需要知道博士面试的日期。他们不给我们发电子邮件。Here他们将公布日期。我意识到我感兴趣的两个博士职位都在HTML评论中。它们都是从字符串URBAN开始的。你知道吗
我创建了一个正则表达式来查找所有的注释
regex = r"<!--(.*?)-->"
并用for循环来检查那些评论里面是否存在URBAN这个词。注释中没有字符串意味着,希望他们公布了日期。你知道吗
这是我的密码:
import requests, re, time, smtplib
url = "http://dottorato.polito.it/Esami_accesso.html"
DEBUG = False
foundInComment = True
"""
. matches anything but \n
* 0 or more occurrences of the pattern to its left
() groups
? for non-greedy
"""
regex = r"<!--(.*?)-->"
while foundInComment:
try:
r = requests.get(url)
html = r.text
result = re.findall(regex,html,re.DOTALL) # re.DOTALL makes . match also \n
for match in result:
if len(re.findall("URBAN",match)) > 1: #One of the commets has to have at least two URBAN
foundInComment = True
print("\"URBAN AND REGIONAL DEVELOPMEN\" found more than once in a comment at "
+ time.strftime("%H:%M:%S"))
break
foundInComment = False
time.sleep(600)
except KeyboardInterrupt:
raise
except Exception as e:
print e
print "Going to sleep for 1 min"
time.sleep(60)
if not DEBUG:
fromaddr = 'someMail@gmail.com'
toaddrs = ['otherMail@gmail.com', fromaddr]
msg = 'Subject: PHD polito\n\n Go to %s' % url
# Credentials
username = 'someone'
password = 'password'
server = smtplib.SMTP('smtp.gmail.com:587')
server.starttls()
server.login(username,password)
server.sendmail(fromaddr, toaddrs, msg)
server.quit()
print "End of program"
你觉得呢?你知道吗
提前谢谢!你知道吗
PS:这是HTML注释的一部分,包含单词URBAN:
<li><a href="colloqui/Architettura_Storia_Progetto2.pdf">URBAN AND REGIONAL DEVELOPMEN</a></li>
<li><a href="colloqui/Architettura_Storia_Progetto2.pdf">URBAN AND REGIONAL DEVELOPMEN - Cluster Tecnologie per le Smart Communities - Progetto Edifici a Zero Consumo Energetico in Distretti Urbani Intelligenti</a></li>
-->
我几乎可以肯定,他们会复制这一点,并粘贴在网页内的评论出来。你知道吗
另一种方法(我认为更可靠)是使用一个专门的工具——HTML解析器。例如,使用^{} ,打印出包含
URBAN
字的所有comments:相关问题 更多 >
编程相关推荐