Python - 使脚本循环直到满足条件,并在每次循环中使用不同的代理地址
我就是个新手,对Python几乎一无所知,正在寻求帮助。我能看懂一些代码,稍微改改变量来满足我的需求,但一旦要做一些原代码没有的事情,我就完全不知道该怎么做了。
事情是这样的,我找到一个在Craigslist(CL)上用来举报的脚本,它最开始是用来搜索所有CL网站,并举报包含特定关键词的帖子(这个脚本是为了举报所有提到“科学教”的帖子)。
我把它改成只搜索我所在地区的CL网站(从437个减少到15个),但它仍然在寻找我已经更改的特定关键词。我想自动举报那些不断在CL上刷广告的人,因为我在CL上做生意时需要花很多时间去筛选帖子。
我希望这个脚本能循环运行,直到找不到符合条件的帖子为止,并且在每次循环后更换代理服务器。我还想在脚本里有个地方可以输入代理服务器的IP地址。
期待大家的回复。
这是我修改后的代码:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib
from twill.commands import * # gives us go()
areas = ['sfbay', 'chico', 'fresno', 'goldcountry', 'humboldt', 'mendocino', 'modesto', 'monterey', 'redding', 'reno', 'sacramento', 'siskiyou', 'stockton', 'yubasutter', 'reno']
def expunge(url, area):
page = urllib.urlopen(url).read() # <-- and v and vv gets you urls of ind. postings
page = page[page.index('<hr>'):].split('\n')[0]
page = [i[:i.index('">')] for i in page.split('href="')[1:-1] if '<font size="-1">' in i]
for u in page:
num = u[u.rfind('/')+1:u.index('.html')] # the number of the posting (like 34235235252)
spam = 'https://post.craigslist.org/flag?flagCode=15&postingID='+num # url for flagging as spam
go(spam) # flag it
print 'Checking ' + str(len(areas)) + ' areas...'
for area in ['http://' + a + '.craigslist.org/' for a in areas]:
ujam = area + 'search/?query=james+"916+821+0590"+&catAbb=hhh'
udre = area + 'search/?query="DRE+%23+01902542+"&catAbb=hhh'
try:
jam = urllib.urlopen(ujam).read()
dre = urllib.urlopen(udre).read()
except:
print 'tl;dr error for ' + area
if 'Found: ' in jam:
print 'Found results for "James 916 821 0590" in ' + area
expunge(ujam, area)
print 'All "James 916 821 0590" listings marked as spam for area'
if 'Found: ' in dre:
print 'Found results for "DRE # 01902542" in ' + area
expunge(udre, area)
print 'All "DRE # 01902542" listings marked as spam for area'
4 个回答
0
我做了一些修改……不太确定效果怎么样,但我没有遇到任何错误。如果你发现有什么问题或者缺失的地方,请告诉我。- 谢谢!
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib, urllib2
from twill.commands import go
proxy = urllib2.ProxyHandler({'https': '108.60.219.136:8080'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
proxy2 = urllib2.ProxyHandler({'https': '198.144.186.98:3128'})
opener2 = urllib2.build_opener(proxy2)
urllib2.install_opener(opener2)
proxy3 = urllib2.ProxyHandler({'https': '66.55.153.226:8080'})
opener3 = urllib2.build_opener(proxy3)
urllib2.install_opener(opener3)
proxy4 = urllib2.ProxyHandler({'https': '173.213.113.111:8080'})
opener4 = urllib2.build_opener(proxy4)
urllib2.install_opener(opener4)
proxy5 = urllib2.ProxyHandler({'https': '198.154.114.118:3128'})
opener5 = urllib2.build_opener(proxy5)
urllib2.install_opener(opener5)
areas = ['sfbay', 'chico', 'fresno', 'goldcountry', 'humboldt',
'mendocino', 'modesto', 'monterey', 'redding', 'reno',
'sacramento', 'siskiyou', 'stockton', 'yubasutter']
queries = ['james+"916+821+0590"','"DRE+%23+01902542"']
def expunge(url, area):
page = urllib.urlopen(url).read() # <-- and v and vv gets you urls of ind. postings
page = page[page.index('<hr>'):].split('\n')[0]
page = [i[:i.index('">')] for i in page.split('href="')[1:-1] if '<font size="-1">' in i]
for u in page:
num = u[u.rfind('/')+1:u.index('.html')] # the number of the posting (like 34235235252)
spam = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=15&postingID='+num )
spam2 = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=28&postingID='+num )
spam3 = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=16&postingID='+num )
go(spam) # flag it
go(spam2) # flag it
go(spam3) # flag it
print 'Checking ' + str(len(areas)) + ' areas...'
for area in areas:
for query in queries:
qurl = 'http://' + area + '.craigslist.org/search/?query=' + query + '+&catAbb=hhh'
try:
q = urllib.urlopen(qurl).read()
except:
print 'tl;dr error for {} in {}'.format(query, area)
break
if 'Found: ' in q:
print 'Found results for {} in {}'.format(query, area)
expunge(qurl, area)
print 'All {} listings marked as spam for {}'.format(query, area)
print ''
print ''
elif 'Nothing found for that search' in q:
print 'No results for {} in {}'.format(query, area)
print ''
print ''
break
else:
break
0
你可以这样创建一个常量循环:
while True:
if condition :
break
Itertools有一些很棒的技巧可以用来循环遍历,具体可以查看这个链接:http://docs.python.org/2/library/itertools.html
特别是,看看 itertools.cycle
这个功能。
(这些只是给你指个方向。你可以用其中一个,或者两个一起,来解决问题。)
0
我对你的代码做了一些修改。看起来函数 expunge
已经在页面中循环遍历所有结果了,所以我不太明白你需要什么样的循环。不过在最后有一个例子,可以检查是否找到了结果,但没有需要跳出的循环。
我不知道怎么更改代理或IP。
顺便说一下,你的代码中出现了两次 'reno'
。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib
from twill.commands import go
areas = ['sfbay', 'chico', 'fresno', 'goldcountry', 'humboldt',
'mendocino', 'modesto', 'monterey', 'redding', 'reno',
'sacramento', 'siskiyou', 'stockton', 'yubasutter']
queries = ['james+"916+821+0590"','"DRE+%23+01902542"']
def expunge(url, area):
page = urllib.urlopen(url).read() # <-- and v and vv gets you urls of ind. postings
page = page[page.index('<hr>'):].split('\n')[0]
page = [i[:i.index('">')] for i in page.split('href="')[1:-1] if '<font size="-1">' in i]
for u in page:
num = u[u.rfind('/')+1:u.index('.html')] # the number of the posting (like 34235235252)
spam = 'https://post.craigslist.org/flag?flagCode=15&postingID='+num # url for flagging as spam
go(spam) # flag it
print 'Checking ' + str(len(areas)) + ' areas...'
for area in areas:
for query in queries:
qurl = 'http://' + area + '.craigslist.org/search/?query=' + query + '+&catAbb=hhh'
try:
q = urllib.urlopen(qurl).read()
except:
print 'tl;dr error for {} in {}'.format(query, area)
break
if 'Found: ' in q:
print 'Found results for {} in {}'.format(query, area)
expunge(qurl, area)
print 'All {} listings marked as spam for area'.format(query)
elif 'Nothing found for that search' in q:
print 'No results for {} in {}'.format(query, area)
break
else:
break