Python&beautifulsoup4循环返回重复结果

2024-05-13 18:52:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从6岁的时候开始pm.com网站我遇到了一个问题-我的循环似乎返回了重复的结果,例如,它不断重复同一个产品多次,而不同的产品只应出现一次。你知道吗

这是我的密码:

url_list1 = ['https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=1',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=2',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=3',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=4',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=5',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=6',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=7',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=8',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=9',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=10',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=11',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=12',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=13',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=14',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=15',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=16',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=17',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=18',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=19',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=20',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=21',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=22',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=23',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=24',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=25',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=26',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=27',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=28',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=29',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=30',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=31',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=32',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=33',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=34',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=35',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=36',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=37',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=38',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=39',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=40',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=41',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=42',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=43',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=44',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=45'
]


url_list2 = []

for url1 in url_list1:
    data1 = requests.get(url1)
    soup1 = BeautifulSoup(data1.text, 'html.parser')


    productUrls = soup1.findAll('article')


    for url2 in productUrls:
        get_urls = "https://www.6pm.com"+url2.find('a', attrs={'itemprop': 'url'})['href']
        url_list2.append(get_urls)

print(url_list2)

所以第一部分(url\u list1)基本上是一个链接列表。每个链接指向一个页面,其中包含100种选定品牌的产品。当我单击每个链接并在浏览器中打开时,每个页面都包含不同的产品,并且没有重复(我知道)。你知道吗

接下来,我初始化一个空列表(url\u list2),在这里我试图存储所有实际的产品url(因此这个列表应该包含46页*100个产品=大约4600个产品url)。你知道吗

第一个“for”循环遍历url\u list1中的每个链接。productUrls变量是一个列表,它应该存储46页中每个页上的所有“article”元素。你知道吗

第二个嵌套的“for”循环遍历producturl列表并构造实际的产品URL。然后应该将构造的产品URL附加到我之前初始化的空列表URL\u list2中。你知道吗

用print语句测试结果,我注意到产品是重复的而不是不同的。你知道吗

如果在我的浏览器url\u list1中手动打开每个url,我可以在每个页面上看到不同的产品,并且没有发现任何重复的产品,为什么会发生这种情况?你知道吗

非常感谢您的帮助。你知道吗


Tags: httpscomurl产品filtersckpmmen
2条回答

所发生的是,您在浏览器中看到的页面与请求得到的页面不同。要解决这个问题,您必须使会话(请求)保持活动状态。你知道吗

试试这个,对我有用。将大for循环替换为:

with requests.Session() as s:    # < - here we create a session that stays alive
        for url1 in url_list1:
            data1 = s.get(url1)  # < - here we call the links with the same session
            soup1 = BeautifulSoup(data1.text, 'html.parser')

            productUrls = soup1.findAll('article')

            for url2 in productUrls:
                get_urls = "https://www.6pm.com"+url2.find('a', attrs={'itemprop': 'url'})['href']
                url_list2.append(get_urls)

祝你好运!你知道吗

你可以做得更好场景。你呢不需要把所有的urls都收进去请列出清单尝试下面的代码,这是一个简单的方法,你可以实现的结果。你知道吗

from bs4 import BeautifulSoup
import re
import requests
headers = {'User-Agent':
       'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso"

url_list2 = []
page_num = 1
session = requests.Session()
while page_num <47:
    pageTree = session.get(page, headers=headers)
    pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
    productUrls = pageSoup.findAll('article')
    for url2 in productUrls:
        get_urls = "https://www.6pm.com"+url2.find('a', attrs={'itemprop': 'url'})['href']
        url_list2.append(get_urls)

    page = "https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?p={}".format(page_num)
    page_num +=1

print(url_list2)
print(len(url_list2))

如果有帮助请告诉我。你知道吗

相关问题 更多 >