在python中提取元关键字?

2024-06-16 09:39:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我写了一个从网站中提取关键字的代码。有些网站有关键字信息,但我的程序返回空列表。我们怎么解决这个问题你能给我些建议吗。网站是代码。在这3个网址中,我只从一个网站得到关键词:

代码

data=['http://www.supermap.com','http://www.itc.com','http://www.astro.com']

for url in data:

     print(str(i)+" : "+url)

     i=i+1

     try:

         html=requests.get(url, timeout=60)

         soup3 = BeautifulSoup(html.text,"html.parser")

         meta = soup3.findAll(attrs={"name":'description'})

         meta1= soup3.findAll(attrs={"name":'keywords'})

         t=[]

         t1=[]

         for link in meta:

              a=link.get("content")

              t.append(a)

         for link in meta1:

              a=link.get("content")

              t1.append(a)

         meta=str(t)

         meta1=str(t1)

         cur.execute("insert into key_meta(url,descript,keywords)values(?, ?, ?)", (url, meta, meta1)) 

         con.commit()

Tags: 代码incomhttpurlforget网站
2条回答

我将重写以使用选择器,这些选择器也存在于content属性中,并确保name属性的值大小写是正确的。由于keywords可以是大写或小写,也可以是description,因此您需要在css选择器中考虑到这一点,否则将找不到匹配项。您可以在选择器中使用或语法执行此操作

keywords = [item['content'] for item in soup.select('[name=Keywords][content], [name=keywords][content]')]
descriptions = [item['content'] for item in soup.select('[name=Description][content], [name=description][content]')]

下面的代码收集所需的数据。它适用于2/3的url。在

import requests
from bs4 import BeautifulSoup

URLS = ['http://www.astro.com', 'http://www.supermap.com', 'http://www.itc.com']
ATTRIBUTES = ['description', 'keywords', 'Description', 'Keywords']

collected_data = []

for url in URLS:
    entry = {'url': url}
    try:
        r = requests.get(url)
    except Exception as e:
        print('Could not load page {}. Reason: {}'.format(url, str(e)))
        continue
    if r.status_code == 200:
        soup = BeautifulSoup(r.content, 'html.parser')
        meta_list = soup.find_all("meta")
        for meta in meta_list:
            if 'name' in meta.attrs:
                name = meta.attrs['name']
                if name in ATTRIBUTES:
                    entry[name.lower()] = meta.attrs['content']
        if len(entry) == 3:
            collected_data.append(entry)
        else:
            print('Could not find all required attributes for URL {}'.format(url))
    else:
        print('Could not load page {}.Reason: {}'.format(url, r.status_code))
print('Collected meta attributes (TODO - push to DB):')
for entry in collected_data:
    print(entry)

输出

^{pr2}$

相关问题 更多 >