从JSON文件中删除重复条目 - BeautifulSoup

2024-04-19 21:57:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在运行一个脚本,以寻找教科书信息的网站,我有脚本工作。但是,当它写入JSON文件时,它会给我重复的结果。我正试图找出如何从JSON文件中删除重复项。这是我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = ['https://open.bccampus.ca/find-open-textbooks/', 
'https://open.bccampus.ca/find-open-textbooks/?start=10']

data = []
#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.findAll("h4")

    for container in containers:
       item = {}
       item['type'] = "Textbook"
       item['title'] = container.parent.a.text
       item['author'] = container.nextSibling.findNextSibling(text=True)
       item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + container.parent.a["href"]
       item['source'] = "BC Campus"
       data.append(item) # add the item to the list

with open("./json/bc.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

下面是JSON输出的示例

^{pr2}$

Tags: httpsimportjsondatacontainerhtmlpageopen
3条回答

您不需要删除任何类型的重复项。在

唯一需要的就是更新代码。在

Please keep reading. I have provided detailed description related to this problem. Also don't forget to check this gist https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c which I had written to debug your code.

»问题出在哪里?

我知道你想要这个是因为你得到了重复的字典。在

这是因为您选择容器作为h4元素&f 或每本书的详细信息,指定页面链接https://open.bccampus.ca/find-open-textbooks/https://open.bccampus.ca/find-open-textbooks/?start=10 有2个h4个元素。在

这就是为什么,你没有得到一个包含20个项目的列表(每页10个)作为容器列表 只获取两倍,即40个项目的列表,其中每个项目是h4元素。在

对于这40个项目,你可能会得到不同的值,但问题是在选择家长时。 因为它给出了相同的元素,所以文本也是一样的。在

让我们通过假设下面的伪代码来澄清这个问题。在

Note: You can also visit and check https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c as it has the Python code which I have created to debug and solve this problem. You may get some IDEA.

<li> <!  1st book  >
    <h4>
        <a> Text 1 </a>
    </h4>
    <h4>
        <a> Text 2 </a>
    </h4>
</li>
<li> <!  2nd book  >
    <h4>
        <a> Text 3 </a>
    </h4>
    <h4>
        <a> Text 4 </a>
    </h4>
</li>
...
...
<li> <!  20th book  >
    <h4>
        <a> Text 39 </a>
    </h4>
    <h4>
        <a> Text 40 </a>
    </h4>
</li>

»»容器=第页_汤。找到所有(“h4”);将给出下面的h4元素的列表。在

^{pr2}$

»»对于您的代码,内部for循环的第一次迭代将把下面的元素称为容器变量。在

<h4>
    <a> Text 1 </a>
</h4>

»»第二次迭代将以下元素称为容器变量。在

<h4>
    <a> Text 1 </a>
</h4>

»»在上述两个内部for循环迭代中,容器.父对象;将给出下面的元素。在

<li> <!  1st book  >
    <h4>
        <a> Text 1 </a>
    </h4>
    <h4>
        <a> Text 2 </a>
    </h4>
</li>

»»和容器.父对象.a将给出下面的元素。在

<a> Text 1 </a>

»»最后,容器.父对象.a.text将下面的文本作为前两本书的书名。在

Text 1

这就是为什么我们会得到重复的字典,因为我们的动态title&;author也是相同的。在

让我们把这个问题一一解决。在

»网页详细信息:

  1. 我们有两个网页的链接。在

enter image description here

enter image description here

  1. 每个网页都有10本教科书的详细信息。

  2. 每本书的细节都有2个h4元素。

  3. 总共,2x10x2=40h4个元素。

»我们的目标:

  1. 我们的目标是只得到20个字典的数组/列表,而不是40个。

  2. 所以需要迭代containers列表2项,即。 在每次迭代中跳过一个项目。

»修改工作代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = [
  'https://open.bccampus.ca/find-open-textbooks/', 
  'https://open.bccampus.ca/find-open-textbooks/?start=10'
]

data = []

#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.find_all("h4")

    for index in range(0, len(containers), 2):
        item = {}
        item['type'] = "Textbook"
        item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
        item['source'] = "BC Campus"
        item['title'] = containers[index].parent.a.text
        item['authors'] = containers[index].nextSibling.findNextSibling(text=True)

    data.append(item) # add the item to the list

with open("./json/bc-modified-final.json", "w") as writeJSON:
  json.dump(data, writeJSON, ensure_ascii=False)

»输出:

[
    {
        "type": "Textbook",
        "title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
        "authors": " Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
        "source": "BC Campus"
    },
    {
        "type": "Textbook",
        "title": "Exploring Movie Construction and Production",
        "authors": " John Reich, SUNY Genesee Community College",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
        "source": "BC Campus"
    },
    {
        "type": "Textbook",
        "title": "Project Management",
        "authors": " Adrienne Watt",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
        "source": "BC Campus"
    },
    ...
    ...
    ...
    {
        "type": "Textbook",
        "title": "Naming the Unnamable: An Approach to Poetry for New Generations",
        "authors": " Michelle Bonczek Evory. Western Michigan University",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
        "source": "BC Campus"
    }
]

最后,我尝试修改您的代码,并向dictionary对象添加更多细节descriptiondate&;categories。在

Python version: 3.6

Dependency: pip install beautifulsoup4

»修改后的工作代码(增强版):

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = [
    'https://open.bccampus.ca/find-open-textbooks/', 
    'https://open.bccampus.ca/find-open-textbooks/?start=10'
]

data = []

#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.find_all("h4")

    for index in range(0, len(containers), 2):
        item = {}

        # Store book's information as per given the web page (all 5 are dynamic)
        item['title'] = containers[index].parent.a.text
        item["catagories"] = [a_tag.text for a_tag in containers[index + 1].find_all('a')]
        item['authors'] = containers[index].nextSibling.findNextSibling(text=True).strip()
        item['date'] = containers[index].parent.find_all("strong")[1].findNextSibling(text=True).strip()
        item["description"] = containers[index].parent.p.text.strip()

        # Store extra information (1st is dynamic, last 2 are static)
        item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
        item['source'] = "BC Campus"
        item['type'] = "Textbook"

        data.append(item) # add the item to the list

with open("./json/bc-modified-final-my-own-version.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

»输出(增强版):

[
    {
        "title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
        "catagories": [
            "Ancillary Resources"
        ],
        "authors": "Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
        "date": "May 3, 2018",
        "description": "Description: The purpose of this textbook is to help learners develop best practices in vital sign measurement. Using a multi-media approach, it will provide opportunities to read about, observe, practice, and test vital sign measurement.",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    },
    {
        "title": "Exploring Movie Construction and Production",
        "catagories": [
            "Adopted"
        ],
        "authors": "John Reich, SUNY Genesee Community College",
        "date": "May 2, 2018",
        "description": "Description: Exploring Movie Construction and Production contains eight chapters of the major areas of film construction and production. The discussion covers theme, genre, narrative structure, character portrayal, story, plot, directing style, cinematography, and editing. Important terminology is defined and types of analysis are discussed and demonstrated. An extended example of how a movie description reflects the setting, narrative structure, or directing style is used throughout the book to illustrate ...[more]",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    },
    ...
    ...
    ...
    {
        "title": "Naming the Unnamable: An Approach to Poetry for New Generations",
        "catagories": [],
        "authors": "Michelle Bonczek Evory. Western Michigan University",
        "date": "Apr 27, 2018",
        "description": "Description: Informed by a writing philosophy that values both spontaneity and discipline, Michelle Bonczek Evory’s Naming the Unnameable: An Approach to Poetry for New Generations  offers practical advice and strategies for developing a writing process that is centered on play and supported by an understanding of America’s rich literary traditions. With consideration to the psychology of invention, Bonczek Evory provides students with exercises aimed to make writing in its early stages a form of play that ...[more]",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    }
]

就这样。谢谢。在

我们最好使用集合数据结构而不是列表。它不保留顺序,但不存储像list这样的重复项。在

更改您的代码

 data = []

^{pr2}$

以及

data.append(item)

data.add(item)

明白了。以下是其他人遇到此问题时的解决方案:

textbook_list = []
for item in data:
    if item not in textbook_list:
        textbook_list.append(item)

with open("./json/bc.json", "w") as writeJSON:
    json.dump(textbook_list, writeJSON, ensure_ascii=False)

相关问题 更多 >