如何重构159行Python代码以修复已弃用的列表'.append'?

-2 投票
1 回答
74 浏览
提问于 2025-04-14 15:24

我们正在为一个客户报价,他们想把数据迁移到Hubspot,现在我们在处理数据建模和数据库的问题,以便做好计划。

在规划迁移时,我们查看了Hubspot的数据,一个团队成员发现了一段代码,这段代码在某些方面很有帮助。不过,代码中用到了EDIT列表的.append方法,这意味着我们需要根据数据的结构来改变写法。

我原以为“DataFrame”对象没有“append”这个属性是因为它是pandas的,抱歉让人困惑,我以为所有的数据框都是pandas的数据框。

这段代码比较长,我有几个问题,完整的代码可以在这里找到 Hubspot Community Data

  1. 欢迎提出所有问题,希望这不是个愚蠢的问题,我对如何解决这个问题感到困惑。
  2. 你是怎么把这152行代码拆分成更小的部分的?最好每个函数只做1到2件事,而不是更多,这正是我希望做到的。
  3. 我该如何重构或调整下面这段代码,以便数据字典能正常工作,因为.append现在不可用了?由于_append可能不是最有效的选择,我不确定从哪里开始。

编辑1:好的,我在更新共享的代码,错误信息“ return object.getattribute(self, name) AttributeError: 'DataFrame' object has no attribute 'append'. Did you mean: '_append'?”来自第90行,这在generate_data函数的第一个for循环中,位于数据字典之后。

        company_industry = faker.random_element(
            ["Technology", "Healthcare", "Finance", "Real Estate"]
        )

Hubspot Data Faker

# This code has been created by Michael Kupermann (michael.kupermann@codersunlimited.com or michael@kupermann.com)
# The purpose of this code is to generate dummy data that simulates a realistic dataset for HubSpot CRM.
# This data can then be used for demonstrations, testing, or other purposes that require a representative dataset.
# You need to amend the HubSpot Sales and Service Pipelines before you import the data.
#
# Required Packages:
# 1. Faker: This package is used to generate the fake data for our dataset.
# 2. Pandas: This package is used to handle the data in a tabular format and to write the data to an Excel file.
# 3. DateTime: This package is used to generate realistic date data for the 'close_date' field.
#
# To install the necessary packages, you can use pip, the Python package installer.
# Open your terminal (or command prompt on Windows), and enter the following commands:
# pip install faker
# pip install pandas
# pip install datetime
#
# If you're using a Jupyter notebook, you can prefix these commands with an exclamation mark:
# !pip install faker
# !pip install pandas
# !pip install datetime

from faker import Faker
import pandas as pd
from datetime import datetime, timedelta


#  Function to generate data for a given country. Here 100 companies with 10 contacts, deals, tickets for each company
def generate_data(
    country,
    company_rows=100,
    contacts_per_company=10,
    deals_per_company=10,
    products_per_deal=10,
):
    # Set the locale for Faker based on the country
    if country == "Germany":
        faker = Faker("de_DE")
    elif country == "United States":
        faker = Faker("en_US")
    elif country == "France":
        faker = Faker("fr_FR")
    elif country == "Italy":
        faker = Faker("it_IT")
    elif country == "Japan":
        faker = Faker("ja_JP")
    elif country == "United Kingdom":
        faker = Faker("en_GB")
    elif country == "Canada":
        faker = Faker("en_CA")
    elif country == "Austria":
        faker = Faker("de_AT")
    elif country == "Switzerland":
        faker = Faker("de_CH")

    # Create a dictionary to hold the data
    data = {
        "company_name": [],
        "company_domain": [],
        "company_industry": [],
        "company_address": [],
        "company_country": [],
        "contact_firstname": [],
        "contact_lastname": [],
        "contact_email": [],
        "contact_phone": [],
        "contact_address": [],
        "contact_country": [],
        "contact_function": [],
        "contact_department": [],
        "deal_name": [],
        "deal_stage": [],
        "deal_amount": [],
        "deal_type": [],
        "deal_source": [],
        "close_date": [],
        "ticket_title": [],
        "ticket_status": [],
        "ticket_priority": [],
        "product_name": [],
        "product_price": [],
        "product_description": [],
        "product_sku": [],
        "product_quantity": [],
    }

    # Loop to generate data for each company
    for _ in range(company_rows):
        company_name = faker.company()
        company_domain = faker.domain_name()
        company_industry = faker.random_element(
            ["Technology", "Healthcare", "Finance", "Real Estate"]
        )
        company_address = faker.address().replace("\n", ", ")
        company_country = country

        # Loop to generate data for each contact
        for _ in range(contacts_per_company):
            contact_firstname = faker.first_name()
            contact_lastname = faker.last_name()
            contact_email = faker.email()
            contact_phone = faker.phone_number()
            contact_address = faker.address().replace("\n", ", ")
            contact_country = country
            contact_function = faker.job()
            contact_department = faker.random_element(
                ["Sales", "Marketing", "Human Resources", "Engineering"]
            )

            # Append generated company and contact data to the lists in the dictionary
            data["company_name"].append(company_name)
            data["company_domain"].append(company_domain)
            data["company_industry"].append(company_industry)
            data["company_address"].append(company_address)
            data["company_country"].append(company_country)

            data["contact_firstname"].append(contact_firstname)
            data["contact_lastname"].append(contact_lastname)
            data["contact_email"].append(contact_email)
            data["contact_phone"].append(contact_phone)
            data["contact_address"].append(contact_address)
            data["contact_country"].append(contact_country)
            data["contact_function"].append(contact_function)
            data["contact_department"].append(contact_department)

            # Generate deal and product data
            data["deal_name"].append(f"Deal-{faker.uuid4()}")
            data["deal_stage"].append(
                faker.random_element(
                    [
                        "Appointment Scheduled",
                        "Qualified To Buy",
                        "Presentation Scheduled",
                        "Decision Maker Brought-In",
                    ]
                )
            )
            data["deal_amount"].append(faker.random_int(min=1000, max=50000))
            data["deal_type"].append(
                faker.random_element(["New Business", "Existing Business"])
            )
            data["deal_source"].append(
                faker.random_element(
                    ["Direct Traffic", "Organic Search", "Paid Search", "Social Media"]
                )
            )
            data["close_date"].append(
                (
                    datetime.today() + timedelta(days=faker.random_int(min=1, max=90))
                ).date()
            )

            # Generate product data
            data["product_name"].append(f"Product-{faker.uuid4()}")
            data["product_price"].append(faker.random_int(min=10, max=1000))
            data["product_description"].append(faker.catch_phrase())
            data["product_sku"].append(faker.random_int(min=10000, max=99999))
            data["product_quantity"].append(faker.random_int(min=1, max=100))

            # Generate ticket data
            data["ticket_title"].append(f"Ticket-{faker.uuid4()}")
            data["ticket_status"].append(
                faker.random_element(
                    ["New", "Waiting on contact", "Waiting on us", "Closed"]
                )
            )
            data["ticket_priority"].append(
                faker.random_element(["Low", "Medium", "High"])
            )

    # Convert the data dictionary to a pandas DataFrame
    df = pd.DataFrame(data)
    return df


# Define the list of countries for which we want to generate data
g7_countries = [
    "Canada",
    "France",
    "Germany",
    "Italy",
    "Japan",
    "United Kingdom",
    "United States",
    "Austria",
    "Switzerland",
]

# Create an empty DataFrame to hold the generated data
result = pd.DataFrame()
for country in g7_countries:
    df = generate_data(country)
    # Append the data for each country to the result DataFrame
    result = result.append(df)

# Write the generated data to an Excel file
result.to_excel(r"C:\~\~\~\hubspot_dummy_data.xlsx", index=False)

1 个回答

1

关于你标题中的问题,提到的这个不再推荐使用的方法 pandas.DataFrame.append

正如评论中提到的,你在问题中分享的代码第一部分使用的 appendlist.append,这个方法是可以正常工作的,速度也很快,而且没有被淘汰。

问题出在后面的部分,它使用了 pandas.DataFrame.append

# Create an empty DataFrame to hold the generated data
result = pd.DataFrame()
for country in g7_countries:
    df = generate_data(country)
    # Append the data for each country to the result DataFrame
    result = result.append(df)

为了避免出现 AttributeError 错误,你可以改用 concat

result_list = []

for country in g7_countries:
    df = generate_data(country)
    result_list.append(df)

result = pd.concat(result_list)

这段代码生成了一些假数据,可能你只是想用来测试。我觉得没必要去优化它。

撰写回答