无法从变量中获取某些元素

-1 投票
3 回答
93 浏览
提问于 2025-04-14 16:35

我正在抓取这个网站 https://www.immobilienscout24.de/Suche/de/berlin/berlin/haus-kaufen?enteredFrom=one_step_search,从这个变量中:

square_meter_str = text.replace('m²', '').replace(',', '').strip()

我得到了这些数值:

232 1.264 8426 990 175 801 140 599 117 581 72 160 110 266 145 519 290 917 151 520 116 0 172 206 46 1.024 479 1.040 1.18904 424 29148 1.007 135 599 156 444

这个变量应该是抓取房子内部的平方数,但它也抓取了房子外部的平方数。我只想要房子内部的平方数,比如232、8426、175、140等等(也就是每个奇数值)。我该怎么做,或者我能否修改代码,只抓取房子内部的平方数?

附注:这是包含该变量的函数的完整代码:

def extract_information(soup):
    information = soup.find_all('dd', class_='font-highlight font-tabular')
    links = [info.find_previous('div', class_='grid-item').find('a')['href'] for info in information]

    for link, info in zip(links, information):
        house_link = f"https://www.immobilienscout24.de{link}"

        # Extract and print only the prices and square meters
        text = info.getText().strip()
        try:
            if '€' in text:
                # Remove euro sign and convert to int
                price = int(text.replace('€', '').replace('.', '').replace(',', '').strip())
            elif 'm²' in text:
                # Remove square meter sign and convert to int
                square_meter_str = text.replace('m²', '').replace(',', '').strip()
                print(square_meter_str)

                # Handle cases where dot is used as a thousand separator
                if '.' in square_meter_str and square_meter_str.count('.') == 1:
                    # If there is only one dot, consider it as a thousand separator
                    square_meter_str = square_meter_str.replace('.', '')
                elif '.' in square_meter_str and square_meter_str.count('.') > 1:
                    # If there are multiple dots, keep only the last one as the decimal point
                    square_meter_str = square_meter_str.rsplit('.', 1)[0] + square_meter_str.rsplit('.', 1)[1].replace('.', '')

                # Convert to float, handle the case when it's zero
                square_meter = float(square_meter_str) if square_meter_str else 0.0

                # Check if square_meter is non-zero before division
                if square_meter != 0:
                    price_per_square_meter = price / square_meter

                    # Check if price_per_square_meter is within the specified range
                    if 2000 <= price_per_square_meter <= 3000:
                        # Append data to the list for saving to Excel
                        house_data['House link'].append(house_link)
                        house_data['Price per square meter [€]'].append(price_per_square_meter)
        except:
            print("Price or the living space information is missing.")

    return house_data

我尝试把它保存到一个列表中,然后循环这个列表来获取每个第二个值,但没有成功。我甚至试过问ChatGPT,但也没有帮助。

3 个回答

0

更新

我找到了解决办法。我把这一部分:

information = soup.find_all('dd', class_='font-highlight font-tabular')
    links = [info.find_previous('div', class_='grid-item').find('a')['href'] for info in information]

    for link, info in zip(links, information):
        house_link = f"https://www.immobilienscout24.de{link}"

        # Extract and print only the prices and square meters
        text = info.getText().strip()

改成了这一部分:

containers = soup.find_all('div', class_='grid grid-flex grid-align-center grid-justify-space-between')

    # Iterate over containers
    for container in containers:
        # Find all dd elements within the container
        informations = container.find_all('dd')

        # Extract links from a elements within the container
        links = container.find_all('a')
        links_list = [f"https://www.immobilienscout24.de{link.get('href')}" for link in links]

        # Extract text from dd elements
        variable = [information.getText() for information in informations]

下面是整个函数的代码:

def extract_information_house(soup):
    # Find all containers
    containers = soup.find_all('div', class_='grid grid-flex grid-align-center grid-justify-space-between')

    # Iterate over containers
    for container in containers:
        # Find all dd elements within the container
        informations = container.find_all('dd')

        # Extract links from a elements within the container
        links = container.find_all('a')
        links_list = [f"https://www.immobilienscout24.de{link.get('href')}" for link in links]

        # Extract text from dd elements
        variable = [information.getText() for information in informations]

        # Extract price and square meter information from the correct positions
        for i in range(0, len(variable), 2):
            # Check if there are enough elements in the variable list
            if i + 1 < len(variable):
                price_text = variable[i]
                square_meter_text = variable[i + 1]
                print(f"0. {square_meter_text}")
                # Check if both price and square meter information are present
                if price_text and square_meter_text:
                    try:
                        # Remove euro sign and convert price to int
                        price = int(price_text.replace('€', '').replace('.', '').replace(',', '').strip())

                        # Remove square meter sign and convert to float
                        square_meter_str = square_meter_text.replace('m²', '').strip()
    
                        # Remove all commas
                        square_meter_str = square_meter_str.replace(',', '.')
                    
                        # Replace the first dot with an empty string to prevent decimal issues
                        if square_meter_str.count('.') == 2:
                            square_meter_str = square_meter_str.replace('.', '', 1)
                

                        # Check if there are three digits after the dot and remove the dot if true
                        if '.' in square_meter_str and len(square_meter_str.split('.')[1]) == 3:
                            square_meter_str = square_meter_str.replace('.', '', 1)

                        # Convert square meter to float
                        square_meter = float(square_meter_str) if square_meter_str else 0.0
                        
                        # Check if square_meter is non-zero before division
                        if square_meter != 0:
                            price_per_square_meter = round(price / square_meter, 2)
            
                            # Check if price_per_square_meter is within the specified range
                            if 2000 <= price_per_square_meter <= 3000:
                                # Append data to the list for saving to Excel
                                house_data['House link'].extend(links_list)
                                house_data['Price per square meter [€]'].append(price_per_square_meter)
                    except (ValueError, IndexError):
                        #print("Price or the living space information is missing or invalid.")
                        pass

    return house_data
0

如果你想从列表中抓取并使用一些属性,你的函数大概会是这个样子的:

def extract_information(soup):
    information = soup.find_all('dd', class_='font-highlight font-tabular')
    categories = [info.find_next_siblings("dt", class_="font-tabular onlyLarge font-xs attribute-label") for info in information]
    links = [info.find_previous('div', class_='grid-item').find('a')['href'] for info in information]

    for link, info, cat in zip(links, information, categories):
        house_link = f"https://www.immobilienscout24.de{link}"
        # Extract and print only the prices and square meters
        text = info.getText().strip()
        cat_text = cat[0].getText().strip()

        try:
            if '€' in text:
                # Remove euro sign and convert to int
                price = int(text.replace('€', '').replace('.', '').replace(',', '').strip())
            elif 'm²' in text and cat_text == "Wohnfläche":
                # Remove square meter sign and convert to int
                square_meter_str = text.replace('m²', '').replace(',', '').strip()
                print(square_meter_str)

                # Handle cases where dot is used as a thousand separator
                if '.' in square_meter_str and square_meter_str.count('.') == 1:
                    # If there is only one dot, consider it as a thousand separator
                    square_meter_str = square_meter_str.replace('.', '')
                elif '.' in square_meter_str and square_meter_str.count('.') > 1:
                    # If there are multiple dots, keep only the last one as the decimal point
                    square_meter_str = square_meter_str.rsplit('.', 1)[0] + square_meter_str.rsplit('.', 1)[1].replace('.', '')

                # Convert to float, handle the case when it's zero
                square_meter = float(square_meter_str) if square_meter_str else 0.0

                # Check if square_meter is non-zero before division
                if square_meter != 0:
                    price_per_square_meter = price / square_meter

                    # Check if price_per_square_meter is within the specified range
                    if 2000 <= price_per_square_meter <= 3000:
                        # Append data to the list for saving to Excel
                        house_data['House link'].append(house_link)
                        house_data['Price per square meter [€]'].append(price_per_square_meter)
        except:
            print("Price or the living space information is missing.")
     return house_data
1

我没有测试你的函数,不过如果你想从结果中提取你需要的元素,可以先把字符串按空格分开,然后再取出奇数位置的元素。

square_meters = square_meter_str.split(" ")[::2]

这个分割的方法会把原来的字符串变成一个字符串列表。然后用步长为2的切片可以返回所有奇数位置的元素。

小提醒一下:在你的评论中提到要转换成整数。要做到这一点,你还得把结果转换一下。例如,

square_meters = list(map(int,square_meter_str.split(" ")[::2]))

在你提供的具体结果中,会出现错误,因为并不是所有的值都能转换成整数。

list(map(int,[m for m in square_meter_str.split(" ") if not "." in m]))

或者

list(map(int,[m for m in square_meter_str.split(" ")[::2] if not "." in m]))

这样做会给你一个整数列表。

编辑

在查看了相关页面后,我明白输出的是一个字符串而不是列表。一个解决方案可以是 round(float(text.replace('m²', '').replace(',', '.').split("- ")[0]))

撰写回答