无法从变量中获取某些元素

-1 投票

3 回答

93 浏览

提问于 2025-04-14 16:35

我正在抓取这个网站 https://www.immobilienscout24.de/Suche/de/berlin/berlin/haus-kaufen?enteredFrom=one_step_search，从这个变量中：

square_meter_str = text.replace('m²', '').replace(',', '').strip()

我得到了这些数值：

232 1.264 8426 990 175 801 140 599 117 581 72 160 110 266 145 519 290 917 151 520 116 0 172 206 46 1.024 479 1.040 1.18904 424 29148 1.007 135 599 156 444

这个变量应该是抓取房子内部的平方数，但它也抓取了房子外部的平方数。我只想要房子内部的平方数，比如232、8426、175、140等等（也就是每个奇数值）。我该怎么做，或者我能否修改代码，只抓取房子内部的平方数？

附注：这是包含该变量的函数的完整代码：

def extract_information(soup):
    information = soup.find_all('dd', class_='font-highlight font-tabular')
    links = [info.find_previous('div', class_='grid-item').find('a')['href'] for info in information]

    for link, info in zip(links, information):
        house_link = f"https://www.immobilienscout24.de{link}"

        # Extract and print only the prices and square meters
        text = info.getText().strip()
        try:
            if '€' in text:
                # Remove euro sign and convert to int
                price = int(text.replace('€', '').replace('.', '').replace(',', '').strip())
            elif 'm²' in text:
                # Remove square meter sign and convert to int
                square_meter_str = text.replace('m²', '').replace(',', '').strip()
                print(square_meter_str)

                # Handle cases where dot is used as a thousand separator
                if '.' in square_meter_str and square_meter_str.count('.') == 1:
                    # If there is only one dot, consider it as a thousand separator
                    square_meter_str = square_meter_str.replace('.', '')
                elif '.' in square_meter_str and square_meter_str.count('.') > 1:
                    # If there are multiple dots, keep only the last one as the decimal point
                    square_meter_str = square_meter_str.rsplit('.', 1)[0] + square_meter_str.rsplit('.', 1)[1].replace('.', '')

                # Convert to float, handle the case when it's zero
                square_meter = float(square_meter_str) if square_meter_str else 0.0

                # Check if square_meter is non-zero before division
                if square_meter != 0:
                    price_per_square_meter = price / square_meter

                    # Check if price_per_square_meter is within the specified range
                    if 2000 <= price_per_square_meter <= 3000:
                        # Append data to the list for saving to Excel
                        house_data['House link'].append(house_link)
                        house_data['Price per square meter [€]'].append(price_per_square_meter)
        except:
            print("Price or the living space information is missing.")

    return house_data

我尝试把它保存到一个列表中，然后循环这个列表来获取每个第二个值，但没有成功。我甚至试过问ChatGPT，但也没有帮助。

代码优化列表操作数据处理编程调试网页解析数据抓取变量筛选数值提取

3 个回答

更新

我找到了解决办法。我把这一部分：

information = soup.find_all('dd', class_='font-highlight font-tabular')
    links = [info.find_previous('div', class_='grid-item').find('a')['href'] for info in information]

    for link, info in zip(links, information):
        house_link = f"https://www.immobilienscout24.de{link}"

        # Extract and print only the prices and square meters
        text = info.getText().strip()

改成了这一部分：

containers = soup.find_all('div', class_='grid grid-flex grid-align-center grid-justify-space-between')

    # Iterate over containers
    for container in containers:
        # Find all dd elements within the container
        informations = container.find_all('dd')

        # Extract links from a elements within the container
        links = container.find_all('a')
        links_list = [f"https://www.immobilienscout24.de{link.get('href')}" for link in links]

        # Extract text from dd elements
        variable = [information.getText() for information in informations]

下面是整个函数的代码：

def extract_information_house(soup):
    # Find all containers
    containers = soup.find_all('div', class_='grid grid-flex grid-align-center grid-justify-space-between')

    # Iterate over containers
    for container in containers:
        # Find all dd elements within the container
        informations = container.find_all('dd')

        # Extract links from a elements within the container
        links = container.find_all('a')
        links_list = [f"https://www.immobilienscout24.de{link.get('href')}" for link in links]

        # Extract text from dd elements
        variable = [information.getText() for information in informations]

        # Extract price and square meter information from the correct positions
        for i in range(0, len(variable), 2):
            # Check if there are enough elements in the variable list
            if i + 1 < len(variable):
                price_text = variable[i]
                square_meter_text = variable[i + 1]
                print(f"0. {square_meter_text}")
                # Check if both price and square meter information are present
                if price_text and square_meter_text:
                    try:
                        # Remove euro sign and convert price to int
                        price = int(price_text.replace('€', '').replace('.', '').replace(',', '').strip())

                        # Remove square meter sign and convert to float
                        square_meter_str = square_meter_text.replace('m²', '').strip()
    
                        # Remove all commas
                        square_meter_str = square_meter_str.replace(',', '.')
                    
                        # Replace the first dot with an empty string to prevent decimal issues
                        if square_meter_str.count('.') == 2:
                            square_meter_str = square_meter_str.replace('.', '', 1)
                

                        # Check if there are three digits after the dot and remove the dot if true
                        if '.' in square_meter_str and len(square_meter_str.split('.')[1]) == 3:
                            square_meter_str = square_meter_str.replace('.', '', 1)

                        # Convert square meter to float
                        square_meter = float(square_meter_str) if square_meter_str else 0.0
                        
                        # Check if square_meter is non-zero before division
                        if square_meter != 0:
                            price_per_square_meter = round(price / square_meter, 2)
            
                            # Check if price_per_square_meter is within the specified range
                            if 2000 <= price_per_square_meter <= 3000:
                                # Append data to the list for saving to Excel
                                house_data['House link'].extend(links_list)
                                house_data['Price per square meter [€]'].append(price_per_square_meter)
                    except (ValueError, IndexError):
                        #print("Price or the living space information is missing or invalid.")
                        pass

    return house_data

回答于 2025-04-14 由 Python大师

分享举报

如果你想从列表中抓取并使用一些属性，你的函数大概会是这个样子的：

def extract_information(soup):
    information = soup.find_all('dd', class_='font-highlight font-tabular')
    categories = [info.find_next_siblings("dt", class_="font-tabular onlyLarge font-xs attribute-label") for info in information]
    links = [info.find_previous('div', class_='grid-item').find('a')['href'] for info in information]

    for link, info, cat in zip(links, information, categories):
        house_link = f"https://www.immobilienscout24.de{link}"
        # Extract and print only the prices and square meters
        text = info.getText().strip()
        cat_text = cat[0].getText().strip()

        try:
            if '€' in text:
                # Remove euro sign and convert to int
                price = int(text.replace('€', '').replace('.', '').replace(',', '').strip())
            elif 'm²' in text and cat_text == "Wohnfläche":
                # Remove square meter sign and convert to int
                square_meter_str = text.replace('m²', '').replace(',', '').strip()
                print(square_meter_str)

                # Handle cases where dot is used as a thousand separator
                if '.' in square_meter_str and square_meter_str.count('.') == 1:
                    # If there is only one dot, consider it as a thousand separator
                    square_meter_str = square_meter_str.replace('.', '')
                elif '.' in square_meter_str and square_meter_str.count('.') > 1:
                    # If there are multiple dots, keep only the last one as the decimal point
                    square_meter_str = square_meter_str.rsplit('.', 1)[0] + square_meter_str.rsplit('.', 1)[1].replace('.', '')

                # Convert to float, handle the case when it's zero
                square_meter = float(square_meter_str) if square_meter_str else 0.0

                # Check if square_meter is non-zero before division
                if square_meter != 0:
                    price_per_square_meter = price / square_meter

                    # Check if price_per_square_meter is within the specified range
                    if 2000 <= price_per_square_meter <= 3000:
                        # Append data to the list for saving to Excel
                        house_data['House link'].append(house_link)
                        house_data['Price per square meter [€]'].append(price_per_square_meter)
        except:
            print("Price or the living space information is missing.")
     return house_data

回答于 2025-04-14 由 Python大师

分享举报

我没有测试你的函数，不过如果你想从结果中提取你需要的元素，可以先把字符串按空格分开，然后再取出奇数位置的元素。

square_meters = square_meter_str.split(" ")[::2]

这个分割的方法会把原来的字符串变成一个字符串列表。然后用步长为2的切片可以返回所有奇数位置的元素。

小提醒一下：在你的评论中提到要转换成整数。要做到这一点，你还得把结果转换一下。例如，

square_meters = list(map(int,square_meter_str.split(" ")[::2]))

在你提供的具体结果中，会出现错误，因为并不是所有的值都能转换成整数。

list(map(int,[m for m in square_meter_str.split(" ") if not "." in m]))

或者

list(map(int,[m for m in square_meter_str.split(" ")[::2] if not "." in m]))

这样做会给你一个整数列表。

编辑

在查看了相关页面后，我明白输出的是一个字符串而不是列表。一个解决方案可以是 round(float(text.replace('m²', '').replace(',', '.').split("- ")[0]))

回答于 2025-04-14 由 Python大师

分享举报

无法从变量中获取某些元素

3 个回答

撰写回答