无法从变量中获取某些元素
我正在抓取这个网站 https://www.immobilienscout24.de/Suche/de/berlin/berlin/haus-kaufen?enteredFrom=one_step_search,从这个变量中:
square_meter_str = text.replace('m²', '').replace(',', '').strip()
我得到了这些数值:
232 1.264 8426 990 175 801 140 599 117 581 72 160 110 266 145 519 290 917 151 520 116 0 172 206 46 1.024 479 1.040 1.18904 424 29148 1.007 135 599 156 444
这个变量应该是抓取房子内部的平方数,但它也抓取了房子外部的平方数。我只想要房子内部的平方数,比如232、8426、175、140等等(也就是每个奇数值)。我该怎么做,或者我能否修改代码,只抓取房子内部的平方数?
附注:这是包含该变量的函数的完整代码:
def extract_information(soup):
information = soup.find_all('dd', class_='font-highlight font-tabular')
links = [info.find_previous('div', class_='grid-item').find('a')['href'] for info in information]
for link, info in zip(links, information):
house_link = f"https://www.immobilienscout24.de{link}"
# Extract and print only the prices and square meters
text = info.getText().strip()
try:
if '€' in text:
# Remove euro sign and convert to int
price = int(text.replace('€', '').replace('.', '').replace(',', '').strip())
elif 'm²' in text:
# Remove square meter sign and convert to int
square_meter_str = text.replace('m²', '').replace(',', '').strip()
print(square_meter_str)
# Handle cases where dot is used as a thousand separator
if '.' in square_meter_str and square_meter_str.count('.') == 1:
# If there is only one dot, consider it as a thousand separator
square_meter_str = square_meter_str.replace('.', '')
elif '.' in square_meter_str and square_meter_str.count('.') > 1:
# If there are multiple dots, keep only the last one as the decimal point
square_meter_str = square_meter_str.rsplit('.', 1)[0] + square_meter_str.rsplit('.', 1)[1].replace('.', '')
# Convert to float, handle the case when it's zero
square_meter = float(square_meter_str) if square_meter_str else 0.0
# Check if square_meter is non-zero before division
if square_meter != 0:
price_per_square_meter = price / square_meter
# Check if price_per_square_meter is within the specified range
if 2000 <= price_per_square_meter <= 3000:
# Append data to the list for saving to Excel
house_data['House link'].append(house_link)
house_data['Price per square meter [€]'].append(price_per_square_meter)
except:
print("Price or the living space information is missing.")
return house_data
我尝试把它保存到一个列表中,然后循环这个列表来获取每个第二个值,但没有成功。我甚至试过问ChatGPT,但也没有帮助。
3 个回答
更新
我找到了解决办法。我把这一部分:
information = soup.find_all('dd', class_='font-highlight font-tabular')
links = [info.find_previous('div', class_='grid-item').find('a')['href'] for info in information]
for link, info in zip(links, information):
house_link = f"https://www.immobilienscout24.de{link}"
# Extract and print only the prices and square meters
text = info.getText().strip()
改成了这一部分:
containers = soup.find_all('div', class_='grid grid-flex grid-align-center grid-justify-space-between')
# Iterate over containers
for container in containers:
# Find all dd elements within the container
informations = container.find_all('dd')
# Extract links from a elements within the container
links = container.find_all('a')
links_list = [f"https://www.immobilienscout24.de{link.get('href')}" for link in links]
# Extract text from dd elements
variable = [information.getText() for information in informations]
下面是整个函数的代码:
def extract_information_house(soup):
# Find all containers
containers = soup.find_all('div', class_='grid grid-flex grid-align-center grid-justify-space-between')
# Iterate over containers
for container in containers:
# Find all dd elements within the container
informations = container.find_all('dd')
# Extract links from a elements within the container
links = container.find_all('a')
links_list = [f"https://www.immobilienscout24.de{link.get('href')}" for link in links]
# Extract text from dd elements
variable = [information.getText() for information in informations]
# Extract price and square meter information from the correct positions
for i in range(0, len(variable), 2):
# Check if there are enough elements in the variable list
if i + 1 < len(variable):
price_text = variable[i]
square_meter_text = variable[i + 1]
print(f"0. {square_meter_text}")
# Check if both price and square meter information are present
if price_text and square_meter_text:
try:
# Remove euro sign and convert price to int
price = int(price_text.replace('€', '').replace('.', '').replace(',', '').strip())
# Remove square meter sign and convert to float
square_meter_str = square_meter_text.replace('m²', '').strip()
# Remove all commas
square_meter_str = square_meter_str.replace(',', '.')
# Replace the first dot with an empty string to prevent decimal issues
if square_meter_str.count('.') == 2:
square_meter_str = square_meter_str.replace('.', '', 1)
# Check if there are three digits after the dot and remove the dot if true
if '.' in square_meter_str and len(square_meter_str.split('.')[1]) == 3:
square_meter_str = square_meter_str.replace('.', '', 1)
# Convert square meter to float
square_meter = float(square_meter_str) if square_meter_str else 0.0
# Check if square_meter is non-zero before division
if square_meter != 0:
price_per_square_meter = round(price / square_meter, 2)
# Check if price_per_square_meter is within the specified range
if 2000 <= price_per_square_meter <= 3000:
# Append data to the list for saving to Excel
house_data['House link'].extend(links_list)
house_data['Price per square meter [€]'].append(price_per_square_meter)
except (ValueError, IndexError):
#print("Price or the living space information is missing or invalid.")
pass
return house_data
如果你想从列表中抓取并使用一些属性,你的函数大概会是这个样子的:
def extract_information(soup):
information = soup.find_all('dd', class_='font-highlight font-tabular')
categories = [info.find_next_siblings("dt", class_="font-tabular onlyLarge font-xs attribute-label") for info in information]
links = [info.find_previous('div', class_='grid-item').find('a')['href'] for info in information]
for link, info, cat in zip(links, information, categories):
house_link = f"https://www.immobilienscout24.de{link}"
# Extract and print only the prices and square meters
text = info.getText().strip()
cat_text = cat[0].getText().strip()
try:
if '€' in text:
# Remove euro sign and convert to int
price = int(text.replace('€', '').replace('.', '').replace(',', '').strip())
elif 'm²' in text and cat_text == "Wohnfläche":
# Remove square meter sign and convert to int
square_meter_str = text.replace('m²', '').replace(',', '').strip()
print(square_meter_str)
# Handle cases where dot is used as a thousand separator
if '.' in square_meter_str and square_meter_str.count('.') == 1:
# If there is only one dot, consider it as a thousand separator
square_meter_str = square_meter_str.replace('.', '')
elif '.' in square_meter_str and square_meter_str.count('.') > 1:
# If there are multiple dots, keep only the last one as the decimal point
square_meter_str = square_meter_str.rsplit('.', 1)[0] + square_meter_str.rsplit('.', 1)[1].replace('.', '')
# Convert to float, handle the case when it's zero
square_meter = float(square_meter_str) if square_meter_str else 0.0
# Check if square_meter is non-zero before division
if square_meter != 0:
price_per_square_meter = price / square_meter
# Check if price_per_square_meter is within the specified range
if 2000 <= price_per_square_meter <= 3000:
# Append data to the list for saving to Excel
house_data['House link'].append(house_link)
house_data['Price per square meter [€]'].append(price_per_square_meter)
except:
print("Price or the living space information is missing.")
return house_data
我没有测试你的函数,不过如果你想从结果中提取你需要的元素,可以先把字符串按空格分开,然后再取出奇数位置的元素。
square_meters = square_meter_str.split(" ")[::2]
这个分割的方法会把原来的字符串变成一个字符串列表。然后用步长为2的切片可以返回所有奇数位置的元素。
小提醒一下:在你的评论中提到要转换成整数。要做到这一点,你还得把结果转换一下。例如,
square_meters = list(map(int,square_meter_str.split(" ")[::2]))
在你提供的具体结果中,会出现错误,因为并不是所有的值都能转换成整数。
list(map(int,[m for m in square_meter_str.split(" ") if not "." in m]))
或者
list(map(int,[m for m in square_meter_str.split(" ")[::2] if not "." in m]))
这样做会给你一个整数列表。
编辑
在查看了相关页面后,我明白输出的是一个字符串而不是列表。一个解决方案可以是 round(float(text.replace('m²', '').replace(',', '.').split("- ")[0]))