了解Python中两个列表之间是否存在公共项的最简单方法？

high=['e202', 'e407', 'e450', 'e250', 'e341', 'e211', 'e621', 'e200', 'e452', 'e481', 'e340', 'e223', 'e451', 'e338', 'e220', 'e252', 'e339', 'e212', 'e224', 'e491', 'e222', 'e251', 'e407a', 'e492', 'e221', 'e473', 'e210', 'e343', 'e482', 'e228', 'e155', 'e243', 'e226', 'e494', 'e459', 'e493', 'e213']

df['high']= np.zeros(len(df)) # create a dummy column with only zeros at first def common_member(a, b): # define a fct that give True if there is at least one common elelment a_set = set(a) b_set = set(b) if (a_set & b_set): return True else: return False i=0 while i<len(df.additives): if common_member(df.additives.iloc[i],high)==True: df['high'][i]=1 # change the dummy to 1 in the given row i=i+1

1条回答

网友
1楼 · 发布于 2024-05-29 10:07:56

主要是，你的问题是在熊猫身上迭代东西可能非常慢。可能是逐行分配导致熊猫必须每行克隆一次整个数据帧
那么，在用df.additives.values迭代之前，让我们先把所有的值都取出来，看看结果如何，然后我们可以创建一列新的布尔值
import random import string import time import pandas as pd start_time = time.time() high=set(['e202', 'e407', 'e450', 'e250', 'e341', 'e211', 'e621', 'e200', 'e452', 'e481', 'e340', 'e223', 'e451', 'e338', 'e220', 'e252', 'e339', 'e212', 'e224', 'e491', 'e222', 'e251', 'e407a', 'e492', 'e221', 'e473', 'e210', 'e343', 'e482', 'e228', 'e155', 'e243', 'e226', 'e494', 'e459', 'e493', 'e213']) def make_ingredients(): return [''.join(random.choices(string.ascii_uppercase + string.digits, k=4)) for i in range(1, 100)] sample_ingredients = make_ingredients() sample_ingredients.append('e202') list_of_ingredients = [make_ingredients() for i in range(1, 350000)] list_of_ingredients.append(sample_ingredients) checkpoint_time = time.time() checkpoint_delta = checkpoint_time - start_time checkpoint_string = time.strftime("%H:%M:%S", time.gmtime(checkpoint_delta)) print(f'Time to create junk data: {checkpoint_string}') df = pd.DataFrame({'id': range(len(list_of_ingredients)), 'additives': list_of_ingredients}) df["high"] = [len(set(additives).intersection(high)) > 0 for additives in df.additives.values] print(df) intersection_delta = time.time() - checkpoint_time intersection_string = time.strftime("%H:%M:%S", time.gmtime(intersection_delta)) print(f'Time to check for intersections: {intersection_string}')
在我的笔记本电脑上，这会产生：
Time to create junk data: 00:01:21 id additives high 0 0 [GBI5, 5ZH5, AUSE, GU8C, Z5WJ, NU56, GJ1M, 8EN... False 1 1 [JPC7, PZ3P, 7PV1, DP6O, 4OZ9, 3UN0, 3116, MXW... False 2 2 [1RJP, BG6O, PMI9, Y9PD, W9NF, 25A8, QB6C, 490... False 3 3 [3WCC, 6682, O0BY, JT52, AG8H, 0HKC, VV7N, 5YU... False 4 4 [ZOGO, 6V4B, NBJZ, 0U93, 0P2G, 8TIH, B15Y, A7I... False ... ... ... ... 349995 349995 [5G6W, QRPL, D3ZH, XIA8, GG8X, H401, 7RU3, 8VY... False 349996 349996 [ZLJJ, Q8YG, NCE8, ULBT, 6VFU, B24E, EYU5, SM0... False 349997 349997 [4UJ0, HYD3, UPQ4, 1H8F, 2MKR, LSAM, M7KC, CWF... False 349998 349998 [LFER, 44CC, 214W, FXU4, 3F4V, UCRD, 8O8F, SBD... False 349999 349999 [KZJY, 28MA, TDUL, ANBM, SD1A, 69FT, 9TYY, VTF... True [350000 rows x 3 columns] Time to check for intersections: 00:00:03
是的，检查设置的交叉口需要三秒的时间。：）

相关问题更多 >

编程相关推荐

热门问题

热门文章