大数据的部分字符串匹配问题的回答

大数据的部分字符串匹配

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

所以我已经做了一段时间了，只是不知道该做什么做。公平熊猫和Python是新手。在 数据集实际上是15000个产品名称。所有的格式都不同，有的有多个破折号，最多6个，有的连字符，长度不同，行都是带有变体的产品名称。在 当我在大型数据集上使用部分字符串时，我使用的代码一直只返回第一个字母，而不是部分字符串。在 在我用来测试它的小数据集上工作得很好。在 我假设这是因为： <ol> <li>我还没有创建一个匹配完整部分字符串的stop部分</li> <li>因为它试图将单词与单个字符相对应，并在发现差异时停止。在</li> </ol> 在一个大数据集上克服这个问题的最佳方法是什么，我遗漏了什么？还是我要做这本手册？在 原始测试数据集 <pre><code>`1.star t-shirt-large-red 2.star t-shirt-large-blue 3.star t-shirt-small-red 4.beautiful rainbow skirt small 5.long maxwell logan jeans- light blue -32L-28W 6.long maxwell logan jeans- Dark blue -32L-28W` </code></pre> 所需数据集/输出： ^{pr2}$ 下面是我在前一个问题中得到帮助的代码： <pre><code>`df['onkey'] = 1 df1 = pd.merge(df[['name','onkey']],df[['name','onkey']], on='onkey') df1['list'] = df1.apply(lambda x:[x.name_x,x.name_y],axis=1) from os.path import commonprefix df1['COL1'] = df1['list'].apply(lambda x:commonprefix(x)) df1['COL1_num'] = df1['COL1'].apply(lambda x:len(x)) df1 = df1[(df1['COL1_num']!=0)] df1 = df1.loc[df1.groupby('name_x')['COL1_num'].idxmin()] df = df.rename(columns ={'name':'name_x'}) df = pd.merge(df,df1[['name_x','COL1']],on='name_x',how ='left')` `df['len'] = df['COL1'].apply(lambda x: len(x)) df['other'] = df.apply(lambda x: x.name_x[x.len:],axis=1) df['COL1'] = df['COL1'].apply(lambda x: x.strip()) df['COL1'] = df['COL1'].apply(lambda x: x[:-1] if x[-1]=='-' else x) df['other'] = df['other'].apply(lambda x:x.split('-')) df = df[['COL1','other']] df = pd.concat([df['COL1'],df['other'].apply(pd.Series)],axis=1)` ` COL1 0 1 2 0 star t-shirt large red NaN 1 star t-shirt large blue NaN 2 star t-shirt small red NaN 3 beautiful rainbow skirt small NaN NaN 4 long maxwell logan jeans light blue 32L 28W 5 long maxwell logan jeans Dark blue 32L 28W` </code></pre> *************更新************* <ol> <li>这是你的产品输入列表，有些有变体，有些没有</li> <li>当搜索重复的字符串以确定什么是带有变体的产品和没有变体的产品时；由于在字符串的末尾添加了变量，因此它们都被视为唯一值，因此不会出现任何问题。在</li> <li>所以我想做的是将部分或相似的字符串组合在一起（最长的匹配），提取组中最长的匹配字符串，然后将差异放入其他列中。在 <ol start=“4”> <li>如果product/string是唯一的，只需将其打印到具有提取的最长字符串的列中。在 <code>star t-shirt-large-red star t-shirt-large-blue star t-shirt-small-red beautiful rainbow skirt small long maxwell logan jeans- light blue -32L-28W long maxwell logan jeans- Dark blue -32L-28W Organic and natural candy - 3 Pack - Mint Organic and natural candy - 3 Pack - Vanilla Organic and natural candy - 3 Pack - Strawberry Organic and natural candy - 3 Pack - Chocolate Organic and natural candy - 3 Pack - Banana Organic and natural candy - 3 Pack - Cola Organic and natural candy - 12 Pack Assorted Morgan T-shirt Company - Small/Medium-Blue Morgan T-shirt Company - Medium/Large-Blue Morgan T-shirt Company - Medium/Large-red Morgan T-shirt Company - Small/Medium-Red Morgan T-shirt Company - Small/Medium-Green Morgan T-shirt Company - Medium/Large-Green Nelly dress leopard small</code> <code>col1 col2 col3 col4 star t-shirt large red star t-shirt large blue star t-shirt small red beautiful rainbow skirt small Long maxwell logan jeans light blue 32L 28W Long maxwell logan jeans Dark blue 32L 28W Organic and natural candy 3 Pack Mint Organic and natural candy 3 Pack Vanilla Organic and natural candy 3 Pack Strawberry Organic and natural candy 3 Pack Chocolate Organic and natural candy 3 Pack Banana Organic and natural candy 3 Pack Cola Organic and natural candy 12 Pack Assorted Morgan T-shirt Company Small/Medium Blue Morgan T-shirt Company Medium/Large Blue Morgan T-shirt Company Medium/Large Red Morgan T-shirt Company Small/Medium Red Morgan T-shirt Company Small/Medium Green Morgan T-shirt Company Medium/Large Green Nelly dress Leopard Small Bijoux Princess PJ-set Lemon tank top Yellow Medium</code></li> </ol></li> </ol>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

一个简单易懂、调试和灵活扩展的解决方案如下： 考虑一下您的初始产品名保存在一个名为<code>strings</code>的列表中。在 然后，解决方案如下： <pre><code>mydf = pd.concat([pd.DataFrame([make_row(row, 4)], columns=['COL1', 'COL2', 'COL3', 'COL4']) for row in strings], ignore_index=True) </code></pre> 其中，我们将解析函数<code>make_row</code>定义为： ^{pr2}$ 定义<code>cols</code>的第一行也可以是简单的<code>cols = string.split('-')</code>，在这种情况下，您可以使用以下命令进行格式化： <pre><code>mydf.applymap(lambda x: x if pd.isnull(x) else str.strip(x)) </code></pre> 现在在您的例子中，我看到您的一些产品名称中有一个连字符，在这种情况下，您可能需要提前对它们进行“清理”（或者在<code>make_row</code>内，如您所愿），使用类似于： <pre><code>strings = [item.replace('t-shirt', 'tshirt') for item in strings] </code></pre> 示例输入： <pre><code>strings = ['1.one-two-three', '2. one-two', '3.one-two-three-four', '4.one - two -three -four '] </code></pre> 输出： <pre><code> COL1 COL2 COL3 COL4 0 one two three NaN 1 one two NaN NaN 2 one two three four 3 one two three four </code></pre> 问题数据的输出（在更正第4项的错误之后）： <pre><code> COL1 COL2 COL3 COL4 0 star tshirt large red NaN 1 star tshirt large blue NaN 2 star tshirt small red NaN 3 beautiful rainbow skirt small NaN NaN 4 long maxwell logan jeans light blue 32L 28W 5 long maxwell logan jeans Dark blue 32L 28W </code></pre> 编辑： 如果您还想将项目“分组”，则可以： a）在获得如上所述的数据帧之后，在列COL1上使用<code>sort_values</code>（<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html" rel="nofollow noreferrer">pandas doc</a>），以简单地逐个显示同一产品对应的行，或者 b）使用<code>group_by</code>来实际获得如下分组的数据帧： <pre><code>grouped_df = mydf.groupby("COL1") </code></pre> 这样可以让每个组都这样： <pre><code>grouped_df.get_group("star tshirt") </code></pre> 产生以下输出： <pre><code> COL1 COL2 COL3 COL4 0 star tshirt large red NaN 1 star tshirt large blue NaN 2 star tshirt small red NaN </code></pre>

大数据的部分字符串匹配

1 个回答

相关Python问题