清理pandas datafram中的URL列

date | URLs | Count ----------------------------------------------------------------------- 17-mar-2014 | www.example.com/abcdef&=randstring | 20 10-mar-2016 | www.example.com/xyzabc | 12 14-apr-2015 | www.example.com/abcdef | 11 12-mar-2016 | www.example.com/abcdef/randstring | 30 15-mar-2016 | www.example.com/abcdef | 10 17-feb-2016 | www.example.com/xyzabc&=randstring | 15 17-mar-2016 | www.example.com/abcdef&=someotherrandstring | 12

2条回答

网友

1楼 · 编辑于 2024-05-16 05:31:12

我认为它与正则表达式的关系比熊猫更大，试着用熊猫。申请更改一列。在

import pandas as pd
import re

def clear_url(origin_url):
    p = re.compile('(www.example.com/[a-zA-Z]*)')
    r = p.search(origin_url)
    if r:
        return r.groups(1)[0]
    else:
        return origin_url


d = [
    {'id':1, 'url':'www.example.com/abcdef&=randstring'},
    {'id':2, 'url':'www.example.com/abcdef'},
    {'id':3, 'url':'www.example.com/xyzabc&=randstring'}
]
df = pd.DataFrame(d)

print 'origin_df'
print df

df['url'] = df['url'].apply(clear_url)
print 'new_df'
print df

输出：

^{pr2}$

网友

2楼 · 编辑于 2024-05-16 05:31:12

我认为您可以通过regex-过滤a-z和{}之间的a-z和{}创建的所有字符串，另一个字符串以/开头：

print (df.URLs.str.extract('(www.[a-zA-Z]*.com/[a-zA-Z]*)', expand=False))
0    www.example.com/abcdef
1    www.example.com/xyzabc
2    www.example.com/abcdef
3    www.example.com/abcdef
4    www.example.com/abcdef
5    www.example.com/xyzabc
6    www.example.com/abcdef
Name: URLs, dtype: object

相关问题更多 >

编程相关推荐

热门问题

热门文章