使用python从文件中删除字符串

2024-04-25 03:50:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我有csv文件

ID,"address","used_at","active_seconds","pageviews"
0a1d796327284ebb443f71d85cb37db9,"vk.com",2016-01-29 22:10:52,3804,115
0a1d796327284ebb443f71d85cb37db9,"2gis.ru",2016-01-29 22:48:52,214,24
0a1d796327284ebb443f71d85cb37db9,"yandex.ru",2016-01-29 22:14:30,4,2
0a1d796327284ebb443f71d85cb37db9,"worldoftanks.ru",2016-01-29 22:10:30,41,2

我需要删除字符串,其中包含一些单词。有117个单词。在

我尽力了

^{pr2}$

但是对于117个单词,它的工作太慢了,然后我创建了pivot_table和我试图删除的单词,包含在列中。在

^{3}$

这些列只包含0

我怎样才能更快地删除行以使单词不在列中?在


Tags: 文件csvcomidaddressru单词at
2条回答

我知道处理csv文件最快的方法是使用Pandas包从中创建一个数据帧。在

import pandas as pd

df = pd.read_csv(the_path_of_your_file,header = 0)
df.ix[df.ix[:,'address'] == 'yandex.ru','address'] = ''

这将替换包含'扬德克斯.ru'通过一个空字符串的单元格。 然后您可以将其写回csv:

^{pr2}$

如果要删除该url所在的行,请使用:

df = df.drop(df[df.address == 'yandex.ru'].index)

IIUC您可以将^{}^{}一起使用:

print df
                                 ID          address              used_at  \
0  0a1d796327284ebb443f71d85cb37db9           vk.com  2016-01-29 22:10:52   
1  0a1d796327284ebb443f71d85cb37db9           vk.com  2016-01-29 22:10:52   
2  0a1d796327284ebb443f71d85cb37db9          2gis.ru  2016-01-29 22:48:52   
3  0a1d796327284ebb443f71d85cb37db9        yandex.ru  2016-01-29 22:14:30   
4  0a1d796327284ebb443f71d85cb37db9  worldoftanks.ru  2016-01-29 22:10:30   

   active_seconds  pageviews  
0            3804        115  
1            3804        115  
2             214         24  
3               4          2  
4              41          2  

words = ['vk.com','yandex.ru']

print ~df.address.isin(words)
0    False
1    False
2     True
3    False
4     True
Name: address, dtype: bool

print df[~df.address.isin(words)]
                                 ID          address              used_at  \
2  0a1d796327284ebb443f71d85cb37db9          2gis.ru  2016-01-29 22:48:52   
4  0a1d796327284ebb443f71d85cb37db9  worldoftanks.ru  2016-01-29 22:10:30   

   active_seconds  pageviews  
2             214         24  
4              41          2  

然后使用^{}

^{pr2}$

另一种解决方案是删除行,当某些列中是0(例如pageviews):

print df

                                 ID          address              used_at  \
0  0a1d796327284ebb443f71d85cb37db9       youtube.ru  2016-01-29 22:10:52   
1            0a1d796327284ebfsffsdf       youtube.ru  2016-01-29 22:10:52   
2  0a1d796327284ebb443f71d85cb37db9           vk.com  2016-01-29 22:10:52   
3  0a1d796327284ebb443f71d85cb37db9          2gis.ru  2016-01-29 22:48:52   
4  0a1d796327284ebb443f71d85cb37db9        yandex.ru  2016-01-29 22:14:30   
5  0a1d796327284ebb443f71d85cb37db9  worldoftanks.ru  2016-01-29 22:10:30   

   active_seconds  pageviews  
0            3804          0  
1            3804          0  
2            3804        115  
3             214         24  
4               4          2  
5              41          2  
print df.pageviews != 0
0    False
1    False
2     True
3     True
4     True
5     True
Name: pageviews, dtype: bool

print df[(df.pageviews != 0)]
                                 ID          address              used_at  \
2  0a1d796327284ebb443f71d85cb37db9           vk.com  2016-01-29 22:10:52   
3  0a1d796327284ebb443f71d85cb37db9          2gis.ru  2016-01-29 22:48:52   
4  0a1d796327284ebb443f71d85cb37db9        yandex.ru  2016-01-29 22:14:30   
5  0a1d796327284ebb443f71d85cb37db9  worldoftanks.ru  2016-01-29 22:10:30   

   active_seconds  pageviews  
2            3804        115  
3             214         24  
4               4          2  
5              41          2  

print df[(df.pageviews != 0)].pivot_table(index='ID', columns='address', values='pageviews')
address                           2gis.ru  vk.com  worldoftanks.ru  yandex.ru
ID                                                                           
0a1d796327284ebb443f71d85cb37db9       24     115                2          2

相关问题 更多 >