我在我的数据集中有文本。我想要将其转换为热编码。

2024-05-13 12:00:09 发布

您现在位置:Python中文网/ 问答频道 /正文

array(['ftp_data', 'other', 'private', 'http', 'remote_job', 'name',
   'netbios_ns', 'eco_i', 'mtp', 'telnet', 'finger', 'domain_u',
   'supdup', 'uucp_path', 'Z39_50', 'smtp', 'csnet_ns', 'uucp',
   'netbios_dgm', 'urp_i', 'auth', 'domain', 'ftp', 'bgp', 'ldap',
   'ecr_i', 'gopher', 'vmnet', 'systat', 'http_443', 'efs', 'whois',
   'imap4', 'iso_tsap', 'echo', 'klogin', 'link', 'sunrpc', 'login',
   'kshell', 'sql_net', 'time', 'hostnames', 'exec', 'ntp_u',
   'discard', 'nntp', 'courier', 'ctf', 'ssh', 'daytime', 'shell',
   'netstat', 'pop_3', 'nnsp', 'IRC', 'pop_2', 'printer', 'tim_i',
   'pm_dump', 'red_i', 'netbios_ssn', 'rje', 'X11', 'urh_i',
   'http_8001', 'aol', 'http_2784', 'tftp_u', 'harvest'], dtype=object)

这是我的数据集中的一个功能集。数组中包含的所有值都是唯一的。唯一值的长度为70。每个值都被视为一个类别。我想将此功能集转换为一个热编码。 我想用一种详细的方式说,如果一行包含“ftp_数据”,那么它应该是一个热编码为1000000。。。。。对于所有行,依此类推。 我知道一种为每个单词分配数值的方法,用数值替换数据集中的单词,然后使用one_hot_编码方法。我希望是否有其他方法可以直接将我的数据集从单词转换为一种热编码 有谁能帮助我们找到一种方法在熊猫身上做到这一点


Tags: 数据方法http编码datadomainftp单词
1条回答
网友
1楼 · 发布于 2024-05-13 12:00:09

我想你在找pandas.get_dummies

s=pd.Series(['ftp_data', 'other', 'private', 'http', 'remote_job', 'name',
   'netbios_ns', 'eco_i', 'mtp', 'telnet', 'finger', 'domain_u',
   'supdup', 'uucp_path', 'Z39_50', 'smtp', 'csnet_ns', 'uucp',
   'netbios_dgm', 'urp_i', 'auth', 'domain', 'ftp', 'bgp', 'ldap',
   'ecr_i', 'gopher', 'vmnet', 'systat', 'http_443', 'efs', 'whois',
   'imap4', 'iso_tsap', 'echo', 'klogin', 'link', 'sunrpc', 'login',
   'kshell', 'sql_net', 'time', 'hostnames', 'exec', 'ntp_u',
   'discard', 'nntp', 'courier', 'ctf', 'ssh', 'daytime', 'shell',
   'netstat', 'pop_3', 'nnsp', 'IRC', 'pop_2', 'printer', 'tim_i',
   'pm_dump', 'red_i', 'netbios_ssn', 'rje', 'X11', 'urh_i',
   'http_8001', 'aol', 'http_2784', 'tftp_u', 'harvest'])
one_hot=pd.get_dummies(s,dtype=int).T.apply(lambda x: ''.join(x.astype(str).tolist()),axis=1).sort_values(ascending=False)
print(one_hot)



ftp_data      1000000000000000000000000000000000000000000000...
other         0100000000000000000000000000000000000000000000...
private       0010000000000000000000000000000000000000000000...
http          0001000000000000000000000000000000000000000000...
remote_job    0000100000000000000000000000000000000000000000...
                                    ...                        
http_8001     0000000000000000000000000000000000000000000000...
aol           0000000000000000000000000000000000000000000000...
http_2784     0000000000000000000000000000000000000000000000...
tftp_u        0000000000000000000000000000000000000000000000...
harvest       0000000000000000000000000000000000000000000000...
Length: 70, dtype: object

print(one_hot.head(50))

ftp_data       1000000000000000000000000000000000000000000000...
other          0100000000000000000000000000000000000000000000...
private        0010000000000000000000000000000000000000000000...
http           0001000000000000000000000000000000000000000000...
remote_job     0000100000000000000000000000000000000000000000...
name           0000010000000000000000000000000000000000000000...
netbios_ns     0000001000000000000000000000000000000000000000...
eco_i          0000000100000000000000000000000000000000000000...
mtp            0000000010000000000000000000000000000000000000...
telnet         0000000001000000000000000000000000000000000000...
finger         0000000000100000000000000000000000000000000000...
domain_u       0000000000010000000000000000000000000000000000...
supdup         0000000000001000000000000000000000000000000000...
uucp_path      0000000000000100000000000000000000000000000000...
Z39_50         0000000000000010000000000000000000000000000000...
smtp           0000000000000001000000000000000000000000000000...
csnet_ns       0000000000000000100000000000000000000000000000...
uucp           0000000000000000010000000000000000000000000000...
netbios_dgm    0000000000000000001000000000000000000000000000...
urp_i          0000000000000000000100000000000000000000000000...
auth           0000000000000000000010000000000000000000000000...
domain         0000000000000000000001000000000000000000000000...
ftp            0000000000000000000000100000000000000000000000...
bgp            0000000000000000000000010000000000000000000000...
ldap           0000000000000000000000001000000000000000000000...
ecr_i          0000000000000000000000000100000000000000000000...
gopher         0000000000000000000000000010000000000000000000...
vmnet          0000000000000000000000000001000000000000000000...
systat         0000000000000000000000000000100000000000000000...
http_443       0000000000000000000000000000010000000000000000...
efs            0000000000000000000000000000001000000000000000...
whois          0000000000000000000000000000000100000000000000...
imap4          0000000000000000000000000000000010000000000000...
iso_tsap       0000000000000000000000000000000001000000000000...
echo           0000000000000000000000000000000000100000000000...
klogin         0000000000000000000000000000000000010000000000...
link           0000000000000000000000000000000000001000000000...
sunrpc         0000000000000000000000000000000000000100000000...
login          0000000000000000000000000000000000000010000000...
kshell         0000000000000000000000000000000000000001000000...
sql_net        0000000000000000000000000000000000000000100000...
time           0000000000000000000000000000000000000000010000...
hostnames      0000000000000000000000000000000000000000001000...
exec           0000000000000000000000000000000000000000000100...
ntp_u          0000000000000000000000000000000000000000000010...
discard        0000000000000000000000000000000000000000000001...
nntp           0000000000000000000000000000000000000000000000...
courier        0000000000000000000000000000000000000000000000...
ctf            0000000000000000000000000000000000000000000000...
ssh            0000000000000000000000000000000000000000000000...
dtype: object

浮动方式:

print(one_hot.astype(float))

ftp_data      1.000000e+69
other         1.000000e+68
private       1.000000e+67
http          1.000000e+66
remote_job    1.000000e+65
                  ...     
http_8001     1.000000e+04
aol           1.000000e+03
http_2784     1.000000e+02
tftp_u        1.000000e+01
harvest       1.000000e+00
Length: 70, dtype: float64

请注意,astype(int)得到一个错误

相关问题 更多 >