我正在尝试从一个单词列表中创建唯一的ID。我希望这些数字是全球唯一的。例如,如果出现另一个列表,我希望唯一的ID是相同的,例如对于“density”,ID可能是151111911
,如果“density”出现在不同的列表中,这将是相同的。你知道吗
如您所见,我当前的方法没有使用id
和intern
工作rrb
的ID与lrb
完全相同。你知道吗
featureList = [u'guinea', u'bissau', u'compared', u'countriesthe', u'population', u'density', u'guinea', u'bissau', u'similar', u'iran', u'afghanistan', u'cameroon', u'panama', u'montenegro', u'guinea', u'belarus', u'palau', u'location_slot', u'south', u'africa', u'respective', u'population', u'density', u'lrb', u'capita', u'per', u'square', u'kilometer', u'rrb', u'global', u'rank', u'number_slot', u'years', u'growthguinea', u'bissau', u'population', u'density', u'positive', u'growth', u'lrb', u'rrb', u'last', u'years', u'lrb', u'rrb', u'LOCATION_SLOT~-appos+LOCATION~-prep_of', u'LOCATION~-prep_of+that~-prep_to', u'that~-prep_to+similar~prep_with', u'similar~prep_with+density~prep_of', u'density~prep_of+NUMBER~appos', u'NUMBER~appos+NUMBER~amod', u'NUMBER~amod+NUMBER_SLOT']
featureVector = mydefaultdict(mydouble)
for featureID,featureVal in enumerate(featureList):
print "featureID is",featureID
print "featureVal is ",featureVal
print "Encoded feature value is", id(intern(str(featureVal.encode("utf-8"))))
featureVector[featureID] = featureVal
featureID is 0
featureVal is guinea
Encoded feature value is 4569583120.0
featureID is 1
featureVal is bissau
Encoded feature value is 4569581632.0
featureID is 2
featureVal is compared
Encoded feature value is 4569583120.0
featureID is 3
featureVal is countriesthe
Encoded feature value is 4567944360.0
featureID is 4
featureVal is population
Encoded feature value is 4347153072.0
featureID is 5
featureVal is density
Encoded feature value is 4455561472.0
featureID is 6
featureVal is guinea
Encoded feature value is 4569581632.0
featureID is 7
featureVal is bissau
Encoded feature value is 4569583120.0
featureID is 8
featureVal is similar
Encoded feature value is 4496118144.0
featureID is 9
featureVal is iran
Encoded feature value is 4569583120.0
featureID is 10
featureVal is afghanistan
Encoded feature value is 4569581632.0
featureID is 11
featureVal is cameroon
Encoded feature value is 4569583120.0
featureID is 12
featureVal is panama
Encoded feature value is 4569581632.0
featureID is 13
featureVal is montenegro
Encoded feature value is 4569583120.0
featureID is 14
featureVal is guinea
Encoded feature value is 4569581632.0
featureID is 15
featureVal is belarus
Encoded feature value is 4569583120.0
featureID is 16
featureVal is palau
Encoded feature value is 4569581632.0
featureID is 17
featureVal is location_slot
Encoded feature value is 4567944360.0
featureID is 18
featureVal is south
Encoded feature value is 4569583120.0
featureID is 19
featureVal is africa
Encoded feature value is 4569581632.0
featureID is 20
featureVal is respective
Encoded feature value is 4569583120.0
featureID is 21
featureVal is population
Encoded feature value is 4347153072.0
featureID is 22
featureVal is density
Encoded feature value is 4455561472.0
featureID is 23
featureVal is lrb
Encoded feature value is 4537993216.0
featureID is 24
featureVal is capita
Encoded feature value is 4569581632.0
featureID is 25
featureVal is per
Encoded feature value is 4455914152.0
featureID is 26
featureVal is square
Encoded feature value is 4347127296.0
featureID is 27
featureVal is kilometer
Encoded feature value is 4569581632.0
featureID is 28
featureVal is rrb
Encoded feature value is 4537993216.0
featureID is 29
featureVal is global
Encoded feature value is 4346597072.0
featureID is 30
featureVal is rank
Encoded feature value is 4346629984.0
featureID is 31
featureVal is number_slot
Encoded feature value is 4569583120.0
featureID is 32
featureVal is years
Encoded feature value is 4569581632.0
featureID is 33
featureVal is growthguinea
Encoded feature value is 4567944360.0
featureID is 34
featureVal is bissau
Encoded feature value is 4569583120.0
featureID is 35
featureVal is population
Encoded feature value is 4347153072.0
featureID is 36
featureVal is density
Encoded feature value is 4455561472.0
featureID is 37
featureVal is positive
Encoded feature value is 4514096160.0
featureID is 38
featureVal is growth
Encoded feature value is 4569583120.0
featureID is 39
featureVal is lrb
Encoded feature value is 4537993216.0
featureID is 40
featureVal is rrb
Encoded feature value is 4537993216.0
featureID is 41
featureVal is last
Encoded feature value is 4346568112.0
featureID is 42
featureVal is years
Encoded feature value is 4569583120.0
featureID is 43
featureVal is lrb
Encoded feature value is 4537993216.0
featureID is 44
featureVal is rrb
Encoded feature value is 4537993216.0
featureID is 45
featureVal is LOCATION_SLOT~-appos+LOCATION~-prep_of
Encoded feature value is 4538026784.0
featureID is 46
featureVal is LOCATION~-prep_of+that~-prep_to
Encoded feature value is 6043251168.0
featureID is 47
featureVal is that~-prep_to+similar~prep_with
Encoded feature value is 6043251168.0
featureID is 48
featureVal is similar~prep_with+density~prep_of
Encoded feature value is 6043251168.0
featureID is 49
featureVal is density~prep_of+NUMBER~appos
Encoded feature value is 6043251168.0
featureID is 50
featureVal is NUMBER~appos+NUMBER~amod
Encoded feature value is 6043247024.0
featureID is 51
featureVal is NUMBER~amod+NUMBER_SLOT
Encoded feature value is 6043247024.0
我做错什么了?我之所以需要将它们转换为浮点数或数字,是因为上面的句子将进入一个需要使用数字/矢量化特征的分类器。你知道吗
您可以使用单词本身、单词的散列,甚至可以将字符串转换为数字。你知道吗
从docs
当下一个字符串被插入时,以前的字符串可能会被删除,而新的字符串可能偶尔会得到相同的id。所以请将引用保存在一个容器中。我会用口述:
也许最简单的方法是使用
defaultdict
和itertools.count
以及float
作为起始位置,例如:这张照片:
我们可以做一些其他检查:
相关问题 更多 >
编程相关推荐