无法获取字符串python2.7的唯一ID

2024-05-23 13:44:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从一个单词列表中创建唯一的ID。我希望这些数字是全球唯一的。例如,如果出现另一个列表,我希望唯一的ID是相同的,例如对于“density”,ID可能是151111911,如果“density”出现在不同的列表中,这将是相同的。你知道吗

如您所见,我当前的方法没有使用idintern工作rrb的ID与lrb完全相同。你知道吗

featureList = [u'guinea', u'bissau', u'compared', u'countriesthe', u'population', u'density', u'guinea', u'bissau', u'similar', u'iran', u'afghanistan', u'cameroon', u'panama', u'montenegro', u'guinea', u'belarus', u'palau', u'location_slot', u'south', u'africa', u'respective', u'population', u'density', u'lrb', u'capita', u'per', u'square', u'kilometer', u'rrb', u'global', u'rank', u'number_slot', u'years', u'growthguinea', u'bissau', u'population', u'density', u'positive', u'growth', u'lrb', u'rrb', u'last', u'years', u'lrb', u'rrb', u'LOCATION_SLOT~-appos+LOCATION~-prep_of', u'LOCATION~-prep_of+that~-prep_to', u'that~-prep_to+similar~prep_with', u'similar~prep_with+density~prep_of', u'density~prep_of+NUMBER~appos', u'NUMBER~appos+NUMBER~amod', u'NUMBER~amod+NUMBER_SLOT']

featureVector = mydefaultdict(mydouble)

for featureID,featureVal in enumerate(featureList):
        print "featureID is",featureID
        print "featureVal is ",featureVal
        print "Encoded feature value is", id(intern(str(featureVal.encode("utf-8"))))
        featureVector[featureID] = featureVal


featureID is 0
featureVal is  guinea
Encoded feature value is 4569583120.0
featureID is 1
featureVal is  bissau
Encoded feature value is 4569581632.0
featureID is 2
featureVal is  compared
Encoded feature value is 4569583120.0
featureID is 3
featureVal is  countriesthe
Encoded feature value is 4567944360.0
featureID is 4
featureVal is  population
Encoded feature value is 4347153072.0
featureID is 5
featureVal is  density
Encoded feature value is 4455561472.0
featureID is 6
featureVal is  guinea
Encoded feature value is 4569581632.0
featureID is 7
featureVal is  bissau
Encoded feature value is 4569583120.0
featureID is 8
featureVal is  similar
Encoded feature value is 4496118144.0
featureID is 9
featureVal is  iran
Encoded feature value is 4569583120.0
featureID is 10
featureVal is  afghanistan
Encoded feature value is 4569581632.0
featureID is 11
featureVal is  cameroon
Encoded feature value is 4569583120.0
featureID is 12
featureVal is  panama
Encoded feature value is 4569581632.0
featureID is 13
featureVal is  montenegro
Encoded feature value is 4569583120.0
featureID is 14
featureVal is  guinea
Encoded feature value is 4569581632.0
featureID is 15
featureVal is  belarus
Encoded feature value is 4569583120.0
featureID is 16
featureVal is  palau
Encoded feature value is 4569581632.0
featureID is 17
featureVal is  location_slot
Encoded feature value is 4567944360.0
featureID is 18
featureVal is  south
Encoded feature value is 4569583120.0
featureID is 19
featureVal is  africa
Encoded feature value is 4569581632.0
featureID is 20
featureVal is  respective
Encoded feature value is 4569583120.0
featureID is 21
featureVal is  population
Encoded feature value is 4347153072.0
featureID is 22
featureVal is  density
Encoded feature value is 4455561472.0
featureID is 23
featureVal is  lrb
Encoded feature value is 4537993216.0
featureID is 24
featureVal is  capita
Encoded feature value is 4569581632.0
featureID is 25
featureVal is  per
Encoded feature value is 4455914152.0
featureID is 26
featureVal is  square
Encoded feature value is 4347127296.0
featureID is 27
featureVal is  kilometer
Encoded feature value is 4569581632.0
featureID is 28
featureVal is  rrb
Encoded feature value is 4537993216.0
featureID is 29
featureVal is  global
Encoded feature value is 4346597072.0
featureID is 30
featureVal is  rank
Encoded feature value is 4346629984.0
featureID is 31
featureVal is  number_slot
Encoded feature value is 4569583120.0
featureID is 32
featureVal is  years
Encoded feature value is 4569581632.0
featureID is 33
featureVal is  growthguinea
Encoded feature value is 4567944360.0
featureID is 34
featureVal is  bissau
Encoded feature value is 4569583120.0
featureID is 35
featureVal is  population
Encoded feature value is 4347153072.0
featureID is 36
featureVal is  density
Encoded feature value is 4455561472.0
featureID is 37
featureVal is  positive
Encoded feature value is 4514096160.0
featureID is 38
featureVal is  growth
Encoded feature value is 4569583120.0
featureID is 39
featureVal is  lrb
Encoded feature value is 4537993216.0
featureID is 40
featureVal is  rrb
Encoded feature value is 4537993216.0
featureID is 41
featureVal is  last
Encoded feature value is 4346568112.0
featureID is 42
featureVal is  years
Encoded feature value is 4569583120.0
featureID is 43
featureVal is  lrb
Encoded feature value is 4537993216.0
featureID is 44
featureVal is  rrb
Encoded feature value is 4537993216.0
featureID is 45
featureVal is  LOCATION_SLOT~-appos+LOCATION~-prep_of
Encoded feature value is 4538026784.0
featureID is 46
featureVal is  LOCATION~-prep_of+that~-prep_to
Encoded feature value is 6043251168.0
featureID is 47
featureVal is  that~-prep_to+similar~prep_with
Encoded feature value is 6043251168.0
featureID is 48
featureVal is  similar~prep_with+density~prep_of
Encoded feature value is 6043251168.0
featureID is 49
featureVal is  density~prep_of+NUMBER~appos
Encoded feature value is 6043251168.0
featureID is 50
featureVal is  NUMBER~appos+NUMBER~amod
Encoded feature value is 6043247024.0
featureID is 51
featureVal is  NUMBER~amod+NUMBER_SLOT
Encoded feature value is 6043247024.0

我做错什么了?我之所以需要将它们转换为浮点数或数字,是因为上面的句子将进入一个需要使用数字/矢量化特征的分类器。你知道吗


Tags: ofnumberisvaluedensityfeaturepopulationencoded
3条回答

您可以使用单词本身、单词的散列,甚至可以将字符串转换为数字。你知道吗

docs

Interned strings are not immortal (like they used to be in Python 2.2 and before); you must keep a reference to the return value of intern() around to benefit from it.

当下一个字符串被插入时,以前的字符串可能会被删除,而新的字符串可能偶尔会得到相同的id。所以请将引用保存在一个容器中。我会用口述:

featureList = [u'guinea', u'bissau', u'compared', u'countriesthe', u'population', u'density', u'guinea', u'bissau', u'similar', u'iran', u'afghanistan', u'cameroon', u'panama', u'montenegro', u'guinea', u'belarus', u'palau', u'location_slot', u'south', u'africa', u'respective', u'population', u'density', u'lrb', u'capita', u'per', u'square', u'kilometer', u'rrb', u'global', u'rank', u'number_slot', u'years', u'growthguinea', u'bissau', u'population', u'density', u'positive', u'growth', u'lrb', u'rrb', u'last', u'years', u'lrb', u'rrb', u'LOCATION_SLOT~-appos+LOCATION~-prep_of', u'LOCATION~-prep_of+that~-prep_to', u'that~-prep_to+similar~prep_with', u'similar~prep_with+density~prep_of', u'density~prep_of+NUMBER~appos', u'NUMBER~appos+NUMBER~amod', u'NUMBER~amod+NUMBER_SLOT']

# dict of id:featureVal pairs 
seen = {}

for featureID,featureVal in enumerate(featureList):
    print "featureID is",featureID
    print "featureVal is ",featureVal
    interned = intern(str(featureVal.encode("utf-8")))
    interned_id = id(interned)

    # ensure that no other string with the same id has been seen
    assert interned_id not in seen or seen[interned_id] == featureVal

    # change this to seen[interned_id] = None and you'll (probably) get AssertionError
    # from the line above
    seen[interned_id] = interned

    print "Encoded feature value is", interned_id

也许最简单的方法是使用defaultdictitertools.count以及float作为起始位置,例如:

from collections import defaultdict
from itertools import count

# Start from 1.0 and increment by one - can change to start from any value or even add a step
# eg: `count(716345.0, 9)` will start at at 716345.0 and increment by 9 for new keys
unique_id = defaultdict(lambda c=count(1.0): next(c))
featureList = [u'guinea', u'bissau', u'compared', u'countriesthe', u'population', u'density', u'guinea', u'bissau', u'similar', u'iran', u'afghanistan', u'cameroon', u'panama', u'montenegro', u'guinea', u'belarus', u'palau', u'location_slot']
for feature in featureList:
    print(feature, unique_id[feature])

这张照片:

guinea 1.0
bissau 2.0
compared 3.0
countriesthe 4.0
population 5.0
density 6.0
guinea 1.0
bissau 2.0
similar 7.0
iran 8.0
afghanistan 9.0
cameroon 10.0
panama 11.0
montenegro 12.0
guinea 1.0
belarus 13.0
palau 14.0
location_slot 15.0

我们可以做一些其他检查:

unique_id['cameroon'] 
# 10.0
unique_id['this is new']
# 16.0

相关问题 更多 >