使用Python和Sqlite进行字符串相似性比较(Levenshtein距离/编辑距离)

2024-05-23 22:04:00 发布

您现在位置:Python中文网/ 问答频道 /正文

在Python+Sqlite中是否有可用的字符串相似性度量,例如sqlite3模块?在

用例示例:

import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')

此查询应匹配ID为1的行,但不匹配ID为2的行:

^{pr2}$

如何在Sqlite+Python中实现这一点?

到目前为止我发现的情况:

  • Levenshtein distance,即将一个单词改为另一个单词所需的最少单字符编辑(插入、删除或替换)可能有用,但我不确定Sqlite中是否存在正式实现(我见过一些自定义实现,如this one

  • Damerau-Levenshtein是相同的,除了它还允许两个相邻字符之间的换位;它也被称为Edit distance

  • 我知道我自己也可以define a function,但实现这样的距离将是非常重要的(对数据库进行超高效的自然语言处理比较是非常重要的),这就是为什么我想看看Python/Sqlite是否已经提供了这样一个工具

  • Sqlite有FTS(全文搜索)特性:FTS3FTS4FTS5

    CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
    CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */
    SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux';  /* 0.03 seconds */
    SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
    

    但是我没有发现用这样一个“相似距离”来比较字符串,FTS的特征MATCH或{}似乎没有字母变化的相似性度量。

  • 此外,this answer表明:

    SQLite's FTS engine is based on tokens - keywords that the search engine tries to match.
    A variety of tokenizers are available, but they are relatively simple. The "simple" tokenizer simply splits up each word and lowercases it: for example, in the string "The quick brown fox jumps over the lazy dog", the word "jumps" would match, but not "jump". The "porter" tokenizer is a bit more advanced, stripping the conjugations of words, so that "jumps" and "jumping" would match, but a typo like "jmups" would not.

    后者(事实上“jmups”不能被发现与“jumps”类似)使得它对于我的用例来说不实际,很遗憾。


Tags: the字符串executesqlitematchcreatemytabletable
1条回答
网友
1楼 · 发布于 2024-05-23 22:04:00

下面是一个现成的示例test.py

import sqlite3
db = sqlite3.connect(':memory:')
db.enable_load_extension(True)
db.load_extension('./spellfix')                 # for Linux
#db.load_extension('./spellfix.dll')            # <  UNCOMMENT HERE FOR WINDOWS
db.enable_load_extension(False)
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')
c.execute('SELECT * FROM mytable WHERE editdist3(description, "hel o wrold guy") < 600')
print c.fetchall()
# Output: [(1, u'hello world, guys')]

重要提示:距离editdist3被规范化,以便

the value of 100 is used for insertion and deletion and 150 is used for substitution


以下是在Windows上首先要执行的操作:

  1. 下载https://sqlite.org/2016/sqlite-src-3110100.ziphttps://sqlite.org/2016/sqlite-amalgamation-3110100.zip并解压缩

  2. C:\Python27\DLLs\sqlite3.dll替换为来自here的新的sqlite3.dll。如果跳过这一步,您将得到一个sqlite3.OperationalError: The specified procedure could not be found稍后

  3. 运行:

    ^{pr2}$

    或者

    call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat" x64
    cl /I sqlite-amalgamation-3110100/ sqlite-src-3110100/ext/misc/spellfix.c /link /DLL /OUT:spellfix.dll
    python test.py
    

    (对于MinGW,它将是:gcc -g -shared spellfix.c -I ~/sqlite-amalgation-3230100/ -o spellfix.dll

以下是如何在Linux Debian上执行此操作:

(基于this answer

apt-get -y install unzip build-essential libsqlite3-dev
wget https://sqlite.org/2016/sqlite-src-3110100.zip
unzip sqlite-src-3110100.zip
gcc -shared -fPIC -Wall -Isqlite-src-3110100 sqlite-src-3110100/ext/misc/spellfix.c -o spellfix.so
python test.py

下面是如何在Linux Debian上使用旧版Python执行此操作:

如果发行版的Python有点旧,则需要另一种方法。由于sqlite3模块是Python中内置的,因此not straightforward似乎要升级它(pip install upgrade pysqlite只会升级pysqlite模块,而不是底层的SQLite库)。因此,如果import sqlite3; print sqlite3.sqlite_version为3.8.2,则this method起作用:

wget https://www.sqlite.org/src/tarball/27392118/SQLite-27392118.tar.gz
tar xvfz SQLite-27392118.tar.gz
cd SQLite-27392118 ; sh configure ; make sqlite3.c ; cd ..
gcc -g -fPIC -shared SQLite-27392118/ext/misc/spellfix.c -I SQLite-27392118/src/ -o spellfix.so
python test.py   # [(1, u'hello world, guys')]

相关问题 更多 >