为什么同一个SQLite查询在只获取两倍结果时要慢30倍?

2024-04-30 06:29:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经花了大约一周的时间来加速我正在使用的查询,并在这里提出了几个问题(How can I speed up fetching the results after running an sqlite query?Is it normal that sqlite.fetchall() is so slow?How to use min() and max() in an efficient way?)。

在这里给出的答案的非常有用的帮助下,我成功地获得了sqlite查询所需的时间,它花费了100.95秒,而fetchall花费了:1485.43。这还不够,所以在尝试了一些不同的索引之后,我设法将一个样本的查询时间降到了0.08秒,而fetchall时间降到了54.97秒。所以我想我终于把事情搞得够快了。

然后查询运行到下一个示例,占用0.58秒,而fetchall占用3952.80秒。对于第三个示例,查询花费了1.01秒和1970.67秒来获取。

第一个样本提取12951行,第二个样本提取24972行,第三个样本提取6470行。 我很好奇为什么第一个样本提取行要快得多,而第二个样本只有大约一半的提取量。


代码(spectrumFeature_inputValues是(1,),(2,)和(3,),来自所使用的3个示例。)

self.cursor.execute('begin')
self.cursor.execute("EXPLAIN QUERY PLAN "+
                    "SELECT precursor_id, feature_table_id "+
                    "FROM `MSMS_precursor` "+
                    "INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+
                    "INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
                    "WHERE spectrum.scan_start_time BETWEEN feature.rtMin AND feature.rtMax "+
                    "AND MSMS_precursor.ion_mz BETWEEN feature.mzMin AND feature.mzMax "+
                    "AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues)
print 'EXPLAIN QUERY PLAN: '
print self.cursor.fetchall()
import time
time0 = time.time()
self.cursor.execute("SELECT precursor_id, feature_table_id "+
                    "FROM `MSMS_precursor` "+
                    "INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+
                    "INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
                    "WHERE spectrum.scan_start_time BETWEEN feature.rtMin AND feature.rtMax "+
                    "AND MSMS_precursor.ion_mz BETWEEN feature.mzMin AND feature.mzMax "+
                    "AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues)
print 'query took:',time.time()-time0,'seconds'
time0 = time.time()
precursorFeatureIds = self.cursor.fetchall()
print 'it fetched:',len(precursorFeatureIds),'rows'
print 'fetchall took',time.time()-time0,'seconds'
time0 = time.time()
for precursorAndFeatureID in precursorFeatureIds:
    feature_has_MSMS_precursor_inputValues = (precursorAndFeatureID[0], precursorAndFeatureID[1])
    self.cursor.execute("INSERT INTO `feature_has_MSMS_precursor` VALUES(?,?)", feature_has_MSMS_precursor_inputValues)
print 'inserting took',time.time()-time0,'seconds'
self.connection.commit()

结果是:

EXPLAIN QUERY PLAN: 
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.0754859447479 seconds
it fetched: 12951 rows
fetchall took 54.2855291367 seconds
inserting took 0.602859973907 seconds
It took 54.9704811573 seconds

EXPLAIN QUERY PLAN: 
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.579694032669 seconds
it fetched: 24972 rows
fetchall took 3950.08093309 seconds
inserting took 2.11575508118 seconds
 It took 3952.80745602 seconds

EXPLAIN QUERY PLAN: 
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 1.01185703278 seconds
it fetched: 6470 rows
fetchall took 1970.622962 seconds
inserting took 0.673867940903 seconds
It took 1972.31343699 seconds

SQLite create语句:

-- -----------------------------------------------------
-- Table `feature`
-- -----------------------------------------------------
CREATE  TABLE IF NOT EXISTS `feature` (
  `feature_table_id` INT PRIMARY KEY NOT NULL ,
  `feature_id` VARCHAR(40) NOT NULL ,
  `intensity` DOUBLE NOT NULL ,
  `overallquality` DOUBLE NOT NULL ,
  `charge` INT NOT NULL ,
  `content` VARCHAR(45) NOT NULL ,
  `intensity_cutoff` DOUBLE NOT NULL,
  `mzMin` DOUBLE NULL ,
  `mzMax` DOUBLE NULL ,
  `rtMin` DOUBLE NULL ,
  `rtMax` DOUBLE NULL ,
  `msrun_msrun_id` INT NOT NULL ,
  CONSTRAINT `fk_feature_msrun1`
    FOREIGN KEY (`msrun_msrun_id` )
    REFERENCES `msrun` (`msrun_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION);

  CREATE INDEX `fk_mzMin_feature` ON `feature` (`mzMin` ASC); 
  CREATE INDEX `fk_mzMax_feature` ON `feature` (`mzMax` ASC); 
  CREATE INDEX `fk_rtMin_feature` ON `feature` (`rtMin` ASC); 
  CREATE INDEX `fk_rtMax_feature` ON `feature` (`rtMax` ASC);

DROP TABLE IF EXISTS `spectrum`;
-- -----------------------------------------------------
-- Table `spectrum`
-- -----------------------------------------------------
CREATE  TABLE IF NOT EXISTS `spectrum` (
  `spectrum_id` INT PRIMARY KEY NOT NULL ,
  `spectrum_index` INT NOT NULL ,
  `ms_level` INT NOT NULL ,
  `base_peak_mz` DOUBLE NOT NULL ,
  `base_peak_intensity` DOUBLE NOT NULL ,
  `total_ion_current` DOUBLE NOT NULL ,
  `lowest_observes_mz` DOUBLE NOT NULL ,
  `highest_observed_mz` DOUBLE NOT NULL ,
  `scan_start_time` DOUBLE NOT NULL ,
  `ion_injection_time` DOUBLE,
  `binary_data_mz` BLOB NOT NULL,
  `binary_data_rt` BLOB NOT NULL,
  `msrun_msrun_id` INT NOT NULL ,
  CONSTRAINT `fk_spectrum_msrun1`
    FOREIGN KEY (`msrun_msrun_id` )
    REFERENCES `msrun` (`msrun_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION);

CREATE INDEX `fk_spectrum_spectrum_id_1` ON  `spectrum` (`spectrum_id` ASC);
CREATE INDEX `fk_spectrum_scahn_start_time_1` ON  `spectrum` (`scan_start_time` ASC);

DROP TABLE IF EXISTS `feature_has_MSMS_precursor`;
-- -----------------------------------------------------
-- Table `spectrum_has_feature`
-- -----------------------------------------------------
CREATE  TABLE IF NOT EXISTS `feature_has_MSMS_precursor` (
  `MSMS_precursor_precursor_id` INT NOT NULL ,
  `feature_feature_table_id` INT NOT NULL ,
  CONSTRAINT `fk_spectrum_has_feature_spectrum1`
    FOREIGN KEY (`MSMS_precursor_precursor_id` )
    REFERENCES `MSMS_precursor` (`precursor_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION,
  CONSTRAINT `fk_spectrum_has_feature_feature1`
    FOREIGN KEY (`feature_feature_table_id` )
    REFERENCES `feature` (`feature_table_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION);

  CREATE INDEX `fk_feature_has_MSMS_precursor_feature1` ON `feature_has_MSMS_precursor` (`feature_feature_table_id` ASC);
  CREATE INDEX `fk_feature_has_MSMS_precursor_precursor1` ON `feature_has_MSMS_precursor` (`MSMS_precursor_precursor_id` ASC);

如您所见,我已经从频谱和特性中的mzrt值中创建了索引,因为我发现大部分时间都花在比较这些数字上。

为什么第一个样本比第二个和第三个样本快得多?查询时间与fetchall时间有什么关系?最重要的是,我有办法加快速度吗?


更新1:

在和同事交谈之后,可能是因为将一个点与二维空间(rtMin、rtMax、mzMin、mzMax)进行比较需要n^2次。这个大致对应于第二个fetchall占用的时间略大于60^2秒(第一个fetchall占用的时间约为该时间),并且它检索的行数略小于行数的两倍。但这并不能回答我的任何问题。


更新2:

我试着按照评论中的建议使用R*tree。我做了一张新桌子:

CREATE VIRTUAL TABLE convexhull_edges USING rtree(
   feature_feature_table_id,             
   rtMin, rtMax,      
   mzMin, mzMax,       
); 

并将我的查询更改为:

self.cursor.execute("SELECT precursor_id, feature_table_id "+
                    "FROM `MSMS_precursor` "+
                    "INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+
                    "INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
                    "INNER JOIN `convexhull_edges` ON convexhull_edges.feature_feature_table_id = feature.feature_table_id "
                    "WHERE spectrum.scan_start_time BETWEEN convexhull_edges.rtMin AND convexhull_edges.rtMax "+
                    "AND MSMS_precursor.ion_mz BETWEEN convexhull_edges.mzMin AND convexhull_edges.mzMax "+
                    "AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues)

结果如下:

EXPLAIN QUERY PLAN: 
[(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.0572800636292 seconds
it fetched: 13140 rows
fetchall took 34.4445540905 seconds

EXPLAIN QUERY PLAN: 
[(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.819370031357 seconds
it fetched: 25402 rows
fetchall took 3625.72873998 seconds

EXPLAIN QUERY PLAN: 
[(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.878498077393 seconds
it fetched: 6761 rows
fetchall took 1419.34246588 seconds
inserting took 0.340960025787 seconds
It took 1420.56637716 seconds

所以比我以前的方法快了一点,但还是不够快。下一步我将尝试web-bod的解决方案。


更新3

使用WebBod的解决方案,我得到了以下时间:

EXPLAIN QUERY PLAN: 
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.0521960258484 seconds
it fetched: 13052 rows
fetchall took 90.5810132027 seconds

EXPLAIN QUERY PLAN: 
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum  USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.278959989548 seconds
it fetched: 25195 rows
fetchall took 4310.6012361 seconds

第三个遗憾的是因为重新启动而没有完成。所以这比我的第一个解决方案快一点,但是比使用R*Tree慢一点


更新4

在处理另一个异常缓慢的查询时,我看到它进入了一个不间断的睡眠状态(参见this question)。所以我在运行这个查询时检查了top,它在R和D状态之间切换,将CPU使用率从100%降低到50%。这可能就是为什么它在提供所有解决方案的情况下运行得如此缓慢的原因。


更新5

我迁移到MySQL,但得到的结果是一样的。


Tags: idindextimetablestartnullfeaturespectrum
3条回答

考虑对查询中涉及的表使用covering indices

您确实在select语句以及相应的inner joinwhere子句中获取了有限数量的列。通过使用一个包含列的覆盖索引,您将得到一个非常快速的查询,即您将删除scan table,而不是^{} using a ^{}

尝试在表中使用这些索引:

CREATE INDEX `fk_covering_feature` ON `feature` (`msrun_msrun_id`,`mzMin`,`mzMax`,`rtMin`,`rtMax`,`feature_table_id`);
CREATE INDEX `fk_covering_spectrum` ON  `spectrum` (`msrun_msrun_id`,`scan_start_time`,`spectrum_id`);
CREATE INDEX `fk_covering_MSMS_precursor` ON  `MSMS_precursor` (`spectrum_spectrum_id`,`ion_mz`,`precursor_id`);

在追求速度时,还应该提示查询计划器理解msrun msrun id是一个常量,用于检查featurespectrum表。在查询中添加常量测试,方法是将此附加测试放在查询的末尾(并通过spectrumFeature_InputValues两次):

"AND spectrum.msrun_msrun_id = ?"

我建议您尝试使用R*Tree index,它们是为高效的范围查询而设计的。


我实际上并没有太多使用R*Tree,只是阅读了文档,但我认为您可能使用错误。您可以尝试将查询更改为使用

WHERE convexhull_edges.rtMin <= spectrum.scan_start_time AND convexhull_edges.rtMax >= spectrum.scan_start_time AND
convexhull_edges.mzMin <= MSMS_precursor.ion_mz AND convexhull_edges.mzMax >= MSMS_precursor.ion_mz

这应该相当于您当前的查询,但我认为应该更快(您应该从R*树中选择一个范围,而不是将点与范围进行比较)

执行时间与每个表中的行数成几何比例,而不是算术比例,例如

3 tables with 10 rows each => 1,000 comparision

3 tables with 10, 10 and 40 rows => 4,000 comparisons

3 tables with 20 rows each => 8,000 comparisons

您可能会重新调整查询的因子,以避免某些连接/游标—何时需要答案?

你能这样做吗:

SELECT precursor_id, feature_table_id 
FROM MSMS_precursor
INNER JOIN 

    (
        SELECT mzMin, mzMax, rtMin, rtMax, spectrum_id, feature_table_id, msrun_msrun_id

        FROM spectrum
        INNER JOIN 

           (select feature_table_id, mzMin, mzMax, rtMin, rtMax, msrun_msrun_id
            from feature
            where feature.msrun_msrun_id = 'value'
           ) subquery 

        ON subquery.msrun_msrun_id = spectrum.msrun_msrun_id
        WHERE 
            spectrum.scan_start_time BETWEEN subquery.rtMin AND subquery.rtMax 
    ) subquery

    ON subquery.spectrum_id = MSMS_precursor.spectrum_spectrum_id 

WHERE 
    MSMS_precursor.ion_mz BETWEEN subquery.mzMin AND subquery.mzMax 

使用子查询可以减少表之间的比较次数-可以在搜索合适的前体之前快速筛选出不需要的特征,然后筛选出不相关的光谱。

我不使用SQLLite-但原则仍然适用。

更新:修复了SQL中的错误

注:

你不必担心这些,你只会得到:

  • feature.msrun_msrun_id='值'
  • 这些特征的光谱和光谱扫描的起始时间 subquery.rtMin和subquery.rtMax
  • 这些光谱的前体,以及subquery.mzMin和 子查询.mzMax

5月18日更新:

这是索引!!!您在搜索字段上有索引,但在参与联接的字段上没有索引-外键索引确实提高了性能:

CREATE INDEX `fk_msrun_msrun_id_feature` ON `feature` (`msrun_msrun_id` ASC); 
CREATE INDEX `fk_spectrum_spectrum_id_feature` ON `feature` (`msrun_msrun_id` ASC); 
CREATE INDEX `fk_spectrum_spectrum_id_MSMS_precursor` ON `MSMS_precursor` (`spectrum_spectrum_id` ASC); 

相关问题 更多 >