Kibana的索引管理不会更新文档

2024-04-20 04:39:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我开始使用elasticsearch dsl与elasticsearch和kibana合作。我遵循以下指南:https://elasticsearch-dsl.readthedocs.io/en/latest/index.html#persistence-example

一切似乎都很顺利。然而,在Kibana的索引管理面板中刷新统计数据时,文档计数在我执行搜索之前不会更新(可能是巧合,但我对此表示怀疑)。你知道吗

这是我插入橡皮筋的代码:

connections.create_connection(hosts=['localhost'])
for index, doc in df.iterrows():
    new_cluster = Cluster(meta={'id': doc.url_hashed}, 
                      title = doc.title,
                      cluster = doc.cluster,
                      url = doc.url,
                      paper = doc.paper,
                      published = doc.published,
                      entered = datetime.datetime.now()
                   )
    new_cluster.save()

其中“cluster”是定义索引结构的自定义类:

from datetime import datetime
from elasticsearch_dsl import Document, Date, Integer, Keyword, Text
from elasticsearch_dsl.connections import connections

class Cluster(Document):
    title = Text(analyzer='standard', fields={'raw': Keyword()})
    cluster = Integer()
    url = Text()
    paper = Text()
    published = Date()
    entered = Date()

    class Index:
        name = 'cluster'

    def save(self, ** kwargs):
        return super(Cluster, self).save(** kwargs)

这是我正在查看的面板:https://www.screencast.com/t/zpEhv66Np 运行上面的“for”循环并单击Kibana上的“Reload index”按钮后,数字保持不变。它们只会更改我在脚本上执行的搜索(仅用于测试):

s2 = Search(using=client, index="cluster")
test_df = pd.DataFrame(d.to_dict() for d in s2.scan())

为什么会这样? 非常感谢!你知道吗


Tags: texturlfordatetimeindexdoctitlesave
1条回答
网友
1楼 · 发布于 2024-04-20 04:39:52

首先,您有1个节点(可能是主节点和数据节点),在索引管理中,它表示您的索引状态为yellow,这意味着没有分配副本碎片(如果只有1个节点,则不能有副本,因为副本意味着将这些主碎片放在另一个节点上)。如果需要1个副本,则至少需要2个数据节点)。您需要为索引将副本设置为0,才能使群集再次处于绿色状态:

PUT /<YOUR_INDEX>/_settings
{
    "index" : {
        "number_of_replicas" : 0
    }
}

至于索引计数,在批量操作之后,需要发生flush才能在磁盘上写入文档。来自文档:

Flushing an index is the process of making sure that any data that is currently only stored in the transaction log is also permanently stored in the Lucene index. When restarting, Elasticsearch replays any unflushed operations from the transaction log into the Lucene index to bring it back into the state that it was in before the restart. Elasticsearch automatically triggers flushes as needed, using heuristics that trade off the size of the unflushed transaction log against the cost of performing each flush.

Once each operation has been flushed it is permanently stored in the Lucene index.

基本上,当您批量处理N个文档时,不会立即看到它们,因为它们还没有写入Lucene索引。在bulk操作完成后,可以手动触发flush

POST /<YOUR_INDEX>/_flush

然后检查索引中的文档数:

GET _cat/indices?v&s=index

也可以强制每N秒刷新一次,例如:

PUT /<YOUR_INDEX>/_settings
{
    "index" : {
        "refresh_interval" : "1s"
    }
} 

您可以在docs中阅读更多关于它的内容,但我的建议是,如果文档的数量与您打包的文档的数量相同,则不要担心它,使用Kibana dev tools而不是index managementGUI。你知道吗

相关问题 更多 >