i have mongodb collection more 20 millions documents (and growing fast). document have 'user_id' (others, don't).
i regularly need check if user_id exists in collection. there lot of 'some'. 10k 100k.
how that?
the first thought have big query:
$or : [ { 'user_id' : ... } , { ... } , ... ]
with 10k 100k ids... it's slow, , don't think it's better way it. told me redis' bloom filters, seems mongo not that.
do see better way it? bloom filters, < 100% accuracy acceptable in case.
you can try using $in
set comparison:
db.users.find({ _id : { $in : [ 1, 2, 3, 4 ] } })
if have index on field searching over, operation should fast. if don't have index, , anticipate needing repeat query often, should build one. mentioned in comments, sparse index suit collection if user_id field present in documents.
i did benchmark in ipython on collection ~200m documents, on test database on relatively high spec laptop:
import random pymongo import mongoclient client = mongoclient() db = client["anonymised"] # generate 100k random ids testing purposes ids = [ random.randint(0, int(1e6)) in range 100000 ] %time db.users.count({ "_id" : { "$in" : ids } }) cpu times: user 182 ms, sys: 75.5 ms, total: 257 ms wall time: 16.1 s # returns 32631
if want improve on this, have @ sharding database keep more important fraction in active memory. in production environment entire working set in memory, operation presumably faster.
by contrast, '$or' approach took:
query = [ { "_id" : v } v in ids ] %time db.users.count({ "$or" : query }) cpu times: user 1.4 s, sys: 682 ms, total: 2.08 s wall time: 35min 30s
Comments
Post a Comment