bloom filter - MongoDB: fastest way to compare two large sets -


i have mongodb collection more 20 millions documents (and growing fast). document have 'user_id' (others, don't).

i regularly need check if user_id exists in collection. there lot of 'some'. 10k 100k.

how that?

the first thought have big query:

$or : [ { 'user_id' : ... } , { ... } , ... ] 

with 10k 100k ids... it's slow, , don't think it's better way it. told me redis' bloom filters, seems mongo not that.

do see better way it? bloom filters, < 100% accuracy acceptable in case.

you can try using $in set comparison:

db.users.find({ _id : { $in : [ 1, 2, 3, 4 ] } }) 

if have index on field searching over, operation should fast. if don't have index, , anticipate needing repeat query often, should build one. mentioned in comments, sparse index suit collection if user_id field present in documents.

i did benchmark in ipython on collection ~200m documents, on test database on relatively high spec laptop:

import random pymongo import mongoclient  client = mongoclient() db = client["anonymised"]  # generate 100k random ids testing purposes ids = [ random.randint(0, int(1e6)) in range 100000 ]  %time db.users.count({ "_id" : { "$in" : ids } }) cpu times: user 182 ms, sys: 75.5 ms, total: 257 ms wall time: 16.1 s # returns 32631 

if want improve on this, have @ sharding database keep more important fraction in active memory. in production environment entire working set in memory, operation presumably faster.

by contrast, '$or' approach took:

query = [ { "_id" : v } v in ids ]  %time db.users.count({ "$or" : query }) cpu times: user 1.4 s, sys: 682 ms, total: 2.08 s wall time: 35min 30s 

Comments