mongodb - Optimizing for random reads -
first of all, using mongodb 3.0 new wiredtiger storage engine. using snappy compression.
the use case trying understand , optimize technical point of view following;
i have large collection, 500 million documents takes 180 gb including indexes.
example document:
{ _id: 123234, type: "car", color: "blue", description: "bla bla" }
queries consist of finding documents specific field value. so;
thing.find( { type: "car" } )
in example type
field should indexed. far good. access pattern data random. @ given time have no idea range of documents accessed. know queried on indexed fields, returning @ 100000 documents @ time.
what means in mind caching in mongodb/wiredtiger pretty useless. thing needs fit in cache indexes. estimation of working set hard if not impossible?
what looking tips on kinds of indexes use , how configure mongodb kind of use case. other databases work better?
currently find mongodb work quite on limited hardware (16 gb ram, non ssd disc). queries return in decent time , instantly if result set in cache. stated not typical case. not critical queries lightning fast, more dependable , database run in stable manner.
edit:
guess left out important things. database archival purposes. such, data arrives source in bulk, once day. updates rare.
the example used bit contrived in essence queries like. when mentioned multiple indexes meant type
, color
fields in example. documents queried on using these fields. now, care returning documents have specific type
, color
etc. naturally, plan have query on fields have index for. ad-hoc queries off table.
right index sizes quite manageable. 500 million documents each of these indexes 2.5gb , fit in ram.
regarding average data size of operation, can speculate @ point. far know, typical operations return 20k documents, average object size in range of 1200 bytes. stat reported db.stats()
guess compressed data on disc, , not how takes once in ram.
hope bit of info helped!
basically, if have consistent rate of reads uniformly @ random on type
(which i'm taking
i have no idea range of documents accessed
to mean), see stable performance database. doing stable proportion of reads cache, luck, , stable proportion reading disk, if number , size of documents same between different type
values. don't think there's special index or besides better hardware. indexes should remain in ram because they'll being used.
i suppose more information help, mention 1 simple query on type
talk having multiple indexes worry keeping in ram. how data average operation return? ever care return subset of docs of type
or of them? inserts , updates collection like?
also, if documents being read random on dataset, working set of data.
Comments
Post a Comment