i using titan 0.4 on cassandra, have indexed key ("ip_address" in case), non-unique, performance , scalability. challenge graph allows duplicates vertices. running background task cleanup duplicate vertices in graph, iterating through vertices. best way or approach identify duplicate vertex in graph. the estimated size of graph in production around 10m ~ 15m vertices or more that. there feature exist in titan index, helps identify duplicate? in advance
index creation gremlin script
g.makekey("ip_address").datatype(string.class).indexed("standard",vertex.class).make();
i start titan/hadoop job:
g.v().ip_address.groupcount()
then use ip addresses count > 1 clean / merge duplicated vertices in oltp mode.
Comments
Post a Comment