TITAN : Identify and remove duplicate vertices in graph -


i using titan 0.4 on cassandra, have indexed key ("ip_address" in case), non-unique, performance , scalability. challenge graph allows duplicates vertices. running background task cleanup duplicate vertices in graph, iterating through vertices. best way or approach identify duplicate vertex in graph. the estimated size of graph in production around 10m ~ 15m vertices or more that. there feature exist in titan index, helps identify duplicate? in advance

index creation gremlin script

g.makekey("ip_address").datatype(string.class).indexed("standard",vertex.class).make(); 

i start titan/hadoop job:

g.v().ip_address.groupcount() 

then use ip addresses count > 1 clean / merge duplicated vertices in oltp mode.


Comments