java - Check for differences between two (large) files -


i want write relatively simple program, can backup files computer remote location , encrypt them in process, while computing diff (well not really...i'm content seeing if changed @ all, not has changed) between local , remote files see ones have changed , necessary update.

i aware there programs out there (rsync, or others based on duplicity). i'm not trying reinvent wheel, it's supposed learning experience myself

my question regarding diff part of project. have made assumptions , wrote sample code test them out, know if see might have missed, if assumptions plain wrong, or if there's go wrong in particular constelation.

assumption 1: if files not of equal length, can not same (ie. modification must have taken place)
assumption 2: if 2 files same (ie. no modification has taken place) byte sub-set of these 2 files have same hash
assumption 3: if byte sub-set of 2 files found not result in same hash, 2 files not same (ie. have been modified)

the code written in java , hashing algorithm used blake-512 using java implementation marc greim.
_file1 , _file2 2 files > 1.5gb of type java.io.file

public boolean comparestream() throws ioexception {     int = 0;     int step = 4096;     boolean equal = false;      fileinputstream fi1 = new fileinputstream(_file1);           fileinputstream fi2 = new fileinputstream(_file2);      byte[] fi1content = new byte[step];     byte[] fi2content = new byte[step];      if(_file1.length() == _file2.length()) { //assumption 1         while(i*step < _file1.length()) {                 fi1.read(fi1content, 0, step); //assumption 2             fi2.read(fi2content, 0, step); //assumption 2              equal = blake512.isequal(fi1content, fi2content); //assumption 2              if(!equal) { //assumption 3                 break;             }              ++i;         }     }      fi1.close();     fi2.close();     return equal; } 

the calculation 2 equal 1.5 gb files takes around 4.2 seconds. times of course shorter when files differ, when of different length since returns immediately.

thank suggestions :)
..i hope isn't broad

while assumptions correct, won't protect rare false positives (when method says files equal when aren't):

assumption 2: if 2 files same (ie. no modification has taken place) byte sub-set have same hash

this right, because of hash collisions can have situation, when hashes of chunks same, chunks differ.


Comments