* Normalize unicode paths in the database to enable use of the same database across different platforms
* Check if unicode normalization should be applied without regexp
1. Use 2 new data structures:
-paths (set) contains all the files in the actual filesystem
-hashes (dictionary) substitute the sqlite query with dict[hash] = set(db paths)
2. Minimal unitary tests created with bats (bash script)
See https://github.com/ambv/bitrot/issues/23 for details.
Was failing on my machine - traced it to line 189 using a hard-coded 'utf-8' encoding. Everywhere else uses FSENCODING (which on my machine is 'mbcs'), so replaced it here as well.
I had a semi-corrupt encfs (which I detected, thanks to this tool!). A
file was only partially readable. Somewhere in the middle, an IOError
occured. Essentially, this is a corrupt file system -- which this tool
is meant to help detect --, so this class of errors shouldn't be
suppressed by "-q".
I get that this is some kind of a grey area due to the underlying race
condition (files vanishing after they have been scanned). However, if we
can't stat() a file it can have many different causes -- the file being
vanished is just one of them. Since this tool is meant to help detect
bit rot and corrupt file systems, I'd rather be informed about
un-stat-able files.
When renaming a file its hash can't be used in the WHERE
condition in the UPDATE statement since there _can_ be more
than one file having the same hash and not all of them are
renamed just the one not existing anymore. So we need to use
the old path (now non-existent) to specify the record to
update.
To make the code more clear I also added setting the hash
explicitly in the UPDATE statement.