Compare commits

..

No commits in common. "main" and "0.9.2" have entirely different histories.
main ... 0.9.2

8 changed files with 211 additions and 641 deletions

View File

@ -1,13 +0,0 @@
root = true
[*]
trim_trailing_whitespace = true
insert_final_newline = true
[*.{py,pyx,pxd,pxi,yml,h}]
indent_size = 4
indent_style = space
[ext/*.{c,cpp,h}]
indent_size = 4
indent_style = tab

21
LICENSE
View File

@ -1,21 +0,0 @@
The MIT License (MIT)
Copyright (c) 2013 Łukasz Langa
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@ -22,83 +22,31 @@ will report all errors, e.g. files that changed on the hard drive but
still have the same modification date. still have the same modification date.
All paths stored in ``.bitrot.db`` are relative so it's safe to rescan All paths stored in ``.bitrot.db`` are relative so it's safe to rescan
a folder after moving it to another drive. Just remember to move it in a folder after moving it to another drive.
a way that doesn't touch modification dates. Otherwise the checksum
database is useless.
Performance Performance
----------- -----------
Obviously depends on how fast the underlying drive is. Historically Obviously depends on how fast the underlying drive is. Since bandwidth
the script was single-threaded because back in 2013 checksum for checksum calculations is greater than your drive's data transfer
calculations on a single core still outran typical drives, including rate, even when comparing mobile CPUs vs. SSD drives, the script is
the mobile SSDs of the day. In 2020 this is no longer the case so the single-threaded.
script now uses a process pool to calculate SHA1 hashes and perform
`stat()` calls.
No rigorous performance tests have been done. Scanning a ~1000 file No rigorous performance tests have been done. Scanning a ~1000 files
directory totalling ~5 GB takes 2.2s on a 2018 MacBook Pro 15" with totalling ~4 GB takes 20 seconds on a 2015 Macbook Air (SM0256G SSD).
a AP0512M SSD. Back in 2013, that same feat on a 2015 MacBook Air with This is with cold disk cache.
a SM0256G SSD took over 20 seconds.
On that same 2018 MacBook Pro 15", scanning a 60+ GB music library takes Some other tests back from 2013: a typical 5400 RPM laptop hard drive
24 seconds. Back in 2013, with a typical 5400 RPM laptop hard drive scanning a 60+ GB music library took around 15 minutes. On an OCZ
it took around 15 minutes. How times have changed! Vertex 3 SSD drive ``bitrot`` was able to scan a 100 GB Aperture library
in under 10 minutes. Both tests on HFS+.
Tests If you'd like to contribute some more rigorous eenchmarks or any
----- performance improvements, I'm accepting pull requests! :)
There's a simple but comprehensive test scenario using
`pytest <https://pypi.org/p/pytest>`_ and
`pytest-order <https://pypi.org/p/pytest-order>`_.
Install::
$ python3 -m venv .venv
$ . .venv/bin/activate
(.venv)$ pip install -e .[test]
Run::
(.venv)$ pytest -x
==================== test session starts ====================
platform darwin -- Python 3.10.12, pytest-7.4.0, pluggy-1.2.0
rootdir: /Users/ambv/Documents/Python/bitrot
plugins: order-1.1.0
collected 12 items
tests/test_bitrot.py ............ [100%]
==================== 12 passed in 15.05s ====================
Change Log Change Log
---------- ----------
1.0.1
~~~~~
* officially remove Python 2 support that was broken since 1.0.0
anyway; now the package works with Python 3.8+ because of a few
features
1.0.0
~~~~~
* significantly sped up execution on solid state drives by using
a process pool executor to calculate SHA1 hashes and perform `stat()`
calls; use `-w1` if your runs on slow magnetic drives were
negatively affected by this change
* sped up execution by pre-loading all SQLite-stored hashes to memory
and doing comparisons using Python sets
* all UTF-8 filenames are now normalized to NFKD in the database to
enable cross-operating system checks
* the SQLite database is now vacuumed to minimize its size
* bugfix: additional Python 3 fixes when Unicode names were encountered
0.9.2 0.9.2
~~~~~ ~~~~~
@ -223,14 +171,8 @@ Authors
------- -------
Glued together by `Łukasz Langa <mailto:lukasz@langa.pl>`_. Multiple Glued together by `Łukasz Langa <mailto:lukasz@langa.pl>`_. Multiple
improvements by improvements by `Yang Zhang <mailto:yaaang@gmail.com>`_, `Jean-Louis
`Ben Shepherd <mailto:bjashepherd@gmail.com>`_, Fuchs <mailto:ganwell@fangorn.ch>`_, `Phil Lundrigan
`Jean-Louis Fuchs <mailto:ganwell@fangorn.ch>`_, <mailto:philipbl@cs.utah.edu>`_, `Ben Shepherd
`Marcus Linderoth <marcus@thingsquare.com>`_, <mailto:bjashepherd@gmail.com`, and `Peter Hofmann
`p1r473 <mailto:subwayjared@gmail.com>`_, <mailto:scm@uninformativ.de>`_.
`Peter Hofmann <mailto:scm@uninformativ.de>`_,
`Phil Lundrigan <mailto:philipbl@cs.utah.edu>`_,
`Reid Williams <rwilliams@ideo.com>`_,
`Stan Senotrusov <senotrusov@gmail.com>`_,
`Yang Zhang <mailto:yaaang@gmail.com>`_, and
`Zhuoyun Wei <wzyboy@wzyboy.org>`_.

30
bin/bitrot Normal file
View File

@ -0,0 +1,30 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2013 by Łukasz Langa
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from bitrot import run_from_command_line
run_from_command_line()

View File

@ -1,34 +0,0 @@
[build-system]
requires = ["setuptools", "setuptools-scm[toml]"]
build-backend = "setuptools.build_meta"
[project]
name = "bitrot"
authors = [
{name = "Łukasz Langa", email = "lukasz@langa.pl"},
]
description = "Detects bit rotten files on the hard drive to save your precious photo and music collection from slow decay."
readme = "README.rst"
requires-python = ">=3.8"
keywords = ["file", "checksum", "database"]
license = {text = "MIT"}
classifiers = [
"Development Status :: 5 - Production/Stable",
"Natural Language :: English",
"Programming Language :: Python :: 3",
"Topic :: System :: Filesystems",
"Topic :: System :: Monitoring",
"Topic :: Software Development :: Libraries :: Python Modules",
]
dependencies = []
dynamic = ["version"]
[project.optional-dependencies]
test = ["pytest", "pytest-order"]
[project.scripts]
bitrot = "bitrot:run_from_command_line"
[tool.setuptools_scm]
tag_regex = "^(?P<version>v\\d+(?:\\.\\d+){0,2}[^\\+]*)(?:\\+.*)?$"

74
setup.py Normal file
View File

@ -0,0 +1,74 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2013 by Łukasz Langa
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
import codecs
import os
import sys
from setuptools import setup, find_packages
current_dir = os.path.abspath(os.path.dirname(__file__))
ld_file = codecs.open(os.path.join(current_dir, 'README.rst'), encoding='utf8')
try:
long_description = ld_file.read()
finally:
ld_file.close()
# We let it die a horrible tracebacking death if reading the file fails.
# We couldn't sensibly recover anyway: we need the long description.
sys.path.insert(0, current_dir + os.sep + 'src')
from bitrot import VERSION
release = ".".join(str(num) for num in VERSION)
setup(
name = 'bitrot',
version = release,
author = u'Łukasz Langa',
author_email = 'lukasz@langa.pl',
description = ("Detects bit rotten files on the hard drive to save your "
"precious photo and music collection from slow decay."),
long_description = long_description,
url = 'https://github.com/ambv/bitrot/',
keywords = 'file checksum database',
platforms = ['any'],
license = 'MIT',
package_dir = {'': 'src'},
packages = find_packages('src'),
py_modules = ['bitrot'],
scripts = ['bin/bitrot'],
include_package_data = True,
zip_safe = False, # if only because of the readme file
install_requires = [
],
classifiers = [
'Development Status :: 4 - Beta',
'License :: OSI Approved :: MIT License',
'Natural Language :: English',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python',
'Topic :: System :: Filesystems',
'Topic :: System :: Monitoring',
'Topic :: Software Development :: Libraries :: Python Modules',
]
)

220
src/bitrot.py Executable file → Normal file
View File

@ -1,4 +1,5 @@
#!/usr/bin/env python3 #!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2013 by Łukasz Langa # Copyright (C) 2013 by Łukasz Langa
@ -20,7 +21,10 @@
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE. # THE SOFTWARE.
from __future__ import annotations from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import argparse import argparse
import atexit import atexit
@ -34,29 +38,18 @@ import stat
import sys import sys
import tempfile import tempfile
import time import time
import unicodedata
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import freeze_support
from importlib.metadata import version, PackageNotFoundError
DEFAULT_CHUNK_SIZE = 16384 # block size in HFS+; 4X the block size in ext4 DEFAULT_CHUNK_SIZE = 16384 # block size in HFS+; 4X the block size in ext4
DOT_THRESHOLD = 200 DOT_THRESHOLD = 200
VERSION = (0, 9, 2)
IGNORED_FILE_SYSTEM_ERRORS = {errno.ENOENT, errno.EACCES} IGNORED_FILE_SYSTEM_ERRORS = {errno.ENOENT, errno.EACCES}
FSENCODING = sys.getfilesystemencoding() FSENCODING = sys.getfilesystemencoding()
try:
VERSION = version("bitrot")
except PackageNotFoundError:
VERSION = "1.0.1"
def normalize_path(path): if sys.version[0] == '2':
path_uni = path.decode(FSENCODING) str = type(u'text')
if FSENCODING in ('utf-8', 'UTF-8'): # use `bytes` for bytestrings
return unicodedata.normalize('NFKD', path_uni)
return path_uni
def sha1(path, chunk_size): def sha1(path, chunk_size):
@ -102,17 +95,16 @@ def get_sqlite3_cursor(path, copy=False):
def list_existing_paths(directory, expected=(), ignored=(), follow_links=False): def list_existing_paths(directory, expected=(), ignored=(), follow_links=False):
"""list_existing_paths(b'/dir') -> ([path1, path2, ...], total_size) """list_existing_paths('/dir') -> ([path1, path2, ...], total_size)
Returns a tuple with a set of existing files in `directory` and its subdirectories Returns a tuple with a list with existing files in `directory` and their
and their `total_size`. If directory was a bytes object, so will be the returned `total_size`.
paths.
Doesn't add entries listed in `ignored`. Doesn't add symlinks if Doesn't add entries listed in `ignored`. Doesn't add symlinks if
`follow_links` is False (the default). All entries present in `expected` `follow_links` is False (the default). All entries present in `expected`
must be files (can't be directories or symlinks). must be files (can't be directories or symlinks).
""" """
paths = set() paths = []
total_size = 0 total_size = 0
for path, _, files in os.walk(directory): for path, _, files in os.walk(directory):
for f in files: for f in files:
@ -137,46 +129,12 @@ def list_existing_paths(directory, expected=(), ignored=(), follow_links=False):
else: else:
if not stat.S_ISREG(st.st_mode) or p in ignored: if not stat.S_ISREG(st.st_mode) or p in ignored:
continue continue
paths.add(p) paths.append(p)
total_size += st.st_size total_size += st.st_size
paths.sort()
return paths, total_size return paths, total_size
def compute_one(path, chunk_size):
"""Return a tuple with (unicode path, size, mtime, sha1). Takes a binary path."""
p_uni = normalize_path(path)
try:
st = os.stat(path)
except OSError as ex:
if ex.errno in IGNORED_FILE_SYSTEM_ERRORS:
# The file disappeared between listing existing paths and
# this run or is (temporarily?) locked with different
# permissions. We'll just skip it for now.
print(
'\rwarning: `{}` is currently unavailable for '
'reading: {}'.format(
p_uni, ex,
),
file=sys.stderr,
)
raise BitrotException
raise # Not expected? https://github.com/ambv/bitrot/issues/
try:
new_sha1 = sha1(path, chunk_size)
except (IOError, OSError) as e:
print(
'\rwarning: cannot compute hash of {} [{}]'.format(
p_uni, errno.errorcode[e.args[0]],
),
file=sys.stderr,
)
raise BitrotException
return p_uni, st.st_size, int(st.st_mtime), new_sha1
class BitrotException(Exception): class BitrotException(Exception):
pass pass
@ -184,7 +142,7 @@ class BitrotException(Exception):
class Bitrot(object): class Bitrot(object):
def __init__( def __init__(
self, verbosity=1, test=False, follow_links=False, commit_interval=300, self, verbosity=1, test=False, follow_links=False, commit_interval=300,
chunk_size=DEFAULT_CHUNK_SIZE, workers=os.cpu_count(), chunk_size=DEFAULT_CHUNK_SIZE,
): ):
self.verbosity = verbosity self.verbosity = verbosity
self.test = test self.test = test
@ -193,7 +151,6 @@ class Bitrot(object):
self.chunk_size = chunk_size self.chunk_size = chunk_size
self._last_reported_size = '' self._last_reported_size = ''
self._last_commit_ts = 0 self._last_commit_ts = 0
self.pool = ProcessPoolExecutor(max_workers=workers)
def maybe_commit(self, conn): def maybe_commit(self, conn):
if time.time() < self._last_commit_ts + self.commit_interval: if time.time() < self._last_commit_ts + self.commit_interval:
@ -223,55 +180,67 @@ class Bitrot(object):
errors = [] errors = []
current_size = 0 current_size = 0
missing_paths = self.select_all_paths(cur) missing_paths = self.select_all_paths(cur)
hashes = self.select_all_hashes(cur)
paths, total_size = list_existing_paths( paths, total_size = list_existing_paths(
b'.', expected=missing_paths, ignored={bitrot_db, bitrot_sha512}, b'.', expected=missing_paths, ignored={bitrot_db, bitrot_sha512},
follow_links=self.follow_links, follow_links=self.follow_links,
) )
paths_uni = set(normalize_path(p) for p in paths)
futures = [self.pool.submit(compute_one, p, self.chunk_size) for p in paths]
for future in as_completed(futures): for p in paths:
p_uni = p.decode(FSENCODING)
try: try:
p_uni, new_size, new_mtime, new_sha1 = future.result() st = os.stat(p)
except BitrotException: except OSError as ex:
if ex.errno in IGNORED_FILE_SYSTEM_ERRORS:
# The file disappeared between listing existing paths and
# this run or is (temporarily?) locked with different
# permissions. We'll just skip it for now.
print(
'\rwarning: `{}` is currently unavailable for '
'reading: {}'.format(
p_uni, ex,
),
file=sys.stderr,
)
continue continue
current_size += new_size raise # Not expected? https://github.com/ambv/bitrot/issues/
new_mtime = int(st.st_mtime)
current_size += st.st_size
if self.verbosity: if self.verbosity:
self.report_progress(current_size, total_size) self.report_progress(current_size, total_size)
if p_uni not in missing_paths: missing_paths.discard(p_uni)
# We are not expecting this path, it wasn't in the database yet. try:
# It's either new or a rename. Let's handle that. new_sha1 = sha1(p, self.chunk_size)
except (IOError, OSError) as e:
print(
'\rwarning: cannot compute hash of {} [{}]'.format(
p, errno.errorcode[e.args[0]],
),
file=sys.stderr,
)
continue
cur.execute('SELECT mtime, hash, timestamp FROM bitrot WHERE '
'path=?', (p_uni,))
row = cur.fetchone()
if not row:
stored_path = self.handle_unknown_path( stored_path = self.handle_unknown_path(
cur, p_uni, new_mtime, new_sha1, paths_uni, hashes cur, p_uni, new_mtime, new_sha1,
) )
self.maybe_commit(conn) self.maybe_commit(conn)
if p_uni == stored_path: if p_uni == stored_path:
new_paths.append(p_uni) new_paths.append(p) # FIXME: shouldn't that be p_uni?
missing_paths.discard(p_uni)
else: else:
renamed_paths.append((stored_path, p_uni)) renamed_paths.append((stored_path, p_uni))
missing_paths.discard(stored_path) missing_paths.discard(stored_path)
continue continue
# At this point we know we're seeing an expected file.
missing_paths.discard(p_uni)
cur.execute('SELECT mtime, hash, timestamp FROM bitrot WHERE path=?',
(p_uni,))
row = cur.fetchone()
if not row:
print(
'\rwarning: path disappeared from the database while running:',
p_uni,
file=sys.stderr,
)
continue
stored_mtime, stored_sha1, stored_ts = row stored_mtime, stored_sha1, stored_ts = row
if int(stored_mtime) != new_mtime: if int(stored_mtime) != new_mtime:
updated_paths.append(p_uni) updated_paths.append(p)
cur.execute('UPDATE bitrot SET mtime=?, hash=?, timestamp=? ' cur.execute('UPDATE bitrot SET mtime=?, hash=?, timestamp=? '
'WHERE path=?', 'WHERE path=?',
(new_mtime, new_sha1, ts(), p_uni)) (new_mtime, new_sha1, ts(), p_uni))
@ -279,11 +248,11 @@ class Bitrot(object):
continue continue
if stored_sha1 != new_sha1: if stored_sha1 != new_sha1:
errors.append(p_uni) errors.append(p)
print( print(
'\rerror: SHA1 mismatch for {}: expected {}, got {}.' '\rerror: SHA1 mismatch for {}: expected {}, got {}.'
' Last good hash checked on {}.'.format( ' Last good hash checked on {}.'.format(
p_uni, stored_sha1, new_sha1, stored_ts p, stored_sha1, new_sha1, stored_ts
), ),
file=sys.stderr, file=sys.stderr,
) )
@ -293,9 +262,6 @@ class Bitrot(object):
conn.commit() conn.commit()
if not self.test:
cur.execute('vacuum')
if self.verbosity: if self.verbosity:
cur.execute('SELECT COUNT(path) FROM bitrot') cur.execute('SELECT COUNT(path) FROM bitrot')
all_count = cur.fetchone()[0] all_count = cur.fetchone()[0]
@ -317,10 +283,6 @@ class Bitrot(object):
) )
def select_all_paths(self, cur): def select_all_paths(self, cur):
"""Return a set of all distinct paths in the bitrot database.
The paths are Unicode and are normalized if FSENCODING was UTF-8.
"""
result = set() result = set()
cur.execute('SELECT path FROM bitrot') cur.execute('SELECT path FROM bitrot')
row = cur.fetchone() row = cur.fetchone()
@ -329,20 +291,6 @@ class Bitrot(object):
row = cur.fetchone() row = cur.fetchone()
return result return result
def select_all_hashes(self, cur):
"""Return a dict where keys are hashes and values are sets of paths.
The paths are Unicode and are normalized if FSENCODING was UTF-8.
"""
result = {}
cur.execute('SELECT hash, path FROM bitrot')
row = cur.fetchone()
while row:
rhash, rpath = row
result.setdefault(rhash, set()).add(rpath)
row = cur.fetchone()
return result
def report_progress(self, current_size, total_size): def report_progress(self, current_size, total_size):
size_fmt = '\r{:>6.1%}'.format(current_size/(total_size or 1)) size_fmt = '\r{:>6.1%}'.format(current_size/(total_size or 1))
if size_fmt == self._last_reported_size: if size_fmt == self._last_reported_size:
@ -355,7 +303,6 @@ class Bitrot(object):
def report_done( def report_done(
self, total_size, all_count, error_count, new_paths, updated_paths, self, total_size, all_count, error_count, new_paths, updated_paths,
renamed_paths, missing_paths): renamed_paths, missing_paths):
"""Print a report on what happened. All paths should be Unicode here."""
print('\rFinished. {:.2f} MiB of data read. {} errors found.' print('\rFinished. {:.2f} MiB of data read. {} errors found.'
''.format(total_size/1024/1024, error_count)) ''.format(total_size/1024/1024, error_count))
if self.verbosity == 1: if self.verbosity == 1:
@ -372,21 +319,21 @@ class Bitrot(object):
print('{} entries new:'.format(len(new_paths))) print('{} entries new:'.format(len(new_paths)))
new_paths.sort() new_paths.sort()
for path in new_paths: for path in new_paths:
print(' ', path) print(' ', path.decode(FSENCODING))
if updated_paths: if updated_paths:
print('{} entries updated:'.format(len(updated_paths))) print('{} entries updated:'.format(len(updated_paths)))
updated_paths.sort() updated_paths.sort()
for path in updated_paths: for path in updated_paths:
print(' ', path) print(' ', path.decode(FSENCODING))
if renamed_paths: if renamed_paths:
print('{} entries renamed:'.format(len(renamed_paths))) print('{} entries renamed:'.format(len(renamed_paths)))
renamed_paths.sort() renamed_paths.sort()
for path in renamed_paths: for path in renamed_paths:
print( print(
' from', ' from',
path[0], path[0].decode(FSENCODING),
'to', 'to',
path[1], path[1].decode(FSENCODING),
) )
if missing_paths: if missing_paths:
print('{} entries missing:'.format(len(missing_paths))) print('{} entries missing:'.format(len(missing_paths)))
@ -398,39 +345,38 @@ class Bitrot(object):
if self.test and self.verbosity: if self.test and self.verbosity:
print('warning: database file not updated on disk (test mode).') print('warning: database file not updated on disk (test mode).')
def handle_unknown_path(self, cur, new_path, new_mtime, new_sha1, paths_uni, hashes): def handle_unknown_path(self, cur, new_path, new_mtime, new_sha1):
"""Either add a new entry to the database or update the existing entry """Either add a new entry to the database or update the existing entry
on rename. on rename.
`cur` is the database cursor. `new_path` is the new Unicode path. Returns `new_path` if the entry was indeed new or the `stored_path` (e.g.
`paths_uni` are Unicode paths seen on disk during this run of Bitrot. outdated path) if there was a rename.
`hashes` is a dictionary selected from the database, keys are hashes, values
are sets of Unicode paths that are stored in the DB under the given hash.
Returns `new_path` if the entry was indeed new or the `old_path` (e.g.
outdated path stored in the database for this hash) if there was a rename.
""" """
cur.execute('SELECT mtime, path, timestamp FROM bitrot WHERE hash=?',
(new_sha1,))
rows = cur.fetchall()
for row in rows:
stored_mtime, stored_path, stored_ts = row
if os.path.exists(stored_path):
# file still exists, move on
continue
for old_path in hashes.get(new_sha1, ()): # update the path in the database
if old_path not in paths_uni:
# File of the same hash used to exist but no longer does.
# Let's treat `new_path` as a renamed version of that `old_path`.
cur.execute( cur.execute(
'UPDATE bitrot SET mtime=?, path=?, timestamp=? WHERE path=?', 'UPDATE bitrot SET mtime=?, path=?, timestamp=? WHERE path=?',
(new_mtime, new_path, ts(), old_path), (new_mtime, new_path, ts(), stored_path),
) )
return old_path
else: return stored_path
# Either we haven't found `new_sha1` at all in the database, or all
# currently stored paths for this hash still point to existing files. # no rename, just a new file with the same hash
# Let's insert a new entry for what appears to be a new file.
cur.execute( cur.execute(
'INSERT INTO bitrot VALUES (?, ?, ?, ?)', 'INSERT INTO bitrot VALUES (?, ?, ?, ?)',
(new_path, new_mtime, new_sha1, ts()), (new_path, new_mtime, new_sha1, ts()),
) )
return new_path return new_path
def get_path(directory=b'.', ext=b'db'): def get_path(directory=b'.', ext=b'db'):
"""Compose the path to the selected bitrot file.""" """Compose the path to the selected bitrot file."""
return os.path.join(directory, b'.bitrot.' + ext) return os.path.join(directory, b'.bitrot.' + ext)
@ -516,8 +462,6 @@ def update_sha512_integrity(verbosity=1):
def run_from_command_line(): def run_from_command_line():
global FSENCODING global FSENCODING
freeze_support()
parser = argparse.ArgumentParser(prog='bitrot') parser = argparse.ArgumentParser(prog='bitrot')
parser.add_argument( parser.add_argument(
'-l', '--follow-links', action='store_true', '-l', '--follow-links', action='store_true',
@ -543,14 +487,11 @@ def run_from_command_line():
help='just test against an existing database, don\'t update anything') help='just test against an existing database, don\'t update anything')
parser.add_argument( parser.add_argument(
'--version', action='version', '--version', action='version',
version=f"%(prog)s {VERSION}") version='%(prog)s {}.{}.{}'.format(*VERSION))
parser.add_argument( parser.add_argument(
'--commit-interval', type=float, default=300, '--commit-interval', type=float, default=300,
help='min time in seconds between commits ' help='min time in seconds between commits '
'(0 commits on every operation)') '(0 commits on every operation)')
parser.add_argument(
'-w', '--workers', type=int, default=os.cpu_count(),
help='run this many workers (use -w1 for slow magnetic disks)')
parser.add_argument( parser.add_argument(
'--chunk-size', type=int, default=DEFAULT_CHUNK_SIZE, '--chunk-size', type=int, default=DEFAULT_CHUNK_SIZE,
help='read files this many bytes at a time') help='read files this many bytes at a time')
@ -576,7 +517,6 @@ def run_from_command_line():
follow_links=args.follow_links, follow_links=args.follow_links,
commit_interval=args.commit_interval, commit_interval=args.commit_interval,
chunk_size=args.chunk_size, chunk_size=args.chunk_size,
workers=args.workers,
) )
if args.fsencoding: if args.fsencoding:
FSENCODING = args.fsencoding FSENCODING = args.fsencoding

View File

@ -1,348 +0,0 @@
"""
NOTE: those tests are ordered and require pytest-order to run correctly.
"""
from __future__ import annotations
import getpass
import os
from pathlib import Path
import shlex
import shutil
import subprocess
import sys
from textwrap import dedent
import pytest
TMP = Path("/tmp/")
ReturnCode = int
StdOut = list[str]
StdErr = list[str]
def bitrot(*args: str) -> tuple[ReturnCode, StdOut, StdErr]:
cmd = [sys.executable, "-m", "bitrot"]
cmd.extend(args)
res = subprocess.run(shlex.join(cmd), shell=True, capture_output=True)
stdout = (res.stdout or b"").decode("utf8")
stderr = (res.stderr or b"").decode("utf8")
return res.returncode, lines(stdout), lines(stderr)
def bash(script, empty_dir: bool = False) -> bool:
username = getpass.getuser()
test_dir = TMP / f"bitrot-dir-{username}"
if empty_dir and test_dir.is_dir():
os.chdir(TMP)
shutil.rmtree(test_dir)
test_dir.mkdir(exist_ok=True)
os.chdir(test_dir)
preamble = """
set -euxo pipefail
LC_ALL=en_US.UTF-8
LANG=en_US.UTF-8
"""
if script:
# We need to wait a second for modification timestamps to differ so that
# the ordering of the output stays the same every run of the tests.
preamble += """
sleep 1
"""
script_path = TMP / "bitrot-test.bash"
script_path.write_text(dedent(preamble + script))
script_path.chmod(0o755)
out = subprocess.run(["bash", str(script_path)], capture_output=True)
if out.returncode:
print(f"Non-zero return code {out.returncode} when running {script_path}")
if out.stdout:
print(out.stdout)
if out.stderr:
print(out.stderr)
return False
return True
def lines(s: str) -> list[str]:
r"""Only return non-empty lines that weren't killed by \r."""
return [
line.rstrip()
for line in s.splitlines(keepends=True)
if line and line.rstrip() and line[-1] != "\r"
]
@pytest.mark.order(1)
def test_command_exists() -> None:
rc, out, err = bitrot("--help")
assert rc == 0
assert not err
assert out[0].startswith("usage:")
assert bash("", empty_dir=True)
@pytest.mark.order(2)
def test_new_files_in_a_tree_dir() -> None:
assert bash(
"""
mkdir -p nonemptydirs/dir2/
touch nonemptydirs/dir2/new-file-{a,b}.txt
echo $RANDOM >> nonemptydirs/dir2/new-file-b.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
# assert out[0] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[1] == "2 entries in the database. 2 entries new:"
assert out[2] == " ./nonemptydirs/dir2/new-file-a.txt"
assert out[3] == " ./nonemptydirs/dir2/new-file-b.txt"
assert out[4] == "Updating bitrot.sha512... done."
@pytest.mark.order(3)
def test_modified_files_in_a_tree_dir() -> None:
assert bash(
"""
echo $RANDOM >> nonemptydirs/dir2/new-file-a.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "2 entries in the database. 1 entries updated:"
assert out[3] == " ./nonemptydirs/dir2/new-file-a.txt"
assert out[4] == "Updating bitrot.sha512... done."
@pytest.mark.order(4)
def test_renamed_files_in_a_tree_dir() -> None:
assert bash(
"""
mv nonemptydirs/dir2/new-file-a.txt nonemptydirs/dir2/new-file-a.txt2
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "2 entries in the database. 1 entries renamed:"
o3 = " from ./nonemptydirs/dir2/new-file-a.txt to ./nonemptydirs/dir2/new-file-a.txt2"
assert out[3] == o3
assert out[4] == "Updating bitrot.sha512... done."
@pytest.mark.order(5)
def test_deleted_files_in_a_tree_dir() -> None:
assert bash(
"""
rm nonemptydirs/dir2/new-file-a.txt2
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "1 entries in the database. 1 entries missing:"
assert out[3] == " ./nonemptydirs/dir2/new-file-a.txt2"
assert out[4] == "Updating bitrot.sha512... done."
@pytest.mark.order(5)
def test_new_files_and_modified_files_in_a_tree_dir() -> None:
assert bash(
"""
for fil in {a,b,c,d,e,f,g}; do
echo $fil >> more-files-$fil.txt
done
echo $RANDOM >> nonemptydirs/dir2/new-file-b.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "8 entries in the database. 7 entries new:"
assert out[3] == " ./more-files-a.txt"
assert out[4] == " ./more-files-b.txt"
assert out[5] == " ./more-files-c.txt"
assert out[6] == " ./more-files-d.txt"
assert out[7] == " ./more-files-e.txt"
assert out[8] == " ./more-files-f.txt"
assert out[9] == " ./more-files-g.txt"
assert out[10] == "1 entries updated:"
assert out[11] == " ./nonemptydirs/dir2/new-file-b.txt"
assert out[12] == "Updating bitrot.sha512... done."
@pytest.mark.order(6)
def test_new_files_modified_deleted_and_moved_in_a_tree_dir() -> None:
assert bash(
"""
for fil in {a,b,c,d,e,f,g}; do
echo $fil $RANDOM >> nonemptydirs/pl-more-files-$fil.txt
done
echo $RANDOM >> nonemptydirs/dir2/new-file-b.txt
mv more-files-a.txt more-files-a.txt2
rm more-files-g.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "14 entries in the database. 7 entries new:"
assert out[3] == " ./nonemptydirs/pl-more-files-a.txt"
assert out[4] == " ./nonemptydirs/pl-more-files-b.txt"
assert out[5] == " ./nonemptydirs/pl-more-files-c.txt"
assert out[6] == " ./nonemptydirs/pl-more-files-d.txt"
assert out[7] == " ./nonemptydirs/pl-more-files-e.txt"
assert out[8] == " ./nonemptydirs/pl-more-files-f.txt"
assert out[9] == " ./nonemptydirs/pl-more-files-g.txt"
assert out[10] == "1 entries updated:"
assert out[11] == " ./nonemptydirs/dir2/new-file-b.txt"
assert out[12] == "1 entries renamed:"
assert out[13] == " from ./more-files-a.txt to ./more-files-a.txt2"
assert out[14] == "1 entries missing:"
assert out[15] == " ./more-files-g.txt"
assert out[16] == "Updating bitrot.sha512... done."
@pytest.mark.order(7)
def test_new_files_modified_deleted_and_moved_in_a_tree_dir_2() -> None:
assert bash(
"""
for fil in {a,b,c,d,e,f,g}; do
echo $RANDOM >> nonemptydirs/pl2-more-files-$fil.txt
done
echo $RANDOM >> nonemptydirs/pl-more-files-a.txt
mv nonemptydirs/pl-more-files-b.txt nonemptydirs/pl-more-files-b.txt2
cp nonemptydirs/pl-more-files-g.txt nonemptydirs/pl2-more-files-g.txt2
cp nonemptydirs/pl-more-files-d.txt nonemptydirs/pl2-more-files-d.txt2
rm more-files-f.txt nonemptydirs/pl-more-files-c.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "21 entries in the database. 9 entries new:"
assert out[3] == " ./nonemptydirs/pl2-more-files-a.txt"
assert out[4] == " ./nonemptydirs/pl2-more-files-b.txt"
assert out[5] == " ./nonemptydirs/pl2-more-files-c.txt"
assert out[6] == " ./nonemptydirs/pl2-more-files-d.txt"
assert out[7] == " ./nonemptydirs/pl2-more-files-d.txt2"
assert out[8] == " ./nonemptydirs/pl2-more-files-e.txt"
assert out[9] == " ./nonemptydirs/pl2-more-files-f.txt"
assert out[10] == " ./nonemptydirs/pl2-more-files-g.txt"
assert out[11] == " ./nonemptydirs/pl2-more-files-g.txt2"
assert out[12] == "1 entries updated:"
assert out[13] == " ./nonemptydirs/pl-more-files-a.txt"
assert out[14] == "1 entries renamed:"
o15 = " from ./nonemptydirs/pl-more-files-b.txt to ./nonemptydirs/pl-more-files-b.txt2"
assert out[15] == o15
assert out[16] == "2 entries missing:"
assert out[17] == " ./more-files-f.txt"
assert out[18] == " ./nonemptydirs/pl-more-files-c.txt"
assert out[19] == "Updating bitrot.sha512... done."
@pytest.mark.order(8)
def test_3278_files() -> None:
assert bash(
"""
mkdir -p alotfiles/here; cd alotfiles/here
# create a 320KB file
dd if=/dev/urandom of=masterfile bs=1 count=327680
# split it in 3277 files (instantly) + masterfile = 3278
split -b 100 -a 10 masterfile
"""
)
rc, out, err = bitrot()
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
o2 = "3299 entries in the database, 3278 new, 0 updated, 0 renamed, 0 missing."
assert out[2] == o2
@pytest.mark.order(9)
def test_3278_files_2() -> None:
assert bash(
"""
mv alotfiles/here alotfiles/here-moved
"""
)
rc, out, err = bitrot()
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
o2 = "3299 entries in the database, 0 new, 0 updated, 3278 renamed, 0 missing."
assert out[2] == o2
@pytest.mark.order(10)
def test_rotten_file() -> None:
assert bash(
"""
touch non-rotten-file
dd if=/dev/zero of=rotten-file bs=1k count=1000 &>/dev/null
# let's make sure they share the same timestamp
touch -r non-rotten-file rotten-file
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "3301 entries in the database. 2 entries new:"
assert out[3] == " ./non-rotten-file"
assert out[4] == " ./rotten-file"
@pytest.mark.order(11)
def test_rotten_file_2() -> None:
assert bash(
"""
# modify the rotten file...
dd if=/dev/urandom of=rotten-file bs=1k count=10 seek=1k conv=notrunc &>/dev/null
# ...but revert the modification date
touch -r non-rotten-file rotten-file
"""
)
rc, out, err = bitrot("-q")
assert rc == 1
assert not out
e = (
"error: SHA1 mismatch for ./rotten-file: expected"
" 8fee1653e234fee8513245d3cb3e3c06d071493e, got"
)
assert err[0].startswith(e)
assert err[1] == "error: There were 1 errors found."
@pytest.mark.order("last")
def test_cleanup() -> None:
username = getpass.getuser()
test_dir = TMP / f"bitrot-dir-{username}"
if test_dir.is_dir():
os.chdir(TMP)
shutil.rmtree(test_dir)