Compare commits

..

No commits in common. "main" and "0.9.2" have entirely different histories.
main ... 0.9.2

8 changed files with 211 additions and 641 deletions

View File

@ -1,13 +0,0 @@
root = true
[*]
trim_trailing_whitespace = true
insert_final_newline = true
[*.{py,pyx,pxd,pxi,yml,h}]
indent_size = 4
indent_style = space
[ext/*.{c,cpp,h}]
indent_size = 4
indent_style = tab

21
LICENSE
View File

@ -1,21 +0,0 @@
The MIT License (MIT)
Copyright (c) 2013 Łukasz Langa
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@ -22,83 +22,31 @@ will report all errors, e.g. files that changed on the hard drive but
still have the same modification date.
All paths stored in ``.bitrot.db`` are relative so it's safe to rescan
a folder after moving it to another drive. Just remember to move it in
a way that doesn't touch modification dates. Otherwise the checksum
database is useless.
a folder after moving it to another drive.
Performance
-----------
Obviously depends on how fast the underlying drive is. Historically
the script was single-threaded because back in 2013 checksum
calculations on a single core still outran typical drives, including
the mobile SSDs of the day. In 2020 this is no longer the case so the
script now uses a process pool to calculate SHA1 hashes and perform
`stat()` calls.
Obviously depends on how fast the underlying drive is. Since bandwidth
for checksum calculations is greater than your drive's data transfer
rate, even when comparing mobile CPUs vs. SSD drives, the script is
single-threaded.
No rigorous performance tests have been done. Scanning a ~1000 file
directory totalling ~5 GB takes 2.2s on a 2018 MacBook Pro 15" with
a AP0512M SSD. Back in 2013, that same feat on a 2015 MacBook Air with
a SM0256G SSD took over 20 seconds.
No rigorous performance tests have been done. Scanning a ~1000 files
totalling ~4 GB takes 20 seconds on a 2015 Macbook Air (SM0256G SSD).
This is with cold disk cache.
On that same 2018 MacBook Pro 15", scanning a 60+ GB music library takes
24 seconds. Back in 2013, with a typical 5400 RPM laptop hard drive
it took around 15 minutes. How times have changed!
Some other tests back from 2013: a typical 5400 RPM laptop hard drive
scanning a 60+ GB music library took around 15 minutes. On an OCZ
Vertex 3 SSD drive ``bitrot`` was able to scan a 100 GB Aperture library
in under 10 minutes. Both tests on HFS+.
Tests
-----
There's a simple but comprehensive test scenario using
`pytest <https://pypi.org/p/pytest>`_ and
`pytest-order <https://pypi.org/p/pytest-order>`_.
Install::
$ python3 -m venv .venv
$ . .venv/bin/activate
(.venv)$ pip install -e .[test]
Run::
(.venv)$ pytest -x
==================== test session starts ====================
platform darwin -- Python 3.10.12, pytest-7.4.0, pluggy-1.2.0
rootdir: /Users/ambv/Documents/Python/bitrot
plugins: order-1.1.0
collected 12 items
tests/test_bitrot.py ............ [100%]
==================== 12 passed in 15.05s ====================
If you'd like to contribute some more rigorous eenchmarks or any
performance improvements, I'm accepting pull requests! :)
Change Log
----------
1.0.1
~~~~~
* officially remove Python 2 support that was broken since 1.0.0
anyway; now the package works with Python 3.8+ because of a few
features
1.0.0
~~~~~
* significantly sped up execution on solid state drives by using
a process pool executor to calculate SHA1 hashes and perform `stat()`
calls; use `-w1` if your runs on slow magnetic drives were
negatively affected by this change
* sped up execution by pre-loading all SQLite-stored hashes to memory
and doing comparisons using Python sets
* all UTF-8 filenames are now normalized to NFKD in the database to
enable cross-operating system checks
* the SQLite database is now vacuumed to minimize its size
* bugfix: additional Python 3 fixes when Unicode names were encountered
0.9.2
~~~~~
@ -223,14 +171,8 @@ Authors
-------
Glued together by `Łukasz Langa <mailto:lukasz@langa.pl>`_. Multiple
improvements by
`Ben Shepherd <mailto:bjashepherd@gmail.com>`_,
`Jean-Louis Fuchs <mailto:ganwell@fangorn.ch>`_,
`Marcus Linderoth <marcus@thingsquare.com>`_,
`p1r473 <mailto:subwayjared@gmail.com>`_,
`Peter Hofmann <mailto:scm@uninformativ.de>`_,
`Phil Lundrigan <mailto:philipbl@cs.utah.edu>`_,
`Reid Williams <rwilliams@ideo.com>`_,
`Stan Senotrusov <senotrusov@gmail.com>`_,
`Yang Zhang <mailto:yaaang@gmail.com>`_, and
`Zhuoyun Wei <wzyboy@wzyboy.org>`_.
improvements by `Yang Zhang <mailto:yaaang@gmail.com>`_, `Jean-Louis
Fuchs <mailto:ganwell@fangorn.ch>`_, `Phil Lundrigan
<mailto:philipbl@cs.utah.edu>`_, `Ben Shepherd
<mailto:bjashepherd@gmail.com`, and `Peter Hofmann
<mailto:scm@uninformativ.de>`_.

30
bin/bitrot Normal file
View File

@ -0,0 +1,30 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2013 by Łukasz Langa
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from bitrot import run_from_command_line
run_from_command_line()

View File

@ -1,34 +0,0 @@
[build-system]
requires = ["setuptools", "setuptools-scm[toml]"]
build-backend = "setuptools.build_meta"
[project]
name = "bitrot"
authors = [
{name = "Łukasz Langa", email = "lukasz@langa.pl"},
]
description = "Detects bit rotten files on the hard drive to save your precious photo and music collection from slow decay."
readme = "README.rst"
requires-python = ">=3.8"
keywords = ["file", "checksum", "database"]
license = {text = "MIT"}
classifiers = [
"Development Status :: 5 - Production/Stable",
"Natural Language :: English",
"Programming Language :: Python :: 3",
"Topic :: System :: Filesystems",
"Topic :: System :: Monitoring",
"Topic :: Software Development :: Libraries :: Python Modules",
]
dependencies = []
dynamic = ["version"]
[project.optional-dependencies]
test = ["pytest", "pytest-order"]
[project.scripts]
bitrot = "bitrot:run_from_command_line"
[tool.setuptools_scm]
tag_regex = "^(?P<version>v\\d+(?:\\.\\d+){0,2}[^\\+]*)(?:\\+.*)?$"

74
setup.py Normal file
View File

@ -0,0 +1,74 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2013 by Łukasz Langa
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
import codecs
import os
import sys
from setuptools import setup, find_packages
current_dir = os.path.abspath(os.path.dirname(__file__))
ld_file = codecs.open(os.path.join(current_dir, 'README.rst'), encoding='utf8')
try:
long_description = ld_file.read()
finally:
ld_file.close()
# We let it die a horrible tracebacking death if reading the file fails.
# We couldn't sensibly recover anyway: we need the long description.
sys.path.insert(0, current_dir + os.sep + 'src')
from bitrot import VERSION
release = ".".join(str(num) for num in VERSION)
setup(
name = 'bitrot',
version = release,
author = u'Łukasz Langa',
author_email = 'lukasz@langa.pl',
description = ("Detects bit rotten files on the hard drive to save your "
"precious photo and music collection from slow decay."),
long_description = long_description,
url = 'https://github.com/ambv/bitrot/',
keywords = 'file checksum database',
platforms = ['any'],
license = 'MIT',
package_dir = {'': 'src'},
packages = find_packages('src'),
py_modules = ['bitrot'],
scripts = ['bin/bitrot'],
include_package_data = True,
zip_safe = False, # if only because of the readme file
install_requires = [
],
classifiers = [
'Development Status :: 4 - Beta',
'License :: OSI Approved :: MIT License',
'Natural Language :: English',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python',
'Topic :: System :: Filesystems',
'Topic :: System :: Monitoring',
'Topic :: Software Development :: Libraries :: Python Modules',
]
)

236
src/bitrot.py Executable file → Normal file
View File

@ -1,4 +1,5 @@
#!/usr/bin/env python3
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2013 by Łukasz Langa
@ -20,7 +21,10 @@
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
from __future__ import annotations
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import argparse
import atexit
@ -34,29 +38,18 @@ import stat
import sys
import tempfile
import time
import unicodedata
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import freeze_support
from importlib.metadata import version, PackageNotFoundError
DEFAULT_CHUNK_SIZE = 16384 # block size in HFS+; 4X the block size in ext4
DOT_THRESHOLD = 200
VERSION = (0, 9, 2)
IGNORED_FILE_SYSTEM_ERRORS = {errno.ENOENT, errno.EACCES}
FSENCODING = sys.getfilesystemencoding()
try:
VERSION = version("bitrot")
except PackageNotFoundError:
VERSION = "1.0.1"
def normalize_path(path):
path_uni = path.decode(FSENCODING)
if FSENCODING in ('utf-8', 'UTF-8'):
return unicodedata.normalize('NFKD', path_uni)
return path_uni
if sys.version[0] == '2':
str = type(u'text')
# use `bytes` for bytestrings
def sha1(path, chunk_size):
@ -102,17 +95,16 @@ def get_sqlite3_cursor(path, copy=False):
def list_existing_paths(directory, expected=(), ignored=(), follow_links=False):
"""list_existing_paths(b'/dir') -> ([path1, path2, ...], total_size)
"""list_existing_paths('/dir') -> ([path1, path2, ...], total_size)
Returns a tuple with a set of existing files in `directory` and its subdirectories
and their `total_size`. If directory was a bytes object, so will be the returned
paths.
Returns a tuple with a list with existing files in `directory` and their
`total_size`.
Doesn't add entries listed in `ignored`. Doesn't add symlinks if
`follow_links` is False (the default). All entries present in `expected`
must be files (can't be directories or symlinks).
"""
paths = set()
paths = []
total_size = 0
for path, _, files in os.walk(directory):
for f in files:
@ -137,46 +129,12 @@ def list_existing_paths(directory, expected=(), ignored=(), follow_links=False):
else:
if not stat.S_ISREG(st.st_mode) or p in ignored:
continue
paths.add(p)
paths.append(p)
total_size += st.st_size
paths.sort()
return paths, total_size
def compute_one(path, chunk_size):
"""Return a tuple with (unicode path, size, mtime, sha1). Takes a binary path."""
p_uni = normalize_path(path)
try:
st = os.stat(path)
except OSError as ex:
if ex.errno in IGNORED_FILE_SYSTEM_ERRORS:
# The file disappeared between listing existing paths and
# this run or is (temporarily?) locked with different
# permissions. We'll just skip it for now.
print(
'\rwarning: `{}` is currently unavailable for '
'reading: {}'.format(
p_uni, ex,
),
file=sys.stderr,
)
raise BitrotException
raise # Not expected? https://github.com/ambv/bitrot/issues/
try:
new_sha1 = sha1(path, chunk_size)
except (IOError, OSError) as e:
print(
'\rwarning: cannot compute hash of {} [{}]'.format(
p_uni, errno.errorcode[e.args[0]],
),
file=sys.stderr,
)
raise BitrotException
return p_uni, st.st_size, int(st.st_mtime), new_sha1
class BitrotException(Exception):
pass
@ -184,7 +142,7 @@ class BitrotException(Exception):
class Bitrot(object):
def __init__(
self, verbosity=1, test=False, follow_links=False, commit_interval=300,
chunk_size=DEFAULT_CHUNK_SIZE, workers=os.cpu_count(),
chunk_size=DEFAULT_CHUNK_SIZE,
):
self.verbosity = verbosity
self.test = test
@ -193,7 +151,6 @@ class Bitrot(object):
self.chunk_size = chunk_size
self._last_reported_size = ''
self._last_commit_ts = 0
self.pool = ProcessPoolExecutor(max_workers=workers)
def maybe_commit(self, conn):
if time.time() < self._last_commit_ts + self.commit_interval:
@ -223,55 +180,67 @@ class Bitrot(object):
errors = []
current_size = 0
missing_paths = self.select_all_paths(cur)
hashes = self.select_all_hashes(cur)
paths, total_size = list_existing_paths(
b'.', expected=missing_paths, ignored={bitrot_db, bitrot_sha512},
follow_links=self.follow_links,
)
paths_uni = set(normalize_path(p) for p in paths)
futures = [self.pool.submit(compute_one, p, self.chunk_size) for p in paths]
for future in as_completed(futures):
for p in paths:
p_uni = p.decode(FSENCODING)
try:
p_uni, new_size, new_mtime, new_sha1 = future.result()
except BitrotException:
continue
st = os.stat(p)
except OSError as ex:
if ex.errno in IGNORED_FILE_SYSTEM_ERRORS:
# The file disappeared between listing existing paths and
# this run or is (temporarily?) locked with different
# permissions. We'll just skip it for now.
print(
'\rwarning: `{}` is currently unavailable for '
'reading: {}'.format(
p_uni, ex,
),
file=sys.stderr,
)
continue
current_size += new_size
raise # Not expected? https://github.com/ambv/bitrot/issues/
new_mtime = int(st.st_mtime)
current_size += st.st_size
if self.verbosity:
self.report_progress(current_size, total_size)
if p_uni not in missing_paths:
# We are not expecting this path, it wasn't in the database yet.
# It's either new or a rename. Let's handle that.
missing_paths.discard(p_uni)
try:
new_sha1 = sha1(p, self.chunk_size)
except (IOError, OSError) as e:
print(
'\rwarning: cannot compute hash of {} [{}]'.format(
p, errno.errorcode[e.args[0]],
),
file=sys.stderr,
)
continue
cur.execute('SELECT mtime, hash, timestamp FROM bitrot WHERE '
'path=?', (p_uni,))
row = cur.fetchone()
if not row:
stored_path = self.handle_unknown_path(
cur, p_uni, new_mtime, new_sha1, paths_uni, hashes
cur, p_uni, new_mtime, new_sha1,
)
self.maybe_commit(conn)
if p_uni == stored_path:
new_paths.append(p_uni)
missing_paths.discard(p_uni)
new_paths.append(p) # FIXME: shouldn't that be p_uni?
else:
renamed_paths.append((stored_path, p_uni))
missing_paths.discard(stored_path)
continue
# At this point we know we're seeing an expected file.
missing_paths.discard(p_uni)
cur.execute('SELECT mtime, hash, timestamp FROM bitrot WHERE path=?',
(p_uni,))
row = cur.fetchone()
if not row:
print(
'\rwarning: path disappeared from the database while running:',
p_uni,
file=sys.stderr,
)
continue
stored_mtime, stored_sha1, stored_ts = row
if int(stored_mtime) != new_mtime:
updated_paths.append(p_uni)
updated_paths.append(p)
cur.execute('UPDATE bitrot SET mtime=?, hash=?, timestamp=? '
'WHERE path=?',
(new_mtime, new_sha1, ts(), p_uni))
@ -279,11 +248,11 @@ class Bitrot(object):
continue
if stored_sha1 != new_sha1:
errors.append(p_uni)
errors.append(p)
print(
'\rerror: SHA1 mismatch for {}: expected {}, got {}.'
' Last good hash checked on {}.'.format(
p_uni, stored_sha1, new_sha1, stored_ts
p, stored_sha1, new_sha1, stored_ts
),
file=sys.stderr,
)
@ -293,9 +262,6 @@ class Bitrot(object):
conn.commit()
if not self.test:
cur.execute('vacuum')
if self.verbosity:
cur.execute('SELECT COUNT(path) FROM bitrot')
all_count = cur.fetchone()[0]
@ -317,10 +283,6 @@ class Bitrot(object):
)
def select_all_paths(self, cur):
"""Return a set of all distinct paths in the bitrot database.
The paths are Unicode and are normalized if FSENCODING was UTF-8.
"""
result = set()
cur.execute('SELECT path FROM bitrot')
row = cur.fetchone()
@ -329,20 +291,6 @@ class Bitrot(object):
row = cur.fetchone()
return result
def select_all_hashes(self, cur):
"""Return a dict where keys are hashes and values are sets of paths.
The paths are Unicode and are normalized if FSENCODING was UTF-8.
"""
result = {}
cur.execute('SELECT hash, path FROM bitrot')
row = cur.fetchone()
while row:
rhash, rpath = row
result.setdefault(rhash, set()).add(rpath)
row = cur.fetchone()
return result
def report_progress(self, current_size, total_size):
size_fmt = '\r{:>6.1%}'.format(current_size/(total_size or 1))
if size_fmt == self._last_reported_size:
@ -355,7 +303,6 @@ class Bitrot(object):
def report_done(
self, total_size, all_count, error_count, new_paths, updated_paths,
renamed_paths, missing_paths):
"""Print a report on what happened. All paths should be Unicode here."""
print('\rFinished. {:.2f} MiB of data read. {} errors found.'
''.format(total_size/1024/1024, error_count))
if self.verbosity == 1:
@ -372,21 +319,21 @@ class Bitrot(object):
print('{} entries new:'.format(len(new_paths)))
new_paths.sort()
for path in new_paths:
print(' ', path)
print(' ', path.decode(FSENCODING))
if updated_paths:
print('{} entries updated:'.format(len(updated_paths)))
updated_paths.sort()
for path in updated_paths:
print(' ', path)
print(' ', path.decode(FSENCODING))
if renamed_paths:
print('{} entries renamed:'.format(len(renamed_paths)))
renamed_paths.sort()
for path in renamed_paths:
print(
' from',
path[0],
path[0].decode(FSENCODING),
'to',
path[1],
path[1].decode(FSENCODING),
)
if missing_paths:
print('{} entries missing:'.format(len(missing_paths)))
@ -398,38 +345,37 @@ class Bitrot(object):
if self.test and self.verbosity:
print('warning: database file not updated on disk (test mode).')
def handle_unknown_path(self, cur, new_path, new_mtime, new_sha1, paths_uni, hashes):
def handle_unknown_path(self, cur, new_path, new_mtime, new_sha1):
"""Either add a new entry to the database or update the existing entry
on rename.
`cur` is the database cursor. `new_path` is the new Unicode path.
`paths_uni` are Unicode paths seen on disk during this run of Bitrot.
`hashes` is a dictionary selected from the database, keys are hashes, values
are sets of Unicode paths that are stored in the DB under the given hash.
Returns `new_path` if the entry was indeed new or the `old_path` (e.g.
outdated path stored in the database for this hash) if there was a rename.
Returns `new_path` if the entry was indeed new or the `stored_path` (e.g.
outdated path) if there was a rename.
"""
cur.execute('SELECT mtime, path, timestamp FROM bitrot WHERE hash=?',
(new_sha1,))
rows = cur.fetchall()
for row in rows:
stored_mtime, stored_path, stored_ts = row
if os.path.exists(stored_path):
# file still exists, move on
continue
for old_path in hashes.get(new_sha1, ()):
if old_path not in paths_uni:
# File of the same hash used to exist but no longer does.
# Let's treat `new_path` as a renamed version of that `old_path`.
cur.execute(
'UPDATE bitrot SET mtime=?, path=?, timestamp=? WHERE path=?',
(new_mtime, new_path, ts(), old_path),
)
return old_path
else:
# Either we haven't found `new_sha1` at all in the database, or all
# currently stored paths for this hash still point to existing files.
# Let's insert a new entry for what appears to be a new file.
# update the path in the database
cur.execute(
'INSERT INTO bitrot VALUES (?, ?, ?, ?)',
(new_path, new_mtime, new_sha1, ts()),
'UPDATE bitrot SET mtime=?, path=?, timestamp=? WHERE path=?',
(new_mtime, new_path, ts(), stored_path),
)
return new_path
return stored_path
# no rename, just a new file with the same hash
cur.execute(
'INSERT INTO bitrot VALUES (?, ?, ?, ?)',
(new_path, new_mtime, new_sha1, ts()),
)
return new_path
def get_path(directory=b'.', ext=b'db'):
"""Compose the path to the selected bitrot file."""
@ -516,8 +462,6 @@ def update_sha512_integrity(verbosity=1):
def run_from_command_line():
global FSENCODING
freeze_support()
parser = argparse.ArgumentParser(prog='bitrot')
parser.add_argument(
'-l', '--follow-links', action='store_true',
@ -543,14 +487,11 @@ def run_from_command_line():
help='just test against an existing database, don\'t update anything')
parser.add_argument(
'--version', action='version',
version=f"%(prog)s {VERSION}")
version='%(prog)s {}.{}.{}'.format(*VERSION))
parser.add_argument(
'--commit-interval', type=float, default=300,
help='min time in seconds between commits '
'(0 commits on every operation)')
parser.add_argument(
'-w', '--workers', type=int, default=os.cpu_count(),
help='run this many workers (use -w1 for slow magnetic disks)')
parser.add_argument(
'--chunk-size', type=int, default=DEFAULT_CHUNK_SIZE,
help='read files this many bytes at a time')
@ -576,7 +517,6 @@ def run_from_command_line():
follow_links=args.follow_links,
commit_interval=args.commit_interval,
chunk_size=args.chunk_size,
workers=args.workers,
)
if args.fsencoding:
FSENCODING = args.fsencoding

View File

@ -1,348 +0,0 @@
"""
NOTE: those tests are ordered and require pytest-order to run correctly.
"""
from __future__ import annotations
import getpass
import os
from pathlib import Path
import shlex
import shutil
import subprocess
import sys
from textwrap import dedent
import pytest
TMP = Path("/tmp/")
ReturnCode = int
StdOut = list[str]
StdErr = list[str]
def bitrot(*args: str) -> tuple[ReturnCode, StdOut, StdErr]:
cmd = [sys.executable, "-m", "bitrot"]
cmd.extend(args)
res = subprocess.run(shlex.join(cmd), shell=True, capture_output=True)
stdout = (res.stdout or b"").decode("utf8")
stderr = (res.stderr or b"").decode("utf8")
return res.returncode, lines(stdout), lines(stderr)
def bash(script, empty_dir: bool = False) -> bool:
username = getpass.getuser()
test_dir = TMP / f"bitrot-dir-{username}"
if empty_dir and test_dir.is_dir():
os.chdir(TMP)
shutil.rmtree(test_dir)
test_dir.mkdir(exist_ok=True)
os.chdir(test_dir)
preamble = """
set -euxo pipefail
LC_ALL=en_US.UTF-8
LANG=en_US.UTF-8
"""
if script:
# We need to wait a second for modification timestamps to differ so that
# the ordering of the output stays the same every run of the tests.
preamble += """
sleep 1
"""
script_path = TMP / "bitrot-test.bash"
script_path.write_text(dedent(preamble + script))
script_path.chmod(0o755)
out = subprocess.run(["bash", str(script_path)], capture_output=True)
if out.returncode:
print(f"Non-zero return code {out.returncode} when running {script_path}")
if out.stdout:
print(out.stdout)
if out.stderr:
print(out.stderr)
return False
return True
def lines(s: str) -> list[str]:
r"""Only return non-empty lines that weren't killed by \r."""
return [
line.rstrip()
for line in s.splitlines(keepends=True)
if line and line.rstrip() and line[-1] != "\r"
]
@pytest.mark.order(1)
def test_command_exists() -> None:
rc, out, err = bitrot("--help")
assert rc == 0
assert not err
assert out[0].startswith("usage:")
assert bash("", empty_dir=True)
@pytest.mark.order(2)
def test_new_files_in_a_tree_dir() -> None:
assert bash(
"""
mkdir -p nonemptydirs/dir2/
touch nonemptydirs/dir2/new-file-{a,b}.txt
echo $RANDOM >> nonemptydirs/dir2/new-file-b.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
# assert out[0] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[1] == "2 entries in the database. 2 entries new:"
assert out[2] == " ./nonemptydirs/dir2/new-file-a.txt"
assert out[3] == " ./nonemptydirs/dir2/new-file-b.txt"
assert out[4] == "Updating bitrot.sha512... done."
@pytest.mark.order(3)
def test_modified_files_in_a_tree_dir() -> None:
assert bash(
"""
echo $RANDOM >> nonemptydirs/dir2/new-file-a.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "2 entries in the database. 1 entries updated:"
assert out[3] == " ./nonemptydirs/dir2/new-file-a.txt"
assert out[4] == "Updating bitrot.sha512... done."
@pytest.mark.order(4)
def test_renamed_files_in_a_tree_dir() -> None:
assert bash(
"""
mv nonemptydirs/dir2/new-file-a.txt nonemptydirs/dir2/new-file-a.txt2
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "2 entries in the database. 1 entries renamed:"
o3 = " from ./nonemptydirs/dir2/new-file-a.txt to ./nonemptydirs/dir2/new-file-a.txt2"
assert out[3] == o3
assert out[4] == "Updating bitrot.sha512... done."
@pytest.mark.order(5)
def test_deleted_files_in_a_tree_dir() -> None:
assert bash(
"""
rm nonemptydirs/dir2/new-file-a.txt2
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "1 entries in the database. 1 entries missing:"
assert out[3] == " ./nonemptydirs/dir2/new-file-a.txt2"
assert out[4] == "Updating bitrot.sha512... done."
@pytest.mark.order(5)
def test_new_files_and_modified_files_in_a_tree_dir() -> None:
assert bash(
"""
for fil in {a,b,c,d,e,f,g}; do
echo $fil >> more-files-$fil.txt
done
echo $RANDOM >> nonemptydirs/dir2/new-file-b.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "8 entries in the database. 7 entries new:"
assert out[3] == " ./more-files-a.txt"
assert out[4] == " ./more-files-b.txt"
assert out[5] == " ./more-files-c.txt"
assert out[6] == " ./more-files-d.txt"
assert out[7] == " ./more-files-e.txt"
assert out[8] == " ./more-files-f.txt"
assert out[9] == " ./more-files-g.txt"
assert out[10] == "1 entries updated:"
assert out[11] == " ./nonemptydirs/dir2/new-file-b.txt"
assert out[12] == "Updating bitrot.sha512... done."
@pytest.mark.order(6)
def test_new_files_modified_deleted_and_moved_in_a_tree_dir() -> None:
assert bash(
"""
for fil in {a,b,c,d,e,f,g}; do
echo $fil $RANDOM >> nonemptydirs/pl-more-files-$fil.txt
done
echo $RANDOM >> nonemptydirs/dir2/new-file-b.txt
mv more-files-a.txt more-files-a.txt2
rm more-files-g.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "14 entries in the database. 7 entries new:"
assert out[3] == " ./nonemptydirs/pl-more-files-a.txt"
assert out[4] == " ./nonemptydirs/pl-more-files-b.txt"
assert out[5] == " ./nonemptydirs/pl-more-files-c.txt"
assert out[6] == " ./nonemptydirs/pl-more-files-d.txt"
assert out[7] == " ./nonemptydirs/pl-more-files-e.txt"
assert out[8] == " ./nonemptydirs/pl-more-files-f.txt"
assert out[9] == " ./nonemptydirs/pl-more-files-g.txt"
assert out[10] == "1 entries updated:"
assert out[11] == " ./nonemptydirs/dir2/new-file-b.txt"
assert out[12] == "1 entries renamed:"
assert out[13] == " from ./more-files-a.txt to ./more-files-a.txt2"
assert out[14] == "1 entries missing:"
assert out[15] == " ./more-files-g.txt"
assert out[16] == "Updating bitrot.sha512... done."
@pytest.mark.order(7)
def test_new_files_modified_deleted_and_moved_in_a_tree_dir_2() -> None:
assert bash(
"""
for fil in {a,b,c,d,e,f,g}; do
echo $RANDOM >> nonemptydirs/pl2-more-files-$fil.txt
done
echo $RANDOM >> nonemptydirs/pl-more-files-a.txt
mv nonemptydirs/pl-more-files-b.txt nonemptydirs/pl-more-files-b.txt2
cp nonemptydirs/pl-more-files-g.txt nonemptydirs/pl2-more-files-g.txt2
cp nonemptydirs/pl-more-files-d.txt nonemptydirs/pl2-more-files-d.txt2
rm more-files-f.txt nonemptydirs/pl-more-files-c.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "21 entries in the database. 9 entries new:"
assert out[3] == " ./nonemptydirs/pl2-more-files-a.txt"
assert out[4] == " ./nonemptydirs/pl2-more-files-b.txt"
assert out[5] == " ./nonemptydirs/pl2-more-files-c.txt"
assert out[6] == " ./nonemptydirs/pl2-more-files-d.txt"
assert out[7] == " ./nonemptydirs/pl2-more-files-d.txt2"
assert out[8] == " ./nonemptydirs/pl2-more-files-e.txt"
assert out[9] == " ./nonemptydirs/pl2-more-files-f.txt"
assert out[10] == " ./nonemptydirs/pl2-more-files-g.txt"
assert out[11] == " ./nonemptydirs/pl2-more-files-g.txt2"
assert out[12] == "1 entries updated:"
assert out[13] == " ./nonemptydirs/pl-more-files-a.txt"
assert out[14] == "1 entries renamed:"
o15 = " from ./nonemptydirs/pl-more-files-b.txt to ./nonemptydirs/pl-more-files-b.txt2"
assert out[15] == o15
assert out[16] == "2 entries missing:"
assert out[17] == " ./more-files-f.txt"
assert out[18] == " ./nonemptydirs/pl-more-files-c.txt"
assert out[19] == "Updating bitrot.sha512... done."
@pytest.mark.order(8)
def test_3278_files() -> None:
assert bash(
"""
mkdir -p alotfiles/here; cd alotfiles/here
# create a 320KB file
dd if=/dev/urandom of=masterfile bs=1 count=327680
# split it in 3277 files (instantly) + masterfile = 3278
split -b 100 -a 10 masterfile
"""
)
rc, out, err = bitrot()
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
o2 = "3299 entries in the database, 3278 new, 0 updated, 0 renamed, 0 missing."
assert out[2] == o2
@pytest.mark.order(9)
def test_3278_files_2() -> None:
assert bash(
"""
mv alotfiles/here alotfiles/here-moved
"""
)
rc, out, err = bitrot()
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
o2 = "3299 entries in the database, 0 new, 0 updated, 3278 renamed, 0 missing."
assert out[2] == o2
@pytest.mark.order(10)
def test_rotten_file() -> None:
assert bash(
"""
touch non-rotten-file
dd if=/dev/zero of=rotten-file bs=1k count=1000 &>/dev/null
# let's make sure they share the same timestamp
touch -r non-rotten-file rotten-file
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "3301 entries in the database. 2 entries new:"
assert out[3] == " ./non-rotten-file"
assert out[4] == " ./rotten-file"
@pytest.mark.order(11)
def test_rotten_file_2() -> None:
assert bash(
"""
# modify the rotten file...
dd if=/dev/urandom of=rotten-file bs=1k count=10 seek=1k conv=notrunc &>/dev/null
# ...but revert the modification date
touch -r non-rotten-file rotten-file
"""
)
rc, out, err = bitrot("-q")
assert rc == 1
assert not out
e = (
"error: SHA1 mismatch for ./rotten-file: expected"
" 8fee1653e234fee8513245d3cb3e3c06d071493e, got"
)
assert err[0].startswith(e)
assert err[1] == "error: There were 1 errors found."
@pytest.mark.order("last")
def test_cleanup() -> None:
username = getpass.getuser()
test_dir = TMP / f"bitrot-dir-{username}"
if test_dir.is_dir():
os.chdir(TMP)
shutil.rmtree(test_dir)