Compare commits

...

26 Commits
0.9.2 ... main

Author SHA1 Message Date
Łukasz Langa
3f5eb8a0ab
Remove the test directory after successful test run 2023-08-02 13:49:12 +02:00
Łukasz Langa
955b370815
Test requirements now live in pyproject.toml 2023-08-02 13:46:28 +02:00
Łukasz Langa
ad04f72da6
Move freeze_support to the main file, remove unused bin/bitrot 2023-08-02 13:38:31 +02:00
Łukasz Langa
0e9391d087
Fix typo in README 2023-08-02 13:08:15 +02:00
Łukasz Langa
87e15913a5
Move to pyproject.toml, drop Python 2 2023-08-02 13:00:30 +02:00
Łukasz Langa
929fb39782
Move tests to pytest 2023-08-02 12:17:02 +02:00
Łukasz Langa
7f9a2e2efc
Remove unused 'wait' import 2020-06-18 20:06:53 +02:00
p1r473
6168723f5b
Unused variable deletion (#42) 2020-06-18 20:05:08 +02:00
Łukasz Langa
67e7b8c904
v1.0.0 2020-05-18 00:15:24 +02:00
Łukasz Langa
0dc3390b7f
Use a process pool to calculate hashes and perform stat()
Fixes #23
2020-05-17 22:55:35 +02:00
Łukasz Langa
104e07b66b
Fix trailing whitespace in the test 2020-05-17 22:54:27 +02:00
Łukasz Langa
52677d2b5d
Optimization: don't SELECT the path twice if it's not there 2020-05-17 21:58:30 +02:00
Łukasz Langa
45ab4501ee
Make handle_unknown_path more readable 2020-05-17 21:50:09 +02:00
Łukasz Langa
8ee84344e8
Simplify normalization and Unicode handling 2020-05-17 21:18:48 +02:00
Łukasz Langa
7608b56ea6
Remove trailing whitespace 2020-05-17 21:17:19 +02:00
Łukasz Langa
8e9e37094d
Claim 3.7 and 3.8 compatibility, can't be bothered to check 3.6 2020-05-17 20:19:10 +02:00
Łukasz Langa
9af31192c2
Make tests more robust and readable 2020-05-17 20:17:56 +02:00
Łukasz Langa
c73646d2e1
Update ACKS 2020-05-17 18:47:23 +02:00
Łukasz Langa
8ec9ea9629
Use NFKD instead of NFKC because that's what macOS uses by default 2020-05-17 18:33:23 +02:00
Łukasz Langa
c27c259282
Add MIT license to make it explicit 2020-05-17 18:28:17 +02:00
Stan Senotrusov
74f043b3ca
Normalize unicode paths in the database (#37)
* Normalize unicode paths in the database to enable use of the same database across different platforms

* Check if unicode normalization should be applied without regexp
2020-05-17 18:27:05 +02:00
Reid Williams
4ea0a57e0a
Add and remove unnecessary / needed decodes (#38)
* remove unecessary decode

* add needed decode

Co-authored-by: Reid Williams <reid@computable.io>
2020-05-17 17:07:47 +02:00
Zhuoyun Wei
6d82ff93b1 Fix a typo in README (#35) 2018-02-26 10:13:56 -08:00
p1r473
a043402114 Vacuuming (#34)
Added database vacuuming to shrink DB size on hard drive of old hashes that went missing
2017-06-13 13:34:44 -07:00
liloman
a8e52626ef Swap sqlite cursor with dictionary and set data structures (#24)
1. Use 2 new data structures:
-paths (set) contains all the files in the actual filesystem
-hashes (dictionary) substitute the sqlite query with dict[hash] = set(db paths)

2. Minimal unitary tests created with bats (bash script)

See https://github.com/ambv/bitrot/issues/23 for details.
2017-03-03 10:16:46 -08:00
Lukasz Langa
6b4a1fd46a Typo in README 2016-11-01 12:06:01 -07:00
8 changed files with 641 additions and 211 deletions

13
.editorconfig Normal file
View File

@ -0,0 +1,13 @@
root = true
[*]
trim_trailing_whitespace = true
insert_final_newline = true
[*.{py,pyx,pxd,pxi,yml,h}]
indent_size = 4
indent_style = space
[ext/*.{c,cpp,h}]
indent_size = 4
indent_style = tab

21
LICENSE Normal file
View File

@ -0,0 +1,21 @@
The MIT License (MIT)
Copyright (c) 2013 Łukasz Langa
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@ -22,31 +22,83 @@ will report all errors, e.g. files that changed on the hard drive but
still have the same modification date.
All paths stored in ``.bitrot.db`` are relative so it's safe to rescan
a folder after moving it to another drive.
a folder after moving it to another drive. Just remember to move it in
a way that doesn't touch modification dates. Otherwise the checksum
database is useless.
Performance
-----------
Obviously depends on how fast the underlying drive is. Since bandwidth
for checksum calculations is greater than your drive's data transfer
rate, even when comparing mobile CPUs vs. SSD drives, the script is
single-threaded.
Obviously depends on how fast the underlying drive is. Historically
the script was single-threaded because back in 2013 checksum
calculations on a single core still outran typical drives, including
the mobile SSDs of the day. In 2020 this is no longer the case so the
script now uses a process pool to calculate SHA1 hashes and perform
`stat()` calls.
No rigorous performance tests have been done. Scanning a ~1000 files
totalling ~4 GB takes 20 seconds on a 2015 Macbook Air (SM0256G SSD).
This is with cold disk cache.
No rigorous performance tests have been done. Scanning a ~1000 file
directory totalling ~5 GB takes 2.2s on a 2018 MacBook Pro 15" with
a AP0512M SSD. Back in 2013, that same feat on a 2015 MacBook Air with
a SM0256G SSD took over 20 seconds.
Some other tests back from 2013: a typical 5400 RPM laptop hard drive
scanning a 60+ GB music library took around 15 minutes. On an OCZ
Vertex 3 SSD drive ``bitrot`` was able to scan a 100 GB Aperture library
in under 10 minutes. Both tests on HFS+.
On that same 2018 MacBook Pro 15", scanning a 60+ GB music library takes
24 seconds. Back in 2013, with a typical 5400 RPM laptop hard drive
it took around 15 minutes. How times have changed!
If you'd like to contribute some more rigorous eenchmarks or any
performance improvements, I'm accepting pull requests! :)
Tests
-----
There's a simple but comprehensive test scenario using
`pytest <https://pypi.org/p/pytest>`_ and
`pytest-order <https://pypi.org/p/pytest-order>`_.
Install::
$ python3 -m venv .venv
$ . .venv/bin/activate
(.venv)$ pip install -e .[test]
Run::
(.venv)$ pytest -x
==================== test session starts ====================
platform darwin -- Python 3.10.12, pytest-7.4.0, pluggy-1.2.0
rootdir: /Users/ambv/Documents/Python/bitrot
plugins: order-1.1.0
collected 12 items
tests/test_bitrot.py ............ [100%]
==================== 12 passed in 15.05s ====================
Change Log
----------
1.0.1
~~~~~
* officially remove Python 2 support that was broken since 1.0.0
anyway; now the package works with Python 3.8+ because of a few
features
1.0.0
~~~~~
* significantly sped up execution on solid state drives by using
a process pool executor to calculate SHA1 hashes and perform `stat()`
calls; use `-w1` if your runs on slow magnetic drives were
negatively affected by this change
* sped up execution by pre-loading all SQLite-stored hashes to memory
and doing comparisons using Python sets
* all UTF-8 filenames are now normalized to NFKD in the database to
enable cross-operating system checks
* the SQLite database is now vacuumed to minimize its size
* bugfix: additional Python 3 fixes when Unicode names were encountered
0.9.2
~~~~~
@ -171,8 +223,14 @@ Authors
-------
Glued together by `Łukasz Langa <mailto:lukasz@langa.pl>`_. Multiple
improvements by `Yang Zhang <mailto:yaaang@gmail.com>`_, `Jean-Louis
Fuchs <mailto:ganwell@fangorn.ch>`_, `Phil Lundrigan
<mailto:philipbl@cs.utah.edu>`_, `Ben Shepherd
<mailto:bjashepherd@gmail.com`, and `Peter Hofmann
<mailto:scm@uninformativ.de>`_.
improvements by
`Ben Shepherd <mailto:bjashepherd@gmail.com>`_,
`Jean-Louis Fuchs <mailto:ganwell@fangorn.ch>`_,
`Marcus Linderoth <marcus@thingsquare.com>`_,
`p1r473 <mailto:subwayjared@gmail.com>`_,
`Peter Hofmann <mailto:scm@uninformativ.de>`_,
`Phil Lundrigan <mailto:philipbl@cs.utah.edu>`_,
`Reid Williams <rwilliams@ideo.com>`_,
`Stan Senotrusov <senotrusov@gmail.com>`_,
`Yang Zhang <mailto:yaaang@gmail.com>`_, and
`Zhuoyun Wei <wzyboy@wzyboy.org>`_.

View File

@ -1,30 +0,0 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2013 by Łukasz Langa
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from bitrot import run_from_command_line
run_from_command_line()

34
pyproject.toml Normal file
View File

@ -0,0 +1,34 @@
[build-system]
requires = ["setuptools", "setuptools-scm[toml]"]
build-backend = "setuptools.build_meta"
[project]
name = "bitrot"
authors = [
{name = "Łukasz Langa", email = "lukasz@langa.pl"},
]
description = "Detects bit rotten files on the hard drive to save your precious photo and music collection from slow decay."
readme = "README.rst"
requires-python = ">=3.8"
keywords = ["file", "checksum", "database"]
license = {text = "MIT"}
classifiers = [
"Development Status :: 5 - Production/Stable",
"Natural Language :: English",
"Programming Language :: Python :: 3",
"Topic :: System :: Filesystems",
"Topic :: System :: Monitoring",
"Topic :: Software Development :: Libraries :: Python Modules",
]
dependencies = []
dynamic = ["version"]
[project.optional-dependencies]
test = ["pytest", "pytest-order"]
[project.scripts]
bitrot = "bitrot:run_from_command_line"
[tool.setuptools_scm]
tag_regex = "^(?P<version>v\\d+(?:\\.\\d+){0,2}[^\\+]*)(?:\\+.*)?$"

View File

@ -1,74 +0,0 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2013 by Łukasz Langa
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
import codecs
import os
import sys
from setuptools import setup, find_packages
current_dir = os.path.abspath(os.path.dirname(__file__))
ld_file = codecs.open(os.path.join(current_dir, 'README.rst'), encoding='utf8')
try:
long_description = ld_file.read()
finally:
ld_file.close()
# We let it die a horrible tracebacking death if reading the file fails.
# We couldn't sensibly recover anyway: we need the long description.
sys.path.insert(0, current_dir + os.sep + 'src')
from bitrot import VERSION
release = ".".join(str(num) for num in VERSION)
setup(
name = 'bitrot',
version = release,
author = u'Łukasz Langa',
author_email = 'lukasz@langa.pl',
description = ("Detects bit rotten files on the hard drive to save your "
"precious photo and music collection from slow decay."),
long_description = long_description,
url = 'https://github.com/ambv/bitrot/',
keywords = 'file checksum database',
platforms = ['any'],
license = 'MIT',
package_dir = {'': 'src'},
packages = find_packages('src'),
py_modules = ['bitrot'],
scripts = ['bin/bitrot'],
include_package_data = True,
zip_safe = False, # if only because of the readme file
install_requires = [
],
classifiers = [
'Development Status :: 4 - Beta',
'License :: OSI Approved :: MIT License',
'Natural Language :: English',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python',
'Topic :: System :: Filesystems',
'Topic :: System :: Monitoring',
'Topic :: Software Development :: Libraries :: Python Modules',
]
)

236
src/bitrot.py Normal file → Executable file
View File

@ -1,5 +1,4 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#!/usr/bin/env python3
# Copyright (C) 2013 by Łukasz Langa
@ -21,10 +20,7 @@
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import annotations
import argparse
import atexit
@ -38,18 +34,29 @@ import stat
import sys
import tempfile
import time
import unicodedata
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import freeze_support
from importlib.metadata import version, PackageNotFoundError
DEFAULT_CHUNK_SIZE = 16384 # block size in HFS+; 4X the block size in ext4
DOT_THRESHOLD = 200
VERSION = (0, 9, 2)
IGNORED_FILE_SYSTEM_ERRORS = {errno.ENOENT, errno.EACCES}
FSENCODING = sys.getfilesystemencoding()
try:
VERSION = version("bitrot")
except PackageNotFoundError:
VERSION = "1.0.1"
if sys.version[0] == '2':
str = type(u'text')
# use `bytes` for bytestrings
def normalize_path(path):
path_uni = path.decode(FSENCODING)
if FSENCODING in ('utf-8', 'UTF-8'):
return unicodedata.normalize('NFKD', path_uni)
return path_uni
def sha1(path, chunk_size):
@ -95,16 +102,17 @@ def get_sqlite3_cursor(path, copy=False):
def list_existing_paths(directory, expected=(), ignored=(), follow_links=False):
"""list_existing_paths('/dir') -> ([path1, path2, ...], total_size)
"""list_existing_paths(b'/dir') -> ([path1, path2, ...], total_size)
Returns a tuple with a list with existing files in `directory` and their
`total_size`.
Returns a tuple with a set of existing files in `directory` and its subdirectories
and their `total_size`. If directory was a bytes object, so will be the returned
paths.
Doesn't add entries listed in `ignored`. Doesn't add symlinks if
`follow_links` is False (the default). All entries present in `expected`
must be files (can't be directories or symlinks).
"""
paths = []
paths = set()
total_size = 0
for path, _, files in os.walk(directory):
for f in files:
@ -129,12 +137,46 @@ def list_existing_paths(directory, expected=(), ignored=(), follow_links=False):
else:
if not stat.S_ISREG(st.st_mode) or p in ignored:
continue
paths.append(p)
paths.add(p)
total_size += st.st_size
paths.sort()
return paths, total_size
def compute_one(path, chunk_size):
"""Return a tuple with (unicode path, size, mtime, sha1). Takes a binary path."""
p_uni = normalize_path(path)
try:
st = os.stat(path)
except OSError as ex:
if ex.errno in IGNORED_FILE_SYSTEM_ERRORS:
# The file disappeared between listing existing paths and
# this run or is (temporarily?) locked with different
# permissions. We'll just skip it for now.
print(
'\rwarning: `{}` is currently unavailable for '
'reading: {}'.format(
p_uni, ex,
),
file=sys.stderr,
)
raise BitrotException
raise # Not expected? https://github.com/ambv/bitrot/issues/
try:
new_sha1 = sha1(path, chunk_size)
except (IOError, OSError) as e:
print(
'\rwarning: cannot compute hash of {} [{}]'.format(
p_uni, errno.errorcode[e.args[0]],
),
file=sys.stderr,
)
raise BitrotException
return p_uni, st.st_size, int(st.st_mtime), new_sha1
class BitrotException(Exception):
pass
@ -142,7 +184,7 @@ class BitrotException(Exception):
class Bitrot(object):
def __init__(
self, verbosity=1, test=False, follow_links=False, commit_interval=300,
chunk_size=DEFAULT_CHUNK_SIZE,
chunk_size=DEFAULT_CHUNK_SIZE, workers=os.cpu_count(),
):
self.verbosity = verbosity
self.test = test
@ -151,6 +193,7 @@ class Bitrot(object):
self.chunk_size = chunk_size
self._last_reported_size = ''
self._last_commit_ts = 0
self.pool = ProcessPoolExecutor(max_workers=workers)
def maybe_commit(self, conn):
if time.time() < self._last_commit_ts + self.commit_interval:
@ -180,67 +223,55 @@ class Bitrot(object):
errors = []
current_size = 0
missing_paths = self.select_all_paths(cur)
hashes = self.select_all_hashes(cur)
paths, total_size = list_existing_paths(
b'.', expected=missing_paths, ignored={bitrot_db, bitrot_sha512},
follow_links=self.follow_links,
)
paths_uni = set(normalize_path(p) for p in paths)
futures = [self.pool.submit(compute_one, p, self.chunk_size) for p in paths]
for p in paths:
p_uni = p.decode(FSENCODING)
for future in as_completed(futures):
try:
st = os.stat(p)
except OSError as ex:
if ex.errno in IGNORED_FILE_SYSTEM_ERRORS:
# The file disappeared between listing existing paths and
# this run or is (temporarily?) locked with different
# permissions. We'll just skip it for now.
print(
'\rwarning: `{}` is currently unavailable for '
'reading: {}'.format(
p_uni, ex,
),
file=sys.stderr,
)
continue
p_uni, new_size, new_mtime, new_sha1 = future.result()
except BitrotException:
continue
raise # Not expected? https://github.com/ambv/bitrot/issues/
new_mtime = int(st.st_mtime)
current_size += st.st_size
current_size += new_size
if self.verbosity:
self.report_progress(current_size, total_size)
missing_paths.discard(p_uni)
try:
new_sha1 = sha1(p, self.chunk_size)
except (IOError, OSError) as e:
print(
'\rwarning: cannot compute hash of {} [{}]'.format(
p, errno.errorcode[e.args[0]],
),
file=sys.stderr,
)
continue
cur.execute('SELECT mtime, hash, timestamp FROM bitrot WHERE '
'path=?', (p_uni,))
row = cur.fetchone()
if not row:
if p_uni not in missing_paths:
# We are not expecting this path, it wasn't in the database yet.
# It's either new or a rename. Let's handle that.
stored_path = self.handle_unknown_path(
cur, p_uni, new_mtime, new_sha1,
cur, p_uni, new_mtime, new_sha1, paths_uni, hashes
)
self.maybe_commit(conn)
if p_uni == stored_path:
new_paths.append(p) # FIXME: shouldn't that be p_uni?
new_paths.append(p_uni)
missing_paths.discard(p_uni)
else:
renamed_paths.append((stored_path, p_uni))
missing_paths.discard(stored_path)
continue
# At this point we know we're seeing an expected file.
missing_paths.discard(p_uni)
cur.execute('SELECT mtime, hash, timestamp FROM bitrot WHERE path=?',
(p_uni,))
row = cur.fetchone()
if not row:
print(
'\rwarning: path disappeared from the database while running:',
p_uni,
file=sys.stderr,
)
continue
stored_mtime, stored_sha1, stored_ts = row
if int(stored_mtime) != new_mtime:
updated_paths.append(p)
updated_paths.append(p_uni)
cur.execute('UPDATE bitrot SET mtime=?, hash=?, timestamp=? '
'WHERE path=?',
(new_mtime, new_sha1, ts(), p_uni))
@ -248,11 +279,11 @@ class Bitrot(object):
continue
if stored_sha1 != new_sha1:
errors.append(p)
errors.append(p_uni)
print(
'\rerror: SHA1 mismatch for {}: expected {}, got {}.'
' Last good hash checked on {}.'.format(
p, stored_sha1, new_sha1, stored_ts
p_uni, stored_sha1, new_sha1, stored_ts
),
file=sys.stderr,
)
@ -262,6 +293,9 @@ class Bitrot(object):
conn.commit()
if not self.test:
cur.execute('vacuum')
if self.verbosity:
cur.execute('SELECT COUNT(path) FROM bitrot')
all_count = cur.fetchone()[0]
@ -283,6 +317,10 @@ class Bitrot(object):
)
def select_all_paths(self, cur):
"""Return a set of all distinct paths in the bitrot database.
The paths are Unicode and are normalized if FSENCODING was UTF-8.
"""
result = set()
cur.execute('SELECT path FROM bitrot')
row = cur.fetchone()
@ -291,6 +329,20 @@ class Bitrot(object):
row = cur.fetchone()
return result
def select_all_hashes(self, cur):
"""Return a dict where keys are hashes and values are sets of paths.
The paths are Unicode and are normalized if FSENCODING was UTF-8.
"""
result = {}
cur.execute('SELECT hash, path FROM bitrot')
row = cur.fetchone()
while row:
rhash, rpath = row
result.setdefault(rhash, set()).add(rpath)
row = cur.fetchone()
return result
def report_progress(self, current_size, total_size):
size_fmt = '\r{:>6.1%}'.format(current_size/(total_size or 1))
if size_fmt == self._last_reported_size:
@ -303,6 +355,7 @@ class Bitrot(object):
def report_done(
self, total_size, all_count, error_count, new_paths, updated_paths,
renamed_paths, missing_paths):
"""Print a report on what happened. All paths should be Unicode here."""
print('\rFinished. {:.2f} MiB of data read. {} errors found.'
''.format(total_size/1024/1024, error_count))
if self.verbosity == 1:
@ -319,21 +372,21 @@ class Bitrot(object):
print('{} entries new:'.format(len(new_paths)))
new_paths.sort()
for path in new_paths:
print(' ', path.decode(FSENCODING))
print(' ', path)
if updated_paths:
print('{} entries updated:'.format(len(updated_paths)))
updated_paths.sort()
for path in updated_paths:
print(' ', path.decode(FSENCODING))
print(' ', path)
if renamed_paths:
print('{} entries renamed:'.format(len(renamed_paths)))
renamed_paths.sort()
for path in renamed_paths:
print(
' from',
path[0].decode(FSENCODING),
path[0],
'to',
path[1].decode(FSENCODING),
path[1],
)
if missing_paths:
print('{} entries missing:'.format(len(missing_paths)))
@ -345,37 +398,38 @@ class Bitrot(object):
if self.test and self.verbosity:
print('warning: database file not updated on disk (test mode).')
def handle_unknown_path(self, cur, new_path, new_mtime, new_sha1):
def handle_unknown_path(self, cur, new_path, new_mtime, new_sha1, paths_uni, hashes):
"""Either add a new entry to the database or update the existing entry
on rename.
Returns `new_path` if the entry was indeed new or the `stored_path` (e.g.
outdated path) if there was a rename.
`cur` is the database cursor. `new_path` is the new Unicode path.
`paths_uni` are Unicode paths seen on disk during this run of Bitrot.
`hashes` is a dictionary selected from the database, keys are hashes, values
are sets of Unicode paths that are stored in the DB under the given hash.
Returns `new_path` if the entry was indeed new or the `old_path` (e.g.
outdated path stored in the database for this hash) if there was a rename.
"""
cur.execute('SELECT mtime, path, timestamp FROM bitrot WHERE hash=?',
(new_sha1,))
rows = cur.fetchall()
for row in rows:
stored_mtime, stored_path, stored_ts = row
if os.path.exists(stored_path):
# file still exists, move on
continue
# update the path in the database
for old_path in hashes.get(new_sha1, ()):
if old_path not in paths_uni:
# File of the same hash used to exist but no longer does.
# Let's treat `new_path` as a renamed version of that `old_path`.
cur.execute(
'UPDATE bitrot SET mtime=?, path=?, timestamp=? WHERE path=?',
(new_mtime, new_path, ts(), old_path),
)
return old_path
else:
# Either we haven't found `new_sha1` at all in the database, or all
# currently stored paths for this hash still point to existing files.
# Let's insert a new entry for what appears to be a new file.
cur.execute(
'UPDATE bitrot SET mtime=?, path=?, timestamp=? WHERE path=?',
(new_mtime, new_path, ts(), stored_path),
'INSERT INTO bitrot VALUES (?, ?, ?, ?)',
(new_path, new_mtime, new_sha1, ts()),
)
return stored_path
# no rename, just a new file with the same hash
cur.execute(
'INSERT INTO bitrot VALUES (?, ?, ?, ?)',
(new_path, new_mtime, new_sha1, ts()),
)
return new_path
return new_path
def get_path(directory=b'.', ext=b'db'):
"""Compose the path to the selected bitrot file."""
@ -462,6 +516,8 @@ def update_sha512_integrity(verbosity=1):
def run_from_command_line():
global FSENCODING
freeze_support()
parser = argparse.ArgumentParser(prog='bitrot')
parser.add_argument(
'-l', '--follow-links', action='store_true',
@ -487,11 +543,14 @@ def run_from_command_line():
help='just test against an existing database, don\'t update anything')
parser.add_argument(
'--version', action='version',
version='%(prog)s {}.{}.{}'.format(*VERSION))
version=f"%(prog)s {VERSION}")
parser.add_argument(
'--commit-interval', type=float, default=300,
help='min time in seconds between commits '
'(0 commits on every operation)')
parser.add_argument(
'-w', '--workers', type=int, default=os.cpu_count(),
help='run this many workers (use -w1 for slow magnetic disks)')
parser.add_argument(
'--chunk-size', type=int, default=DEFAULT_CHUNK_SIZE,
help='read files this many bytes at a time')
@ -517,6 +576,7 @@ def run_from_command_line():
follow_links=args.follow_links,
commit_interval=args.commit_interval,
chunk_size=args.chunk_size,
workers=args.workers,
)
if args.fsencoding:
FSENCODING = args.fsencoding

348
tests/test_bitrot.py Normal file
View File

@ -0,0 +1,348 @@
"""
NOTE: those tests are ordered and require pytest-order to run correctly.
"""
from __future__ import annotations
import getpass
import os
from pathlib import Path
import shlex
import shutil
import subprocess
import sys
from textwrap import dedent
import pytest
TMP = Path("/tmp/")
ReturnCode = int
StdOut = list[str]
StdErr = list[str]
def bitrot(*args: str) -> tuple[ReturnCode, StdOut, StdErr]:
cmd = [sys.executable, "-m", "bitrot"]
cmd.extend(args)
res = subprocess.run(shlex.join(cmd), shell=True, capture_output=True)
stdout = (res.stdout or b"").decode("utf8")
stderr = (res.stderr or b"").decode("utf8")
return res.returncode, lines(stdout), lines(stderr)
def bash(script, empty_dir: bool = False) -> bool:
username = getpass.getuser()
test_dir = TMP / f"bitrot-dir-{username}"
if empty_dir and test_dir.is_dir():
os.chdir(TMP)
shutil.rmtree(test_dir)
test_dir.mkdir(exist_ok=True)
os.chdir(test_dir)
preamble = """
set -euxo pipefail
LC_ALL=en_US.UTF-8
LANG=en_US.UTF-8
"""
if script:
# We need to wait a second for modification timestamps to differ so that
# the ordering of the output stays the same every run of the tests.
preamble += """
sleep 1
"""
script_path = TMP / "bitrot-test.bash"
script_path.write_text(dedent(preamble + script))
script_path.chmod(0o755)
out = subprocess.run(["bash", str(script_path)], capture_output=True)
if out.returncode:
print(f"Non-zero return code {out.returncode} when running {script_path}")
if out.stdout:
print(out.stdout)
if out.stderr:
print(out.stderr)
return False
return True
def lines(s: str) -> list[str]:
r"""Only return non-empty lines that weren't killed by \r."""
return [
line.rstrip()
for line in s.splitlines(keepends=True)
if line and line.rstrip() and line[-1] != "\r"
]
@pytest.mark.order(1)
def test_command_exists() -> None:
rc, out, err = bitrot("--help")
assert rc == 0
assert not err
assert out[0].startswith("usage:")
assert bash("", empty_dir=True)
@pytest.mark.order(2)
def test_new_files_in_a_tree_dir() -> None:
assert bash(
"""
mkdir -p nonemptydirs/dir2/
touch nonemptydirs/dir2/new-file-{a,b}.txt
echo $RANDOM >> nonemptydirs/dir2/new-file-b.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
# assert out[0] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[1] == "2 entries in the database. 2 entries new:"
assert out[2] == " ./nonemptydirs/dir2/new-file-a.txt"
assert out[3] == " ./nonemptydirs/dir2/new-file-b.txt"
assert out[4] == "Updating bitrot.sha512... done."
@pytest.mark.order(3)
def test_modified_files_in_a_tree_dir() -> None:
assert bash(
"""
echo $RANDOM >> nonemptydirs/dir2/new-file-a.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "2 entries in the database. 1 entries updated:"
assert out[3] == " ./nonemptydirs/dir2/new-file-a.txt"
assert out[4] == "Updating bitrot.sha512... done."
@pytest.mark.order(4)
def test_renamed_files_in_a_tree_dir() -> None:
assert bash(
"""
mv nonemptydirs/dir2/new-file-a.txt nonemptydirs/dir2/new-file-a.txt2
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "2 entries in the database. 1 entries renamed:"
o3 = " from ./nonemptydirs/dir2/new-file-a.txt to ./nonemptydirs/dir2/new-file-a.txt2"
assert out[3] == o3
assert out[4] == "Updating bitrot.sha512... done."
@pytest.mark.order(5)
def test_deleted_files_in_a_tree_dir() -> None:
assert bash(
"""
rm nonemptydirs/dir2/new-file-a.txt2
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "1 entries in the database. 1 entries missing:"
assert out[3] == " ./nonemptydirs/dir2/new-file-a.txt2"
assert out[4] == "Updating bitrot.sha512... done."
@pytest.mark.order(5)
def test_new_files_and_modified_files_in_a_tree_dir() -> None:
assert bash(
"""
for fil in {a,b,c,d,e,f,g}; do
echo $fil >> more-files-$fil.txt
done
echo $RANDOM >> nonemptydirs/dir2/new-file-b.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "8 entries in the database. 7 entries new:"
assert out[3] == " ./more-files-a.txt"
assert out[4] == " ./more-files-b.txt"
assert out[5] == " ./more-files-c.txt"
assert out[6] == " ./more-files-d.txt"
assert out[7] == " ./more-files-e.txt"
assert out[8] == " ./more-files-f.txt"
assert out[9] == " ./more-files-g.txt"
assert out[10] == "1 entries updated:"
assert out[11] == " ./nonemptydirs/dir2/new-file-b.txt"
assert out[12] == "Updating bitrot.sha512... done."
@pytest.mark.order(6)
def test_new_files_modified_deleted_and_moved_in_a_tree_dir() -> None:
assert bash(
"""
for fil in {a,b,c,d,e,f,g}; do
echo $fil $RANDOM >> nonemptydirs/pl-more-files-$fil.txt
done
echo $RANDOM >> nonemptydirs/dir2/new-file-b.txt
mv more-files-a.txt more-files-a.txt2
rm more-files-g.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "14 entries in the database. 7 entries new:"
assert out[3] == " ./nonemptydirs/pl-more-files-a.txt"
assert out[4] == " ./nonemptydirs/pl-more-files-b.txt"
assert out[5] == " ./nonemptydirs/pl-more-files-c.txt"
assert out[6] == " ./nonemptydirs/pl-more-files-d.txt"
assert out[7] == " ./nonemptydirs/pl-more-files-e.txt"
assert out[8] == " ./nonemptydirs/pl-more-files-f.txt"
assert out[9] == " ./nonemptydirs/pl-more-files-g.txt"
assert out[10] == "1 entries updated:"
assert out[11] == " ./nonemptydirs/dir2/new-file-b.txt"
assert out[12] == "1 entries renamed:"
assert out[13] == " from ./more-files-a.txt to ./more-files-a.txt2"
assert out[14] == "1 entries missing:"
assert out[15] == " ./more-files-g.txt"
assert out[16] == "Updating bitrot.sha512... done."
@pytest.mark.order(7)
def test_new_files_modified_deleted_and_moved_in_a_tree_dir_2() -> None:
assert bash(
"""
for fil in {a,b,c,d,e,f,g}; do
echo $RANDOM >> nonemptydirs/pl2-more-files-$fil.txt
done
echo $RANDOM >> nonemptydirs/pl-more-files-a.txt
mv nonemptydirs/pl-more-files-b.txt nonemptydirs/pl-more-files-b.txt2
cp nonemptydirs/pl-more-files-g.txt nonemptydirs/pl2-more-files-g.txt2
cp nonemptydirs/pl-more-files-d.txt nonemptydirs/pl2-more-files-d.txt2
rm more-files-f.txt nonemptydirs/pl-more-files-c.txt
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "21 entries in the database. 9 entries new:"
assert out[3] == " ./nonemptydirs/pl2-more-files-a.txt"
assert out[4] == " ./nonemptydirs/pl2-more-files-b.txt"
assert out[5] == " ./nonemptydirs/pl2-more-files-c.txt"
assert out[6] == " ./nonemptydirs/pl2-more-files-d.txt"
assert out[7] == " ./nonemptydirs/pl2-more-files-d.txt2"
assert out[8] == " ./nonemptydirs/pl2-more-files-e.txt"
assert out[9] == " ./nonemptydirs/pl2-more-files-f.txt"
assert out[10] == " ./nonemptydirs/pl2-more-files-g.txt"
assert out[11] == " ./nonemptydirs/pl2-more-files-g.txt2"
assert out[12] == "1 entries updated:"
assert out[13] == " ./nonemptydirs/pl-more-files-a.txt"
assert out[14] == "1 entries renamed:"
o15 = " from ./nonemptydirs/pl-more-files-b.txt to ./nonemptydirs/pl-more-files-b.txt2"
assert out[15] == o15
assert out[16] == "2 entries missing:"
assert out[17] == " ./more-files-f.txt"
assert out[18] == " ./nonemptydirs/pl-more-files-c.txt"
assert out[19] == "Updating bitrot.sha512... done."
@pytest.mark.order(8)
def test_3278_files() -> None:
assert bash(
"""
mkdir -p alotfiles/here; cd alotfiles/here
# create a 320KB file
dd if=/dev/urandom of=masterfile bs=1 count=327680
# split it in 3277 files (instantly) + masterfile = 3278
split -b 100 -a 10 masterfile
"""
)
rc, out, err = bitrot()
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
o2 = "3299 entries in the database, 3278 new, 0 updated, 0 renamed, 0 missing."
assert out[2] == o2
@pytest.mark.order(9)
def test_3278_files_2() -> None:
assert bash(
"""
mv alotfiles/here alotfiles/here-moved
"""
)
rc, out, err = bitrot()
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
o2 = "3299 entries in the database, 0 new, 0 updated, 3278 renamed, 0 missing."
assert out[2] == o2
@pytest.mark.order(10)
def test_rotten_file() -> None:
assert bash(
"""
touch non-rotten-file
dd if=/dev/zero of=rotten-file bs=1k count=1000 &>/dev/null
# let's make sure they share the same timestamp
touch -r non-rotten-file rotten-file
"""
)
rc, out, err = bitrot("-v")
assert rc == 0
assert not err
assert out[0] == "Checking bitrot.db integrity... ok."
# assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found."
assert out[2] == "3301 entries in the database. 2 entries new:"
assert out[3] == " ./non-rotten-file"
assert out[4] == " ./rotten-file"
@pytest.mark.order(11)
def test_rotten_file_2() -> None:
assert bash(
"""
# modify the rotten file...
dd if=/dev/urandom of=rotten-file bs=1k count=10 seek=1k conv=notrunc &>/dev/null
# ...but revert the modification date
touch -r non-rotten-file rotten-file
"""
)
rc, out, err = bitrot("-q")
assert rc == 1
assert not out
e = (
"error: SHA1 mismatch for ./rotten-file: expected"
" 8fee1653e234fee8513245d3cb3e3c06d071493e, got"
)
assert err[0].startswith(e)
assert err[1] == "error: There were 1 errors found."
@pytest.mark.order("last")
def test_cleanup() -> None:
username = getpass.getuser()
test_dir = TMP / f"bitrot-dir-{username}"
if test_dir.is_dir():
os.chdir(TMP)
shutil.rmtree(test_dir)