Noah Levitt
17a5fabb75
use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
...
payloads. because as of https://github.com/internetarchive/brozzler/pull/115
brozzler will be sending big videos via WARCPROX_WRITE_RECORD
2018-08-16 11:08:36 -07:00
Noah Levitt
0031091d4f
Merge pull request #99 from vbanos/blackout_period
...
New --blackout-period option to skip writing redundant revisits to WARC
2018-08-03 17:27:42 -07:00
Vangelis Banos
6b1d60c390
Apply blackout on when dedup URL equals request URL
2018-07-24 07:16:21 +00:00
Vangelis Banos
2c2c1d008a
New --blackout-period option to skip writing redundant revisits to WARC
...
Add option `--blackout-period` (default=0)
When set and if the record is a duplicate (revisit record), check the
datetime of `dedup_info` and its inside the `blackout_period`, skip
writing the record to WARC.
Add some unit tests.
This is an improved implementation based on @nlevitt comments here:
https://github.com/internetarchive/warcprox/pull/92
2018-07-21 11:20:49 +00:00
Noah Levitt
fbce243787
bump dev version after pull request
2018-07-19 11:18:31 -05:00
Noah Levitt
f32d5636a1
Merge pull request #98 from nlevitt/trough-dedup-bugs
...
WIP: trough dedup bug fix
2018-07-19 11:17:19 -05:00
Noah Levitt
fde443070c
dumb mistake
2018-07-18 20:10:30 -05:00
Noah Levitt
d3314d7904
hopefully fix a trough dedup concurrency bug
2018-07-18 19:26:16 -05:00
Noah Levitt
b7e12a3ec2
some logging improvements
2018-07-18 19:25:43 -05:00
Noah Levitt
f4cf782922
test should expose trough dedup concurrency bug
2018-07-18 19:23:24 -05:00
Noah Levitt
67392930f6
Merge pull request #97 from nlevitt/fix-travis-clean
...
run trough with python 3.6 plus travis cleanup
2018-07-18 16:38:08 -05:00
Noah Levitt
46d5b0e82c
run trough with python 3.6 plus travis cleanup
...
docker image python:3 is now using 3.7 and building pyyaml < 3.13 fails
yaml/pyyaml#126
also filed pull request to update trough's pyyaml dependency spec
internetarchive/trough#20
2018-07-18 16:09:42 -05:00
Noah Levitt
2df82bd403
record request method in crawl log if not GET
2018-07-17 13:47:52 -05:00
Noah Levitt
8c22c55955
back to dev version number
2018-07-17 12:04:08 -05:00
Noah Levitt
6786a668b1
2.4b2 for pypi
2.4b2
2018-07-17 12:03:26 -05:00
Noah Levitt
8022257a57
setuptools likes README.rst not readme.rst
2018-07-17 16:35:05 +00:00
Noah Levitt
ec7a0bf569
log exception and continue 🤞 if schema reg fails
...
at trough dedup startup
2018-05-31 16:57:37 -07:00
Noah Levitt
e73cbcb6b3
log stack trace in case batch postprocessor raises
...
exception somehow
2018-05-31 16:57:06 -07:00
Noah Levitt
e8cb3afa71
bump dev version after merge
2018-05-31 16:52:37 -07:00
Noah Levitt
a1356709df
Merge pull request #93 from nlevitt/docs
...
docs
2018-05-30 15:57:50 -07:00
Noah Levitt
6f43286b07
more edits
2018-05-30 14:46:14 -07:00
Noah Levitt
9434a1ccd8
more little edits
2018-05-30 14:26:10 -07:00
Noah Levitt
f5bcec20a9
explain a bit about mitm
2018-05-30 14:12:58 -07:00
Noah Levitt
68ede68e5f
little edits
2018-05-29 17:35:33 -07:00
Noah Levitt
cd6e30fe36
describe the last two remaining fields
2018-05-29 17:28:04 -07:00
Noah Levitt
4a87a08230
fixlets
2018-05-29 17:09:14 -07:00
Noah Levitt
8877259b7d
more progress on documenting "limits"
2018-05-29 16:57:15 -07:00
Noah Levitt
6256ec6a07
add another "wait" to fix failing test
2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2
fix bug in limits enforcement
...
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
07dc978f09
docs still in progress
2018-05-25 17:36:26 -07:00
Noah Levitt
195faa5cff
new checks exposing bug in limits enforcement
2018-05-25 17:35:32 -07:00
Noah Levitt
1e76ed3302
working on "limits" and "soft-limits"
2018-05-25 16:38:19 -07:00
Noah Levitt
2c850876e8
explain warcprox-meta "blocks"
2018-05-25 16:06:12 -07:00
Noah Levitt
4bd49b61a9
starting to explain some warcprox-meta fields
2018-05-25 15:26:26 -07:00
Noah Levitt
401de22600
short sectioni on stats
2018-05-25 14:46:19 -07:00
Noah Levitt
02e96188c3
barely starting to flesh out warcprox-meta section
2018-05-25 10:33:45 -07:00
Noah Levitt
b562170403
explain deduplication
2018-05-25 10:32:42 -07:00
Noah Levitt
b26a5d2d73
starting to talk about warcprox-meta
2018-05-22 15:00:36 -07:00
Noah Levitt
36f6696552
fix failure message in test_return_capture_timestamp
2018-05-22 15:00:10 -07:00
Noah Levitt
44ca939cb6
double the backticks
2018-05-22 12:02:49 -07:00
Noah Levitt
efc51a4361
stubby api docs
2018-05-22 11:59:06 -07:00
Noah Levitt
b7ebc38491
rename README.rst -> readme.rst
2018-05-21 22:18:28 +00:00
Noah Levitt
997d4341fe
add some debug logging in BatchTroughLoader
2018-05-18 17:29:38 -07:00
Noah Levitt
b762d6468b
just one should_dedup() for trough dedup
...
fixes failing test and clarifies things
2018-05-16 14:25:01 -07:00
Noah Levitt
d834ac3e59
only run tests in py3
2018-05-16 14:21:18 -07:00
Noah Levitt
49f637af05
fix trough deployment in Dockerfile
2018-05-16 13:48:04 -07:00
Noah Levitt
76ebaea944
fix test_dedup_min_text_size failure?
...
by waiting for postfetch chain in test_socket_timeout_response
2018-05-16 12:17:06 -07:00
Noah Levitt
5f0c46d579
rewrite test_dedup_min_size() to account for
...
the fact that we always save a record to the big captures table,
partly by adding a new check that --dedup-min-*-size is respected even
if there is an entry in the dedup db for the sha1
2018-05-16 10:52:04 -07:00
Noah Levitt
e23af32e94
we want to save all captures to the big "captures"
...
table, even if we don't want to dedup against them
2018-05-15 15:33:52 -07:00
Noah Levitt
af863c6dba
default values for dedup_min_text_size et al
...
because they may be missing in case warcprox is used as a library
2018-05-15 11:22:10 -07:00