783 Commits

Author SHA1 Message Date
Noah Levitt
f082db62cf take all the queues and active requests into...
... account when calculating the `seconds_behind` number, and include
the timestamp `earliest_still_active_fetch_start` in the status output
2018-10-30 13:05:45 -07:00
Noah Levitt
52f2ac0f4e send nice 503s and avoid scary stack traces...
... at shutdown
2018-10-26 15:26:27 -07:00
Noah Levitt
89212e782d fix failing test 2018-10-26 13:44:27 -07:00
Noah Levitt
e993b0c28c fix shutdown
at shutdown, abort active connections, but allow completed fetches to
finish processing

this should fix race condition issue at shutdown, where postfetch
processor B would shut down, then postfetch processor A would try to
enqueue more urls, filling up the queue to the point where it blocks
forever, since B is no longer pulling urls off the queue
2018-10-26 13:21:15 -07:00
Noah Levitt
4f01772782 enforce limits on WARCPROX_WRITE_RECORD requests
should make test from previous commit pass
2018-10-10 18:24:54 -07:00
Noah Levitt
4c0dfb432e failing test for new feature, enforcing limits on
WARCPROX_WRITE_RECORD requests
2018-10-10 18:21:28 -07:00
Noah Levitt
57e1b82e3d bump version after merge 2018-09-19 13:03:59 -07:00
Noah Levitt
d8edc551ba
Merge pull request #105 from nlevitt/host-port-in-log-name
include warcprox host and port in filenames
2018-09-19 13:03:19 -07:00
Noah Levitt
269e9604c1 include warcprox host and port in filenames
when using --crawl-log-dir, to avoid collisions (outside of warcprox
itself, in most cases) with crawl logs written by other warcprox
instances
2018-09-19 12:10:29 -07:00
Noah Levitt
45aed2e4f6
Merge pull request #104 from nlevitt/arch-svg
replace pencil drawing with nice diagram by James
2018-09-17 17:13:42 -07:00
Noah Levitt
741436ddcb replace pencil drawing with nice diagram by James
Kafader
2018-09-17 17:11:51 -07:00
Noah Levitt
ea7257a2b6
Merge pull request #103 from nlevitt/love
Love
2018-08-20 14:26:02 -07:00
Noah Levitt
4f04172374 fix bug 2018-08-20 12:07:51 -07:00
Noah Levitt
8dfb63f70d readable stack traces, thanks py.test 2018-08-20 12:07:23 -07:00
Noah Levitt
5654bcbeb8 --quiet means NOTICE level logging
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
de01700c54 tweak max threads option handling 2018-08-20 11:13:14 -07:00
Noah Levitt
bfe3f18126 set socket timeout for tor .onion fetching 2018-08-20 11:11:13 -07:00
Noah Levitt
2e71d86072 WARCPROX_WRITE_RECORD respect buffer size setting 2018-08-20 11:09:53 -07:00
Noah Levitt
e4befeec14 --help-hidden for help on hidden args 2018-08-20 11:08:32 -07:00
Noah Levitt
1d1a73536a half-baked readme section on warcprox architecture 2018-08-20 11:05:58 -07:00
Noah Levitt
8f51ba4ab9 bump dev version number after merge 2018-08-16 17:09:35 -07:00
Noah Levitt
8be7ddee2b
Merge pull request #100 from nlevitt/karl-copy-edits
Karl's copy edits
2018-08-16 17:08:14 -07:00
Noah Levitt
9da5e86b67 restore 80 column lines 2018-08-16 16:32:55 -07:00
Karl-Rainer Blumenthal
fa6b98cf4e Copy edits updated
Edits for readability updated as per https://github.com/internetarchive/warcprox/pull/95#discussion_r200491731

@nlevitt please go ahead and apply your < 80 lines retroactively and I'll refrain from that in future PRs.
2018-08-16 16:31:23 -07:00
Karl-Rainer Blumenthal
b72192d3d0 Copy edits 2018-08-16 16:31:05 -07:00
Noah Levitt
f8b86a0122 update cryptography dep version
github tells me there's a vulnerability <2.3
2018-08-16 12:54:30 -07:00
Noah Levitt
17a5fabb75 use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
payloads. because as of https://github.com/internetarchive/brozzler/pull/115
brozzler will be sending big videos via WARCPROX_WRITE_RECORD
2018-08-16 11:08:36 -07:00
Noah Levitt
0031091d4f
Merge pull request #99 from vbanos/blackout_period
New --blackout-period option to skip writing redundant revisits to WARC
2018-08-03 17:27:42 -07:00
Vangelis Banos
6b1d60c390 Apply blackout on when dedup URL equals request URL 2018-07-24 07:16:21 +00:00
Vangelis Banos
2c2c1d008a New --blackout-period option to skip writing redundant revisits to WARC
Add option `--blackout-period` (default=0)

When set and if the record is a duplicate (revisit record), check the
datetime of `dedup_info` and its inside the `blackout_period`, skip
writing the record to WARC.

Add some unit tests.

This is an improved implementation based on @nlevitt comments here:
https://github.com/internetarchive/warcprox/pull/92
2018-07-21 11:20:49 +00:00
Noah Levitt
fbce243787 bump dev version after pull request 2018-07-19 11:18:31 -05:00
Noah Levitt
f32d5636a1
Merge pull request #98 from nlevitt/trough-dedup-bugs
WIP: trough dedup bug fix
2018-07-19 11:17:19 -05:00
Noah Levitt
fde443070c dumb mistake 2018-07-18 20:10:30 -05:00
Noah Levitt
d3314d7904 hopefully fix a trough dedup concurrency bug 2018-07-18 19:26:16 -05:00
Noah Levitt
b7e12a3ec2 some logging improvements 2018-07-18 19:25:43 -05:00
Noah Levitt
f4cf782922 test should expose trough dedup concurrency bug 2018-07-18 19:23:24 -05:00
Noah Levitt
67392930f6
Merge pull request #97 from nlevitt/fix-travis-clean
run trough with python 3.6 plus travis cleanup
2018-07-18 16:38:08 -05:00
Noah Levitt
46d5b0e82c run trough with python 3.6 plus travis cleanup
docker image python:3 is now using 3.7 and building pyyaml < 3.13 fails
yaml/pyyaml#126

also filed pull request to update trough's pyyaml dependency spec
internetarchive/trough#20
2018-07-18 16:09:42 -05:00
Noah Levitt
2df82bd403 record request method in crawl log if not GET 2018-07-17 13:47:52 -05:00
Noah Levitt
8c22c55955 back to dev version number 2018-07-17 12:04:08 -05:00
Noah Levitt
6786a668b1 2.4b2 for pypi 2.4b2 2018-07-17 12:03:26 -05:00
Noah Levitt
8022257a57 setuptools likes README.rst not readme.rst 2018-07-17 16:35:05 +00:00
Noah Levitt
ec7a0bf569 log exception and continue 🤞 if schema reg fails
at trough dedup startup
2018-05-31 16:57:37 -07:00
Noah Levitt
e73cbcb6b3 log stack trace in case batch postprocessor raises
exception somehow
2018-05-31 16:57:06 -07:00
Noah Levitt
e8cb3afa71 bump dev version after merge 2018-05-31 16:52:37 -07:00
Noah Levitt
a1356709df
Merge pull request #93 from nlevitt/docs
docs
2018-05-30 15:57:50 -07:00
Noah Levitt
6f43286b07 more edits 2018-05-30 14:46:14 -07:00
Noah Levitt
9434a1ccd8 more little edits 2018-05-30 14:26:10 -07:00
Noah Levitt
f5bcec20a9 explain a bit about mitm 2018-05-30 14:12:58 -07:00
Noah Levitt
68ede68e5f little edits 2018-05-29 17:35:33 -07:00