829 Commits

Author SHA1 Message Date
Noah Levitt
eab9181129 Merge branch 'master' into qa
* master:
  bump version after merge
  include warcprox host and port in filenames
  replace pencil drawing with nice diagram by James
  fix bug
  readable stack traces, thanks py.test
  --quiet means NOTICE level logging
  tweak max threads option handling
  set socket timeout for tor .onion fetching
  WARCPROX_WRITE_RECORD respect buffer size setting
  --help-hidden for help on hidden args
  half-baked readme section on warcprox architecture
  bump dev version number after merge
  restore 80 column lines
  Copy edits updated
  Copy edits
  update cryptography dep version
  use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
  Apply blackout on when dedup URL equals request URL
  New --blackout-period option to skip writing redundant revisits to WARC
2018-09-28 11:12:18 -07:00
Noah Levitt
57e1b82e3d bump version after merge 2018-09-19 13:03:59 -07:00
Noah Levitt
d8edc551ba
Merge pull request #105 from nlevitt/host-port-in-log-name
include warcprox host and port in filenames
2018-09-19 13:03:19 -07:00
Noah Levitt
269e9604c1 include warcprox host and port in filenames
when using --crawl-log-dir, to avoid collisions (outside of warcprox
itself, in most cases) with crawl logs written by other warcprox
instances
2018-09-19 12:10:29 -07:00
Noah Levitt
45aed2e4f6
Merge pull request #104 from nlevitt/arch-svg
replace pencil drawing with nice diagram by James
2018-09-17 17:13:42 -07:00
Noah Levitt
741436ddcb replace pencil drawing with nice diagram by James
Kafader
2018-09-17 17:11:51 -07:00
Noah Levitt
ea7257a2b6
Merge pull request #103 from nlevitt/love
Love
2018-08-20 14:26:02 -07:00
Noah Levitt
4f04172374 fix bug 2018-08-20 12:07:51 -07:00
Noah Levitt
8dfb63f70d readable stack traces, thanks py.test 2018-08-20 12:07:23 -07:00
Noah Levitt
5654bcbeb8 --quiet means NOTICE level logging
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
de01700c54 tweak max threads option handling 2018-08-20 11:13:14 -07:00
Noah Levitt
bfe3f18126 set socket timeout for tor .onion fetching 2018-08-20 11:11:13 -07:00
Noah Levitt
2e71d86072 WARCPROX_WRITE_RECORD respect buffer size setting 2018-08-20 11:09:53 -07:00
Noah Levitt
e4befeec14 --help-hidden for help on hidden args 2018-08-20 11:08:32 -07:00
Noah Levitt
1d1a73536a half-baked readme section on warcprox architecture 2018-08-20 11:05:58 -07:00
Noah Levitt
8f51ba4ab9 bump dev version number after merge 2018-08-16 17:09:35 -07:00
Noah Levitt
8be7ddee2b
Merge pull request #100 from nlevitt/karl-copy-edits
Karl's copy edits
2018-08-16 17:08:14 -07:00
Noah Levitt
9da5e86b67 restore 80 column lines 2018-08-16 16:32:55 -07:00
Karl-Rainer Blumenthal
fa6b98cf4e Copy edits updated
Edits for readability updated as per https://github.com/internetarchive/warcprox/pull/95#discussion_r200491731

@nlevitt please go ahead and apply your < 80 lines retroactively and I'll refrain from that in future PRs.
2018-08-16 16:31:23 -07:00
Karl-Rainer Blumenthal
b72192d3d0 Copy edits 2018-08-16 16:31:05 -07:00
Noah Levitt
f8b86a0122 update cryptography dep version
github tells me there's a vulnerability <2.3
2018-08-16 12:54:30 -07:00
Noah Levitt
17a5fabb75 use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
payloads. because as of https://github.com/internetarchive/brozzler/pull/115
brozzler will be sending big videos via WARCPROX_WRITE_RECORD
2018-08-16 11:08:36 -07:00
Noah Levitt
0031091d4f
Merge pull request #99 from vbanos/blackout_period
New --blackout-period option to skip writing redundant revisits to WARC
2018-08-03 17:27:42 -07:00
Vangelis Banos
6b1d60c390 Apply blackout on when dedup URL equals request URL 2018-07-24 07:16:21 +00:00
Vangelis Banos
2c2c1d008a New --blackout-period option to skip writing redundant revisits to WARC
Add option `--blackout-period` (default=0)

When set and if the record is a duplicate (revisit record), check the
datetime of `dedup_info` and its inside the `blackout_period`, skip
writing the record to WARC.

Add some unit tests.

This is an improved implementation based on @nlevitt comments here:
https://github.com/internetarchive/warcprox/pull/92
2018-07-21 11:20:49 +00:00
Noah Levitt
966c386ac3 Merge branch 'master' into qa
* master:
  bump dev version after pull request
  dumb mistake
  hopefully fix a trough dedup concurrency bug
  some logging improvements
  test should expose trough dedup concurrency bug
  run trough with python 3.6 plus travis cleanup
  record request method in crawl log if not GET
  back to dev version number
  2.4b2 for pypi
  setuptools likes README.rst not readme.rst
2018-07-19 11:19:27 -05:00
Noah Levitt
fbce243787 bump dev version after pull request 2018-07-19 11:18:31 -05:00
Noah Levitt
f32d5636a1
Merge pull request #98 from nlevitt/trough-dedup-bugs
WIP: trough dedup bug fix
2018-07-19 11:17:19 -05:00
Noah Levitt
fde443070c dumb mistake 2018-07-18 20:10:30 -05:00
Noah Levitt
d3314d7904 hopefully fix a trough dedup concurrency bug 2018-07-18 19:26:16 -05:00
Noah Levitt
b7e12a3ec2 some logging improvements 2018-07-18 19:25:43 -05:00
Noah Levitt
f4cf782922 test should expose trough dedup concurrency bug 2018-07-18 19:23:24 -05:00
Noah Levitt
67392930f6
Merge pull request #97 from nlevitt/fix-travis-clean
run trough with python 3.6 plus travis cleanup
2018-07-18 16:38:08 -05:00
Noah Levitt
46d5b0e82c run trough with python 3.6 plus travis cleanup
docker image python:3 is now using 3.7 and building pyyaml < 3.13 fails
yaml/pyyaml#126

also filed pull request to update trough's pyyaml dependency spec
internetarchive/trough#20
2018-07-18 16:09:42 -05:00
Noah Levitt
2df82bd403 record request method in crawl log if not GET 2018-07-17 13:47:52 -05:00
Noah Levitt
8c22c55955 back to dev version number 2018-07-17 12:04:08 -05:00
Noah Levitt
6786a668b1 2.4b2 for pypi 2.4b2 2018-07-17 12:03:26 -05:00
Noah Levitt
8022257a57 setuptools likes README.rst not readme.rst 2018-07-17 16:35:05 +00:00
Noah Levitt
a3f5313850 Merge branch 'master' into qa
* master:
  log exception and continue 🤞 if schema reg fails
  log stack trace in case batch postprocessor raises
  bump dev version after merge
  more edits
  more little edits
  explain a bit about mitm
  little edits
  describe the last two remaining fields
  fixlets
  more progress on documenting "limits"
  add another "wait" to fix failing test
  fix bug in limits enforcement
  docs still in progress
  new checks exposing bug in limits enforcement
  working on "limits" and "soft-limits"
  explain warcprox-meta "blocks"
  starting to explain some warcprox-meta fields
  short sectioni on stats
  barely starting to flesh out warcprox-meta section
  explain deduplication
  starting to talk about warcprox-meta
  fix failure message in test_return_capture_timestamp
  double the backticks
  stubby api docs
  rename README.rst -> readme.rst
  add some debug logging in BatchTroughLoader
  just one should_dedup() for trough dedup
  only run tests in py3
  fix trough deployment in Dockerfile
  fix test_dedup_min_text_size failure?
  rewrite test_dedup_min_size() to account for
  we want to save all captures to the big "captures"
  default values for dedup_min_text_size et al
2018-06-01 15:49:24 -07:00
Noah Levitt
ec7a0bf569 log exception and continue 🤞 if schema reg fails
at trough dedup startup
2018-05-31 16:57:37 -07:00
Noah Levitt
e73cbcb6b3 log stack trace in case batch postprocessor raises
exception somehow
2018-05-31 16:57:06 -07:00
Noah Levitt
e8cb3afa71 bump dev version after merge 2018-05-31 16:52:37 -07:00
Noah Levitt
a1356709df
Merge pull request #93 from nlevitt/docs
docs
2018-05-30 15:57:50 -07:00
Noah Levitt
6f43286b07 more edits 2018-05-30 14:46:14 -07:00
Noah Levitt
9434a1ccd8 more little edits 2018-05-30 14:26:10 -07:00
Noah Levitt
f5bcec20a9 explain a bit about mitm 2018-05-30 14:12:58 -07:00
Noah Levitt
68ede68e5f little edits 2018-05-29 17:35:33 -07:00
Noah Levitt
cd6e30fe36 describe the last two remaining fields 2018-05-29 17:28:04 -07:00
Noah Levitt
4a87a08230 fixlets 2018-05-29 17:09:14 -07:00
Noah Levitt
8877259b7d more progress on documenting "limits" 2018-05-29 16:57:15 -07:00