208 Commits

Author SHA1 Message Date
Barbara Miller
f782f8a985
Merge pull request #162 from internetarchive/fixes-malformed-crawl-log-lines
Fixes malformed crawl log lines
2021-04-01 12:19:03 -07:00
Adam Miller
7f406b7942 Trying to fix tests that only fail during ci 2021-04-01 00:01:47 +00:00
Adam Miller
5f1c8c75fa Add test cases for space in content type header and exception messages 2021-03-31 23:22:04 +00:00
Noah Levitt
962e407483 Merge branch 'use-trough-lib' into qa
* use-trough-lib:
  trough uses py3.5+ async syntax
  use trough.client instead of warcprox.trough
2019-11-19 13:31:47 -08:00
Noah Levitt
c7ddeea2f0 Merge remote-tracking branch 'origin/master' into qa
* origin/master:
  bump version after merge
  Another exception when trying to close a WARC file
  bump version after merges
  try to fix test failing due to url-encoding
  Use "except Exception" to catch all exception types
  Set connection pool maxsize=6
  Handle ValueError when trying to close WARC file
  Skip cdx dedup for volatile URLs with session params
  Increase remote_connection_pool maxsize
  bump version
  add missing import
  avoid this problem
2019-11-19 13:31:34 -08:00
Noah Levitt
fe19bb268f use trough.client instead of warcprox.trough
less redundant code!
trough.client was based off of warcprox.trough but has been improved
since then
2019-11-19 11:45:14 -08:00
Noah Levitt
d1b52f8d80 try to fix test failing due to url-encoding
https://travis-ci.org/internetarchive/warcprox/jobs/588557539
test_domain_data_soft_limit
not sure what changed, maybe the requests library, though i can't
reproduce locally, but explicitly decoding should fix the problem
2019-09-23 11:16:48 -07:00
Barbara Miller
f906312800 Merge branch 'dedup-fixes' into qa 2019-06-14 15:04:00 -07:00
Barbara Miller
c0fcf59c86 rm test not matching use case 2019-06-14 13:34:47 -07:00
Barbara Miller
79aab697e2 more tests 2019-06-14 12:42:25 -07:00
Barbara Miller
51c4f6d622 test_dedup_buckets_multiple 2019-06-13 17:57:29 -07:00
Barbara Miller
6ee7ab36a2 fix tests too 2019-05-31 17:36:13 -07:00
Noah Levitt
f51f2ec225 some tweaks to error responses
use 502, 504 when appropriate, and don't send `str(e)` as in the http
status line, because that is often an ugly jumble
2019-05-14 15:51:11 -07:00
Vangelis Banos
89041e83b4 Catch RemoteDisconnected case when starting downloading
A common error is to connect to the remote server successfully but raise a
`http_client.RemoteDisconnected` exception when trying to begin
downloading. Its caused by call `prox_rec_res.begin(...)` which calls
`http_client._read_status()`. In that case, we also add the target
`hostname:port` to the `bad_hostnames_ports` cache.

Modify 2 unit tests to clear the `bad_hostnames_ports` cache because
localhost is added from previous tests and this breaks them.
2019-05-10 07:32:42 +00:00
Noah Levitt
f207e32f50 followup on IncompleteRead 2019-04-15 00:17:50 -07:00
Noah Levitt
5ced2588d4 failing test test_incomplete_read 2019-04-13 17:33:38 -07:00
Noah Levitt
ac3d238a3d new snakebite git url 2019-04-08 11:11:51 -07:00
Noah Levitt
3f08639553 still seeing a warning but 🤷‍♂️ 2019-03-21 16:00:36 -07:00
Noah Levitt
a25971e06b appease some warnings 2019-03-21 14:17:24 -07:00
Noah Levitt
cb2a07bff2 account for surt fix in urlcanon 0.3.0 2019-03-21 12:59:32 -07:00
Noah Levitt
321c638a62 Merge branch 'warc-close-api' into qa
* warc-close-api:
  WarcWriterProcessor.close_for_prefix()
  ThreadPoolExecutor no longer used
  remove --writer-threads option
2019-01-08 11:28:52 -08:00
Noah Levitt
150c1e67c6 WarcWriterProcessor.close_for_prefix()
New API to allow some code from outside of warcprox proper (in a
third-party plugin for example) can close open warcs promptly when it
knows they are finished.
2019-01-08 11:27:11 -08:00
Noah Levitt
0882a2b174 remove --writer-threads option
Support for multiple writer threads was broken, and profiling had shown
it to be of dubious utility.
https://github.com/internetarchive/warcprox/issues/101
https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2019-01-07 15:54:35 -08:00
Noah Levitt
6dc4abfa84 Merge branch 'master' into qa
* master:
  3 hour hard timeout on urls without content-length
  use predictable id in service registry
2018-11-14 13:01:17 -08:00
Noah Levitt
bb50a6c7ff use predictable id in service registry
so that when warcprox restarts it replaces the obsolete entry
2018-11-12 15:11:23 -08:00
Noah Levitt
64e92d8953 Merge branch 'fix-seconds-behind' into qa
* fix-seconds-behind:
  datetimes with timezone in status because...
  be clear about timezone in timestamps
  take all the queues and active requests into...
2018-10-31 12:36:26 -07:00
Noah Levitt
f082db62cf take all the queues and active requests into...
... account when calculating the `seconds_behind` number, and include
the timestamp `earliest_still_active_fetch_start` in the status output
2018-10-30 13:05:45 -07:00
Noah Levitt
7e45c11501 Merge branch 'master' into qa
* master:
  fix shutdown
  enforce limits on WARCPROX_WRITE_RECORD requests
  failing test for new feature, enforcing limits on
2018-10-26 13:24:52 -07:00
Noah Levitt
4c0dfb432e failing test for new feature, enforcing limits on
WARCPROX_WRITE_RECORD requests
2018-10-10 18:21:28 -07:00
Noah Levitt
eab9181129 Merge branch 'master' into qa
* master:
  bump version after merge
  include warcprox host and port in filenames
  replace pencil drawing with nice diagram by James
  fix bug
  readable stack traces, thanks py.test
  --quiet means NOTICE level logging
  tweak max threads option handling
  set socket timeout for tor .onion fetching
  WARCPROX_WRITE_RECORD respect buffer size setting
  --help-hidden for help on hidden args
  half-baked readme section on warcprox architecture
  bump dev version number after merge
  restore 80 column lines
  Copy edits updated
  Copy edits
  update cryptography dep version
  use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
  Apply blackout on when dedup URL equals request URL
  New --blackout-period option to skip writing redundant revisits to WARC
2018-09-28 11:12:18 -07:00
Noah Levitt
269e9604c1 include warcprox host and port in filenames
when using --crawl-log-dir, to avoid collisions (outside of warcprox
itself, in most cases) with crawl logs written by other warcprox
instances
2018-09-19 12:10:29 -07:00
Noah Levitt
5654bcbeb8 --quiet means NOTICE level logging
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
17a5fabb75 use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
payloads. because as of https://github.com/internetarchive/brozzler/pull/115
brozzler will be sending big videos via WARCPROX_WRITE_RECORD
2018-08-16 11:08:36 -07:00
Noah Levitt
0031091d4f
Merge pull request #99 from vbanos/blackout_period
New --blackout-period option to skip writing redundant revisits to WARC
2018-08-03 17:27:42 -07:00
Vangelis Banos
6b1d60c390 Apply blackout on when dedup URL equals request URL 2018-07-24 07:16:21 +00:00
Vangelis Banos
2c2c1d008a New --blackout-period option to skip writing redundant revisits to WARC
Add option `--blackout-period` (default=0)

When set and if the record is a duplicate (revisit record), check the
datetime of `dedup_info` and its inside the `blackout_period`, skip
writing the record to WARC.

Add some unit tests.

This is an improved implementation based on @nlevitt comments here:
https://github.com/internetarchive/warcprox/pull/92
2018-07-21 11:20:49 +00:00
Noah Levitt
966c386ac3 Merge branch 'master' into qa
* master:
  bump dev version after pull request
  dumb mistake
  hopefully fix a trough dedup concurrency bug
  some logging improvements
  test should expose trough dedup concurrency bug
  run trough with python 3.6 plus travis cleanup
  record request method in crawl log if not GET
  back to dev version number
  2.4b2 for pypi
  setuptools likes README.rst not readme.rst
2018-07-19 11:19:27 -05:00
Noah Levitt
f32d5636a1
Merge pull request #98 from nlevitt/trough-dedup-bugs
WIP: trough dedup bug fix
2018-07-19 11:17:19 -05:00
Noah Levitt
f4cf782922 test should expose trough dedup concurrency bug 2018-07-18 19:23:24 -05:00
Noah Levitt
46d5b0e82c run trough with python 3.6 plus travis cleanup
docker image python:3 is now using 3.7 and building pyyaml < 3.13 fails
yaml/pyyaml#126

also filed pull request to update trough's pyyaml dependency spec
internetarchive/trough#20
2018-07-18 16:09:42 -05:00
Noah Levitt
2df82bd403 record request method in crawl log if not GET 2018-07-17 13:47:52 -05:00
Noah Levitt
a3f5313850 Merge branch 'master' into qa
* master:
  log exception and continue 🤞 if schema reg fails
  log stack trace in case batch postprocessor raises
  bump dev version after merge
  more edits
  more little edits
  explain a bit about mitm
  little edits
  describe the last two remaining fields
  fixlets
  more progress on documenting "limits"
  add another "wait" to fix failing test
  fix bug in limits enforcement
  docs still in progress
  new checks exposing bug in limits enforcement
  working on "limits" and "soft-limits"
  explain warcprox-meta "blocks"
  starting to explain some warcprox-meta fields
  short sectioni on stats
  barely starting to flesh out warcprox-meta section
  explain deduplication
  starting to talk about warcprox-meta
  fix failure message in test_return_capture_timestamp
  double the backticks
  stubby api docs
  rename README.rst -> readme.rst
  add some debug logging in BatchTroughLoader
  just one should_dedup() for trough dedup
  only run tests in py3
  fix trough deployment in Dockerfile
  fix test_dedup_min_text_size failure?
  rewrite test_dedup_min_size() to account for
  we want to save all captures to the big "captures"
  default values for dedup_min_text_size et al
2018-06-01 15:49:24 -07:00
Noah Levitt
6256ec6a07 add another "wait" to fix failing test 2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2 fix bug in limits enforcement
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
195faa5cff new checks exposing bug in limits enforcement 2018-05-25 17:35:32 -07:00
Noah Levitt
36f6696552 fix failure message in test_return_capture_timestamp 2018-05-22 15:00:10 -07:00
Noah Levitt
d834ac3e59 only run tests in py3 2018-05-16 14:21:18 -07:00
Noah Levitt
49f637af05 fix trough deployment in Dockerfile 2018-05-16 13:48:04 -07:00
Noah Levitt
76ebaea944 fix test_dedup_min_text_size failure?
by waiting for postfetch chain in test_socket_timeout_response
2018-05-16 12:17:06 -07:00
Noah Levitt
5f0c46d579 rewrite test_dedup_min_size() to account for
the fact that we always save a record to the big captures table,
partly by adding a new check that --dedup-min-*-size is respected even
if there is an entry in the dedup db for the sha1
2018-05-16 10:52:04 -07:00