160 Commits

Author SHA1 Message Date
Barbara Miller
8f10fce93a resetting to Jul 1 updates 2022-08-03 19:58:54 -07:00
Barbara Miller
ab172189fd Merge branch 'tls-fingerprint' into qa 2022-07-01 11:11:54 -07:00
Adam Miller
731cfe80cc Adding url canonicalization tests and handling of edge cases to reduce log noise 2022-04-26 23:48:54 +00:00
Adam Miller
1e3d22aba4 Better handle non-ascii urls for crawl log hop info 2022-04-20 22:48:28 +00:00
Barbara Miller
f782f8a985
Merge pull request #162 from internetarchive/fixes-malformed-crawl-log-lines
Fixes malformed crawl log lines
2021-04-01 12:19:03 -07:00
Adam Miller
7f406b7942 Trying to fix tests that only fail during ci 2021-04-01 00:01:47 +00:00
Adam Miller
5f1c8c75fa Add test cases for space in content type header and exception messages 2021-03-31 23:22:04 +00:00
Noah Levitt
962e407483 Merge branch 'use-trough-lib' into qa
* use-trough-lib:
  trough uses py3.5+ async syntax
  use trough.client instead of warcprox.trough
2019-11-19 13:31:47 -08:00
Noah Levitt
c7ddeea2f0 Merge remote-tracking branch 'origin/master' into qa
* origin/master:
  bump version after merge
  Another exception when trying to close a WARC file
  bump version after merges
  try to fix test failing due to url-encoding
  Use "except Exception" to catch all exception types
  Set connection pool maxsize=6
  Handle ValueError when trying to close WARC file
  Skip cdx dedup for volatile URLs with session params
  Increase remote_connection_pool maxsize
  bump version
  add missing import
  avoid this problem
2019-11-19 13:31:34 -08:00
Noah Levitt
fe19bb268f use trough.client instead of warcprox.trough
less redundant code!
trough.client was based off of warcprox.trough but has been improved
since then
2019-11-19 11:45:14 -08:00
Noah Levitt
d1b52f8d80 try to fix test failing due to url-encoding
https://travis-ci.org/internetarchive/warcprox/jobs/588557539
test_domain_data_soft_limit
not sure what changed, maybe the requests library, though i can't
reproduce locally, but explicitly decoding should fix the problem
2019-09-23 11:16:48 -07:00
Barbara Miller
f906312800 Merge branch 'dedup-fixes' into qa 2019-06-14 15:04:00 -07:00
Barbara Miller
c0fcf59c86 rm test not matching use case 2019-06-14 13:34:47 -07:00
Barbara Miller
79aab697e2 more tests 2019-06-14 12:42:25 -07:00
Barbara Miller
51c4f6d622 test_dedup_buckets_multiple 2019-06-13 17:57:29 -07:00
Barbara Miller
6ee7ab36a2 fix tests too 2019-05-31 17:36:13 -07:00
Noah Levitt
f51f2ec225 some tweaks to error responses
use 502, 504 when appropriate, and don't send `str(e)` as in the http
status line, because that is often an ugly jumble
2019-05-14 15:51:11 -07:00
Vangelis Banos
89041e83b4 Catch RemoteDisconnected case when starting downloading
A common error is to connect to the remote server successfully but raise a
`http_client.RemoteDisconnected` exception when trying to begin
downloading. Its caused by call `prox_rec_res.begin(...)` which calls
`http_client._read_status()`. In that case, we also add the target
`hostname:port` to the `bad_hostnames_ports` cache.

Modify 2 unit tests to clear the `bad_hostnames_ports` cache because
localhost is added from previous tests and this breaks them.
2019-05-10 07:32:42 +00:00
Noah Levitt
f207e32f50 followup on IncompleteRead 2019-04-15 00:17:50 -07:00
Noah Levitt
5ced2588d4 failing test test_incomplete_read 2019-04-13 17:33:38 -07:00
Noah Levitt
3f08639553 still seeing a warning but 🤷‍♂️ 2019-03-21 16:00:36 -07:00
Noah Levitt
a25971e06b appease some warnings 2019-03-21 14:17:24 -07:00
Noah Levitt
cb2a07bff2 account for surt fix in urlcanon 0.3.0 2019-03-21 12:59:32 -07:00
Noah Levitt
321c638a62 Merge branch 'warc-close-api' into qa
* warc-close-api:
  WarcWriterProcessor.close_for_prefix()
  ThreadPoolExecutor no longer used
  remove --writer-threads option
2019-01-08 11:28:52 -08:00
Noah Levitt
0882a2b174 remove --writer-threads option
Support for multiple writer threads was broken, and profiling had shown
it to be of dubious utility.
https://github.com/internetarchive/warcprox/issues/101
https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2019-01-07 15:54:35 -08:00
Noah Levitt
6dc4abfa84 Merge branch 'master' into qa
* master:
  3 hour hard timeout on urls without content-length
  use predictable id in service registry
2018-11-14 13:01:17 -08:00
Noah Levitt
bb50a6c7ff use predictable id in service registry
so that when warcprox restarts it replaces the obsolete entry
2018-11-12 15:11:23 -08:00
Noah Levitt
64e92d8953 Merge branch 'fix-seconds-behind' into qa
* fix-seconds-behind:
  datetimes with timezone in status because...
  be clear about timezone in timestamps
  take all the queues and active requests into...
2018-10-31 12:36:26 -07:00
Noah Levitt
f082db62cf take all the queues and active requests into...
... account when calculating the `seconds_behind` number, and include
the timestamp `earliest_still_active_fetch_start` in the status output
2018-10-30 13:05:45 -07:00
Noah Levitt
7e45c11501 Merge branch 'master' into qa
* master:
  fix shutdown
  enforce limits on WARCPROX_WRITE_RECORD requests
  failing test for new feature, enforcing limits on
2018-10-26 13:24:52 -07:00
Noah Levitt
4c0dfb432e failing test for new feature, enforcing limits on
WARCPROX_WRITE_RECORD requests
2018-10-10 18:21:28 -07:00
Noah Levitt
eab9181129 Merge branch 'master' into qa
* master:
  bump version after merge
  include warcprox host and port in filenames
  replace pencil drawing with nice diagram by James
  fix bug
  readable stack traces, thanks py.test
  --quiet means NOTICE level logging
  tweak max threads option handling
  set socket timeout for tor .onion fetching
  WARCPROX_WRITE_RECORD respect buffer size setting
  --help-hidden for help on hidden args
  half-baked readme section on warcprox architecture
  bump dev version number after merge
  restore 80 column lines
  Copy edits updated
  Copy edits
  update cryptography dep version
  use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
  Apply blackout on when dedup URL equals request URL
  New --blackout-period option to skip writing redundant revisits to WARC
2018-09-28 11:12:18 -07:00
Noah Levitt
269e9604c1 include warcprox host and port in filenames
when using --crawl-log-dir, to avoid collisions (outside of warcprox
itself, in most cases) with crawl logs written by other warcprox
instances
2018-09-19 12:10:29 -07:00
Noah Levitt
5654bcbeb8 --quiet means NOTICE level logging
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
966c386ac3 Merge branch 'master' into qa
* master:
  bump dev version after pull request
  dumb mistake
  hopefully fix a trough dedup concurrency bug
  some logging improvements
  test should expose trough dedup concurrency bug
  run trough with python 3.6 plus travis cleanup
  record request method in crawl log if not GET
  back to dev version number
  2.4b2 for pypi
  setuptools likes README.rst not readme.rst
2018-07-19 11:19:27 -05:00
Noah Levitt
f4cf782922 test should expose trough dedup concurrency bug 2018-07-18 19:23:24 -05:00
Noah Levitt
2df82bd403 record request method in crawl log if not GET 2018-07-17 13:47:52 -05:00
Noah Levitt
a3f5313850 Merge branch 'master' into qa
* master:
  log exception and continue 🤞 if schema reg fails
  log stack trace in case batch postprocessor raises
  bump dev version after merge
  more edits
  more little edits
  explain a bit about mitm
  little edits
  describe the last two remaining fields
  fixlets
  more progress on documenting "limits"
  add another "wait" to fix failing test
  fix bug in limits enforcement
  docs still in progress
  new checks exposing bug in limits enforcement
  working on "limits" and "soft-limits"
  explain warcprox-meta "blocks"
  starting to explain some warcprox-meta fields
  short sectioni on stats
  barely starting to flesh out warcprox-meta section
  explain deduplication
  starting to talk about warcprox-meta
  fix failure message in test_return_capture_timestamp
  double the backticks
  stubby api docs
  rename README.rst -> readme.rst
  add some debug logging in BatchTroughLoader
  just one should_dedup() for trough dedup
  only run tests in py3
  fix trough deployment in Dockerfile
  fix test_dedup_min_text_size failure?
  rewrite test_dedup_min_size() to account for
  we want to save all captures to the big "captures"
  default values for dedup_min_text_size et al
2018-06-01 15:49:24 -07:00
Noah Levitt
6256ec6a07 add another "wait" to fix failing test 2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2 fix bug in limits enforcement
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
195faa5cff new checks exposing bug in limits enforcement 2018-05-25 17:35:32 -07:00
Noah Levitt
36f6696552 fix failure message in test_return_capture_timestamp 2018-05-22 15:00:10 -07:00
Noah Levitt
76ebaea944 fix test_dedup_min_text_size failure?
by waiting for postfetch chain in test_socket_timeout_response
2018-05-16 12:17:06 -07:00
Noah Levitt
5f0c46d579 rewrite test_dedup_min_size() to account for
the fact that we always save a record to the big captures table,
partly by adding a new check that --dedup-min-*-size is respected even
if there is an entry in the dedup db for the sha1
2018-05-16 10:52:04 -07:00
Noah Levitt
7afa92f102 Merge branch 'master' into qa
* master:
  support "captures-bucket" for backward compatibility
  Add hidden CLI option --dedup-only-with-bucket
  dedup-bucket is required in Warcprox-Meta to do dedup
  Rename captures-bucket to dedup-bucket in Warcprox-Meta
  bump dev version number after #86
  Use DedupableMixin in RethinkCapturesDedup
  Fix travis-ci unit test issue
  Add unit tests
  Remove method decorate_with_dedup_info
  Use DedupableMixin in all dedup classes
  default to 100 proxy threads, 1 warc writer thread
  include warc writer worker threads in profiling
  cap the number of urls queued for warc writing
  oops! /status has been lying about queued urls
  Configurable min dedupable size for text/binary resources
  bump dev version number after PR
  Fix Accept-Encoding request header
  CDX dedup improvements
  bump dev version number after PR
  make test server multithreaded so tests will pass
  always call socket.shutdown() to close connections
  bump dev version number
  close connection when truncating response
  test another request after truncated response
  close all remote connections at shutdown
  tweak tests to make them pass now that keepalive
  enable keepalive on test http server
  more logging
  remove some debug logging
  this is some logging meant to debug the mysterious
  work around odd problem (see comment in code)
2018-05-09 15:43:52 -07:00
Noah Levitt
15830fc5a2 support "captures-bucket" for backward compatibility 2018-05-09 15:43:39 -07:00
Vangelis Banos
9baa2e22d5 Rename captures-bucket to dedup-bucket in Warcprox-Meta 2018-05-04 13:26:38 +00:00
Vangelis Banos
9dac806ca1 Fix travis-ci unit test issue
`test_dedup_https` fails on travis-ci.
https://travis-ci.org/internetarchive/warcprox/jobs/370598950

We didn't touch that at all but worked on `test_dedup_min_size` which
runs just before that. We move `test_dedup_min_size` to the end of the
file hoping to resolve this.
2018-04-24 16:31:37 +00:00
Vangelis Banos
944c9a1e11 Add unit tests
Create two very small dummy responses (text, 2 bytes and binary, 4 bytes).
Use options --dedup-min-text-size=3 and --dedup-min-binary-size=5.
Ensure that due to the effects of these options, dedup is not happening.

Existing dedup unit tests are not affected at all.
2018-04-24 12:18:20 +00:00
Noah Levitt
38e2a87f31 make test server multithreaded so tests will pass 2018-04-05 17:59:10 -07:00