187 Commits

Author SHA1 Message Date
Vangelis Banos
2068c037ea Create warcprox.certauth and drop certauth dependency
Copy certauth.py and tests_certauth.gr from `certauth==1.1.6`
b526eb2bfd

Change only imports.

Drop unused imports.

Update setup.py: drop `certauth` and add `pyopenssl`.
2024-07-09 11:56:06 +00:00
Adam Miller
731cfe80cc Adding url canonicalization tests and handling of edge cases to reduce log noise 2022-04-26 23:48:54 +00:00
Adam Miller
1e3d22aba4 Better handle non-ascii urls for crawl log hop info 2022-04-20 22:48:28 +00:00
Adam Miller
7f406b7942 Trying to fix tests that only fail during ci 2021-04-01 00:01:47 +00:00
Adam Miller
5f1c8c75fa Add test cases for space in content type header and exception messages 2021-03-31 23:22:04 +00:00
Noah Levitt
fe19bb268f use trough.client instead of warcprox.trough
less redundant code!
trough.client was based off of warcprox.trough but has been improved
since then
2019-11-19 11:45:14 -08:00
Noah Levitt
d1b52f8d80 try to fix test failing due to url-encoding
https://travis-ci.org/internetarchive/warcprox/jobs/588557539
test_domain_data_soft_limit
not sure what changed, maybe the requests library, though i can't
reproduce locally, but explicitly decoding should fix the problem
2019-09-23 11:16:48 -07:00
Barbara Miller
c0fcf59c86 rm test not matching use case 2019-06-14 13:34:47 -07:00
Barbara Miller
79aab697e2 more tests 2019-06-14 12:42:25 -07:00
Barbara Miller
51c4f6d622 test_dedup_buckets_multiple 2019-06-13 17:57:29 -07:00
Barbara Miller
6ee7ab36a2 fix tests too 2019-05-31 17:36:13 -07:00
Noah Levitt
f51f2ec225 some tweaks to error responses
use 502, 504 when appropriate, and don't send `str(e)` as in the http
status line, because that is often an ugly jumble
2019-05-14 15:51:11 -07:00
Vangelis Banos
89041e83b4 Catch RemoteDisconnected case when starting downloading
A common error is to connect to the remote server successfully but raise a
`http_client.RemoteDisconnected` exception when trying to begin
downloading. Its caused by call `prox_rec_res.begin(...)` which calls
`http_client._read_status()`. In that case, we also add the target
`hostname:port` to the `bad_hostnames_ports` cache.

Modify 2 unit tests to clear the `bad_hostnames_ports` cache because
localhost is added from previous tests and this breaks them.
2019-05-10 07:32:42 +00:00
Noah Levitt
f207e32f50 followup on IncompleteRead 2019-04-15 00:17:50 -07:00
Noah Levitt
5ced2588d4 failing test test_incomplete_read 2019-04-13 17:33:38 -07:00
Noah Levitt
ac3d238a3d new snakebite git url 2019-04-08 11:11:51 -07:00
Noah Levitt
3f08639553 still seeing a warning but 🤷‍♂️ 2019-03-21 16:00:36 -07:00
Noah Levitt
a25971e06b appease some warnings 2019-03-21 14:17:24 -07:00
Noah Levitt
cb2a07bff2 account for surt fix in urlcanon 0.3.0 2019-03-21 12:59:32 -07:00
Noah Levitt
150c1e67c6 WarcWriterProcessor.close_for_prefix()
New API to allow some code from outside of warcprox proper (in a
third-party plugin for example) can close open warcs promptly when it
knows they are finished.
2019-01-08 11:27:11 -08:00
Noah Levitt
0882a2b174 remove --writer-threads option
Support for multiple writer threads was broken, and profiling had shown
it to be of dubious utility.
https://github.com/internetarchive/warcprox/issues/101
https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2019-01-07 15:54:35 -08:00
Noah Levitt
bb50a6c7ff use predictable id in service registry
so that when warcprox restarts it replaces the obsolete entry
2018-11-12 15:11:23 -08:00
Noah Levitt
f082db62cf take all the queues and active requests into...
... account when calculating the `seconds_behind` number, and include
the timestamp `earliest_still_active_fetch_start` in the status output
2018-10-30 13:05:45 -07:00
Noah Levitt
4c0dfb432e failing test for new feature, enforcing limits on
WARCPROX_WRITE_RECORD requests
2018-10-10 18:21:28 -07:00
Noah Levitt
269e9604c1 include warcprox host and port in filenames
when using --crawl-log-dir, to avoid collisions (outside of warcprox
itself, in most cases) with crawl logs written by other warcprox
instances
2018-09-19 12:10:29 -07:00
Noah Levitt
5654bcbeb8 --quiet means NOTICE level logging
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
17a5fabb75 use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
payloads. because as of https://github.com/internetarchive/brozzler/pull/115
brozzler will be sending big videos via WARCPROX_WRITE_RECORD
2018-08-16 11:08:36 -07:00
Noah Levitt
0031091d4f
Merge pull request #99 from vbanos/blackout_period
New --blackout-period option to skip writing redundant revisits to WARC
2018-08-03 17:27:42 -07:00
Vangelis Banos
6b1d60c390 Apply blackout on when dedup URL equals request URL 2018-07-24 07:16:21 +00:00
Vangelis Banos
2c2c1d008a New --blackout-period option to skip writing redundant revisits to WARC
Add option `--blackout-period` (default=0)

When set and if the record is a duplicate (revisit record), check the
datetime of `dedup_info` and its inside the `blackout_period`, skip
writing the record to WARC.

Add some unit tests.

This is an improved implementation based on @nlevitt comments here:
https://github.com/internetarchive/warcprox/pull/92
2018-07-21 11:20:49 +00:00
Noah Levitt
f32d5636a1
Merge pull request #98 from nlevitt/trough-dedup-bugs
WIP: trough dedup bug fix
2018-07-19 11:17:19 -05:00
Noah Levitt
f4cf782922 test should expose trough dedup concurrency bug 2018-07-18 19:23:24 -05:00
Noah Levitt
46d5b0e82c run trough with python 3.6 plus travis cleanup
docker image python:3 is now using 3.7 and building pyyaml < 3.13 fails
yaml/pyyaml#126

also filed pull request to update trough's pyyaml dependency spec
internetarchive/trough#20
2018-07-18 16:09:42 -05:00
Noah Levitt
2df82bd403 record request method in crawl log if not GET 2018-07-17 13:47:52 -05:00
Noah Levitt
6256ec6a07 add another "wait" to fix failing test 2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2 fix bug in limits enforcement
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
195faa5cff new checks exposing bug in limits enforcement 2018-05-25 17:35:32 -07:00
Noah Levitt
36f6696552 fix failure message in test_return_capture_timestamp 2018-05-22 15:00:10 -07:00
Noah Levitt
d834ac3e59 only run tests in py3 2018-05-16 14:21:18 -07:00
Noah Levitt
49f637af05 fix trough deployment in Dockerfile 2018-05-16 13:48:04 -07:00
Noah Levitt
76ebaea944 fix test_dedup_min_text_size failure?
by waiting for postfetch chain in test_socket_timeout_response
2018-05-16 12:17:06 -07:00
Noah Levitt
5f0c46d579 rewrite test_dedup_min_size() to account for
the fact that we always save a record to the big captures table,
partly by adding a new check that --dedup-min-*-size is respected even
if there is an entry in the dedup db for the sha1
2018-05-16 10:52:04 -07:00
Noah Levitt
15830fc5a2 support "captures-bucket" for backward compatibility 2018-05-09 15:43:39 -07:00
Vangelis Banos
9baa2e22d5 Rename captures-bucket to dedup-bucket in Warcprox-Meta 2018-05-04 13:26:38 +00:00
Vangelis Banos
9dac806ca1 Fix travis-ci unit test issue
`test_dedup_https` fails on travis-ci.
https://travis-ci.org/internetarchive/warcprox/jobs/370598950

We didn't touch that at all but worked on `test_dedup_min_size` which
runs just before that. We move `test_dedup_min_size` to the end of the
file hoping to resolve this.
2018-04-24 16:31:37 +00:00
Vangelis Banos
944c9a1e11 Add unit tests
Create two very small dummy responses (text, 2 bytes and binary, 4 bytes).
Use options --dedup-min-text-size=3 and --dedup-min-binary-size=5.
Ensure that due to the effects of these options, dedup is not happening.

Existing dedup unit tests are not affected at all.
2018-04-24 12:18:20 +00:00
Noah Levitt
38e2a87f31 make test server multithreaded so tests will pass 2018-04-05 17:59:10 -07:00
Noah Levitt
385014c322 always call socket.shutdown() to close connections 2018-04-04 17:49:08 -07:00
Noah Levitt
595e819961 test another request after truncated response
to check for hangs or timeouts
2018-04-04 15:45:13 -07:00
Noah Levitt
3f9ecbacac tweak tests to make them pass now that keepalive
is enabled on the test server
2018-04-04 15:41:54 -07:00