Vangelis Banos
2068c037ea
Create warcprox.certauth and drop certauth dependency
...
Copy certauth.py and tests_certauth.gr from `certauth==1.1.6`
b526eb2bfd
Change only imports.
Drop unused imports.
Update setup.py: drop `certauth` and add `pyopenssl`.
2024-07-09 11:56:06 +00:00
Adam Miller
731cfe80cc
Adding url canonicalization tests and handling of edge cases to reduce log noise
2022-04-26 23:48:54 +00:00
Adam Miller
1e3d22aba4
Better handle non-ascii urls for crawl log hop info
2022-04-20 22:48:28 +00:00
Adam Miller
7f406b7942
Trying to fix tests that only fail during ci
2021-04-01 00:01:47 +00:00
Adam Miller
5f1c8c75fa
Add test cases for space in content type header and exception messages
2021-03-31 23:22:04 +00:00
Noah Levitt
fe19bb268f
use trough.client instead of warcprox.trough
...
less redundant code!
trough.client was based off of warcprox.trough but has been improved
since then
2019-11-19 11:45:14 -08:00
Noah Levitt
d1b52f8d80
try to fix test failing due to url-encoding
...
https://travis-ci.org/internetarchive/warcprox/jobs/588557539
test_domain_data_soft_limit
not sure what changed, maybe the requests library, though i can't
reproduce locally, but explicitly decoding should fix the problem
2019-09-23 11:16:48 -07:00
Barbara Miller
c0fcf59c86
rm test not matching use case
2019-06-14 13:34:47 -07:00
Barbara Miller
79aab697e2
more tests
2019-06-14 12:42:25 -07:00
Barbara Miller
51c4f6d622
test_dedup_buckets_multiple
2019-06-13 17:57:29 -07:00
Barbara Miller
6ee7ab36a2
fix tests too
2019-05-31 17:36:13 -07:00
Noah Levitt
f51f2ec225
some tweaks to error responses
...
use 502, 504 when appropriate, and don't send `str(e)` as in the http
status line, because that is often an ugly jumble
2019-05-14 15:51:11 -07:00
Vangelis Banos
89041e83b4
Catch RemoteDisconnected case when starting downloading
...
A common error is to connect to the remote server successfully but raise a
`http_client.RemoteDisconnected` exception when trying to begin
downloading. Its caused by call `prox_rec_res.begin(...)` which calls
`http_client._read_status()`. In that case, we also add the target
`hostname:port` to the `bad_hostnames_ports` cache.
Modify 2 unit tests to clear the `bad_hostnames_ports` cache because
localhost is added from previous tests and this breaks them.
2019-05-10 07:32:42 +00:00
Noah Levitt
f207e32f50
followup on IncompleteRead
2019-04-15 00:17:50 -07:00
Noah Levitt
5ced2588d4
failing test test_incomplete_read
2019-04-13 17:33:38 -07:00
Noah Levitt
ac3d238a3d
new snakebite git url
2019-04-08 11:11:51 -07:00
Noah Levitt
3f08639553
still seeing a warning but 🤷♂️
2019-03-21 16:00:36 -07:00
Noah Levitt
a25971e06b
appease some warnings
2019-03-21 14:17:24 -07:00
Noah Levitt
cb2a07bff2
account for surt fix in urlcanon 0.3.0
2019-03-21 12:59:32 -07:00
Noah Levitt
150c1e67c6
WarcWriterProcessor.close_for_prefix()
...
New API to allow some code from outside of warcprox proper (in a
third-party plugin for example) can close open warcs promptly when it
knows they are finished.
2019-01-08 11:27:11 -08:00
Noah Levitt
0882a2b174
remove --writer-threads option
...
Support for multiple writer threads was broken, and profiling had shown
it to be of dubious utility.
https://github.com/internetarchive/warcprox/issues/101
https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2019-01-07 15:54:35 -08:00
Noah Levitt
bb50a6c7ff
use predictable id in service registry
...
so that when warcprox restarts it replaces the obsolete entry
2018-11-12 15:11:23 -08:00
Noah Levitt
f082db62cf
take all the queues and active requests into...
...
... account when calculating the `seconds_behind` number, and include
the timestamp `earliest_still_active_fetch_start` in the status output
2018-10-30 13:05:45 -07:00
Noah Levitt
4c0dfb432e
failing test for new feature, enforcing limits on
...
WARCPROX_WRITE_RECORD requests
2018-10-10 18:21:28 -07:00
Noah Levitt
269e9604c1
include warcprox host and port in filenames
...
when using --crawl-log-dir, to avoid collisions (outside of warcprox
itself, in most cases) with crawl logs written by other warcprox
instances
2018-09-19 12:10:29 -07:00
Noah Levitt
5654bcbeb8
--quiet means NOTICE level logging
...
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
17a5fabb75
use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
...
payloads. because as of https://github.com/internetarchive/brozzler/pull/115
brozzler will be sending big videos via WARCPROX_WRITE_RECORD
2018-08-16 11:08:36 -07:00
Noah Levitt
0031091d4f
Merge pull request #99 from vbanos/blackout_period
...
New --blackout-period option to skip writing redundant revisits to WARC
2018-08-03 17:27:42 -07:00
Vangelis Banos
6b1d60c390
Apply blackout on when dedup URL equals request URL
2018-07-24 07:16:21 +00:00
Vangelis Banos
2c2c1d008a
New --blackout-period option to skip writing redundant revisits to WARC
...
Add option `--blackout-period` (default=0)
When set and if the record is a duplicate (revisit record), check the
datetime of `dedup_info` and its inside the `blackout_period`, skip
writing the record to WARC.
Add some unit tests.
This is an improved implementation based on @nlevitt comments here:
https://github.com/internetarchive/warcprox/pull/92
2018-07-21 11:20:49 +00:00
Noah Levitt
f32d5636a1
Merge pull request #98 from nlevitt/trough-dedup-bugs
...
WIP: trough dedup bug fix
2018-07-19 11:17:19 -05:00
Noah Levitt
f4cf782922
test should expose trough dedup concurrency bug
2018-07-18 19:23:24 -05:00
Noah Levitt
46d5b0e82c
run trough with python 3.6 plus travis cleanup
...
docker image python:3 is now using 3.7 and building pyyaml < 3.13 fails
yaml/pyyaml#126
also filed pull request to update trough's pyyaml dependency spec
internetarchive/trough#20
2018-07-18 16:09:42 -05:00
Noah Levitt
2df82bd403
record request method in crawl log if not GET
2018-07-17 13:47:52 -05:00
Noah Levitt
6256ec6a07
add another "wait" to fix failing test
2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2
fix bug in limits enforcement
...
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
195faa5cff
new checks exposing bug in limits enforcement
2018-05-25 17:35:32 -07:00
Noah Levitt
36f6696552
fix failure message in test_return_capture_timestamp
2018-05-22 15:00:10 -07:00
Noah Levitt
d834ac3e59
only run tests in py3
2018-05-16 14:21:18 -07:00
Noah Levitt
49f637af05
fix trough deployment in Dockerfile
2018-05-16 13:48:04 -07:00
Noah Levitt
76ebaea944
fix test_dedup_min_text_size failure?
...
by waiting for postfetch chain in test_socket_timeout_response
2018-05-16 12:17:06 -07:00
Noah Levitt
5f0c46d579
rewrite test_dedup_min_size() to account for
...
the fact that we always save a record to the big captures table,
partly by adding a new check that --dedup-min-*-size is respected even
if there is an entry in the dedup db for the sha1
2018-05-16 10:52:04 -07:00
Noah Levitt
15830fc5a2
support "captures-bucket" for backward compatibility
2018-05-09 15:43:39 -07:00
Vangelis Banos
9baa2e22d5
Rename captures-bucket to dedup-bucket in Warcprox-Meta
2018-05-04 13:26:38 +00:00
Vangelis Banos
9dac806ca1
Fix travis-ci unit test issue
...
`test_dedup_https` fails on travis-ci.
https://travis-ci.org/internetarchive/warcprox/jobs/370598950
We didn't touch that at all but worked on `test_dedup_min_size` which
runs just before that. We move `test_dedup_min_size` to the end of the
file hoping to resolve this.
2018-04-24 16:31:37 +00:00
Vangelis Banos
944c9a1e11
Add unit tests
...
Create two very small dummy responses (text, 2 bytes and binary, 4 bytes).
Use options --dedup-min-text-size=3 and --dedup-min-binary-size=5.
Ensure that due to the effects of these options, dedup is not happening.
Existing dedup unit tests are not affected at all.
2018-04-24 12:18:20 +00:00
Noah Levitt
38e2a87f31
make test server multithreaded so tests will pass
2018-04-05 17:59:10 -07:00
Noah Levitt
385014c322
always call socket.shutdown() to close connections
2018-04-04 17:49:08 -07:00
Noah Levitt
595e819961
test another request after truncated response
...
to check for hangs or timeouts
2018-04-04 15:45:13 -07:00
Noah Levitt
3f9ecbacac
tweak tests to make them pass now that keepalive
...
is enabled on the test server
2018-04-04 15:41:54 -07:00