160 Commits

Author SHA1 Message Date
Vangelis Banos
56f0118374 Replace timestamp parameter with more generic request/response syntax
Replace timestamp parameter with more generic extra_response_headers={}

When request has --header ``Warcprox-Meta: {\"accept\":[\"capture-metadata\"]}"``
Response has the following header:
``Warcprox-Meta: {"capture-metadata":{"timestamp":"2017-10-31T10:47:50Z"}}``

Update unit test
2017-10-31 10:49:10 +00:00
Vangelis Banos
3d9a22b6c7 Return capture timestamp
When client request has HTTP header ``Warcprox-Meta": {"return-capture-timestamp": 1}``,
add to the response the WARC record timestamp in the following HTTP header:
``Warcprox-Meta: {"capture-timestamp": '%Y-%m-%d %H:%M:%S"}``.

Add unit test.
2017-10-29 18:48:08 +00:00
Noah Levitt
d177b3b80d change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option 2017-10-11 12:06:19 -07:00
Noah Levitt
9b8043d3a2 greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code 2017-10-06 17:00:35 -07:00
Noah Levitt
0de10791aa Merge pull request #35 from vbanos/dedup-redundant-code
Remove redundant methods from dedup classes
2017-09-29 11:42:47 -07:00
Vangelis Banos
4e7d8fa917 Remove deleted `close` method call from test. 2017-09-29 06:36:37 +00:00
Noah Levitt
faae23d764 allow very long request header lines, to support large warcprox-meta header values 2017-09-27 17:29:55 -07:00
Noah Levitt
30b69c5838 make test pass with py27 2017-08-07 16:21:08 -07:00
Noah Levitt
8a768dcd44 fix crawl log test to avoid any dedup collisions 2017-08-07 14:06:53 -07:00
Noah Levitt
edcc2cc296 fix crawl log test 2017-08-07 13:23:51 -07:00
Noah Levitt
ecb07fc9cd heritrix-style crawl log support 2017-08-07 13:07:54 -07:00
Noah Levitt
7aed867c90 disallow slash and backslash in warc-prefix 2017-08-07 11:30:52 -07:00
Noah Levitt
b23e485898 simplify recovery of stats batch in case of exception saving them (not sure what was wrong with summy_merge, but this is simpler) 2017-06-22 16:54:04 -07:00
Noah Levitt
c0ee9c6093 avoid holding the lock, which makes all warc writer threads block, while doing rethinkdb operations, in RethinkStatsDb 2017-06-22 16:17:25 -07:00
Noah Levitt
1500341875 use %r instead of calling repr() 2017-06-07 16:05:47 -07:00
Noah Levitt
95dfa54968 get rid of dbm, switch to sqlite, for easier portability, clarity around threading 2017-05-24 13:57:09 -07:00
Noah Levitt
99dd840d20 use "ttl" for updated doublethink svc reg api 2017-05-23 10:37:39 -07:00
Noah Levitt
aca0b881c6 make sure records are written to warc in a predictable order to make tests pass consistently 2017-05-19 16:34:27 -07:00
Noah Levitt
ef5dd2e4ae multiple warc writer threads (hacked in with little thought to code organization) 2017-05-19 16:10:44 -07:00
Noah Levitt
338e5cd878 comment out debug logging thing 2017-04-28 11:08:41 -07:00
Noah Levitt
ca7625b18d set via header on request and response, record request via in warc (because it is sent to the remote site), do not record response via in warc (because it is not sent by the remote site) 2017-04-28 11:07:33 -07:00
Noah Levitt
47680cc17d let test_choose_a_port_for_me pass when service registry is missing, i.e. when not running with rethinkdb 2017-04-17 12:05:39 -07:00
Noah Levitt
3d87ed61be whoops, stop warcprox and join thread in test_choose_a_port_for_me 2017-04-17 11:47:22 -07:00
Noah Levitt
1900dfac08 test choosing port 0 which means, let the system choose one for me, and fix a bug in service registry reporting of the port 2017-04-17 11:45:37 -07:00
Noah Levitt
21a9a26f51 fix some obsolete calls 2017-04-17 11:00:43 -07:00
Noah Levitt
f17584836e add another field to status api and service registry, "threads", the size of the proxy server thread pool 2017-03-30 16:18:50 -07:00
Noah Levitt
35d7ccd12e add seconds_behind to service registry and status api, which is the length of time the next url to be written to warc has been waiting in the queue 2017-03-30 15:54:19 -07:00
Noah Levitt
da26b25ac3 accept failures from the tor test 2017-03-28 12:55:30 -07:00
Noah Levitt
89643b7497 make the status api test pass in python 2 2017-03-23 10:13:14 -07:00
Noah Levitt
8caae0d7d3 new api, http://{warcprox_host}:{port}/status returns status info json 2017-03-23 09:56:51 -07:00
Noah Levitt
f1d07ad921 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 09:33:50 -07:00
Noah Levitt
842bfd651c rethinkstuff -> doublethink 2017-03-02 15:06:26 -08:00
Noah Levitt
1c7564ee6a really fix tests for python2 2017-02-02 10:09:03 -08:00
Noah Levitt
859c93f390 comment out unused code that fails in py2 2017-02-01 15:42:02 -08:00
Noah Levitt
ddb60876a3 WARCPROX_WRITE_RECORD is exempt from method filter 2017-02-01 15:30:22 -08:00
Noah Levitt
4b505c524b new flag dedup_ok and warcprox-meta field dedup-ok which can be used to prevent deduplication against particular entries rethinkdb big captures table 2017-01-13 17:29:05 -08:00
Noah Levitt
de7a23325b a test for alex's method filter 2016-11-15 12:42:25 -08:00
Noah Levitt
3b167459e3 change tested idns to valid idna2008 now that requests 2.12.0 enforces that (for better or worse, see https://github.com/kennethreitz/requests/issues/3687) 2016-11-15 12:08:07 -08:00
Noah Levitt
fa1e8d3af4 allow travis-ci failures for python-nightly and also test 3.6-dev (but allow failures);
enable the onion site tor test because apparently travis-ci is allowing me to
install tor now, see https://travis-ci.org/internetarchive/warcprox/jobs/169101744
although https://github.com/travis-ci/apt-package-whitelist/issues/1753 is still open
2016-10-19 18:24:25 -07:00
Noah Levitt
314be33707 new test that reveals connection hang on https urls missing a content-length http response header (not chunked and server leaves connection open) -- reported by Alex Osborne 2016-10-19 13:43:44 -07:00
Noah Levitt
46c24833ff emoji idn fails with python 2.7, so test with a BMP unicode character 2016-06-29 17:16:50 -05:00
Noah Levitt
33775d360a comment out segfaulting test 2016-06-29 16:47:54 -05:00
Noah Levitt
a59871e17b idn support, at least for domain limits (getting a segfault in tests on mac however, let's see what happens on travis-ci) 2016-06-29 15:54:40 -05:00
Noah Levitt
c9e403585b switching from host limits to domain limits, which apply in aggregate to the host and subdomains 2016-06-29 14:56:14 -05:00
Noah Levitt
2c8b194090 really only apply host limits to the host 2016-06-28 15:53:29 -05:00
Noah Levitt
04c4b63f03 renaming scope rule "host" to "domain" to make it a less confusing, since rules apply to subdomains as well 2016-06-28 15:35:02 -05:00
Noah Levitt
320df0565e support "soft limits" which result in a different response code (430) than regular (hard) limits (which result in a 420) 2016-06-27 16:07:20 -05:00
Noah Levitt
fabd732b7f couple of fixes for host limits 2016-06-24 21:58:37 -05:00
Noah Levitt
2fe0c2f25b support for tallying substats of a configured bucket by host, and enforcing limits host limits using those stats, with tests 2016-06-24 20:04:27 -05:00
Noah Levitt
d48e2c462d add a start() method to the two classes that save data to rethinkdb periodically in batches, instead of starting the timer in __init__ 2016-06-16 00:04:59 +00:00