58 Commits

Author SHA1 Message Date
Noah Levitt
5347cc92c3 change where RunningStats is initialized and fix tests 2017-12-29 11:06:46 -08:00
Noah Levitt
c966f7f6e8 more stats available from /status (and in rethindkb services table) 2017-12-28 17:07:02 -08:00
Noah Levitt
eacf070a2a dropping claim of support for python 2.7 (not worth hacking around tempfile.TemporaryDirectory to make tests pass) 2017-12-21 15:45:39 -08:00
Noah Levitt
95b2b86487 better error message for bad WARCPROX_WRITE_RECORD request 2017-11-15 23:41:44 +00:00
Noah Levitt
c13fd9a40e have --profile profile proxy threads as well as warc writer threads 2017-11-14 16:35:25 -08:00
Noah Levitt
3a0f6e0947 fix payload digest by pulling calculation up one level where content has already been transfer-decoded 2017-11-10 17:18:22 -08:00
Noah Levitt
78c6137016 fix crawl log handling of WARCPROX_WRITE_RECORD request 2017-11-09 12:35:10 -08:00
Noah Levitt
5c18054d37 Merge branch 'master' into crawl-log
* master:
  Update docstring
  Move Warcprox-Meta header construction to warcproxy
  Improve test_writer tests
  Replace timestamp parameter with more generic request/response syntax
  Return capture timestamp
  Swap fcntl.flock with fcntl.lockf
  Unit test fix for Python2 compatibility
  Test WarcWriter file locking when no_warc_open_suffix=True
  Rename writer var and add exception handling
  Acquire and exclusive file lock when not using .open WARC suffix
  Add hidden --no-warc-open-suffix CLI option
  Fix missing dummy url param in bigtable lookup method
  back to dev version number
  version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
  Expand comment with limit=-1 explanation
  Drop unnecessary split for newline in CDX results
  fix benchmarks (update command line args)
  Update CdxServerDedup lookup algorithm
  Pass url instead of recorded_url obj to dedup lookup methods
  Filter out warc/revisit records in CdxServerDedup
  Improve CdxServerDedup implementation
  Fix minor CdxServerDedup unit test
  Fix bug with dedup_info date encoding
  Add mock pkg to run-tests.sh
  Add CdxServerDedup unit tests and improve its exception handling
  Add CDX Server based deduplication
  cryptography lib version 2.1.1 is causing problems
  Revert changes to test_warcprox.py
  Revert changes to bigtable and dedup
  Revert warc to previous behavior
  Update unit test
  Replace invalid warcfilename variable in playback
  Stop using WarcRecord.REFERS_TO header and use payload_digest instead
  greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code
  avoid TypeError: 'NoneType' object is not iterable exception at shutdown
  wait for rethinkdb indexes to be ready
  Remove deleted ``close`` method call from test.
  bump dev version number after merging pull requests
  Add missing "," in deps
  Remove tox.ini, move warcio to test_requires
  allow very long request header lines, to support large warcprox-meta header values
  Remove redundant stop() & sync() dedup methods
  Remove redundant close method from DedupDb and RethinkDedupDb
  Remove unused imports
  Add missing packages from setup.py, add tox config.
  fix python2 tests
  don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string
  fix zero-indexing of warc_writer_threads so they can be disabled via empty list
2017-11-09 11:13:02 -08:00
Vangelis Banos
ca3121102e Move Warcprox-Meta header construction to warcproxy 2017-11-02 08:24:28 +00:00
Vangelis Banos
56f0118374 Replace timestamp parameter with more generic request/response syntax
Replace timestamp parameter with more generic extra_response_headers={}

When request has --header ``Warcprox-Meta: {\"accept\":[\"capture-metadata\"]}"``
Response has the following header:
``Warcprox-Meta: {"capture-metadata":{"timestamp":"2017-10-31T10:47:50Z"}}``

Update unit test
2017-10-31 10:49:10 +00:00
Vangelis Banos
3d9a22b6c7 Return capture timestamp
When client request has HTTP header ``Warcprox-Meta": {"return-capture-timestamp": 1}``,
add to the response the WARC record timestamp in the following HTTP header:
``Warcprox-Meta: {"capture-timestamp": '%Y-%m-%d %H:%M:%S"}``.

Add unit test.
2017-10-29 18:48:08 +00:00
Vangelis Banos
66b4c35322 Remove unused imports 2017-09-24 11:15:30 +00:00
Noah Levitt
8bfda9f4b3 fix python2 tests 2017-09-20 11:03:36 -07:00
Noah Levitt
1bca9d0324 don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string 2017-09-18 14:45:16 -07:00
Noah Levitt
ecb07fc9cd heritrix-style crawl log support 2017-08-07 13:07:54 -07:00
Noah Levitt
7aed867c90 disallow slash and backslash in warc-prefix 2017-08-07 11:30:52 -07:00
Noah Levitt
f17584836e add another field to status api and service registry, "threads", the size of the proxy server thread pool 2017-03-30 16:18:50 -07:00
Noah Levitt
35d7ccd12e add seconds_behind to service registry and status api, which is the length of time the next url to be written to warc has been waiting in the queue 2017-03-30 15:54:19 -07:00
Noah Levitt
8caae0d7d3 new api, http://{warcprox_host}:{port}/status returns status info json 2017-03-23 09:56:51 -07:00
Noah Levitt
f1d07ad921 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 09:33:50 -07:00
Noah Levitt
7c1d5796a3 fix problem in python 2 where warcprox was always single-threaded, because of "old-style" class inheritance issues 2017-02-06 10:56:54 -08:00
Noah Levitt
adb264b40e treat limit value of null, zero, or negative as meaning "unlimited" 2017-02-03 16:20:15 -08:00
Noah Levitt
719380e612 refactor some general mitm proxy stuff into mitmproxy.py 2016-10-19 15:32:58 -07:00
Noah Levitt
15eeaebde5 fix for connection hang on https urls missing a content-length http response header 2016-10-19 13:45:46 -07:00
Noah Levitt
5eed7061b1 do not require --kafka-capture-feed-topic to make the kafka capture feed work (it can be configured per job or per site) 2016-07-05 11:51:56 -05:00
Noah Levitt
a59871e17b idn support, at least for domain limits (getting a segfault in tests on mac however, let's see what happens on travis-ci) 2016-06-29 15:54:40 -05:00
Noah Levitt
c9e403585b switching from host limits to domain limits, which apply in aggregate to the host and subdomains 2016-06-29 14:56:14 -05:00
Noah Levitt
2c8b194090 really only apply host limits to the host 2016-06-28 15:53:29 -05:00
Noah Levitt
04c4b63f03 renaming scope rule "host" to "domain" to make it a less confusing, since rules apply to subdomains as well 2016-06-28 15:35:02 -05:00
Noah Levitt
320df0565e support "soft limits" which result in a different response code (430) than regular (hard) limits (which result in a 420) 2016-06-27 16:07:20 -05:00
Noah Levitt
2fe0c2f25b support for tallying substats of a configured bucket by host, and enforcing limits host limits using those stats, with tests 2016-06-24 20:04:27 -05:00
Noah Levitt
4bb3556709 implement enforcement of Warcprox-Meta header block rules; includes automated tests 2016-05-10 23:11:47 +00:00
Noah Levitt
4fd17be339 started adding some docstrings, and moved some of the more generally man-in-the-middle recording proxy code from warcproxy.py into mitmproxy.py 2016-05-10 01:11:17 -07:00
Noah Levitt
0809c78486 add Strict-Transport-Security to list of http response headers to swallow, to avoid some problems with HSTS when browsing through warcprox (doesn't solve the case of preloaded HSTS though) 2016-04-08 23:26:20 -07:00
Noah Levitt
2c65ff89fa add license headers 2016-04-06 19:37:55 -07:00
Noah Levitt
422672408a fix this error
File "/home/nlevitt/workspace/warcprox/warcprox/warcproxy.py", line 256, in _proxy_request
    return recorded_url
UnboundLocalError: local variable 'recorded_url' referenced before assignment
2016-03-04 21:02:47 +00:00
Noah Levitt
918fdd3e9b heuristic to set size of thread pool based on open files limit, to hopefully fix problem where warcprox got stuck because it ran out of file handles 2016-03-04 20:59:11 +00:00
Noah Levitt
00dc9eed84 new option --onion-tor-socks-proxy, host:port of tor socks proxy, used only to connect to .onion sites 2016-01-26 18:47:08 -08:00
Noah Levitt
734b2f5396 limit max number of threads to 500; make sure connection with proxy client has a timeout; log errors from connection with proxy client 2016-01-26 18:47:08 -08:00
Noah Levitt
e3a5717446 hidden --profile option to enable profiling of warc writer thread and periodic logging of memory usage info; at shutdown, close stats db and unregister from service registry; logging improvements 2016-01-26 18:47:08 -08:00
Noah Levitt
0171cdd01d fixes for python 2.7 2016-01-26 18:47:08 -08:00
Noah Levitt
dd1c7b5f7d don't implement __del__, maybe it can cause mem leaks; bunch of logging to try to detect leaks 2016-01-26 18:47:08 -08:00
Noah Levitt
12432b23ae for captures table generate canonical surt with scheme:// 2016-01-26 18:47:08 -08:00
Noah Levitt
c02c98e369 make sure warc headers are bytes 2016-01-26 18:47:08 -08:00
Noah Levitt
b30218027e get "mimetype" (without ;params) from content-type in one place in RecordedUrl, and also note host and duration (time spent serving request) 2016-01-26 18:47:08 -08:00
Noah Levitt
decb985250 add length field to each record in big captures table (size in bytes of compressed warc record) because pywayback needs it 2016-01-26 18:47:08 -08:00
Noah Levitt
44a62111fb support for deduplication buckets specified in warcprox-meta header {"captures-bucket":...,...} 2016-01-26 18:47:08 -08:00
Noah Levitt
ab4e90c4b8 make warc-date follow warc spec "timestamp shall represent the instant that data capture for record creation began" 2016-01-26 18:47:08 -08:00
Noah Levitt
c430f81883 some refactoring to prep for big rethinkdb capture table 2016-01-26 18:47:08 -08:00
Noah Levitt
cc71c331a1 modify response headers from server, always send connection:close to proxy client 2016-01-26 18:47:08 -08:00