111 Commits

Author SHA1 Message Date
Vangelis Banos
202d664f39 Improve CdxServerDedup implementation
Replace ``_split_timestamp`` with ``datetime.strptime`` in
``warcprox.dedup``.

Remove ``isinstance()`` and add optional ``record_url`` in the rest of
the dedup ``lookup`` methods.

Make `--cdxserver-dedup` option help more explanatory.
2017-10-20 20:00:02 +00:00
Vangelis Banos
fc5f39ffed Add CDX Server based deduplication
Add ``--cdxserver-dedup URL`` option.
Create ``warcprox.dedup.CdxServerDedup`` class.
Add dummy unit test (TODO)
2017-10-19 14:33:12 +00:00
Noah Levitt
828a2c3dcf get all the tests to pass with ./tests/run-tests.sh 2017-10-13 15:54:05 -07:00
Noah Levitt
d177b3b80d change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option 2017-10-11 12:06:19 -07:00
Noah Levitt
4eda89f232 trough for deduplication initial proof-of-concept-ish code 2017-10-06 17:03:56 -07:00
Vangelis Banos
66b4c35322 Remove unused imports 2017-09-24 11:15:30 +00:00
Noah Levitt
a3f84097ee Merge branch 'master' into crawl-log
* master:
  no SIGQUIT on windows, so no SIGQUIT handler
  https://github.com/internetarchive/warcprox/pull/32 warrants a version bump
  fix --size option (https://github.com/internetarchive/warcprox/issues/31)
  fix --playback-port option (https://github.com/internetarchive/warcprox/issues/29)
2017-09-07 12:28:07 -07:00
Noah Levitt
b89f834ce3 no SIGQUIT on windows, so no SIGQUIT handler 2017-09-07 12:01:51 -07:00
Noah Levitt
c73fdd91f8 Merge pull request #32 from internetarchive/trough
hello --plugin, goodbye kafka feed
2017-09-07 10:31:42 -07:00
Noah Levitt
db0f36c745 fix --size option (https://github.com/internetarchive/warcprox/issues/31) 2017-09-05 12:43:55 -07:00
Noah Levitt
7e55568851 fix --playback-port option (https://github.com/internetarchive/warcprox/issues/29) 2017-09-05 12:20:22 -07:00
Noah Levitt
ecb07fc9cd heritrix-style crawl log support 2017-08-07 13:07:54 -07:00
Noah Levitt
0cf283f058 can't see any reason to split the main() like this (anymore?) 2017-08-03 15:19:57 -07:00
Noah Levitt
c0cb59e5af Merge branch 'master' into trough
* master:
  hidden argument --rethinkdb-big-table-name
  try to fix https://github.com/internetarchive/warcprox/issues/27
2017-08-03 11:22:27 -07:00
Noah Levitt
13ee68ce4a hidden argument --rethinkdb-big-table-name 2017-07-20 12:53:59 -07:00
Noah Levitt
9ea3540d63 fix misuse of += 2017-06-28 14:19:06 -07:00
Noah Levitt
2c95a1f2ee remove kafka feed code 2017-06-28 13:12:30 -07:00
Noah Levitt
4c32394256 new option --plugin 2017-06-28 12:53:34 -07:00
Noah Levitt
e31302a6e3 hide kafka options as first step toward removing them 2017-06-28 12:03:48 -07:00
Noah Levitt
2f0c4454ac try not to let problems responding to kill -QUIT (which prints stack trace of each thread) kill the whole process 2017-06-12 16:51:50 -07:00
Noah Levitt
95dfa54968 get rid of dbm, switch to sqlite, for easier portability, clarity around threading 2017-05-24 13:57:09 -07:00
Noah Levitt
ef5dd2e4ae multiple warc writer threads (hacked in with little thought to code organization) 2017-05-19 16:10:44 -07:00
Noah Levitt
fd770b71bc revert stuff accidentally committed as part of eea582c6db9ed6d :( 2017-05-11 11:56:01 -07:00
Noah Levitt
2a0c8c28c9 improvements to run-benchmark.py, primarily to actually make multiple requests in parallel 2017-05-10 18:01:56 +00:00
Noah Levitt
eea582c6db rewrite run-benchmarks.py for aiohttp2 2017-05-08 20:56:32 -07:00
Noah Levitt
cbefa37fd9 make --queue-size and --max-threads hidden options work 2017-04-11 16:29:57 -07:00
Noah Levitt
35d7ccd12e add seconds_behind to service registry and status api, which is the length of time the next url to be written to warc has been waiting in the queue 2017-03-30 15:54:19 -07:00
Noah Levitt
73d934d0a4 turn down kafka log level 2017-03-27 22:42:46 +00:00
Noah Levitt
842bfd651c rethinkstuff -> doublethink 2017-03-02 15:06:26 -08:00
Alex Osborne
90031a2058 add --method-filter option 2016-11-15 23:26:13 +11:00
Noah Levitt
6000237c47 workaround for nasty python/ssl deadlock that has been affecting warcprox, same issue as https://github.com/pyca/cryptography/issues/2911 2016-09-23 15:54:31 +01:00
Noah Levitt
5eed7061b1 do not require --kafka-capture-feed-topic to make the kafka capture feed work (it can be configured per job or per site) 2016-07-05 11:51:56 -05:00
Noah Levitt
b82d82b5f1 command line utility warcprox-ensure-rethinkdb-tables, creates rethinkdb tables if they don't already exist... warcprox normally creates them on demand at startup, but if multiple instances are starting up at the same time, you can end up with duplicate broken tables, so it's a good idea to use this utility when spinning up a cluster 2016-06-30 15:24:40 -05:00
Noah Levitt
6410e4c8c7 reorganize WarcproxController.run_until_shutdown, moving parts of it into new start() and shutdown() methods, for easier integration into a separate python program 2016-06-27 14:18:21 -05:00
Noah Levitt
4bb3556709 implement enforcement of Warcprox-Meta header block rules; includes automated tests 2016-05-10 23:11:47 +00:00
Noah Levitt
4fd17be339 started adding some docstrings, and moved some of the more generally man-in-the-middle recording proxy code from warcproxy.py into mitmproxy.py 2016-05-10 01:11:17 -07:00
Noah Levitt
2c65ff89fa add license headers 2016-04-06 19:37:55 -07:00
Noah Levitt
918fdd3e9b heuristic to set size of thread pool based on open files limit, to hopefully fix problem where warcprox got stuck because it ran out of file handles 2016-03-04 20:59:11 +00:00
Noah Levitt
00dc9eed84 new option --onion-tor-socks-proxy, host:port of tor socks proxy, used only to connect to .onion sites 2016-01-26 18:47:08 -08:00
Noah Levitt
734b2f5396 limit max number of threads to 500; make sure connection with proxy client has a timeout; log errors from connection with proxy client 2016-01-26 18:47:08 -08:00
Noah Levitt
e3a5717446 hidden --profile option to enable profiling of warc writer thread and periodic logging of memory usage info; at shutdown, close stats db and unregister from service registry; logging improvements 2016-01-26 18:47:08 -08:00
Noah Levitt
a9fc550453 oops, argparse.SUPPRESS isn't supposed to be in quotes 2016-01-26 18:47:08 -08:00
Noah Levitt
3e2696525b make sure svcreg is set 2016-01-26 18:47:08 -08:00
Noah Levitt
d7d992731c register self for service discovery 2016-01-26 18:47:08 -08:00
Noah Levitt
2169369dab working on benchmarking code... so far they seem to reveal that warcprox behaves poorly under load (perhaps timeouts are configured too short?) 2016-01-26 18:47:08 -08:00
Noah Levitt
a41c426b0a giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing 2016-01-26 18:47:08 -08:00
Noah Levitt
3b9345e7d7 use nicer rethinkdbstuff.Rethinker api 2016-01-26 18:47:08 -08:00
Noah Levitt
d98f03012b kafka capture feed, for druid 2016-01-26 18:47:08 -08:00
Noah Levitt
44a62111fb support for deduplication buckets specified in warcprox-meta header {"captures-bucket":...,...} 2016-01-26 18:47:08 -08:00
Noah Levitt
6d673ee35f tests pass with big rethinkdb captures table 2016-01-26 18:47:08 -08:00