Noah Levitt
908988c4f0
wait for rethinkdb indexes to be ready
2017-10-06 16:57:39 -07:00
Noah Levitt
0de10791aa
Merge pull request #35 from vbanos/dedup-redundant-code
...
Remove redundant methods from dedup classes
2017-09-29 11:42:47 -07:00
Noah Levitt
9aa330ecb3
Merge pull request #34 from vbanos/remove-unused
...
Remove unused imports
2017-09-28 14:34:58 -07:00
Noah Levitt
faae23d764
allow very long request header lines, to support large warcprox-meta header values
2017-09-27 17:29:55 -07:00
Vangelis Banos
eb266f198d
Remove redundant stop() & sync() dedup methods
...
Similarly with my previous commits, these methods do nothing.
I think that the reason they are here is because the author uses the
same style in other places in the code (e.g.
``warcprox.stats.StatsDb``). Similar methods exist there.
2017-09-24 13:44:13 +00:00
Vangelis Banos
d035147e3e
Remove redundant close method from DedupDb and RethinkDedupDb
...
I'm trying to implement another DedupDb interface and I looked into the
use of each method. The ``close`` method of ``dedup.DedupDb`` and
``deup.RethinkDedupDb`` is empty.
It is also invoked from ``controller``.
Since it doesn't do anything and it won't in the foreseeable future,
let's remove it.
2017-09-24 13:36:12 +00:00
Vangelis Banos
66b4c35322
Remove unused imports
2017-09-24 11:15:30 +00:00
Noah Levitt
8bfda9f4b3
fix python2 tests
2017-09-20 11:03:36 -07:00
Noah Levitt
1bca9d0324
don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string
2017-09-18 14:45:16 -07:00
Noah Levitt
a8adaaf527
Merge pull request #30 from trifle/master
...
allow zero warc_writer_threads
2017-09-12 13:46:12 -07:00
Noah Levitt
a3f84097ee
Merge branch 'master' into crawl-log
...
* master:
no SIGQUIT on windows, so no SIGQUIT handler
https://github.com/internetarchive/warcprox/pull/32 warrants a version bump
fix --size option (https://github.com/internetarchive/warcprox/issues/31 )
fix --playback-port option (https://github.com/internetarchive/warcprox/issues/29 )
2017-09-07 12:28:07 -07:00
Noah Levitt
b89f834ce3
no SIGQUIT on windows, so no SIGQUIT handler
2017-09-07 12:01:51 -07:00
Noah Levitt
c73fdd91f8
Merge pull request #32 from internetarchive/trough
...
hello --plugin, goodbye kafka feed
2017-09-07 10:31:42 -07:00
Noah Levitt
db0f36c745
fix --size option ( https://github.com/internetarchive/warcprox/issues/31 )
2017-09-05 12:43:55 -07:00
Noah Levitt
7e55568851
fix --playback-port option ( https://github.com/internetarchive/warcprox/issues/29 )
2017-09-05 12:20:22 -07:00
Pascal Jürgens
940af4e888
fix zero-indexing of warc_writer_threads so they can be disabled via empty list
2017-08-18 15:52:34 +02:00
Noah Levitt
bac45a9df2
create crawl log dir at startup if it doesn't exist
2017-08-08 11:54:57 -07:00
Noah Levitt
ecb07fc9cd
heritrix-style crawl log support
2017-08-07 13:07:54 -07:00
Noah Levitt
7aed867c90
disallow slash and backslash in warc-prefix
2017-08-07 11:30:52 -07:00
Noah Levitt
0cf283f058
can't see any reason to split the main() like this (anymore?)
2017-08-03 15:19:57 -07:00
Noah Levitt
c0cb59e5af
Merge branch 'master' into trough
...
* master:
hidden argument --rethinkdb-big-table-name
try to fix https://github.com/internetarchive/warcprox/issues/27
2017-08-03 11:22:27 -07:00
Noah Levitt
13ee68ce4a
hidden argument --rethinkdb-big-table-name
2017-07-20 12:53:59 -07:00
Noah Levitt
b1a8fecd9d
try to fix https://github.com/internetarchive/warcprox/issues/27
2017-07-07 14:54:55 -07:00
Noah Levitt
ad3e6f405d
call stop() at shutdown if present on plugins
2017-06-28 16:40:20 -07:00
Noah Levitt
9ea3540d63
fix misuse of +=
2017-06-28 14:19:06 -07:00
Noah Levitt
2c95a1f2ee
remove kafka feed code
2017-06-28 13:12:30 -07:00
Noah Levitt
4c32394256
new option --plugin
2017-06-28 12:53:34 -07:00
Noah Levitt
e31302a6e3
hide kafka options as first step toward removing them
2017-06-28 12:03:48 -07:00
Noah Levitt
b23e485898
simplify recovery of stats batch in case of exception saving them (not sure what was wrong with summy_merge, but this is simpler)
2017-06-22 16:54:04 -07:00
Noah Levitt
c0ee9c6093
avoid holding the lock, which makes all warc writer threads block, while doing rethinkdb operations, in RethinkStatsDb
2017-06-22 16:17:25 -07:00
Noah Levitt
24082c2e8c
don't wait for queue to be empty to do idle rollovers, because sometimes warcprox can stay busy for a long, long time
2017-06-22 15:04:01 -07:00
Noah Levitt
2f0c4454ac
try not to let problems responding to kill -QUIT (which prints stack trace of each thread) kill the whole process
2017-06-12 16:51:50 -07:00
Noah Levitt
808950abb4
recover properly from exception updating stats in rethinkdb
2017-06-12 16:51:45 -07:00
Noah Levitt
1500341875
use %r instead of calling repr()
2017-06-07 16:05:47 -07:00
Noah Levitt
2f93cdcad9
use locking to ensure consistency and avoid this kind of test failure https://travis-ci.org/internetarchive/warcprox/jobs/235819316
2017-05-25 17:38:20 +00:00
Noah Levitt
95dfa54968
get rid of dbm, switch to sqlite, for easier portability, clarity around threading
2017-05-24 13:57:09 -07:00
Noah Levitt
99dd840d20
use "ttl" for updated doublethink svc reg api
2017-05-23 10:37:39 -07:00
Noah Levitt
ef5dd2e4ae
multiple warc writer threads (hacked in with little thought to code organization)
2017-05-19 16:10:44 -07:00
Noah Levitt
a3dde3d97f
fix mistake (incorrect interpration of concurrent.futures.ThreadPoolExecutor internals) that caused unnecessary waits, and unnecessarily long waits, before calling socket.accept()
2017-05-12 14:18:35 -07:00
Noah Levitt
fd770b71bc
revert stuff accidentally committed as part of eea582c6db9ed6d :(
2017-05-11 11:56:01 -07:00
Noah Levitt
2a0c8c28c9
improvements to run-benchmark.py, primarily to actually make multiple requests in parallel
2017-05-10 18:01:56 +00:00
Noah Levitt
eea582c6db
rewrite run-benchmarks.py for aiohttp2
2017-05-08 20:56:32 -07:00
Noah Levitt
c87ff90bc1
move more stuff in do_COMMAND inside the try block so that exceptions result in a 500 response
2017-05-05 13:44:46 -07:00
Noah Levitt
c642565ad8
bump up the socket backlog argument to try to stop kernel closing attempted connections on linux
2017-05-05 18:49:56 +00:00
Noah Levitt
b2f08535ae
set method when creating ProxyingRecordingHTTPResponse so that it knows when to close the connection, and HEAD requests don't sit around trying to read more data until socket timeout
2017-05-04 12:54:04 -07:00
Noah Levitt
11e11f4e68
early trace-level logging of the requestline
2017-05-03 18:39:57 -07:00
Noah Levitt
c0e6c219ca
python2 fixes
2017-04-28 11:12:17 -07:00
Noah Levitt
ca7625b18d
set via header on request and response, record request via in warc (because it is sent to the remote site), do not record response via in warc (because it is not sent by the remote site)
2017-04-28 11:07:33 -07:00
Noah Levitt
1900dfac08
test choosing port 0 which means, let the system choose one for me, and fix a bug in service registry reporting of the port
2017-04-17 11:45:37 -07:00
Noah Levitt
21a9a26f51
fix some obsolete calls
2017-04-17 11:00:43 -07:00