Noah Levitt
2772b80fab
bump version after merge
2019-05-14 15:50:59 -07:00
Vangelis Banos
89d987a181
Cache bad target hostname:port to avoid reconnection attempts
...
If connection to a hostname:port fails, add it to a `TTLCache` with
60 sec expiration time. Subsequent requests to the same hostname:port
return really quickly as we check the cache and avoid trying a new
network connection.
The short expiration time guarantees that if a host becomes OK again,
we'll be able to connect to it quickly.
Adding `cachetools` dependency was necessary as there isn't any other
way to have an expiring in-memory cache using stdlib. The library
doesn't have any other dependencies, it has good test coverage and seems
maintained. It also supports Python 3.7.
2019-05-09 10:03:16 +00:00
Noah Levitt
41d7f0be53
bump version after merges
2019-05-06 16:49:35 -07:00
Noah Levitt
dfc081fff8
do not write incorrect warc-payload-digest to...
...
... request records
see https://github.com/webrecorder/warcio/issues/74#issuecomment-487816378
2019-05-02 14:25:29 -07:00
Noah Levitt
38d6e4337d
handle graceful shutdown failure
...
print stack trace and kill myself -9
2019-04-24 13:14:12 -07:00
Noah Levitt
de01d498cb
requests/urllib3 version conflict
2019-04-24 12:11:20 -07:00
Noah Levitt
3298128e0c
deal with bad content-type header
...
we had bad stuff get into a crawl log because of a url that returned a
Content-Type header value with spaces in it (but no semicolon)
2019-04-24 10:40:22 -07:00
Noah Levitt
f207e32f50
followup on IncompleteRead
2019-04-15 00:17:50 -07:00
Noah Levitt
5de2569430
bump version after #124 merge
2019-04-13 18:11:02 -07:00
Noah Levitt
98b3c1f80b
bump version after merge
2019-04-09 21:52:31 +00:00
Noah Levitt
2ca84ae023
bump version after merge
2019-04-08 11:50:27 -07:00
Noah Levitt
794cc29c80
bump version
2019-03-21 16:04:05 -07:00
Noah Levitt
cb2a07bff2
account for surt fix in urlcanon 0.3.0
2019-03-21 12:59:32 -07:00
Noah Levitt
1e0a0ca63a
every change is a point release now
2019-03-21 12:38:29 -07:00
Vangelis Banos
436a27b19e
Upgrade PyYAML to >=5.1
2019-03-21 19:34:52 +00:00
Vangelis Banos
878ab0977f
Use YAML instead of JSON
...
Add PyYAML<=3.13 dependency.
2019-03-21 19:18:55 +00:00
Noah Levitt
c70bf2e2b9
debugging a shutdown issue
2019-02-27 12:36:35 -08:00
Noah Levitt
adca46427d
back to dev version number
2019-02-12 15:04:22 -08:00
Noah Levitt
5a7a4ff710
pypi release
2019-02-12 15:00:22 -08:00
Noah Levitt
cb72af015a
fix idle rollover
2019-01-21 10:37:09 -08:00
Noah Levitt
a780f1774c
back to dev version number
2019-01-17 17:15:33 -08:00
Noah Levitt
e07ee3630e
2.4b3 for pypi
2019-01-09 15:15:37 -08:00
Noah Levitt
3ea5c36e7f
add idna as dep with acceptable to other deps
...
because
<kenji> my understanding is that pip cannot fully resolve version
constraints for indirect dependencies.
2019-01-09 15:10:37 -08:00
Noah Levitt
1ea8a06a69
3 hour hard timeout on urls without content-length
...
so that indefinite streams like icecast radio stations don't hang
forever
2018-11-12 15:57:37 -08:00
Noah Levitt
bb50a6c7ff
use predictable id in service registry
...
so that when warcprox restarts it replaces the obsolete entry
2018-11-12 15:11:23 -08:00
Noah Levitt
4f836e9179
bump version number
2018-11-06 11:29:33 -08:00
Noah Levitt
1460040789
bump version after merge
2018-10-31 16:23:00 -07:00
Noah Levitt
52f2ac0f4e
send nice 503s and avoid scary stack traces...
...
... at shutdown
2018-10-26 15:26:27 -07:00
Noah Levitt
89212e782d
fix failing test
2018-10-26 13:44:27 -07:00
Noah Levitt
e993b0c28c
fix shutdown
...
at shutdown, abort active connections, but allow completed fetches to
finish processing
this should fix race condition issue at shutdown, where postfetch
processor B would shut down, then postfetch processor A would try to
enqueue more urls, filling up the queue to the point where it blocks
forever, since B is no longer pulling urls off the queue
2018-10-26 13:21:15 -07:00
Noah Levitt
4f01772782
enforce limits on WARCPROX_WRITE_RECORD requests
...
should make test from previous commit pass
2018-10-10 18:24:54 -07:00
Noah Levitt
57e1b82e3d
bump version after merge
2018-09-19 13:03:59 -07:00
Noah Levitt
8f51ba4ab9
bump dev version number after merge
2018-08-16 17:09:35 -07:00
Noah Levitt
f8b86a0122
update cryptography dep version
...
github tells me there's a vulnerability <2.3
2018-08-16 12:54:30 -07:00
Noah Levitt
17a5fabb75
use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
...
payloads. because as of https://github.com/internetarchive/brozzler/pull/115
brozzler will be sending big videos via WARCPROX_WRITE_RECORD
2018-08-16 11:08:36 -07:00
Noah Levitt
fbce243787
bump dev version after pull request
2018-07-19 11:18:31 -05:00
Noah Levitt
2df82bd403
record request method in crawl log if not GET
2018-07-17 13:47:52 -05:00
Noah Levitt
8c22c55955
back to dev version number
2018-07-17 12:04:08 -05:00
Noah Levitt
6786a668b1
2.4b2 for pypi
2018-07-17 12:03:26 -05:00
Noah Levitt
8022257a57
setuptools likes README.rst not readme.rst
2018-07-17 16:35:05 +00:00
Noah Levitt
ec7a0bf569
log exception and continue 🤞 if schema reg fails
...
at trough dedup startup
2018-05-31 16:57:37 -07:00
Noah Levitt
e8cb3afa71
bump dev version after merge
2018-05-31 16:52:37 -07:00
Noah Levitt
b7ebc38491
rename README.rst -> readme.rst
2018-05-21 22:18:28 +00:00
Noah Levitt
997d4341fe
add some debug logging in BatchTroughLoader
2018-05-18 17:29:38 -07:00
Noah Levitt
b762d6468b
just one should_dedup() for trough dedup
...
fixes failing test and clarifies things
2018-05-16 14:25:01 -07:00
Noah Levitt
e23af32e94
we want to save all captures to the big "captures"
...
table, even if we don't want to dedup against them
2018-05-15 15:33:52 -07:00
Noah Levitt
af863c6dba
default values for dedup_min_text_size et al
...
because they may be missing in case warcprox is used as a library
2018-05-15 11:22:10 -07:00
Noah Levitt
15830fc5a2
support "captures-bucket" for backward compatibility
2018-05-09 15:43:39 -07:00
Noah Levitt
6f6a88fc0b
bump dev version number after #86
2018-05-03 12:36:16 -07:00
Noah Levitt
a1930495af
default to 100 proxy threads, 1 warc writer thread
...
see https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2018-04-12 12:31:04 -07:00