Noah Levitt
3ea5c36e7f
add idna as dep with acceptable to other deps
...
because
<kenji> my understanding is that pip cannot fully resolve version
constraints for indirect dependencies.
2019-01-09 15:10:37 -08:00
Noah Levitt
1ea8a06a69
3 hour hard timeout on urls without content-length
...
so that indefinite streams like icecast radio stations don't hang
forever
2018-11-12 15:57:37 -08:00
Noah Levitt
bb50a6c7ff
use predictable id in service registry
...
so that when warcprox restarts it replaces the obsolete entry
2018-11-12 15:11:23 -08:00
Noah Levitt
4f836e9179
bump version number
2018-11-06 11:29:33 -08:00
Noah Levitt
9837d3e3a6
make sure we always format WARC-Date properly
...
We started getting some WARC-Dates like this:
> WARC-Date: 2018-11-04T06:34:35+00:00Z
but only rarely. The warctools library function we were using to format
the timestamps looks like this:
def warc_datetime_str(d):
s = d.isoformat()
if '.' in s:
s = s[:s.find('.')]
return (s + 'Z').encode('utf-8')
isoformat() adds a timestamp like "+00:00" if the datetime has a
timezone. And it turns out that `isoformat()` leaves off the fractional
part if it's zero. In that case we don't get inside the if statement
there and don't chop off the timestamp.
Theoretically this case should only happen once in every million
records, but in practice we are seeing it more often than that (maybe in
the ballpark of 1/1000). It could be that there's a codepath that
produces a timestamp with no microsecond part but I'm not seeing that in
the warcprox code.
In any case, this is the fix.
2018-11-06 11:21:12 -08:00
Noah Levitt
1460040789
bump version after merge
2018-10-31 16:23:00 -07:00
jkafader
9a59299f98
Merge pull request #106 from nlevitt/fix-seconds-behind
...
take all the queues and active requests into...
2018-10-31 15:21:53 -07:00
Noah Levitt
2f98d93467
datetimes with timezone in status because...
...
... status json populates rethinkdb service registry when that is
enabled, and rethinkdb insists on timezones on dates, and it doesn't
cause any problems
2018-10-31 11:00:21 -07:00
Noah Levitt
dbf868a74d
be clear about timezone in timestamps
2018-10-30 13:17:33 -07:00
Noah Levitt
f082db62cf
take all the queues and active requests into...
...
... account when calculating the `seconds_behind` number, and include
the timestamp `earliest_still_active_fetch_start` in the status output
2018-10-30 13:05:45 -07:00
Noah Levitt
52f2ac0f4e
send nice 503s and avoid scary stack traces...
...
... at shutdown
2018-10-26 15:26:27 -07:00
Noah Levitt
89212e782d
fix failing test
2018-10-26 13:44:27 -07:00
Noah Levitt
e993b0c28c
fix shutdown
...
at shutdown, abort active connections, but allow completed fetches to
finish processing
this should fix race condition issue at shutdown, where postfetch
processor B would shut down, then postfetch processor A would try to
enqueue more urls, filling up the queue to the point where it blocks
forever, since B is no longer pulling urls off the queue
2018-10-26 13:21:15 -07:00
Noah Levitt
4f01772782
enforce limits on WARCPROX_WRITE_RECORD requests
...
should make test from previous commit pass
2018-10-10 18:24:54 -07:00
Noah Levitt
4c0dfb432e
failing test for new feature, enforcing limits on
...
WARCPROX_WRITE_RECORD requests
2018-10-10 18:21:28 -07:00
Noah Levitt
57e1b82e3d
bump version after merge
2018-09-19 13:03:59 -07:00
Noah Levitt
d8edc551ba
Merge pull request #105 from nlevitt/host-port-in-log-name
...
include warcprox host and port in filenames
2018-09-19 13:03:19 -07:00
Noah Levitt
269e9604c1
include warcprox host and port in filenames
...
when using --crawl-log-dir, to avoid collisions (outside of warcprox
itself, in most cases) with crawl logs written by other warcprox
instances
2018-09-19 12:10:29 -07:00
Noah Levitt
45aed2e4f6
Merge pull request #104 from nlevitt/arch-svg
...
replace pencil drawing with nice diagram by James
2018-09-17 17:13:42 -07:00
Noah Levitt
741436ddcb
replace pencil drawing with nice diagram by James
...
Kafader
2018-09-17 17:11:51 -07:00
Noah Levitt
ea7257a2b6
Merge pull request #103 from nlevitt/love
...
Love
2018-08-20 14:26:02 -07:00
Noah Levitt
4f04172374
fix bug
2018-08-20 12:07:51 -07:00
Noah Levitt
8dfb63f70d
readable stack traces, thanks py.test
2018-08-20 12:07:23 -07:00
Noah Levitt
5654bcbeb8
--quiet means NOTICE level logging
...
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
de01700c54
tweak max threads option handling
2018-08-20 11:13:14 -07:00
Noah Levitt
bfe3f18126
set socket timeout for tor .onion fetching
2018-08-20 11:11:13 -07:00
Noah Levitt
2e71d86072
WARCPROX_WRITE_RECORD respect buffer size setting
2018-08-20 11:09:53 -07:00
Noah Levitt
e4befeec14
--help-hidden for help on hidden args
2018-08-20 11:08:32 -07:00
Noah Levitt
1d1a73536a
half-baked readme section on warcprox architecture
2018-08-20 11:05:58 -07:00
Noah Levitt
8f51ba4ab9
bump dev version number after merge
2018-08-16 17:09:35 -07:00
Noah Levitt
8be7ddee2b
Merge pull request #100 from nlevitt/karl-copy-edits
...
Karl's copy edits
2018-08-16 17:08:14 -07:00
Noah Levitt
9da5e86b67
restore 80 column lines
2018-08-16 16:32:55 -07:00
Karl-Rainer Blumenthal
fa6b98cf4e
Copy edits updated
...
Edits for readability updated as per https://github.com/internetarchive/warcprox/pull/95#discussion_r200491731
@nlevitt please go ahead and apply your < 80 lines retroactively and I'll refrain from that in future PRs.
2018-08-16 16:31:23 -07:00
Karl-Rainer Blumenthal
b72192d3d0
Copy edits
2018-08-16 16:31:05 -07:00
Noah Levitt
f8b86a0122
update cryptography dep version
...
github tells me there's a vulnerability <2.3
2018-08-16 12:54:30 -07:00
Noah Levitt
17a5fabb75
use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
...
payloads. because as of https://github.com/internetarchive/brozzler/pull/115
brozzler will be sending big videos via WARCPROX_WRITE_RECORD
2018-08-16 11:08:36 -07:00
Noah Levitt
0031091d4f
Merge pull request #99 from vbanos/blackout_period
...
New --blackout-period option to skip writing redundant revisits to WARC
2018-08-03 17:27:42 -07:00
Vangelis Banos
6b1d60c390
Apply blackout on when dedup URL equals request URL
2018-07-24 07:16:21 +00:00
Vangelis Banos
2c2c1d008a
New --blackout-period option to skip writing redundant revisits to WARC
...
Add option `--blackout-period` (default=0)
When set and if the record is a duplicate (revisit record), check the
datetime of `dedup_info` and its inside the `blackout_period`, skip
writing the record to WARC.
Add some unit tests.
This is an improved implementation based on @nlevitt comments here:
https://github.com/internetarchive/warcprox/pull/92
2018-07-21 11:20:49 +00:00
Noah Levitt
fbce243787
bump dev version after pull request
2018-07-19 11:18:31 -05:00
Noah Levitt
f32d5636a1
Merge pull request #98 from nlevitt/trough-dedup-bugs
...
WIP: trough dedup bug fix
2018-07-19 11:17:19 -05:00
Noah Levitt
fde443070c
dumb mistake
2018-07-18 20:10:30 -05:00
Noah Levitt
d3314d7904
hopefully fix a trough dedup concurrency bug
2018-07-18 19:26:16 -05:00
Noah Levitt
b7e12a3ec2
some logging improvements
2018-07-18 19:25:43 -05:00
Noah Levitt
f4cf782922
test should expose trough dedup concurrency bug
2018-07-18 19:23:24 -05:00
Noah Levitt
67392930f6
Merge pull request #97 from nlevitt/fix-travis-clean
...
run trough with python 3.6 plus travis cleanup
2018-07-18 16:38:08 -05:00
Noah Levitt
46d5b0e82c
run trough with python 3.6 plus travis cleanup
...
docker image python:3 is now using 3.7 and building pyyaml < 3.13 fails
yaml/pyyaml#126
also filed pull request to update trough's pyyaml dependency spec
internetarchive/trough#20
2018-07-18 16:09:42 -05:00
Noah Levitt
2df82bd403
record request method in crawl log if not GET
2018-07-17 13:47:52 -05:00
Noah Levitt
8c22c55955
back to dev version number
2018-07-17 12:04:08 -05:00
Noah Levitt
6786a668b1
2.4b2 for pypi
2.4b2
2018-07-17 12:03:26 -05:00