Noah Levitt
dbf868a74d
be clear about timezone in timestamps
2018-10-30 13:17:33 -07:00
Noah Levitt
f082db62cf
take all the queues and active requests into...
...
... account when calculating the `seconds_behind` number, and include
the timestamp `earliest_still_active_fetch_start` in the status output
2018-10-30 13:05:45 -07:00
Noah Levitt
52f2ac0f4e
send nice 503s and avoid scary stack traces...
...
... at shutdown
2018-10-26 15:26:27 -07:00
Noah Levitt
89212e782d
fix failing test
2018-10-26 13:44:27 -07:00
Noah Levitt
e993b0c28c
fix shutdown
...
at shutdown, abort active connections, but allow completed fetches to
finish processing
this should fix race condition issue at shutdown, where postfetch
processor B would shut down, then postfetch processor A would try to
enqueue more urls, filling up the queue to the point where it blocks
forever, since B is no longer pulling urls off the queue
2018-10-26 13:21:15 -07:00
Noah Levitt
4f01772782
enforce limits on WARCPROX_WRITE_RECORD requests
...
should make test from previous commit pass
2018-10-10 18:24:54 -07:00
Noah Levitt
269e9604c1
include warcprox host and port in filenames
...
when using --crawl-log-dir, to avoid collisions (outside of warcprox
itself, in most cases) with crawl logs written by other warcprox
instances
2018-09-19 12:10:29 -07:00
Noah Levitt
4f04172374
fix bug
2018-08-20 12:07:51 -07:00
Noah Levitt
5654bcbeb8
--quiet means NOTICE level logging
...
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
de01700c54
tweak max threads option handling
2018-08-20 11:13:14 -07:00
Noah Levitt
bfe3f18126
set socket timeout for tor .onion fetching
2018-08-20 11:11:13 -07:00
Noah Levitt
2e71d86072
WARCPROX_WRITE_RECORD respect buffer size setting
2018-08-20 11:09:53 -07:00
Noah Levitt
e4befeec14
--help-hidden for help on hidden args
2018-08-20 11:08:32 -07:00
Noah Levitt
17a5fabb75
use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
...
payloads. because as of https://github.com/internetarchive/brozzler/pull/115
brozzler will be sending big videos via WARCPROX_WRITE_RECORD
2018-08-16 11:08:36 -07:00
Noah Levitt
0031091d4f
Merge pull request #99 from vbanos/blackout_period
...
New --blackout-period option to skip writing redundant revisits to WARC
2018-08-03 17:27:42 -07:00
Vangelis Banos
6b1d60c390
Apply blackout on when dedup URL equals request URL
2018-07-24 07:16:21 +00:00
Vangelis Banos
2c2c1d008a
New --blackout-period option to skip writing redundant revisits to WARC
...
Add option `--blackout-period` (default=0)
When set and if the record is a duplicate (revisit record), check the
datetime of `dedup_info` and its inside the `blackout_period`, skip
writing the record to WARC.
Add some unit tests.
This is an improved implementation based on @nlevitt comments here:
https://github.com/internetarchive/warcprox/pull/92
2018-07-21 11:20:49 +00:00
Noah Levitt
fde443070c
dumb mistake
2018-07-18 20:10:30 -05:00
Noah Levitt
d3314d7904
hopefully fix a trough dedup concurrency bug
2018-07-18 19:26:16 -05:00
Noah Levitt
b7e12a3ec2
some logging improvements
2018-07-18 19:25:43 -05:00
Noah Levitt
2df82bd403
record request method in crawl log if not GET
2018-07-17 13:47:52 -05:00
Noah Levitt
8022257a57
setuptools likes README.rst not readme.rst
2018-07-17 16:35:05 +00:00
Noah Levitt
ec7a0bf569
log exception and continue 🤞 if schema reg fails
...
at trough dedup startup
2018-05-31 16:57:37 -07:00
Noah Levitt
e73cbcb6b3
log stack trace in case batch postprocessor raises
...
exception somehow
2018-05-31 16:57:06 -07:00
Noah Levitt
6256ec6a07
add another "wait" to fix failing test
2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2
fix bug in limits enforcement
...
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
2c850876e8
explain warcprox-meta "blocks"
2018-05-25 16:06:12 -07:00
Noah Levitt
4bd49b61a9
starting to explain some warcprox-meta fields
2018-05-25 15:26:26 -07:00
Noah Levitt
b7ebc38491
rename README.rst -> readme.rst
2018-05-21 22:18:28 +00:00
Noah Levitt
997d4341fe
add some debug logging in BatchTroughLoader
2018-05-18 17:29:38 -07:00
Noah Levitt
b762d6468b
just one should_dedup() for trough dedup
...
fixes failing test and clarifies things
2018-05-16 14:25:01 -07:00
Noah Levitt
e23af32e94
we want to save all captures to the big "captures"
...
table, even if we don't want to dedup against them
2018-05-15 15:33:52 -07:00
Noah Levitt
af863c6dba
default values for dedup_min_text_size et al
...
because they may be missing in case warcprox is used as a library
2018-05-15 11:22:10 -07:00
Noah Levitt
15830fc5a2
support "captures-bucket" for backward compatibility
2018-05-09 15:43:39 -07:00
Vangelis Banos
abb54e42d1
Add hidden CLI option --dedup-only-with-bucket
...
When we use `--dedup-only-with-bucket`, dedup will be done only when a
request has key `dedup-bucket` in `Warcprox-Meta`.
2018-05-04 20:50:54 +00:00
Vangelis Banos
432e42803c
dedup-bucket is required in Warcprox-Meta to do dedup
...
Modify `DedupableMixin.should_dedup` to check Warcprox-Meta for
`dedup-bucket` in order to perform dedup.
2018-05-04 14:27:42 +00:00
Vangelis Banos
9baa2e22d5
Rename captures-bucket to dedup-bucket in Warcprox-Meta
2018-05-04 13:26:38 +00:00
Noah Levitt
f76b43f2a3
Merge pull request #86 from vbanos/configurable-dedup-size-limits
...
Configurable min dedupable size for text/binary resources
2018-05-03 12:35:43 -07:00
Vangelis Banos
255d359ad4
Use DedupableMixin in RethinkCapturesDedup
...
I note that we didn't do any payload_size check at all here.
2018-04-24 17:06:56 +00:00
Vangelis Banos
6dce8cc644
Remove method decorate_with_dedup_info
...
Method `warcprox.dedup.decorate_with_dedup_info` is only used in
`DedupLoader._process_url` and nowhere else.
The problem is that `decorate_with_dedup_info` cannot get warcprox cli
options. Thus we cannot pass the custom min size limits.
2018-04-24 10:58:13 +00:00
Vangelis Banos
9057fbdf36
Use DedupableMixin in all dedup classes
...
Rename `DedupableMixin.is_dedupable` to `should_dedup`.
2018-04-24 10:29:35 +00:00
Noah Levitt
a1930495af
default to 100 proxy threads, 1 warc writer thread
...
see https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2018-04-12 12:31:04 -07:00
Noah Levitt
ea4fc0f10a
include warc writer worker threads in profiling
2018-04-11 22:35:37 +00:00
Noah Levitt
cc8fb4c608
cap the number of urls queued for warc writing
2018-04-11 22:29:50 +00:00
Noah Levitt
cb0dea3739
oops! /status has been lying about queued urls
2018-04-11 22:05:31 +00:00
Vangelis Banos
d32bf743bd
Configurable min dedupable size for text/binary resources
...
New `--dedup-min-text-size` and `--dedup-min-binary-size` cli options
with default value = `0`.
New `DedupableMixin` which can be used in any dedup class. It is
currently used only in CDX dedup. Instead of checking `payload_size() >
0`, we now use `.is_dedupable(recorded_url)`
New utility method `RecordedUrl.is_text`.
2018-04-09 15:52:44 +00:00
Vangelis Banos
cce0c705fb
Fix Accept-Encoding request header
2018-04-06 19:55:19 +00:00
Vangelis Banos
7c5c5da9b7
CDX dedup improvements
...
Check for not empty captured content (`payload_size() > 0`) before
creating a new thread and running a CDX dedup request.
Most dedup modules perform the same check to avoid unnecessary dedup
requests.
Increase CDX dedup max workers from 200 to 400 in order to handle more
load.
Set `user-agent: warcprox` for HTTP requests we send to CDX server. Its
useful to identify and monitor `warcprox` requests.
Pass HTTP headers to connection pool on init and not on each request.
2018-04-06 19:55:19 +00:00
Noah Levitt
385014c322
always call socket.shutdown() to close connections
2018-04-04 17:49:08 -07:00
Noah Levitt
7ef0612fa6
close connection when truncating response
2018-04-04 15:45:32 -07:00