444 Commits

Author SHA1 Message Date
Noah Levitt
c70bf2e2b9 debugging a shutdown issue 2019-02-27 12:36:35 -08:00
Noah Levitt
2824ee6a5b omfg too many warcs 2019-02-12 14:59:54 -08:00
Vangelis Banos
99fb998e1d log LRU cache info every 1000 requests
to avoid writing to the log too often.
2019-02-12 21:46:49 +00:00
Vangelis Banos
660989939e Remove cli option cdxserver-dedup-lru-cache-size
LRU cache is always enabled for cdxserver dedup module with a default
cache size of 1024.
2019-02-12 20:43:27 +00:00
Vangelis Banos
1133715331 Enable cdx dedup lru cache by default
use default value 1024
2019-02-12 08:28:15 +00:00
Vangelis Banos
53f13d3536 Use in-memory LRU cache in CDX Server dedup
Add option `--cdxserver-dedup-lru-cache-size=N` (default None) to enable
in-memory caching of CDX dedup requests using stdlib `lru_cache` method.

Cache memory info is available on `INFO` logging outputs like:
```
CacheInfo(hits=3172, misses=3293, maxsize=1024, currsize=1024)
``
2019-02-07 09:08:11 +00:00
Vangelis Banos
e04ffa5a36 Change default --cdxserver-dedup-max-threads from 400 to 50 2019-01-23 18:34:33 +00:00
Vangelis Banos
25281376f6 Configurable max threads in CdxServerDedupLoader
`CdxServerDedupLoader` used `max_workers=400` by default.
We make it a CLI option `--cdxserver-dedup-max-threads` with a default
value of 400.

We need to be able to tweak this setting because it creates too many CDX
queries which cause problems with our production CDX servers.
2019-01-23 11:07:46 +00:00
Noah Levitt
cb72af015a fix idle rollover 2019-01-21 10:37:09 -08:00
Noah Levitt
8fd1af1d04 offer WarcproxController to plugin constructors
because plugin needs to get at stuff, especially the warc writer
processor, for this close api to be useful
2019-01-09 22:47:04 +00:00
Noah Levitt
150c1e67c6 WarcWriterProcessor.close_for_prefix()
New API to allow some code from outside of warcprox proper (in a
third-party plugin for example) can close open warcs promptly when it
knows they are finished.
2019-01-08 11:27:11 -08:00
Noah Levitt
79d09d013b ThreadPoolExecutor no longer used
it was part of the multi-threaded warc writer implementation
2019-01-08 11:15:20 -08:00
Noah Levitt
0882a2b174 remove --writer-threads option
Support for multiple writer threads was broken, and profiling had shown
it to be of dubious utility.
https://github.com/internetarchive/warcprox/issues/101
https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2019-01-07 15:54:35 -08:00
Noah Levitt
1ea8a06a69 3 hour hard timeout on urls without content-length
so that indefinite streams like icecast radio stations don't hang
forever
2018-11-12 15:57:37 -08:00
Noah Levitt
bb50a6c7ff use predictable id in service registry
so that when warcprox restarts it replaces the obsolete entry
2018-11-12 15:11:23 -08:00
Noah Levitt
9837d3e3a6 make sure we always format WARC-Date properly
We started getting some WARC-Dates like this:
> WARC-Date: 2018-11-04T06:34:35+00:00Z
but only rarely. The warctools library function we were using to format
the timestamps looks like this:

    def warc_datetime_str(d):
        s = d.isoformat()
        if '.' in s:
            s = s[:s.find('.')]
        return (s + 'Z').encode('utf-8')

isoformat() adds a timestamp like "+00:00" if the datetime has a
timezone. And it turns out that `isoformat()` leaves off the fractional
part if it's zero. In that case we don't get inside the if statement
there and don't chop off the timestamp.

Theoretically this case should only happen once in every million
records, but in practice we are seeing it more often than that (maybe in
the ballpark of 1/1000). It could be that there's a codepath that
produces a timestamp with no microsecond part but I'm not seeing that in
the warcprox code.

In any case, this is the fix.
2018-11-06 11:21:12 -08:00
Noah Levitt
2f98d93467 datetimes with timezone in status because...
... status json populates rethinkdb service registry when that is
enabled, and rethinkdb insists on timezones on dates, and it doesn't
cause any problems
2018-10-31 11:00:21 -07:00
Noah Levitt
dbf868a74d be clear about timezone in timestamps 2018-10-30 13:17:33 -07:00
Noah Levitt
f082db62cf take all the queues and active requests into...
... account when calculating the `seconds_behind` number, and include
the timestamp `earliest_still_active_fetch_start` in the status output
2018-10-30 13:05:45 -07:00
Noah Levitt
52f2ac0f4e send nice 503s and avoid scary stack traces...
... at shutdown
2018-10-26 15:26:27 -07:00
Noah Levitt
89212e782d fix failing test 2018-10-26 13:44:27 -07:00
Noah Levitt
e993b0c28c fix shutdown
at shutdown, abort active connections, but allow completed fetches to
finish processing

this should fix race condition issue at shutdown, where postfetch
processor B would shut down, then postfetch processor A would try to
enqueue more urls, filling up the queue to the point where it blocks
forever, since B is no longer pulling urls off the queue
2018-10-26 13:21:15 -07:00
Noah Levitt
4f01772782 enforce limits on WARCPROX_WRITE_RECORD requests
should make test from previous commit pass
2018-10-10 18:24:54 -07:00
Noah Levitt
269e9604c1 include warcprox host and port in filenames
when using --crawl-log-dir, to avoid collisions (outside of warcprox
itself, in most cases) with crawl logs written by other warcprox
instances
2018-09-19 12:10:29 -07:00
Noah Levitt
4f04172374 fix bug 2018-08-20 12:07:51 -07:00
Noah Levitt
5654bcbeb8 --quiet means NOTICE level logging
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
de01700c54 tweak max threads option handling 2018-08-20 11:13:14 -07:00
Noah Levitt
bfe3f18126 set socket timeout for tor .onion fetching 2018-08-20 11:11:13 -07:00
Noah Levitt
2e71d86072 WARCPROX_WRITE_RECORD respect buffer size setting 2018-08-20 11:09:53 -07:00
Noah Levitt
e4befeec14 --help-hidden for help on hidden args 2018-08-20 11:08:32 -07:00
Noah Levitt
17a5fabb75 use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
payloads. because as of https://github.com/internetarchive/brozzler/pull/115
brozzler will be sending big videos via WARCPROX_WRITE_RECORD
2018-08-16 11:08:36 -07:00
Noah Levitt
0031091d4f
Merge pull request #99 from vbanos/blackout_period
New --blackout-period option to skip writing redundant revisits to WARC
2018-08-03 17:27:42 -07:00
Vangelis Banos
6b1d60c390 Apply blackout on when dedup URL equals request URL 2018-07-24 07:16:21 +00:00
Vangelis Banos
2c2c1d008a New --blackout-period option to skip writing redundant revisits to WARC
Add option `--blackout-period` (default=0)

When set and if the record is a duplicate (revisit record), check the
datetime of `dedup_info` and its inside the `blackout_period`, skip
writing the record to WARC.

Add some unit tests.

This is an improved implementation based on @nlevitt comments here:
https://github.com/internetarchive/warcprox/pull/92
2018-07-21 11:20:49 +00:00
Noah Levitt
fde443070c dumb mistake 2018-07-18 20:10:30 -05:00
Noah Levitt
d3314d7904 hopefully fix a trough dedup concurrency bug 2018-07-18 19:26:16 -05:00
Noah Levitt
b7e12a3ec2 some logging improvements 2018-07-18 19:25:43 -05:00
Noah Levitt
2df82bd403 record request method in crawl log if not GET 2018-07-17 13:47:52 -05:00
Noah Levitt
8022257a57 setuptools likes README.rst not readme.rst 2018-07-17 16:35:05 +00:00
Noah Levitt
ec7a0bf569 log exception and continue 🤞 if schema reg fails
at trough dedup startup
2018-05-31 16:57:37 -07:00
Noah Levitt
e73cbcb6b3 log stack trace in case batch postprocessor raises
exception somehow
2018-05-31 16:57:06 -07:00
Noah Levitt
6256ec6a07 add another "wait" to fix failing test 2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2 fix bug in limits enforcement
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
2c850876e8 explain warcprox-meta "blocks" 2018-05-25 16:06:12 -07:00
Noah Levitt
4bd49b61a9 starting to explain some warcprox-meta fields 2018-05-25 15:26:26 -07:00
Noah Levitt
b7ebc38491 rename README.rst -> readme.rst 2018-05-21 22:18:28 +00:00
Noah Levitt
997d4341fe add some debug logging in BatchTroughLoader 2018-05-18 17:29:38 -07:00
Noah Levitt
b762d6468b just one should_dedup() for trough dedup
fixes failing test and clarifies things
2018-05-16 14:25:01 -07:00
Noah Levitt
e23af32e94 we want to save all captures to the big "captures"
table, even if we don't want to dedup against them
2018-05-15 15:33:52 -07:00
Noah Levitt
af863c6dba default values for dedup_min_text_size et al
because they may be missing in case warcprox is used as a library
2018-05-15 11:22:10 -07:00