833 Commits

Author SHA1 Message Date
Noah Levitt
21731a2dfe
Merge pull request #122 from nlevitt/avoid-oserror
avoid exception sending error to client
2019-04-09 14:51:28 -07:00
Noah Levitt
7560c0946d avoid exception sending error to client
this is a slightly different approach to
https://github.com/internetarchive/warcprox/pull/121
2019-04-09 21:16:45 +00:00
Noah Levitt
2ca84ae023 bump version after merge 2019-04-08 11:50:27 -07:00
Noah Levitt
4893a8eac0
Merge pull request #119 from vbanos/max-headers
Increase the MAXHEADERS limit of http client
2019-04-08 11:50:08 -07:00
Noah Levitt
c048b05d46
Merge pull request #120 from nlevitt/travis-trough
fixing travis build
2019-04-08 11:25:35 -07:00
Noah Levitt
ac3d238a3d new snakebite git url 2019-04-08 11:11:51 -07:00
Vangelis Banos
0cab6fc4bf Increase the MAXHEADERS limit of http client
`http.client` has an arbitrary limit of MAXHEADERS=100. If a target URL
has more it raises an HTTPException and the request fails. (The target
pages are perfectly fine besides having more than 100 headers).
https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L113

We increase this limit to 7000. We currently use this in production WBM.
We bumped into the same issue trying to replay pages with too many
HTTP headers. We increased the limit progressively from 100 to 500, 1000
etc and we found that 7000 is a good place to stop.
2019-04-08 16:13:14 +00:00
Noah Levitt
794cc29c80 bump version 2019-03-21 16:04:05 -07:00
Noah Levitt
5633ae6a9c
Merge pull request #117 from nlevitt/travis-py37
travis-ci python 3.7
2019-03-21 16:03:43 -07:00
Noah Levitt
3f08639553 still seeing a warning but 🤷‍♂️ 2019-03-21 16:00:36 -07:00
Noah Levitt
a25971e06b appease some warnings 2019-03-21 14:17:24 -07:00
Noah Levitt
f2eebae641 Merge branch 'master' into travis-py37
* master:
  account for surt fix in urlcanon 0.3.0
  every change is a point release now
  Upgrade PyYAML to >=5.1
  Use YAML instead of JSON
  Add option to load logging conf from JSON file
2019-03-21 13:48:58 -07:00
Noah Levitt
a291de086d
Merge pull request #118 from nlevitt/urlcanon-surt-fix
account for surt fix in urlcanon 0.3.0
2019-03-21 13:48:29 -07:00
Noah Levitt
cb2a07bff2 account for surt fix in urlcanon 0.3.0 2019-03-21 12:59:32 -07:00
Noah Levitt
1e0a0ca63a every change is a point release now 2019-03-21 12:38:29 -07:00
Noah Levitt
df7b46d94f
Merge pull request #116 from vbanos/logging-config-file
Add option to load logging conf from YAML file
2019-03-21 12:37:24 -07:00
Vangelis Banos
436a27b19e Upgrade PyYAML to >=5.1 2019-03-21 19:34:52 +00:00
Noah Levitt
b0367a9c82 fix pypy3? see:
https://docs.travis-ci.com/user/languages/python/
2019-03-21 12:25:51 -07:00
Vangelis Banos
878ab0977f Use YAML instead of JSON
Add PyYAML<=3.13 dependency.
2019-03-21 19:18:55 +00:00
Noah Levitt
c8f1c64494 travis-ci python 3.7 2019-03-21 12:15:39 -07:00
Vangelis Banos
6e6b43eb79 Add option to load logging conf from JSON file
New option `--logging-conf-file` to load `logging` conf from a JSON
file.

Prefer JSON over the `configparser` format supported by
`logging.config.fileConfig` because JSON format is much better (nesting
is supported) and its easier to detect errors.
2019-03-20 11:53:32 +00:00
Noah Levitt
c70bf2e2b9 debugging a shutdown issue 2019-02-27 12:36:35 -08:00
Noah Levitt
adca46427d back to dev version number 2019-02-12 15:04:22 -08:00
Noah Levitt
5a7a4ff710 pypi release 2.4b6 2019-02-12 15:00:22 -08:00
Noah Levitt
2824ee6a5b omfg too many warcs 2019-02-12 14:59:54 -08:00
Noah Levitt
dde2c3efda
Merge pull request #114 from vbanos/cdx-dedup-lru-cache
Use in-memory LRU cache in CDX Server dedup
2019-02-12 14:35:44 -08:00
Vangelis Banos
99fb998e1d log LRU cache info every 1000 requests
to avoid writing to the log too often.
2019-02-12 21:46:49 +00:00
Vangelis Banos
660989939e Remove cli option cdxserver-dedup-lru-cache-size
LRU cache is always enabled for cdxserver dedup module with a default
cache size of 1024.
2019-02-12 20:43:27 +00:00
Vangelis Banos
1133715331 Enable cdx dedup lru cache by default
use default value 1024
2019-02-12 08:28:15 +00:00
Vangelis Banos
53f13d3536 Use in-memory LRU cache in CDX Server dedup
Add option `--cdxserver-dedup-lru-cache-size=N` (default None) to enable
in-memory caching of CDX dedup requests using stdlib `lru_cache` method.

Cache memory info is available on `INFO` logging outputs like:
```
CacheInfo(hits=3172, misses=3293, maxsize=1024, currsize=1024)
``
2019-02-07 09:08:11 +00:00
Noah Levitt
98f50ca296
Merge pull request #113 from vbanos/cdxserver-dedup-max-threads
Configurable max threads in CdxServerDedupLoader
2019-01-23 10:44:04 -08:00
Vangelis Banos
e04ffa5a36 Change default --cdxserver-dedup-max-threads from 400 to 50 2019-01-23 18:34:33 +00:00
Vangelis Banos
25281376f6 Configurable max threads in CdxServerDedupLoader
`CdxServerDedupLoader` used `max_workers=400` by default.
We make it a CLI option `--cdxserver-dedup-max-threads` with a default
value of 400.

We need to be able to tweak this setting because it creates too many CDX
queries which cause problems with our production CDX servers.
2019-01-23 11:07:46 +00:00
Noah Levitt
cb72af015a fix idle rollover 2019-01-21 10:37:09 -08:00
Noah Levitt
a780f1774c back to dev version number 2019-01-17 17:15:33 -08:00
Noah Levitt
16e3302d36
Merge pull request #109 from nlevitt/warc-close-api
Warc close api
2019-01-17 17:15:13 -08:00
Noah Levitt
e07ee3630e 2.4b3 for pypi 2.4b3 2019-01-09 15:15:37 -08:00
Noah Levitt
3ea5c36e7f add idna as dep with acceptable to other deps
because
<kenji> my understanding is that pip cannot fully resolve version
constraints for indirect dependencies.
2019-01-09 15:10:37 -08:00
Noah Levitt
8fd1af1d04 offer WarcproxController to plugin constructors
because plugin needs to get at stuff, especially the warc writer
processor, for this close api to be useful
2019-01-09 22:47:04 +00:00
Noah Levitt
150c1e67c6 WarcWriterProcessor.close_for_prefix()
New API to allow some code from outside of warcprox proper (in a
third-party plugin for example) can close open warcs promptly when it
knows they are finished.
2019-01-08 11:27:11 -08:00
Noah Levitt
79d09d013b ThreadPoolExecutor no longer used
it was part of the multi-threaded warc writer implementation
2019-01-08 11:15:20 -08:00
Noah Levitt
0882a2b174 remove --writer-threads option
Support for multiple writer threads was broken, and profiling had shown
it to be of dubious utility.
https://github.com/internetarchive/warcprox/issues/101
https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2019-01-07 15:54:35 -08:00
Noah Levitt
1ea8a06a69 3 hour hard timeout on urls without content-length
so that indefinite streams like icecast radio stations don't hang
forever
2018-11-12 15:57:37 -08:00
Noah Levitt
bb50a6c7ff use predictable id in service registry
so that when warcprox restarts it replaces the obsolete entry
2018-11-12 15:11:23 -08:00
Noah Levitt
4f836e9179 bump version number 2018-11-06 11:29:33 -08:00
Noah Levitt
9837d3e3a6 make sure we always format WARC-Date properly
We started getting some WARC-Dates like this:
> WARC-Date: 2018-11-04T06:34:35+00:00Z
but only rarely. The warctools library function we were using to format
the timestamps looks like this:

    def warc_datetime_str(d):
        s = d.isoformat()
        if '.' in s:
            s = s[:s.find('.')]
        return (s + 'Z').encode('utf-8')

isoformat() adds a timestamp like "+00:00" if the datetime has a
timezone. And it turns out that `isoformat()` leaves off the fractional
part if it's zero. In that case we don't get inside the if statement
there and don't chop off the timestamp.

Theoretically this case should only happen once in every million
records, but in practice we are seeing it more often than that (maybe in
the ballpark of 1/1000). It could be that there's a codepath that
produces a timestamp with no microsecond part but I'm not seeing that in
the warcprox code.

In any case, this is the fix.
2018-11-06 11:21:12 -08:00
Noah Levitt
1460040789 bump version after merge 2018-10-31 16:23:00 -07:00
jkafader
9a59299f98
Merge pull request #106 from nlevitt/fix-seconds-behind
take all the queues and active requests into...
2018-10-31 15:21:53 -07:00
Noah Levitt
2f98d93467 datetimes with timezone in status because...
... status json populates rethinkdb service registry when that is
enabled, and rethinkdb insists on timezones on dates, and it doesn't
cause any problems
2018-10-31 11:00:21 -07:00
Noah Levitt
dbf868a74d be clear about timezone in timestamps 2018-10-30 13:17:33 -07:00