Noah Levitt
50d29bdf80
Merge pull request #128 from vbanos/recordedurl-compile
...
Compile RecordedUrl regex to improve performance
2019-05-06 15:52:28 -07:00
Noah Levitt
dfc081fff8
do not write incorrect warc-payload-digest to...
...
... request records
see https://github.com/webrecorder/warcio/issues/74#issuecomment-487816378
2019-05-02 14:25:29 -07:00
Vangelis Banos
be7048844b
Compile RecordedUrl regex to improve performance
...
Minor optimisation.
2019-05-02 07:11:24 +00:00
Noah Levitt
38d6e4337d
handle graceful shutdown failure
...
print stack trace and kill myself -9
2019-04-24 13:14:12 -07:00
Noah Levitt
de01d498cb
requests/urllib3 version conflict
2019-04-24 12:11:20 -07:00
Noah Levitt
3298128e0c
deal with bad content-type header
...
we had bad stuff get into a crawl log because of a url that returned a
Content-Type header value with spaces in it (but no semicolon)
2019-04-24 10:40:22 -07:00
Noah Levitt
f207e32f50
followup on IncompleteRead
2019-04-15 00:17:50 -07:00
Noah Levitt
5de2569430
bump version after #124 merge
2019-04-13 18:11:02 -07:00
Noah Levitt
10327d28c9
Merge pull request #124 from nlevitt/incomplete-read
...
IncompleteRead fix with test
2019-04-13 18:10:14 -07:00
Noah Levitt
0d268659ab
handle incomplete read
...
see Vangelis's writeup at https://github.com/internetarchive/warcprox/pull/123
2019-04-13 17:46:52 -07:00
Noah Levitt
5ced2588d4
failing test test_incomplete_read
2019-04-13 17:33:38 -07:00
Noah Levitt
98b3c1f80b
bump version after merge
2019-04-09 21:52:31 +00:00
Noah Levitt
21731a2dfe
Merge pull request #122 from nlevitt/avoid-oserror
...
avoid exception sending error to client
2019-04-09 14:51:28 -07:00
Noah Levitt
7560c0946d
avoid exception sending error to client
...
this is a slightly different approach to
https://github.com/internetarchive/warcprox/pull/121
2019-04-09 21:16:45 +00:00
Noah Levitt
2ca84ae023
bump version after merge
2019-04-08 11:50:27 -07:00
Noah Levitt
4893a8eac0
Merge pull request #119 from vbanos/max-headers
...
Increase the MAXHEADERS limit of http client
2019-04-08 11:50:08 -07:00
Noah Levitt
c048b05d46
Merge pull request #120 from nlevitt/travis-trough
...
fixing travis build
2019-04-08 11:25:35 -07:00
Noah Levitt
ac3d238a3d
new snakebite git url
2019-04-08 11:11:51 -07:00
Vangelis Banos
0cab6fc4bf
Increase the MAXHEADERS limit of http client
...
`http.client` has an arbitrary limit of MAXHEADERS=100. If a target URL
has more it raises an HTTPException and the request fails. (The target
pages are perfectly fine besides having more than 100 headers).
https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L113
We increase this limit to 7000. We currently use this in production WBM.
We bumped into the same issue trying to replay pages with too many
HTTP headers. We increased the limit progressively from 100 to 500, 1000
etc and we found that 7000 is a good place to stop.
2019-04-08 16:13:14 +00:00
Noah Levitt
794cc29c80
bump version
2019-03-21 16:04:05 -07:00
Noah Levitt
5633ae6a9c
Merge pull request #117 from nlevitt/travis-py37
...
travis-ci python 3.7
2019-03-21 16:03:43 -07:00
Noah Levitt
3f08639553
still seeing a warning but 🤷♂️
2019-03-21 16:00:36 -07:00
Noah Levitt
a25971e06b
appease some warnings
2019-03-21 14:17:24 -07:00
Noah Levitt
f2eebae641
Merge branch 'master' into travis-py37
...
* master:
account for surt fix in urlcanon 0.3.0
every change is a point release now
Upgrade PyYAML to >=5.1
Use YAML instead of JSON
Add option to load logging conf from JSON file
2019-03-21 13:48:58 -07:00
Noah Levitt
a291de086d
Merge pull request #118 from nlevitt/urlcanon-surt-fix
...
account for surt fix in urlcanon 0.3.0
2019-03-21 13:48:29 -07:00
Noah Levitt
cb2a07bff2
account for surt fix in urlcanon 0.3.0
2019-03-21 12:59:32 -07:00
Noah Levitt
1e0a0ca63a
every change is a point release now
2019-03-21 12:38:29 -07:00
Noah Levitt
df7b46d94f
Merge pull request #116 from vbanos/logging-config-file
...
Add option to load logging conf from YAML file
2019-03-21 12:37:24 -07:00
Vangelis Banos
436a27b19e
Upgrade PyYAML to >=5.1
2019-03-21 19:34:52 +00:00
Noah Levitt
b0367a9c82
fix pypy3? see:
...
https://docs.travis-ci.com/user/languages/python/
2019-03-21 12:25:51 -07:00
Vangelis Banos
878ab0977f
Use YAML instead of JSON
...
Add PyYAML<=3.13 dependency.
2019-03-21 19:18:55 +00:00
Noah Levitt
c8f1c64494
travis-ci python 3.7
2019-03-21 12:15:39 -07:00
Vangelis Banos
6e6b43eb79
Add option to load logging conf from JSON file
...
New option `--logging-conf-file` to load `logging` conf from a JSON
file.
Prefer JSON over the `configparser` format supported by
`logging.config.fileConfig` because JSON format is much better (nesting
is supported) and its easier to detect errors.
2019-03-20 11:53:32 +00:00
Noah Levitt
c70bf2e2b9
debugging a shutdown issue
2019-02-27 12:36:35 -08:00
Noah Levitt
adca46427d
back to dev version number
2019-02-12 15:04:22 -08:00
Noah Levitt
5a7a4ff710
pypi release
2.4b6
2019-02-12 15:00:22 -08:00
Noah Levitt
2824ee6a5b
omfg too many warcs
2019-02-12 14:59:54 -08:00
Noah Levitt
dde2c3efda
Merge pull request #114 from vbanos/cdx-dedup-lru-cache
...
Use in-memory LRU cache in CDX Server dedup
2019-02-12 14:35:44 -08:00
Vangelis Banos
99fb998e1d
log LRU cache info every 1000 requests
...
to avoid writing to the log too often.
2019-02-12 21:46:49 +00:00
Vangelis Banos
660989939e
Remove cli option cdxserver-dedup-lru-cache-size
...
LRU cache is always enabled for cdxserver dedup module with a default
cache size of 1024.
2019-02-12 20:43:27 +00:00
Vangelis Banos
1133715331
Enable cdx dedup lru cache by default
...
use default value 1024
2019-02-12 08:28:15 +00:00
Vangelis Banos
53f13d3536
Use in-memory LRU cache in CDX Server dedup
...
Add option `--cdxserver-dedup-lru-cache-size=N` (default None) to enable
in-memory caching of CDX dedup requests using stdlib `lru_cache` method.
Cache memory info is available on `INFO` logging outputs like:
```
CacheInfo(hits=3172, misses=3293, maxsize=1024, currsize=1024)
``
2019-02-07 09:08:11 +00:00
Noah Levitt
98f50ca296
Merge pull request #113 from vbanos/cdxserver-dedup-max-threads
...
Configurable max threads in CdxServerDedupLoader
2019-01-23 10:44:04 -08:00
Vangelis Banos
e04ffa5a36
Change default --cdxserver-dedup-max-threads from 400 to 50
2019-01-23 18:34:33 +00:00
Vangelis Banos
25281376f6
Configurable max threads in CdxServerDedupLoader
...
`CdxServerDedupLoader` used `max_workers=400` by default.
We make it a CLI option `--cdxserver-dedup-max-threads` with a default
value of 400.
We need to be able to tweak this setting because it creates too many CDX
queries which cause problems with our production CDX servers.
2019-01-23 11:07:46 +00:00
Noah Levitt
cb72af015a
fix idle rollover
2019-01-21 10:37:09 -08:00
Noah Levitt
a780f1774c
back to dev version number
2019-01-17 17:15:33 -08:00
Noah Levitt
16e3302d36
Merge pull request #109 from nlevitt/warc-close-api
...
Warc close api
2019-01-17 17:15:13 -08:00
Noah Levitt
e07ee3630e
2.4b3 for pypi
2.4b3
2019-01-09 15:15:37 -08:00
Noah Levitt
3ea5c36e7f
add idna as dep with acceptable to other deps
...
because
<kenji> my understanding is that pip cannot fully resolve version
constraints for indirect dependencies.
2019-01-09 15:10:37 -08:00