90 Commits

Author SHA1 Message Date
Noah Levitt
fe19bb268f use trough.client instead of warcprox.trough
less redundant code!
trough.client was based off of warcprox.trough but has been improved
since then
2019-11-19 11:45:14 -08:00
Vangelis Banos
8f20fc014e Skip cdx dedup for volatile URLs with session params
A lot of cdx dedup requests fail. Checking production logs, we see that
we try to dedup URLs that are certainly volative and session-specific.
We can skip them to reduce cdx dedup load. We won't find any matches
anyway since they contain session-specific vars.

We suggest to skip cdx dedup for URL that include `JSESSIONID=`,
`session=` or `sess=`. These are common session URL params, there could
be many-many more.

Example URLs:
```
/session/683/urii8zej/xhr_streaming?JSESSIONID=dv0jkbk2-8xm9t9tf-7wp8lx0m-x4vb22ys

https://tw.popin.cc/popin_discovery/recommend?mode=new&url=https%3A%2F%2Fwww.nownews.com%2Fcat%2Fpolitics%2Fmilitary%2F&&device=pc&media=www.nownews.com&extra=other&agency=cnplus&topn=100&ad=100&r_category=all&country=tw&redirect=false&infinite=nownews&infinite_domain=m.nownews.com&piuid=43757d2474f09288b8410a9f2a40acf1&info=eyJ1c2VyX3RkX29zIjoib3RoZXIiLCJ1c2VyX3RkX29zX3ZlcnNpb24iOiIwLjAuMCIsInVzZXJfdGRfYnJvd3NlciI6IkNocm9tZSIsInVzZXJfdGRfYnJvd3Nlcl92ZXJzaW9uIjoiNzQuMC4zNzI5IiwidXNlcl90ZF9zY3JlZW4iOiIxNjAweDEwMDAiLCJ1c2VyX3RkX3ZpZXdwb3J0IjoiMTEwMHg3ODQiLCJ1c2VyX3RkX3VzZXJfYWdlbnQiOiJNb3ppbGxhLzUuMCAoWDExOyBMaW51eCB4ODZfNjQpIEFwcGxlV2ViS2l0LzUzNy4zNiAoS0hUTUwsIGxpa2UgR2Vja28pIFVidW50dSBDaHJvbWl1bS83NC4wLjM3MjkuMTY5IENocm9tZS83NC4wLjM3MjkuMTY5IFNhZmFyaS81MzcuMzYiLCJ1c2VyX3RkX3JlZmVycmVyIjoiIiwidXNlcl90ZF9wYXRoIjoiL2NhdC9wb2xpdGljcy9taWxpdGFyeS8iLCJ1c2VyX3RkX2NoYXJzZXQiOiJ1dGYtOCIsInVzZXJfdGRfbGFuZ3VhZ2UiOiJlbi11cyIsInVzZXJfdGRfY29sb3IiOiIyNC1iaXQiLCJ1c2VyX3RkX3RpdGxlIjoiJUU4JUJCJThEJUU2JUFEJUE2JTIwJTdDJTIwTk9XbmV3cyUyMCVFNCVCQiU4QSVFNiU5NyVBNSVFNiU5NiVCMCVFOCU4MSU5RSIsInVzZXJfdGRfdXJsIjoiaHR0cHM6Ly93d3cubm93bmV3cy5jb20vY2F0L3BvbGl0aWNzL21pbGl0YXJ5LyIsInVzZXJfdGRfcGxhdGZvcm0iOiJMaW51eCB4ODZfNjQiLCJ1c2VyX3RkX2hvc3QiOiJ3d3cubm93bmV3cy5jb20iLCJ1c2VyX2RldmljZSI6InBjIiwidXNlcl90aW1lIjoxNTYyMDAxMzkyNzY2fQ==&session=13927861b5403&callback=_p6_8e102dd0c975

http://c.statcounter.com/text.php?sc_project=4092884&java=1&security=10fe3b6b&u1=915B47A927524F10185B2F074074BDCB&sc_random=0.017686960888044556&jg=310&rr=1.1.1.1.1.1.1.1.1&resolution=1600&h=1000&camefrom=&u=http%3A//buchlatech.blogspot.com/search/label/prototype&t=Buchla%20Tech%3A%20prototype&rcat=d&rdomo=d&rdomg=310&bb=0&sc_snum=1&sess=cfa820&p=0&text=2
```
2019-09-20 06:31:15 +00:00
Barbara Miller
957bd079e8 WIP (untested): handle multiple dedup-buckets, rw or ro 2019-05-30 19:27:46 -07:00
Noah Levitt
a25971e06b appease some warnings 2019-03-21 14:17:24 -07:00
Vangelis Banos
99fb998e1d log LRU cache info every 1000 requests
to avoid writing to the log too often.
2019-02-12 21:46:49 +00:00
Vangelis Banos
660989939e Remove cli option cdxserver-dedup-lru-cache-size
LRU cache is always enabled for cdxserver dedup module with a default
cache size of 1024.
2019-02-12 20:43:27 +00:00
Vangelis Banos
53f13d3536 Use in-memory LRU cache in CDX Server dedup
Add option `--cdxserver-dedup-lru-cache-size=N` (default None) to enable
in-memory caching of CDX dedup requests using stdlib `lru_cache` method.

Cache memory info is available on `INFO` logging outputs like:
```
CacheInfo(hits=3172, misses=3293, maxsize=1024, currsize=1024)
``
2019-02-07 09:08:11 +00:00
Vangelis Banos
25281376f6 Configurable max threads in CdxServerDedupLoader
`CdxServerDedupLoader` used `max_workers=400` by default.
We make it a CLI option `--cdxserver-dedup-max-threads` with a default
value of 400.

We need to be able to tweak this setting because it creates too many CDX
queries which cause problems with our production CDX servers.
2019-01-23 11:07:46 +00:00
Noah Levitt
fde443070c dumb mistake 2018-07-18 20:10:30 -05:00
Noah Levitt
d3314d7904 hopefully fix a trough dedup concurrency bug 2018-07-18 19:26:16 -05:00
Noah Levitt
b7e12a3ec2 some logging improvements 2018-07-18 19:25:43 -05:00
Noah Levitt
ec7a0bf569 log exception and continue 🤞 if schema reg fails
at trough dedup startup
2018-05-31 16:57:37 -07:00
Noah Levitt
997d4341fe add some debug logging in BatchTroughLoader 2018-05-18 17:29:38 -07:00
Noah Levitt
b762d6468b just one should_dedup() for trough dedup
fixes failing test and clarifies things
2018-05-16 14:25:01 -07:00
Noah Levitt
af863c6dba default values for dedup_min_text_size et al
because they may be missing in case warcprox is used as a library
2018-05-15 11:22:10 -07:00
Vangelis Banos
abb54e42d1 Add hidden CLI option --dedup-only-with-bucket
When we use `--dedup-only-with-bucket`, dedup will be done only when a
request has key `dedup-bucket` in `Warcprox-Meta`.
2018-05-04 20:50:54 +00:00
Vangelis Banos
432e42803c dedup-bucket is required in Warcprox-Meta to do dedup
Modify `DedupableMixin.should_dedup` to check Warcprox-Meta for
`dedup-bucket` in order to perform dedup.
2018-05-04 14:27:42 +00:00
Vangelis Banos
9baa2e22d5 Rename captures-bucket to dedup-bucket in Warcprox-Meta 2018-05-04 13:26:38 +00:00
Vangelis Banos
6dce8cc644 Remove method decorate_with_dedup_info
Method `warcprox.dedup.decorate_with_dedup_info` is only used in
`DedupLoader._process_url` and nowhere else.

The problem is that `decorate_with_dedup_info` cannot get warcprox cli
options. Thus we cannot pass the custom min size limits.
2018-04-24 10:58:13 +00:00
Vangelis Banos
9057fbdf36 Use DedupableMixin in all dedup classes
Rename `DedupableMixin.is_dedupable` to `should_dedup`.
2018-04-24 10:29:35 +00:00
Vangelis Banos
d32bf743bd Configurable min dedupable size for text/binary resources
New `--dedup-min-text-size` and `--dedup-min-binary-size` cli options
with default value = `0`.

New `DedupableMixin` which can be used in any dedup class. It is
currently used only in CDX dedup. Instead of checking `payload_size() >
0`, we now use `.is_dedupable(recorded_url)`

New utility method `RecordedUrl.is_text`.
2018-04-09 15:52:44 +00:00
Vangelis Banos
cce0c705fb Fix Accept-Encoding request header 2018-04-06 19:55:19 +00:00
Vangelis Banos
7c5c5da9b7 CDX dedup improvements
Check for not empty captured content (`payload_size() > 0`) before
creating a new thread and running a CDX dedup request.
Most dedup modules perform the same check to avoid unnecessary dedup
requests.

Increase CDX dedup max workers from 200 to 400 in order to handle more
load.

Set `user-agent: warcprox` for HTTP requests we send to CDX server. Its
useful to identify and monitor `warcprox` requests.

Pass HTTP headers to connection pool on init and not on each request.
2018-04-06 19:55:19 +00:00
Vangelis Banos
0d8fe4a38f Disable retries and set timeout=2.0 for CDX Dedup server
Its better to skip CDX server dedup than slow down when its
unresponsive.

Also increase pool size from 50 to 200.
2018-02-08 22:24:20 +00:00
Noah Levitt
824c194142 make plugin api more flexible 2018-01-24 16:07:45 -08:00
Vangelis Banos
5631eaced1 Parallelize CDX Server dedup queries 2018-01-23 23:16:35 +00:00
Noah Levitt
7fb78ef1df parallelize trough dedup queries
Each dedup bucket (in archive-it, generally one per seed) requires a
separate http request. The batches of urls processed by the trough dedup
loader and storer may include multiple dedup buckets. This commit makes
those all the trough queries in a given batch run in parallel, using a
thread pool.
2018-01-19 16:33:15 -08:00
Noah Levitt
57abab100c handle case where warc record id is missing
... from trough dedup. Not sure why this error happened but we shouldn't
need that field anyway.
2018-01-19 14:38:54 -08:00
Vangelis Banos
1c50235305 Add --cdxserver-dedup-cookies option
It is necessary to pass cookies to the CDX Server we use for deduplication.
To do this, we add the optional CLI argument
``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is
available, its used in the `Cookie` HTTP header in CDX server requests.
2018-01-19 15:16:26 +00:00
Noah Levitt
c933cb3119 batch storing for trough dedup 2018-01-17 16:49:28 -08:00
Noah Levitt
9c5a5eda99 use batch postfetch processor for stats 2018-01-17 14:58:52 -08:00
Noah Levitt
5354648512 Merge branch 'master' into wip-postfetch-chain
* master:
  fix running_stats thing
  Update CdxServerDedup unit test
  Chec writer._fname in unit test
  Configurable CdxServerDedup urllib3 connection pool size
  Yet another unit test fix
  Change the writer unit test
  fix github problem with unit test
  Another fix for the unit test
  Fix writer unit test
  Add WarcWriter warc_filename unit test
  Fix warc_filename default value
  Configurable WARC filenames
2018-01-16 16:01:40 -08:00
Noah Levitt
6ff9030e67 improve batching, make tests pass 2018-01-16 15:18:53 -08:00
Noah Levitt
d4bbaf10b7 batch trough dedup loader 2018-01-16 11:37:56 -08:00
Noah Levitt
c9a39958db tests are passing 2018-01-15 14:49:13 -08:00
Noah Levitt
bd25991a0d slightly less incomplete work on new postfetch processor chain 2018-01-15 14:49:13 -08:00
Vangelis Banos
e59fed2b6f Configurable CdxServerDedup urllib3 connection pool size
urllib3 pool has default ``maxsize=1``
http://urllib3.readthedocs.io/en/latest/advanced-usage.html.
We need to set a higher value because we get warnings like this:
```
2018-01-15 20:04:10,044 18436 WARNING WarcWriterThread030(tid=18502)
urllib3.connectionpool._put_conn(connectionpool.py:277) Connection pool
is full, discarding connection: wwwb-dedup
```

We set value: ```cdxserver_maxsize = args.writer_threads or 200```.

Note that the ideal would be to use this
https://github.com/internetarchive/warcprox/blob/master/warcprox/main.py#L284
but it is initialized after dedup, there is a dependency and we cannot
use it.
2018-01-15 17:43:34 +00:00
Noah Levitt
c5f33bda7a trough dedup - handle case of no warc records written 2017-11-30 12:55:39 -08:00
Noah Levitt
f5351a43df automatic segment promotion every hour 2017-11-13 14:22:17 -08:00
Noah Levitt
d7aea40b05 move trough client into separate module 2017-11-13 12:52:45 -08:00
Noah Levitt
895683e062 more cleanly separate trough client code from the rest of TroughDedup 2017-11-13 12:45:49 -08:00
Noah Levitt
43c36cae10 update payload_digest reference in trough dedup for changes in commit 3a0f6e0947 2017-11-13 12:27:31 -08:00
Noah Levitt
c40ad8391d Merge branch 'master' into trough-dedup
* master:
  hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content
  fix payload digest by pulling calculation up one level where content has already been transfer-decoded
  new failing test for correct calculation of payload digest
  missed a spot handling case of no warc records written
2017-11-13 11:47:04 -08:00
Noah Levitt
3a0f6e0947 fix payload digest by pulling calculation up one level where content has already been transfer-decoded 2017-11-10 17:18:22 -08:00
Noah Levitt
cdd747f48e eh, don't prefix sqlite filenames with 'warcprox-trough-'; logging tweaks 2017-11-10 13:37:09 -08:00
Noah Levitt
b2adb778ee Merge branch 'master' into trough-dedup
* master:
  not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615
  fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
  make test_crawl_log expect HEAD request to be logged
  fix crawl log handling of WARCPROX_WRITE_RECORD request
  modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
  bump dev version number
  add --crawl-log-dir option to fix failing test
  create crawl log dir at startup if it doesn't exist
  make test pass with py27
  fix crawl log test to avoid any dedup collisions
  fix crawl log test
  heritrix-style crawl log support
  disallow slash and backslash in warc-prefix
  can't see any reason to split the main() like this (anymore?)
  add missing dependency warcio to tests_require
2017-11-09 15:50:18 -08:00
Noah Levitt
700056cc04 fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly 2017-11-09 13:10:57 -08:00
Noah Levitt
3dbfc06e68 on error from trough read or write url, delete read/write url from cache, so next request will retrieve a fresh, hopefully working, url (n.b. not covered by automated tests at this point) 2017-11-03 14:16:09 -07:00
Noah Levitt
147b097a53 cache trough read and write urls 2017-11-03 13:48:00 -07:00
Noah Levitt
ab99fe52b9 update trough dedup to use new segment manager api to register schema sql 2017-11-03 12:39:26 -07:00