Barbara Miller
b91a7d1d89
more updates qa prototyping
2023-06-28 17:34:26 -07:00
Barbara Miller
ef75164f8b
fixes for qa prototyping
2023-06-27 17:19:40 -07:00
Barbara Miller
d9145eefb5
LimitRecords, more LimitRevisitsPGMixin
2023-06-26 22:49:33 -07:00
Barbara Miller
08f2903f14
LimitRevisitsPGMixin
2023-06-22 19:29:53 -07:00
Barbara Miller
5075920415
limit revisits mixin
2023-06-21 17:25:41 -07:00
Barbara Miller
9e8ea5bb45
fix logging buglet iii
2021-12-29 12:06:18 -08:00
Barbara Miller
bc3d1e6d00
fix logging buglet ii
2021-12-29 11:55:39 -08:00
Barbara Miller
5d8fbf7038
fix logging buglet
2021-12-29 10:25:04 -08:00
Barbara Miller
d7aec77597
faster, likely
2021-12-16 18:36:00 -08:00
Barbara Miller
bcaf293081
better logging
2021-12-09 12:19:45 -08:00
Barbara Miller
7d4c8dcb4e
recorded_url.do_not_archive = True
2021-12-08 11:04:09 -08:00
Barbara Miller
da089e0a92
bytes not str
2021-12-06 20:33:16 -08:00
Barbara Miller
3eeccd0016
more hash_plus_url
2021-12-06 19:43:27 -08:00
Barbara Miller
5e5a74f204
str, not object
2021-12-06 19:33:10 -08:00
Barbara Miller
b67f1ad0f3
add logging
2021-12-06 17:29:27 -08:00
Barbara Miller
e744075913
python 3.5 version, mostly
2021-12-02 11:46:39 -08:00
Barbara Miller
1476bfec8c
discard batch hash+url match
2021-12-02 11:17:59 -08:00
Adam Miller
36784de174
Merge branch 'master' into adds-logging-for-failed-connections
2020-09-23 19:18:41 +00:00
Vangelis Banos
8078ee7af9
DedupableMixin.should_dedup() improvement
...
When a recorded URL has `recorded_url.do_not_archive = True`, it is not
written to WARC. This is checked in
`WarcWriterProcessor._should_archive`.
We shouldn't waste time on deduping something that is not going to be
written to WARC anyway.
2020-08-15 09:17:39 +00:00
Noah Levitt
90fba01514
make trough dependency optional
2020-01-08 13:37:01 -08:00
Noah Levitt
469b41773a
fix logging config which trough interfered with
2020-01-07 15:19:03 -08:00
Noah Levitt
3f5251ed60
Merge pull request #144 from nlevitt/trough-dedup-schema
...
change trough dedup `date` type to varchar
2020-01-07 14:41:45 -08:00
Adam Miller
4ceebe1fa9
Moving more variables from RecordedUrl to RequiredUrl
2020-01-04 01:41:28 +00:00
Adam Miller
e88a88f247
Refactor failed requests into new class.
2020-01-03 20:43:47 +00:00
Noah Levitt
ac959c6db5
change trough dedup date
type to varchar
...
This is a backwards-compatible change whose purpose is to clarify the
existing usage.
In sqlite (and therefore trough), the datatypes of columns are just
suggestions. In fact the values can have any type. See
https://sqlite.org/datatype3.html . `datetime` isn't even a real sqlite
type.
Warcprox stores a string formatted like '2019-11-19T01:23:45Z' in that
field. When it pulls it out of the database and writes a revisit record,
it sticks the raw value in the `WARC-Date` header of that record.
Warcprox never parses the string value.
Since we use the raw textual value of the field, it makes sense to use a
textual datatype to store it.
2019-11-19 13:33:59 -08:00
Noah Levitt
fe19bb268f
use trough.client instead of warcprox.trough
...
less redundant code!
trough.client was based off of warcprox.trough but has been improved
since then
2019-11-19 11:45:14 -08:00
Vangelis Banos
8f20fc014e
Skip cdx dedup for volatile URLs with session params
...
A lot of cdx dedup requests fail. Checking production logs, we see that
we try to dedup URLs that are certainly volative and session-specific.
We can skip them to reduce cdx dedup load. We won't find any matches
anyway since they contain session-specific vars.
We suggest to skip cdx dedup for URL that include `JSESSIONID=`,
`session=` or `sess=`. These are common session URL params, there could
be many-many more.
Example URLs:
```
/session/683/urii8zej/xhr_streaming?JSESSIONID=dv0jkbk2-8xm9t9tf-7wp8lx0m-x4vb22ys
https://tw.popin.cc/popin_discovery/recommend?mode=new&url=https%3A%2F%2Fwww.nownews.com%2Fcat%2Fpolitics%2Fmilitary%2F&&device=pc&media=www.nownews.com&extra=other&agency=cnplus&topn=100&ad=100&r_category=all&country=tw&redirect=false&infinite=nownews&infinite_domain=m.nownews.com&piuid=43757d2474f09288b8410a9f2a40acf1&info=eyJ1c2VyX3RkX29zIjoib3RoZXIiLCJ1c2VyX3RkX29zX3ZlcnNpb24iOiIwLjAuMCIsInVzZXJfdGRfYnJvd3NlciI6IkNocm9tZSIsInVzZXJfdGRfYnJvd3Nlcl92ZXJzaW9uIjoiNzQuMC4zNzI5IiwidXNlcl90ZF9zY3JlZW4iOiIxNjAweDEwMDAiLCJ1c2VyX3RkX3ZpZXdwb3J0IjoiMTEwMHg3ODQiLCJ1c2VyX3RkX3VzZXJfYWdlbnQiOiJNb3ppbGxhLzUuMCAoWDExOyBMaW51eCB4ODZfNjQpIEFwcGxlV2ViS2l0LzUzNy4zNiAoS0hUTUwsIGxpa2UgR2Vja28pIFVidW50dSBDaHJvbWl1bS83NC4wLjM3MjkuMTY5IENocm9tZS83NC4wLjM3MjkuMTY5IFNhZmFyaS81MzcuMzYiLCJ1c2VyX3RkX3JlZmVycmVyIjoiIiwidXNlcl90ZF9wYXRoIjoiL2NhdC9wb2xpdGljcy9taWxpdGFyeS8iLCJ1c2VyX3RkX2NoYXJzZXQiOiJ1dGYtOCIsInVzZXJfdGRfbGFuZ3VhZ2UiOiJlbi11cyIsInVzZXJfdGRfY29sb3IiOiIyNC1iaXQiLCJ1c2VyX3RkX3RpdGxlIjoiJUU4JUJCJThEJUU2JUFEJUE2JTIwJTdDJTIwTk9XbmV3cyUyMCVFNCVCQiU4QSVFNiU5NyVBNSVFNiU5NiVCMCVFOCU4MSU5RSIsInVzZXJfdGRfdXJsIjoiaHR0cHM6Ly93d3cubm93bmV3cy5jb20vY2F0L3BvbGl0aWNzL21pbGl0YXJ5LyIsInVzZXJfdGRfcGxhdGZvcm0iOiJMaW51eCB4ODZfNjQiLCJ1c2VyX3RkX2hvc3QiOiJ3d3cubm93bmV3cy5jb20iLCJ1c2VyX2RldmljZSI6InBjIiwidXNlcl90aW1lIjoxNTYyMDAxMzkyNzY2fQ==&session=13927861b5403&callback=_p6_8e102dd0c975
http://c.statcounter.com/text.php?sc_project=4092884&java=1&security=10fe3b6b&u1=915B47A927524F10185B2F074074BDCB&sc_random=0.017686960888044556&jg=310&rr=1.1.1.1.1.1.1.1.1&resolution=1600&h=1000&camefrom=&u=http%3A//buchlatech.blogspot.com/search/label/prototype&t=Buchla%20Tech%3A%20prototype&rcat=d&rdomo=d&rdomg=310&bb=0&sc_snum=1&sess=cfa820&p=0&text=2
```
2019-09-20 06:31:15 +00:00
Barbara Miller
957bd079e8
WIP (untested): handle multiple dedup-buckets, rw or ro
2019-05-30 19:27:46 -07:00
Noah Levitt
a25971e06b
appease some warnings
2019-03-21 14:17:24 -07:00
Vangelis Banos
99fb998e1d
log LRU cache info every 1000 requests
...
to avoid writing to the log too often.
2019-02-12 21:46:49 +00:00
Vangelis Banos
660989939e
Remove cli option cdxserver-dedup-lru-cache-size
...
LRU cache is always enabled for cdxserver dedup module with a default
cache size of 1024.
2019-02-12 20:43:27 +00:00
Vangelis Banos
53f13d3536
Use in-memory LRU cache in CDX Server dedup
...
Add option `--cdxserver-dedup-lru-cache-size=N` (default None) to enable
in-memory caching of CDX dedup requests using stdlib `lru_cache` method.
Cache memory info is available on `INFO` logging outputs like:
```
CacheInfo(hits=3172, misses=3293, maxsize=1024, currsize=1024)
``
2019-02-07 09:08:11 +00:00
Vangelis Banos
25281376f6
Configurable max threads in CdxServerDedupLoader
...
`CdxServerDedupLoader` used `max_workers=400` by default.
We make it a CLI option `--cdxserver-dedup-max-threads` with a default
value of 400.
We need to be able to tweak this setting because it creates too many CDX
queries which cause problems with our production CDX servers.
2019-01-23 11:07:46 +00:00
Noah Levitt
fde443070c
dumb mistake
2018-07-18 20:10:30 -05:00
Noah Levitt
d3314d7904
hopefully fix a trough dedup concurrency bug
2018-07-18 19:26:16 -05:00
Noah Levitt
b7e12a3ec2
some logging improvements
2018-07-18 19:25:43 -05:00
Noah Levitt
ec7a0bf569
log exception and continue 🤞 if schema reg fails
...
at trough dedup startup
2018-05-31 16:57:37 -07:00
Noah Levitt
997d4341fe
add some debug logging in BatchTroughLoader
2018-05-18 17:29:38 -07:00
Noah Levitt
b762d6468b
just one should_dedup() for trough dedup
...
fixes failing test and clarifies things
2018-05-16 14:25:01 -07:00
Noah Levitt
af863c6dba
default values for dedup_min_text_size et al
...
because they may be missing in case warcprox is used as a library
2018-05-15 11:22:10 -07:00
Vangelis Banos
abb54e42d1
Add hidden CLI option --dedup-only-with-bucket
...
When we use `--dedup-only-with-bucket`, dedup will be done only when a
request has key `dedup-bucket` in `Warcprox-Meta`.
2018-05-04 20:50:54 +00:00
Vangelis Banos
432e42803c
dedup-bucket is required in Warcprox-Meta to do dedup
...
Modify `DedupableMixin.should_dedup` to check Warcprox-Meta for
`dedup-bucket` in order to perform dedup.
2018-05-04 14:27:42 +00:00
Vangelis Banos
9baa2e22d5
Rename captures-bucket to dedup-bucket in Warcprox-Meta
2018-05-04 13:26:38 +00:00
Vangelis Banos
6dce8cc644
Remove method decorate_with_dedup_info
...
Method `warcprox.dedup.decorate_with_dedup_info` is only used in
`DedupLoader._process_url` and nowhere else.
The problem is that `decorate_with_dedup_info` cannot get warcprox cli
options. Thus we cannot pass the custom min size limits.
2018-04-24 10:58:13 +00:00
Vangelis Banos
9057fbdf36
Use DedupableMixin in all dedup classes
...
Rename `DedupableMixin.is_dedupable` to `should_dedup`.
2018-04-24 10:29:35 +00:00
Vangelis Banos
d32bf743bd
Configurable min dedupable size for text/binary resources
...
New `--dedup-min-text-size` and `--dedup-min-binary-size` cli options
with default value = `0`.
New `DedupableMixin` which can be used in any dedup class. It is
currently used only in CDX dedup. Instead of checking `payload_size() >
0`, we now use `.is_dedupable(recorded_url)`
New utility method `RecordedUrl.is_text`.
2018-04-09 15:52:44 +00:00
Vangelis Banos
cce0c705fb
Fix Accept-Encoding request header
2018-04-06 19:55:19 +00:00
Vangelis Banos
7c5c5da9b7
CDX dedup improvements
...
Check for not empty captured content (`payload_size() > 0`) before
creating a new thread and running a CDX dedup request.
Most dedup modules perform the same check to avoid unnecessary dedup
requests.
Increase CDX dedup max workers from 200 to 400 in order to handle more
load.
Set `user-agent: warcprox` for HTTP requests we send to CDX server. Its
useful to identify and monitor `warcprox` requests.
Pass HTTP headers to connection pool on init and not on each request.
2018-04-06 19:55:19 +00:00
Vangelis Banos
0d8fe4a38f
Disable retries and set timeout=2.0 for CDX Dedup server
...
Its better to skip CDX server dedup than slow down when its
unresponsive.
Also increase pool size from 50 to 200.
2018-02-08 22:24:20 +00:00
Noah Levitt
824c194142
make plugin api more flexible
2018-01-24 16:07:45 -08:00