DedupableMixin.should_dedup() improvement

When a recorded URL has `recorded_url.do_not_archive = True`, it is not
written to WARC. This is checked in
`WarcWriterProcessor._should_archive`.
We shouldn't waste time on deduping something that is not going to be
written to WARC anyway.
This commit is contained in:
Vangelis Banos 2020-08-15 09:17:39 +00:00
parent de9219e646
commit 8078ee7af9

View File

@ -48,8 +48,12 @@ class DedupableMixin(object):
size compared with min text/binary dedup size options.
When we use option --dedup-only-with-bucket, `dedup-buckets` is required
in Warcprox-Meta to perform dedup.
If recorded_url.do_not_archive is True, we skip dedup. This record will
not be written to WARC anyway.
Return Boolean.
"""
if recorded_url.do_not_archive:
return False
if self.dedup_only_with_bucket and "dedup-buckets" not in recorded_url.warcprox_meta:
return False
if recorded_url.is_text():