111 Commits

Author SHA1 Message Date
Vangelis Banos
56e0b17dc9 New option --subdir-prefix
Save WARCs in subdirectories equal to the current value of Warcprox-Meta['warc-prefix'].
E.g. if warc-prefix=='spn2' and --dir=/warcs, save them in /warcs/spn2/.
2024-06-03 21:21:19 +00:00
Vangelis Banos
9fd5a22502 fix typo 2023-10-17 06:12:28 +00:00
Vangelis Banos
3d653e023c Add SOCKS proxy options
Add options `--socks-proxy`, `--socks-proxy-username,
`--socks-proxy-password`.

If enabled, all traffic is routed throught the SOCKS proxy.
2023-10-16 18:33:42 +00:00
Vangelis Banos
ca0197330d Add port to custom WARC filename vars 2020-01-08 21:19:48 +00:00
Noah Levitt
469b41773a fix logging config which trough interfered with 2020-01-07 15:19:03 -08:00
Noah Levitt
a25971e06b appease some warnings 2019-03-21 14:17:24 -07:00
Vangelis Banos
436a27b19e Upgrade PyYAML to >=5.1 2019-03-21 19:34:52 +00:00
Vangelis Banos
878ab0977f Use YAML instead of JSON
Add PyYAML<=3.13 dependency.
2019-03-21 19:18:55 +00:00
Vangelis Banos
6e6b43eb79 Add option to load logging conf from JSON file
New option `--logging-conf-file` to load `logging` conf from a JSON
file.

Prefer JSON over the `configparser` format supported by
`logging.config.fileConfig` because JSON format is much better (nesting
is supported) and its easier to detect errors.
2019-03-20 11:53:32 +00:00
Noah Levitt
c70bf2e2b9 debugging a shutdown issue 2019-02-27 12:36:35 -08:00
Vangelis Banos
e04ffa5a36 Change default --cdxserver-dedup-max-threads from 400 to 50 2019-01-23 18:34:33 +00:00
Vangelis Banos
25281376f6 Configurable max threads in CdxServerDedupLoader
`CdxServerDedupLoader` used `max_workers=400` by default.
We make it a CLI option `--cdxserver-dedup-max-threads` with a default
value of 400.

We need to be able to tweak this setting because it creates too many CDX
queries which cause problems with our production CDX servers.
2019-01-23 11:07:46 +00:00
Noah Levitt
150c1e67c6 WarcWriterProcessor.close_for_prefix()
New API to allow some code from outside of warcprox proper (in a
third-party plugin for example) can close open warcs promptly when it
knows they are finished.
2019-01-08 11:27:11 -08:00
Noah Levitt
0882a2b174 remove --writer-threads option
Support for multiple writer threads was broken, and profiling had shown
it to be of dubious utility.
https://github.com/internetarchive/warcprox/issues/101
https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2019-01-07 15:54:35 -08:00
Noah Levitt
5654bcbeb8 --quiet means NOTICE level logging
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
e4befeec14 --help-hidden for help on hidden args 2018-08-20 11:08:32 -07:00
Noah Levitt
0031091d4f
Merge pull request #99 from vbanos/blackout_period
New --blackout-period option to skip writing redundant revisits to WARC
2018-08-03 17:27:42 -07:00
Vangelis Banos
2c2c1d008a New --blackout-period option to skip writing redundant revisits to WARC
Add option `--blackout-period` (default=0)

When set and if the record is a duplicate (revisit record), check the
datetime of `dedup_info` and its inside the `blackout_period`, skip
writing the record to WARC.

Add some unit tests.

This is an improved implementation based on @nlevitt comments here:
https://github.com/internetarchive/warcprox/pull/92
2018-07-21 11:20:49 +00:00
Noah Levitt
8022257a57 setuptools likes README.rst not readme.rst 2018-07-17 16:35:05 +00:00
Noah Levitt
b7ebc38491 rename README.rst -> readme.rst 2018-05-21 22:18:28 +00:00
Vangelis Banos
abb54e42d1 Add hidden CLI option --dedup-only-with-bucket
When we use `--dedup-only-with-bucket`, dedup will be done only when a
request has key `dedup-bucket` in `Warcprox-Meta`.
2018-05-04 20:50:54 +00:00
Vangelis Banos
d32bf743bd Configurable min dedupable size for text/binary resources
New `--dedup-min-text-size` and `--dedup-min-binary-size` cli options
with default value = `0`.

New `DedupableMixin` which can be used in any dedup class. It is
currently used only in CDX dedup. Instead of checking `payload_size() >
0`, we now use `.is_dedupable(recorded_url)`

New utility method `RecordedUrl.is_text`.
2018-04-09 15:52:44 +00:00
Vangelis Banos
eda0656737 Configurable tmp file max memory size
We use `tempfile.SpooledTemporaryFile(max_size=512*1024)` to keep
recorded data before writing them to WARC.
Data are kept in memory when they are smaller than `max_size`, else they
are written to disk.

We add option `--tmp-file-max-memory-size` to make this configurable.
A higher value means less /tmp disk I/O and higher overall performance but
also increased memory usage.
2018-03-07 08:00:18 +00:00
Vangelis Banos
e8d0fd0f3c Add Warc-Truncated: lenght header
Add ``ProxyingRecordingHTTPResponse.truncated`` and
``RecordedUrl.truncated`` attributes.
Set ``truncated = b'length'` when max resource size limit applies.
Add `Warc-Truncated: length` header to WARC record.
2018-02-10 21:30:56 +00:00
Vangelis Banos
d6b5c8bb39 Add option to limit max resource size
Add hidden option ``--max-resource-size`` which indicates the max file size of
a target resource in bytes. If the size is over the limit, an exception is
raised.
2018-02-09 14:48:11 +00:00
Vangelis Banos
9474a7ae7f Rename remote-server-timeout to socket-timeout
Also apply it to both remote target and local proxy client connections.
2018-02-07 15:48:42 -08:00
Vangelis Banos
428a03689f Make remote server connection timeout configurable
Default is 60 sec (the previously hard-coded value) and you can override
it with --remote-server-timeout=XX
2018-02-07 15:48:42 -08:00
Noah Levitt
824c194142 make plugin api more flexible 2018-01-24 16:07:45 -08:00
Vangelis Banos
1c50235305 Add --cdxserver-dedup-cookies option
It is necessary to pass cookies to the CDX Server we use for deduplication.
To do this, we add the optional CLI argument
``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is
available, its used in the `Cookie` HTTP header in CDX server requests.
2018-01-19 15:16:26 +00:00
Noah Levitt
87cdd855d4 fix import to fix plugins 2018-01-18 11:28:23 -08:00
Noah Levitt
a974ec86fa fixes to make tests pass 2018-01-17 15:33:41 -08:00
Noah Levitt
5354648512 Merge branch 'master' into wip-postfetch-chain
* master:
  fix running_stats thing
  Update CdxServerDedup unit test
  Chec writer._fname in unit test
  Configurable CdxServerDedup urllib3 connection pool size
  Yet another unit test fix
  Change the writer unit test
  fix github problem with unit test
  Another fix for the unit test
  Fix writer unit test
  Add WarcWriter warc_filename unit test
  Fix warc_filename default value
  Configurable WARC filenames
2018-01-16 16:01:40 -08:00
Noah Levitt
b43ab751f0 fix running_stats thing 2018-01-15 17:28:20 -08:00
Noah Levitt
6ab73764ea make run-benchmarks.py work (with no args) 2018-01-15 17:15:36 -08:00
Noah Levitt
d7208d89c6
Merge pull request #50 from vbanos/cdxserverdedup-maxsize
Configurable CdxServerDedup urllib3 connection pool size
2018-01-15 16:46:37 -08:00
Noah Levitt
c9a39958db tests are passing 2018-01-15 14:49:13 -08:00
Noah Levitt
bd25991a0d slightly less incomplete work on new postfetch processor chain 2018-01-15 14:49:13 -08:00
Noah Levitt
c715eaba4e very incomplete work on new postfetch processor chain 2018-01-15 14:45:02 -08:00
Vangelis Banos
e59fed2b6f Configurable CdxServerDedup urllib3 connection pool size
urllib3 pool has default ``maxsize=1``
http://urllib3.readthedocs.io/en/latest/advanced-usage.html.
We need to set a higher value because we get warnings like this:
```
2018-01-15 20:04:10,044 18436 WARNING WarcWriterThread030(tid=18502)
urllib3.connectionpool._put_conn(connectionpool.py:277) Connection pool
is full, discarding connection: wwwb-dedup
```

We set value: ```cdxserver_maxsize = args.writer_threads or 200```.

Note that the ideal would be to use this
https://github.com/internetarchive/warcprox/blob/master/warcprox/main.py#L284
but it is initialized after dedup, there is a dependency and we cannot
use it.
2018-01-15 17:43:34 +00:00
Vangelis Banos
ae23011d84 Configurable WARC filenames
New ``--warc-filename`` CLI parameter with default value:
``'{prefix}-{timestamp17}-{serialno}-{randomtoken}'`` (the previous
hard-coded WARC filename format).

Use variables: ``{prefix}, {timestamp14}, {timestamp17}, {serialno},
{randomtoken}, {hostname}, {shorthostname}`` to define custom WARC
filenames.
2018-01-08 12:13:05 +00:00
Noah Levitt
c966f7f6e8 more stats available from /status (and in rethindkb services table) 2017-12-28 17:07:02 -08:00
Noah Levitt
500ffad7e4 implementation of special prefix "-" which means "do not archive" 2017-12-21 14:33:30 -08:00
jkafader
e5a3dd8b3e
Merge pull request #37 from nlevitt/trough-dedup
WIP: trough for deduplication initial proof-of-concept-ish code
2017-11-30 16:14:43 -08:00
Noah Levitt
61a7c234e8 fix warcprox-ensure-rethinkdb-tables and add tests 2017-11-28 10:38:38 -08:00
Noah Levitt
c13fd9a40e have --profile profile proxy threads as well as warc writer threads 2017-11-14 16:35:25 -08:00
Noah Levitt
b2adb778ee Merge branch 'master' into trough-dedup
* master:
  not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615
  fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
  make test_crawl_log expect HEAD request to be logged
  fix crawl log handling of WARCPROX_WRITE_RECORD request
  modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
  bump dev version number
  add --crawl-log-dir option to fix failing test
  create crawl log dir at startup if it doesn't exist
  make test pass with py27
  fix crawl log test to avoid any dedup collisions
  fix crawl log test
  heritrix-style crawl log support
  disallow slash and backslash in warc-prefix
  can't see any reason to split the main() like this (anymore?)
  add missing dependency warcio to tests_require
2017-11-09 15:50:18 -08:00
Noah Levitt
78c6137016 fix crawl log handling of WARCPROX_WRITE_RECORD request 2017-11-09 12:35:10 -08:00
Noah Levitt
5c18054d37 Merge branch 'master' into crawl-log
* master:
  Update docstring
  Move Warcprox-Meta header construction to warcproxy
  Improve test_writer tests
  Replace timestamp parameter with more generic request/response syntax
  Return capture timestamp
  Swap fcntl.flock with fcntl.lockf
  Unit test fix for Python2 compatibility
  Test WarcWriter file locking when no_warc_open_suffix=True
  Rename writer var and add exception handling
  Acquire and exclusive file lock when not using .open WARC suffix
  Add hidden --no-warc-open-suffix CLI option
  Fix missing dummy url param in bigtable lookup method
  back to dev version number
  version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
  Expand comment with limit=-1 explanation
  Drop unnecessary split for newline in CDX results
  fix benchmarks (update command line args)
  Update CdxServerDedup lookup algorithm
  Pass url instead of recorded_url obj to dedup lookup methods
  Filter out warc/revisit records in CdxServerDedup
  Improve CdxServerDedup implementation
  Fix minor CdxServerDedup unit test
  Fix bug with dedup_info date encoding
  Add mock pkg to run-tests.sh
  Add CdxServerDedup unit tests and improve its exception handling
  Add CDX Server based deduplication
  cryptography lib version 2.1.1 is causing problems
  Revert changes to test_warcprox.py
  Revert changes to bigtable and dedup
  Revert warc to previous behavior
  Update unit test
  Replace invalid warcfilename variable in playback
  Stop using WarcRecord.REFERS_TO header and use payload_digest instead
  greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code
  avoid TypeError: 'NoneType' object is not iterable exception at shutdown
  wait for rethinkdb indexes to be ready
  Remove deleted ``close`` method call from test.
  bump dev version number after merging pull requests
  Add missing "," in deps
  Remove tox.ini, move warcio to test_requires
  allow very long request header lines, to support large warcprox-meta header values
  Remove redundant stop() & sync() dedup methods
  Remove redundant close method from DedupDb and RethinkDedupDb
  Remove unused imports
  Add missing packages from setup.py, add tox config.
  fix python2 tests
  don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string
  fix zero-indexing of warc_writer_threads so they can be disabled via empty list
2017-11-09 11:13:02 -08:00
Noah Levitt
ed49eea4d5 Merge branch 'master' into trough-dedup
* master:
  Update docstring
  Move Warcprox-Meta header construction to warcproxy
  Improve test_writer tests
  Replace timestamp parameter with more generic request/response syntax
  Return capture timestamp
  Swap fcntl.flock with fcntl.lockf
  Unit test fix for Python2 compatibility
  Test WarcWriter file locking when no_warc_open_suffix=True
  Rename writer var and add exception handling
  Acquire and exclusive file lock when not using .open WARC suffix
  Add hidden --no-warc-open-suffix CLI option
  Fix missing dummy url param in bigtable lookup method
  back to dev version number
  version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
  Expand comment with limit=-1 explanation
  Drop unnecessary split for newline in CDX results
  fix benchmarks (update command line args)
  Update CdxServerDedup lookup algorithm
  Pass url instead of recorded_url obj to dedup lookup methods
  Filter out warc/revisit records in CdxServerDedup
  Improve CdxServerDedup implementation
  Fix minor CdxServerDedup unit test
  Fix bug with dedup_info date encoding
  Add mock pkg to run-tests.sh
  Add CdxServerDedup unit tests and improve its exception handling
  Add CDX Server based deduplication
  cryptography lib version 2.1.1 is causing problems
  Revert changes to test_warcprox.py
  Revert changes to bigtable and dedup
  Revert warc to previous behavior
  Update unit test
  Replace invalid warcfilename variable in playback
  Stop using WarcRecord.REFERS_TO header and use payload_digest instead
2017-11-02 16:34:52 -07:00
Vangelis Banos
c9f1feb3db Add hidden --no-warc-open-suffix CLI option
By default warcprox adds `.open` suffix in open WARC files. Using this
option we disable that. The option does not appear on the program help.
2017-10-26 19:44:22 +00:00