945 Commits

Author SHA1 Message Date
Barbara Miller
bcaf293081 better logging 2021-12-09 12:19:45 -08:00
Barbara Miller
7d4c8dcb4e recorded_url.do_not_archive = True 2021-12-08 11:04:09 -08:00
Barbara Miller
da089e0a92 bytes not str 2021-12-06 20:33:16 -08:00
Barbara Miller
3eeccd0016 more hash_plus_url 2021-12-06 19:43:27 -08:00
Barbara Miller
5e5a74f204 str, not object 2021-12-06 19:33:10 -08:00
Barbara Miller
b67f1ad0f3 add logging 2021-12-06 17:29:27 -08:00
Barbara Miller
e6a1a7dd7e increase trough dedup batch window 2021-12-06 17:29:02 -08:00
Barbara Miller
e744075913 python 3.5 version, mostly 2021-12-02 11:46:39 -08:00
Barbara Miller
1476bfec8c discard batch hash+url match 2021-12-02 11:17:59 -08:00
Barbara Miller
0e23a31a31
Merge pull request #161 from internetarchive/fixes-malformed-crawl-log-lines
Checking for content type header consiting of only empty spaces and r…
2021-04-21 15:31:17 -07:00
Adam Miller
7f406b7942 Trying to fix tests that only fail during ci 2021-04-01 00:01:47 +00:00
Adam Miller
5f1c8c75fa Add test cases for space in content type header and exception messages 2021-03-31 23:22:04 +00:00
Adam Miller
e0732ffaf4 Checking for content type header consiting of only empty spaces and removing spaces from exception messages in json section 2021-03-29 22:22:19 +00:00
Adam Miller
b8057825d8
Merge pull request #158 from galgeek/failed_url.timestamp
set failed_url.timestamp
2020-09-30 14:49:17 -07:00
Barbara Miller
e2e2c02802 set failed_url.timestamp 2020-09-30 11:47:17 -07:00
jkafader
f19ead0058
Merge pull request #145 from internetarchive/adds-logging-for-failed-connections
Adds logging for failed connections
2020-09-23 12:22:12 -07:00
Adam Miller
36784de174 Merge branch 'master' into adds-logging-for-failed-connections 2020-09-23 19:18:41 +00:00
Barbara Miller
ce1f32dc41
Merge pull request #154 from internetarchive/galgeek-version-update
bump version
2020-08-18 09:30:28 -07:00
Barbara Miller
ae11daedc1
bump version 2020-08-18 09:29:57 -07:00
Barbara Miller
456698fe06
Merge pull request #153 from vbanos/should-dedup-impr
Thanks, @vbanos!
2020-08-17 14:04:49 -07:00
Barbara Miller
d90367f21f
Merge pull request #152 from cclauss/patch-1
Thank you, @cclauss!
2020-08-15 08:49:59 -07:00
Vangelis Banos
8078ee7af9 DedupableMixin.should_dedup() improvement
When a recorded URL has `recorded_url.do_not_archive = True`, it is not
written to WARC. This is checked in
`WarcWriterProcessor._should_archive`.
We shouldn't waste time on deduping something that is not going to be
written to WARC anyway.
2020-08-15 09:17:39 +00:00
Christian Clauss
c649355285
setup.py: Add Python 3.8 2020-08-06 17:58:00 +02:00
Christian Clauss
21351094ec
Travis CI: Add Python 3.8 to testing 2020-08-06 17:27:15 +02:00
Adam Miller
edeae3b21a Expanding logging to handle DNS failures, print error message to crawl log info, and report cached connection errors. 2020-07-22 21:36:39 +00:00
Noah Levitt
b34419543f Oops! 2020-05-06 14:52:32 -07:00
Noah Levitt
5e397e9bca Elide unnecessary params 2020-05-06 14:28:00 -07:00
Noah Levitt
d0b21f5dc4 Undo accidentally committed code 2020-05-06 14:27:34 -07:00
Noah Levitt
36711c0148 try to fix .travis.yml 2020-05-06 14:19:19 -07:00
Noah Levitt
a5e9c27223 Share code, handle exception during CONNECT 2020-05-06 09:54:17 -07:00
Noah Levitt
de9219e646 require more recent urllib3
to avoid this error: https://github.com/internetarchive/warcprox/issues/148

2020-01-28 14:42:44,851 2023 ERROR MitmProxyHandler(tid=2037,started=2020-01-28T20:42:44.834551,client=127.0.0.1:49100) warcprox.warcprox.WarcProxyHandler.do_COMMAND(mitmproxy.py:442) problem processing request 'GET / HTTP/1.1': TypeError("connection_from_host() got an unexpected keyword argument 'pool_kwargs'",)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/warcprox/mitmproxy.py", line 413, in do_COMMAND
    self._connect_to_remote_server()
  File "/usr/local/lib/python3.5/dist-packages/warcprox/warcproxy.py", line 189, in _connect_to_remote_server
    return warcprox.mitmproxy.MitmProxyHandler._connect_to_remote_server(self)
  File "/usr/local/lib/python3.5/dist-packages/warcprox/mitmproxy.py", line 277, in _connect_to_remote_server
    pool_kwargs={'maxsize': 12, 'timeout': self._socket_timeout})
TypeError: connection_from_host() got an unexpected keyword argument 'pool_kwargs'
2020-02-06 10:10:53 -08:00
Noah Levitt
5c15582be5
Merge pull request #147 from nlevitt/fix-travis-jan2020
tests need trough
2020-01-08 14:29:16 -08:00
Noah Levitt
47731c61c1 tests need trough 2020-01-08 14:05:04 -08:00
Noah Levitt
90fba01514 make trough dependency optional 2020-01-08 13:37:01 -08:00
Noah Levitt
a8cd53bfe4 bump version, trough dep version 2020-01-08 13:24:00 -08:00
Noah Levitt
ee6bc151e1
Merge pull request #146 from vbanos/warc-filename-port
Add port to custom WARC filename vars
2020-01-08 13:22:50 -08:00
Vangelis Banos
ca0197330d Add port to custom WARC filename vars 2020-01-08 21:19:48 +00:00
Noah Levitt
469b41773a fix logging config which trough interfered with 2020-01-07 15:19:03 -08:00
Noah Levitt
91fcc054c4 bump version after merge 2020-01-07 14:42:40 -08:00
Noah Levitt
3f5251ed60
Merge pull request #144 from nlevitt/trough-dedup-schema
change trough dedup `date` type to varchar
2020-01-07 14:41:45 -08:00
Noah Levitt
f54e1b37c7 bump version after merge 2020-01-07 14:40:58 -08:00
Noah Levitt
47ec5d7644
Merge pull request #143 from nlevitt/use-trough-lib
use trough.client instead of warcprox.trough
2020-01-07 14:40:41 -08:00
Adam Miller
4ceebe1fa9 Moving more variables from RecordedUrl to RequiredUrl 2020-01-04 01:41:28 +00:00
Adam Miller
e88a88f247 Refactor failed requests into new class. 2020-01-03 20:43:47 +00:00
Adam Miller
f9c9443d2f Beginning modifications to pass along a dummy RecordedUrl on connection timeout for logging 2019-12-11 01:54:11 +00:00
Noah Levitt
ac959c6db5 change trough dedup date type to varchar
This is a backwards-compatible change whose purpose is to clarify the
existing usage.

In sqlite (and therefore trough), the datatypes of columns are just
suggestions. In fact the values can have any type. See
https://sqlite.org/datatype3.html. `datetime` isn't even a real sqlite
type.

Warcprox stores a string formatted like '2019-11-19T01:23:45Z' in that
field. When it pulls it out of the database and writes a revisit record,
it sticks the raw value in the `WARC-Date` header of that record.
Warcprox never parses the string value.

Since we use the raw textual value of the field, it makes sense to use a
textual datatype to store it.
2019-11-19 13:33:59 -08:00
Noah Levitt
ad652b407c trough uses py3.5+ async syntax
so don't test 3.4; also we know warcprox requires py3 now so don't test
py2
2019-11-19 11:58:56 -08:00
Noah Levitt
fe19bb268f use trough.client instead of warcprox.trough
less redundant code!
trough.client was based off of warcprox.trough but has been improved
since then
2019-11-19 11:45:14 -08:00
Noah Levitt
f77c152037 bump version after merge 2019-09-26 11:49:07 -07:00
Noah Levitt
22d786f72e
Merge pull request #142 from vbanos/fix-close-rename
Another exception when trying to close a WARC file
2019-09-26 11:20:27 -07:00