Barbara Miller
bc3d1e6d00
fix logging buglet ii
2021-12-29 11:55:39 -08:00
Barbara Miller
5d8fbf7038
fix logging buglet
2021-12-29 10:25:04 -08:00
Barbara Miller
d7aec77597
faster, likely
2021-12-16 18:36:00 -08:00
Barbara Miller
bcaf293081
better logging
2021-12-09 12:19:45 -08:00
Barbara Miller
7d4c8dcb4e
recorded_url.do_not_archive = True
2021-12-08 11:04:09 -08:00
Barbara Miller
da089e0a92
bytes not str
2021-12-06 20:33:16 -08:00
Barbara Miller
3eeccd0016
more hash_plus_url
2021-12-06 19:43:27 -08:00
Barbara Miller
5e5a74f204
str, not object
2021-12-06 19:33:10 -08:00
Barbara Miller
b67f1ad0f3
add logging
2021-12-06 17:29:27 -08:00
Barbara Miller
e6a1a7dd7e
increase trough dedup batch window
2021-12-06 17:29:02 -08:00
Barbara Miller
e744075913
python 3.5 version, mostly
2021-12-02 11:46:39 -08:00
Barbara Miller
1476bfec8c
discard batch hash+url match
2021-12-02 11:17:59 -08:00
Barbara Miller
0e23a31a31
Merge pull request #161 from internetarchive/fixes-malformed-crawl-log-lines
...
Checking for content type header consiting of only empty spaces and r…
2021-04-21 15:31:17 -07:00
Adam Miller
7f406b7942
Trying to fix tests that only fail during ci
2021-04-01 00:01:47 +00:00
Adam Miller
5f1c8c75fa
Add test cases for space in content type header and exception messages
2021-03-31 23:22:04 +00:00
Adam Miller
e0732ffaf4
Checking for content type header consiting of only empty spaces and removing spaces from exception messages in json section
2021-03-29 22:22:19 +00:00
Adam Miller
b8057825d8
Merge pull request #158 from galgeek/failed_url.timestamp
...
set failed_url.timestamp
2020-09-30 14:49:17 -07:00
Barbara Miller
e2e2c02802
set failed_url.timestamp
2020-09-30 11:47:17 -07:00
jkafader
f19ead0058
Merge pull request #145 from internetarchive/adds-logging-for-failed-connections
...
Adds logging for failed connections
2020-09-23 12:22:12 -07:00
Adam Miller
36784de174
Merge branch 'master' into adds-logging-for-failed-connections
2020-09-23 19:18:41 +00:00
Barbara Miller
ce1f32dc41
Merge pull request #154 from internetarchive/galgeek-version-update
...
bump version
2020-08-18 09:30:28 -07:00
Barbara Miller
ae11daedc1
bump version
2020-08-18 09:29:57 -07:00
Barbara Miller
456698fe06
Merge pull request #153 from vbanos/should-dedup-impr
...
Thanks, @vbanos!
2020-08-17 14:04:49 -07:00
Barbara Miller
d90367f21f
Merge pull request #152 from cclauss/patch-1
...
Thank you, @cclauss!
2020-08-15 08:49:59 -07:00
Vangelis Banos
8078ee7af9
DedupableMixin.should_dedup() improvement
...
When a recorded URL has `recorded_url.do_not_archive = True`, it is not
written to WARC. This is checked in
`WarcWriterProcessor._should_archive`.
We shouldn't waste time on deduping something that is not going to be
written to WARC anyway.
2020-08-15 09:17:39 +00:00
Christian Clauss
c649355285
setup.py: Add Python 3.8
2020-08-06 17:58:00 +02:00
Christian Clauss
21351094ec
Travis CI: Add Python 3.8 to testing
2020-08-06 17:27:15 +02:00
Adam Miller
edeae3b21a
Expanding logging to handle DNS failures, print error message to crawl log info, and report cached connection errors.
2020-07-22 21:36:39 +00:00
Noah Levitt
b34419543f
Oops!
2020-05-06 14:52:32 -07:00
Noah Levitt
5e397e9bca
Elide unnecessary params
2020-05-06 14:28:00 -07:00
Noah Levitt
d0b21f5dc4
Undo accidentally committed code
2020-05-06 14:27:34 -07:00
Noah Levitt
36711c0148
try to fix .travis.yml
2020-05-06 14:19:19 -07:00
Noah Levitt
a5e9c27223
Share code, handle exception during CONNECT
2020-05-06 09:54:17 -07:00
Noah Levitt
de9219e646
require more recent urllib3
...
to avoid this error: https://github.com/internetarchive/warcprox/issues/148
2020-01-28 14:42:44,851 2023 ERROR MitmProxyHandler(tid=2037,started=2020-01-28T20:42:44.834551,client=127.0.0.1:49100) warcprox.warcprox.WarcProxyHandler.do_COMMAND(mitmproxy.py:442) problem processing request 'GET / HTTP/1.1': TypeError("connection_from_host() got an unexpected keyword argument 'pool_kwargs'",)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/warcprox/mitmproxy.py", line 413, in do_COMMAND
self._connect_to_remote_server()
File "/usr/local/lib/python3.5/dist-packages/warcprox/warcproxy.py", line 189, in _connect_to_remote_server
return warcprox.mitmproxy.MitmProxyHandler._connect_to_remote_server(self)
File "/usr/local/lib/python3.5/dist-packages/warcprox/mitmproxy.py", line 277, in _connect_to_remote_server
pool_kwargs={'maxsize': 12, 'timeout': self._socket_timeout})
TypeError: connection_from_host() got an unexpected keyword argument 'pool_kwargs'
2020-02-06 10:10:53 -08:00
Noah Levitt
5c15582be5
Merge pull request #147 from nlevitt/fix-travis-jan2020
...
tests need trough
2020-01-08 14:29:16 -08:00
Noah Levitt
47731c61c1
tests need trough
2020-01-08 14:05:04 -08:00
Noah Levitt
90fba01514
make trough dependency optional
2020-01-08 13:37:01 -08:00
Noah Levitt
a8cd53bfe4
bump version, trough dep version
2020-01-08 13:24:00 -08:00
Noah Levitt
ee6bc151e1
Merge pull request #146 from vbanos/warc-filename-port
...
Add port to custom WARC filename vars
2020-01-08 13:22:50 -08:00
Vangelis Banos
ca0197330d
Add port to custom WARC filename vars
2020-01-08 21:19:48 +00:00
Noah Levitt
469b41773a
fix logging config which trough interfered with
2020-01-07 15:19:03 -08:00
Noah Levitt
91fcc054c4
bump version after merge
2020-01-07 14:42:40 -08:00
Noah Levitt
3f5251ed60
Merge pull request #144 from nlevitt/trough-dedup-schema
...
change trough dedup `date` type to varchar
2020-01-07 14:41:45 -08:00
Noah Levitt
f54e1b37c7
bump version after merge
2020-01-07 14:40:58 -08:00
Noah Levitt
47ec5d7644
Merge pull request #143 from nlevitt/use-trough-lib
...
use trough.client instead of warcprox.trough
2020-01-07 14:40:41 -08:00
Adam Miller
4ceebe1fa9
Moving more variables from RecordedUrl to RequiredUrl
2020-01-04 01:41:28 +00:00
Adam Miller
e88a88f247
Refactor failed requests into new class.
2020-01-03 20:43:47 +00:00
Adam Miller
f9c9443d2f
Beginning modifications to pass along a dummy RecordedUrl on connection timeout for logging
2019-12-11 01:54:11 +00:00
Noah Levitt
ac959c6db5
change trough dedup date
type to varchar
...
This is a backwards-compatible change whose purpose is to clarify the
existing usage.
In sqlite (and therefore trough), the datatypes of columns are just
suggestions. In fact the values can have any type. See
https://sqlite.org/datatype3.html . `datetime` isn't even a real sqlite
type.
Warcprox stores a string formatted like '2019-11-19T01:23:45Z' in that
field. When it pulls it out of the database and writes a revisit record,
it sticks the raw value in the `WARC-Date` header of that record.
Warcprox never parses the string value.
Since we use the raw textual value of the field, it makes sense to use a
textual datatype to store it.
2019-11-19 13:33:59 -08:00
Noah Levitt
ad652b407c
trough uses py3.5+ async syntax
...
so don't test 3.4; also we know warcprox requires py3 now so don't test
py2
2019-11-19 11:58:56 -08:00