Barbara Miller
c027659001
Merge pull request #167 from galgeek/WT-31
...
fix logging buglet iii
2021-12-29 12:14:56 -08:00
Barbara Miller
9e8ea5bb45
fix logging buglet iii
2021-12-29 12:06:18 -08:00
Barbara Miller
bc3d1e6d00
fix logging buglet ii
2021-12-29 11:55:39 -08:00
Barbara Miller
6b372e2f3f
Merge pull request #166 from galgeek/WT-31
...
fix logging buglet
2021-12-29 11:04:03 -08:00
Barbara Miller
5d8fbf7038
fix logging buglet
2021-12-29 10:25:04 -08:00
Barbara Miller
a969430b37
Merge pull request #163 from internetarchive/idna2_10
...
idna==2.10
2021-12-28 13:50:23 -08:00
Barbara Miller
aeecb6515f
bump version
2021-12-28 11:58:30 -08:00
Adam Miller
e1eddb8fa7
Merge pull request #165 from galgeek/WT-31
...
in-batch dedup
2021-12-28 11:52:41 -08:00
Barbara Miller
d7aec77597
faster, likely
2021-12-16 18:36:00 -08:00
Barbara Miller
bcaf293081
better logging
2021-12-09 12:19:45 -08:00
Barbara Miller
7d4c8dcb4e
recorded_url.do_not_archive = True
2021-12-08 11:04:09 -08:00
Barbara Miller
da089e0a92
bytes not str
2021-12-06 20:33:16 -08:00
Barbara Miller
3eeccd0016
more hash_plus_url
2021-12-06 19:43:27 -08:00
Barbara Miller
5e5a74f204
str, not object
2021-12-06 19:33:10 -08:00
Barbara Miller
b67f1ad0f3
add logging
2021-12-06 17:29:27 -08:00
Barbara Miller
e6a1a7dd7e
increase trough dedup batch window
2021-12-06 17:29:02 -08:00
Barbara Miller
e744075913
python 3.5 version, mostly
2021-12-02 11:46:39 -08:00
Barbara Miller
1476bfec8c
discard batch hash+url match
2021-12-02 11:17:59 -08:00
Barbara Miller
e61099ff5f
idna==2.10
2021-04-27 10:26:45 -07:00
Barbara Miller
0e23a31a31
Merge pull request #161 from internetarchive/fixes-malformed-crawl-log-lines
...
Checking for content type header consiting of only empty spaces and r…
2021-04-21 15:31:17 -07:00
Adam Miller
7f406b7942
Trying to fix tests that only fail during ci
2021-04-01 00:01:47 +00:00
Adam Miller
5f1c8c75fa
Add test cases for space in content type header and exception messages
2021-03-31 23:22:04 +00:00
Adam Miller
e0732ffaf4
Checking for content type header consiting of only empty spaces and removing spaces from exception messages in json section
2021-03-29 22:22:19 +00:00
Adam Miller
b8057825d8
Merge pull request #158 from galgeek/failed_url.timestamp
...
set failed_url.timestamp
2020-09-30 14:49:17 -07:00
Barbara Miller
e2e2c02802
set failed_url.timestamp
2020-09-30 11:47:17 -07:00
jkafader
f19ead0058
Merge pull request #145 from internetarchive/adds-logging-for-failed-connections
...
Adds logging for failed connections
2020-09-23 12:22:12 -07:00
Adam Miller
36784de174
Merge branch 'master' into adds-logging-for-failed-connections
2020-09-23 19:18:41 +00:00
Barbara Miller
ce1f32dc41
Merge pull request #154 from internetarchive/galgeek-version-update
...
bump version
2020-08-18 09:30:28 -07:00
Barbara Miller
ae11daedc1
bump version
2020-08-18 09:29:57 -07:00
Barbara Miller
456698fe06
Merge pull request #153 from vbanos/should-dedup-impr
...
Thanks, @vbanos!
2020-08-17 14:04:49 -07:00
Barbara Miller
d90367f21f
Merge pull request #152 from cclauss/patch-1
...
Thank you, @cclauss!
2020-08-15 08:49:59 -07:00
Vangelis Banos
8078ee7af9
DedupableMixin.should_dedup() improvement
...
When a recorded URL has `recorded_url.do_not_archive = True`, it is not
written to WARC. This is checked in
`WarcWriterProcessor._should_archive`.
We shouldn't waste time on deduping something that is not going to be
written to WARC anyway.
2020-08-15 09:17:39 +00:00
Christian Clauss
c649355285
setup.py: Add Python 3.8
2020-08-06 17:58:00 +02:00
Christian Clauss
21351094ec
Travis CI: Add Python 3.8 to testing
2020-08-06 17:27:15 +02:00
Adam Miller
edeae3b21a
Expanding logging to handle DNS failures, print error message to crawl log info, and report cached connection errors.
2020-07-22 21:36:39 +00:00
Noah Levitt
b34419543f
Oops!
2020-05-06 14:52:32 -07:00
Noah Levitt
5e397e9bca
Elide unnecessary params
2020-05-06 14:28:00 -07:00
Noah Levitt
d0b21f5dc4
Undo accidentally committed code
2020-05-06 14:27:34 -07:00
Noah Levitt
36711c0148
try to fix .travis.yml
2020-05-06 14:19:19 -07:00
Noah Levitt
a5e9c27223
Share code, handle exception during CONNECT
2020-05-06 09:54:17 -07:00
Noah Levitt
de9219e646
require more recent urllib3
...
to avoid this error: https://github.com/internetarchive/warcprox/issues/148
2020-01-28 14:42:44,851 2023 ERROR MitmProxyHandler(tid=2037,started=2020-01-28T20:42:44.834551,client=127.0.0.1:49100) warcprox.warcprox.WarcProxyHandler.do_COMMAND(mitmproxy.py:442) problem processing request 'GET / HTTP/1.1': TypeError("connection_from_host() got an unexpected keyword argument 'pool_kwargs'",)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/warcprox/mitmproxy.py", line 413, in do_COMMAND
self._connect_to_remote_server()
File "/usr/local/lib/python3.5/dist-packages/warcprox/warcproxy.py", line 189, in _connect_to_remote_server
return warcprox.mitmproxy.MitmProxyHandler._connect_to_remote_server(self)
File "/usr/local/lib/python3.5/dist-packages/warcprox/mitmproxy.py", line 277, in _connect_to_remote_server
pool_kwargs={'maxsize': 12, 'timeout': self._socket_timeout})
TypeError: connection_from_host() got an unexpected keyword argument 'pool_kwargs'
2020-02-06 10:10:53 -08:00
Noah Levitt
5c15582be5
Merge pull request #147 from nlevitt/fix-travis-jan2020
...
tests need trough
2020-01-08 14:29:16 -08:00
Noah Levitt
47731c61c1
tests need trough
2020-01-08 14:05:04 -08:00
Noah Levitt
90fba01514
make trough dependency optional
2020-01-08 13:37:01 -08:00
Noah Levitt
a8cd53bfe4
bump version, trough dep version
2020-01-08 13:24:00 -08:00
Noah Levitt
ee6bc151e1
Merge pull request #146 from vbanos/warc-filename-port
...
Add port to custom WARC filename vars
2020-01-08 13:22:50 -08:00
Vangelis Banos
ca0197330d
Add port to custom WARC filename vars
2020-01-08 21:19:48 +00:00
Noah Levitt
469b41773a
fix logging config which trough interfered with
2020-01-07 15:19:03 -08:00
Noah Levitt
91fcc054c4
bump version after merge
2020-01-07 14:42:40 -08:00
Noah Levitt
3f5251ed60
Merge pull request #144 from nlevitt/trough-dedup-schema
...
change trough dedup `date` type to varchar
2020-01-07 14:41:45 -08:00