Barbara Miller
cff2b19745
Merge branch 'WT-31' into qa
2021-12-29 10:25:30 -08:00
Barbara Miller
5d8fbf7038
fix logging buglet
2021-12-29 10:25:04 -08:00
Barbara Miller
48f48c34cd
Merge branch 'WT-31' into qa
2021-12-16 18:45:00 -08:00
Barbara Miller
d7aec77597
faster, likely
2021-12-16 18:36:00 -08:00
Barbara Miller
6e65b5ff55
Merge branch 'WT-31' into qa
2021-12-09 12:20:09 -08:00
Barbara Miller
bcaf293081
better logging
2021-12-09 12:19:45 -08:00
Barbara Miller
1d3e3b3671
Merge branch 'WT-31' into qa
2021-12-08 11:04:27 -08:00
Barbara Miller
7d4c8dcb4e
recorded_url.do_not_archive = True
2021-12-08 11:04:09 -08:00
Barbara Miller
69529e5845
Merge branch 'WT-31' into qa
2021-12-06 20:33:37 -08:00
Barbara Miller
da089e0a92
bytes not str
2021-12-06 20:33:16 -08:00
Barbara Miller
2ceb0f69f1
Merge branch 'WT-31' into qa
2021-12-06 19:43:37 -08:00
Barbara Miller
3eeccd0016
more hash_plus_url
2021-12-06 19:43:27 -08:00
Barbara Miller
a8944ddea3
Merge branch 'WT-31' into qa
2021-12-06 19:33:32 -08:00
Barbara Miller
5e5a74f204
str, not object
2021-12-06 19:33:10 -08:00
Barbara Miller
533234162e
str, not object
2021-12-06 19:32:35 -08:00
Barbara Miller
85bb6ff437
Merge branch 'WT-31' into qa
2021-12-06 17:30:25 -08:00
Barbara Miller
b67f1ad0f3
add logging
2021-12-06 17:29:27 -08:00
Barbara Miller
e6a1a7dd7e
increase trough dedup batch window
2021-12-06 17:29:02 -08:00
Barbara Miller
4e7a4c3eae
Merge branch 'WT-31' into qa
2021-12-02 11:46:58 -08:00
Barbara Miller
e744075913
python 3.5 version, mostly
2021-12-02 11:46:39 -08:00
Barbara Miller
16412d64dc
Merge branch 'WT-31' into qa
2021-12-02 11:18:44 -08:00
Barbara Miller
1476bfec8c
discard batch hash+url match
2021-12-02 11:17:59 -08:00
Barbara Miller
bab938f080
Merge branch 'idna2_10' into qa
2021-04-27 10:28:25 -07:00
Barbara Miller
e61099ff5f
idna==2.10
2021-04-27 10:26:45 -07:00
Barbara Miller
0e23a31a31
Merge pull request #161 from internetarchive/fixes-malformed-crawl-log-lines
...
Checking for content type header consiting of only empty spaces and r…
2021-04-21 15:31:17 -07:00
Barbara Miller
f782f8a985
Merge pull request #162 from internetarchive/fixes-malformed-crawl-log-lines
...
Fixes malformed crawl log lines
2021-04-01 12:19:03 -07:00
Adam Miller
7f406b7942
Trying to fix tests that only fail during ci
2021-04-01 00:01:47 +00:00
Adam Miller
5f1c8c75fa
Add test cases for space in content type header and exception messages
2021-03-31 23:22:04 +00:00
Adam Miller
e0732ffaf4
Checking for content type header consiting of only empty spaces and removing spaces from exception messages in json section
2021-03-29 22:22:19 +00:00
Adam Miller
b8057825d8
Merge pull request #158 from galgeek/failed_url.timestamp
...
set failed_url.timestamp
2020-09-30 14:49:17 -07:00
Barbara Miller
30db93a47e
Merge branch 'failed_url.timestamp' into qa
2020-09-30 11:47:48 -07:00
Barbara Miller
e2e2c02802
set failed_url.timestamp
2020-09-30 11:47:17 -07:00
Barbara Miller
c21d77335f
Merge branch 'controller_fix_etc' into qa
2020-09-30 11:00:59 -07:00
Barbara Miller
42676cfb35
check record_url.timestamp
2020-09-30 11:00:35 -07:00
Barbara Miller
8c855ec4db
Merge branch 'controller_fix_etc' into qa
2020-09-25 16:03:39 -07:00
Barbara Miller
e29d377dfd
fix for TypeError
2020-09-25 15:58:47 -07:00
jkafader
f19ead0058
Merge pull request #145 from internetarchive/adds-logging-for-failed-connections
...
Adds logging for failed connections
2020-09-23 12:22:12 -07:00
Adam Miller
36784de174
Merge branch 'master' into adds-logging-for-failed-connections
2020-09-23 19:18:41 +00:00
Adam Miller
aaaf1bff7c
Merge pull request #155 from internetarchive/adds-logging-for-failed-connections
...
Expanding logging to handle DNS failures, print error message to craw…
2020-08-20 15:02:05 -07:00
Barbara Miller
ce1f32dc41
Merge pull request #154 from internetarchive/galgeek-version-update
...
bump version
2020-08-18 09:30:28 -07:00
Barbara Miller
ae11daedc1
bump version
2020-08-18 09:29:57 -07:00
Barbara Miller
456698fe06
Merge pull request #153 from vbanos/should-dedup-impr
...
Thanks, @vbanos!
2020-08-17 14:04:49 -07:00
Barbara Miller
d90367f21f
Merge pull request #152 from cclauss/patch-1
...
Thank you, @cclauss!
2020-08-15 08:49:59 -07:00
Vangelis Banos
8078ee7af9
DedupableMixin.should_dedup() improvement
...
When a recorded URL has `recorded_url.do_not_archive = True`, it is not
written to WARC. This is checked in
`WarcWriterProcessor._should_archive`.
We shouldn't waste time on deduping something that is not going to be
written to WARC anyway.
2020-08-15 09:17:39 +00:00
Christian Clauss
c649355285
setup.py: Add Python 3.8
2020-08-06 17:58:00 +02:00
Christian Clauss
21351094ec
Travis CI: Add Python 3.8 to testing
2020-08-06 17:27:15 +02:00
Adam Miller
edeae3b21a
Expanding logging to handle DNS failures, print error message to crawl log info, and report cached connection errors.
2020-07-22 21:36:39 +00:00
Barbara Miller
ca3d5d4edd
Merge pull request #151 from vbanos/fix-runtime-error
...
Fix runtime error
2020-07-09 15:12:00 -07:00
Vangelis Banos
89e6745274
Handle RuntimeError
...
Some times when warcprox runs for several days under load it freezes
and the last error in the log is:
```
WARNING:warcprox.warcproxy.WarcProxy:exception processing request
<socket.socket fd=53, family=AddressFamily.AF_INET,
type=SocketKind.SOCK_STREAM, proto=0, laddr=('207.241.225.241', 8003),
raddr=('207.241.225.241', 40738)> from ('207.241.225.241', 40738)
Traceback (most recent call last):
File "/usr/lib/python3.7/socketserver.py", line 316, in
_handle_request_noblock
self.process_request(request, client_address)
File "/opt/spn2/lib/python3.7/site-packages/warcprox/mitmproxy.py",
line 641, in process_request
self.process_request_thread, request, client_address)
File "/usr/lib/python3.7/concurrent/futures/thread.py", line 172, in
submit
self._adjust_thread_count()
File "/usr/lib/python3.7/concurrent/futures/thread.py", line 193, in
_adjust_thread_count
t.start()
File "/usr/lib/python3.7/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
```
The process seems to run but it doesn't respond to any connection, not
even `status` requests.
We handle this exception and allow it to continue operation.
2020-07-08 16:48:05 +00:00
jkafader
73a787ac88
Merge pull request #149 from internetarchive/adds-logging-for-failed-connections
...
Adds logging for failed connections
2020-06-18 14:20:18 -07:00