1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
|
Periodic check-in of daily crawling/ingest.
Overall ingest status, past 30 days:
SELECT ingest_file_result.ingest_type, ingest_file_result.status, COUNT(*)
FROM ingest_file_result
LEFT JOIN ingest_request
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE ingest_request.created >= NOW() - '30 day'::INTERVAL
AND ingest_request.ingest_type = 'pdf'
AND ingest_request.ingest_request_source = 'fatcat-changelog'
GROUP BY ingest_file_result.ingest_type, ingest_file_result.status
ORDER BY COUNT DESC
LIMIT 20;
ingest_type | status | count
-------------+-------------------------+--------
pdf | no-pdf-link | 158474
pdf | spn2-cdx-lookup-failure | 135344
pdf | success | 127938
pdf | spn2-error | 65411
pdf | gateway-timeout | 63112
pdf | blocked-cookie | 26338
pdf | terminal-bad-status | 24853
pdf | link-loop | 15699
pdf | spn2-error:job-failed | 13862
pdf | redirect-loop | 11432
pdf | cdx-error | 2376
pdf | too-many-redirects | 2186
pdf | wrong-mimetype | 2142
pdf | forbidden | 1758
pdf | spn2-error:no-status | 972
pdf | not-found | 820
pdf | bad-redirect | 536
pdf | read-timeout | 392
pdf | wayback-error | 251
pdf | remote-server-error | 220
(20 rows)
Hrm, that is a healthy fraction of `no-pdf-link`.
Broken domains, past 30 days:
SELECT domain, status, COUNT((domain, status))
FROM (
SELECT
ingest_file_result.ingest_type,
ingest_file_result.status,
substring(ingest_file_result.terminal_url FROM '[^/]+://([^/]*)') AS domain
FROM ingest_file_result
LEFT JOIN ingest_request
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
-- ingest_request.created >= NOW() - '3 day'::INTERVAL
ingest_file_result.updated >= NOW() - '30 day'::INTERVAL
AND ingest_request.ingest_type = 'pdf'
AND ingest_request.ingest_request_source = 'fatcat-changelog'
) t1
WHERE t1.domain != ''
AND t1.status != 'success'
GROUP BY domain, status
ORDER BY COUNT DESC
LIMIT 25;
domain | status | count
-------------------------+-------------------------+-------
zenodo.org | no-pdf-link | 39678
osf.io | gateway-timeout | 29809
acervus.unicamp.br | no-pdf-link | 21978
osf.io | terminal-bad-status | 18727
zenodo.org | spn2-cdx-lookup-failure | 17008
doi.org | spn2-cdx-lookup-failure | 15503
www.degruyter.com | no-pdf-link | 15122
ieeexplore.ieee.org | spn2-error:job-failed | 12921
osf.io | spn2-cdx-lookup-failure | 11123
www.tandfonline.com | blocked-cookie | 8096
www.morressier.com | no-pdf-link | 4655
ieeexplore.ieee.org | spn2-cdx-lookup-failure | 4580
pubs.acs.org | blocked-cookie | 4415
www.frontiersin.org | no-pdf-link | 4163
www.degruyter.com | spn2-cdx-lookup-failure | 3788
www.taylorfrancis.com | no-pdf-link | 3568
www.sciencedirect.com | no-pdf-link | 3128
www.taylorfrancis.com | spn2-cdx-lookup-failure | 3116
acervus.unicamp.br | spn2-cdx-lookup-failure | 2797
www.mdpi.com | spn2-cdx-lookup-failure | 2719
brill.com | link-loop | 2681
linkinghub.elsevier.com | spn2-cdx-lookup-failure | 2657
www.sciencedirect.com | spn2-cdx-lookup-failure | 2546
apps.crossref.org | no-pdf-link | 2537
onlinelibrary.wiley.com | blocked-cookie | 2528
(25 rows)
Summary of significant domains and status, past 30 days, minus spn2-cdx-lookup-failure:
SELECT domain, status, count
FROM (
SELECT domain, status, COUNT((domain, status)) as count
FROM (
SELECT
ingest_file_result.ingest_type,
ingest_file_result.status,
substring(ingest_file_result.terminal_url FROM '[^/]+://([^/]*)') AS domain
FROM ingest_file_result
LEFT JOIN ingest_request
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_file_result.updated >= NOW() - '30 day'::INTERVAL
AND ingest_request.ingest_type = 'pdf'
AND ingest_request.ingest_request_source = 'fatcat-changelog'
AND ingest_file_result.status != 'spn2-cdx-lookup-failure'
) t1
WHERE t1.domain != ''
GROUP BY CUBE (domain, status)
) t2
WHERE count > 200
ORDER BY domain ASC , count DESC;
domain | status | count
-----------------------------------------------------------------+-----------------------+--------
academic.oup.com | | 2405
academic.oup.com | no-pdf-link | 1240
academic.oup.com | link-loop | 1010
acervus.unicamp.br | | 21980
acervus.unicamp.br | no-pdf-link | 21978 **
aclanthology.org | | 208
acp.copernicus.org | | 365
acp.copernicus.org | success | 356
aip.scitation.org | | 1071
aip.scitation.org | blocked-cookie | 843
aip.scitation.org | redirect-loop | 227
apps.crossref.org | | 2537
apps.crossref.org | no-pdf-link | 2537
arxiv.org | | 17817
arxiv.org | success | 17370
arxiv.org | terminal-bad-status | 320
asmedigitalcollection.asme.org | | 401
asmedigitalcollection.asme.org | link-loop | 364
assets.researchsquare.com | | 3706
assets.researchsquare.com | success | 3706
avmj.journals.ekb.eg | | 605
avmj.journals.ekb.eg | success | 595
bfa.journals.ekb.eg | | 224
bfa.journals.ekb.eg | success | 214
biorxiv.org | redirect-loop | 895
biorxiv.org | | 895
birdsoftheworld.org | | 286
birdsoftheworld.org | no-pdf-link | 285
bmjopen.bmj.com | success | 232
bmjopen.bmj.com | | 232
books.openedition.org | | 396
books.openedition.org | no-pdf-link | 396
brill.com | | 4272
brill.com | link-loop | 2681
brill.com | no-pdf-link | 1410
cas.columbia.edu | | 1038
cas.columbia.edu | no-pdf-link | 1038 **
cdr.lib.unc.edu | | 513
cdr.lib.unc.edu | success | 469
chemrxiv.org | | 278
chemrxiv.org | success | 275
classiques-garnier.com | | 531
classiques-garnier.com | no-pdf-link | 487 *
content.iospress.com | | 275
content.iospress.com | link-loop | 230
cris.maastrichtuniversity.nl | | 318
cris.maastrichtuniversity.nl | success | 284
cyberleninka.ru | | 1165
cyberleninka.ru | success | 1134
deepblue.lib.umich.edu | | 289
dergipark.org.tr | | 1185
dergipark.org.tr | success | 774
dergipark.org.tr | no-pdf-link | 320
didaktorika.gr | | 688
didaktorika.gr | redirect-loop | 688
digi.ub.uni-heidelberg.de | | 292
digi.ub.uni-heidelberg.de | no-pdf-link | 292
direct.mit.edu | | 236
direct.mit.edu | no-pdf-link | 207 *
dl.acm.org | | 2319
dl.acm.org | blocked-cookie | 2230
dmtcs.episciences.org | | 733
dmtcs.episciences.org | success | 730
doi.ala.org.au | no-pdf-link | 2373 **
doi.ala.org.au | | 2373
doi.org | | 732
doi.org | terminal-bad-status | 673
downloads.hindawi.com | success | 1452
downloads.hindawi.com | | 1452
drive.google.com | | 216
drive.google.com | no-pdf-link | 211
dtb.bmj.com | | 674
dtb.bmj.com | link-loop | 669
easy.dans.knaw.nl | no-pdf-link | 261 *
easy.dans.knaw.nl | | 261
ebooks.marilia.unesp.br | | 688
ebooks.marilia.unesp.br | no-pdf-link | 688 *
ehp.niehs.nih.gov | | 766
ehp.niehs.nih.gov | blocked-cookie | 765
ejournal.mandalanursa.org | | 307
ejournal.mandalanursa.org | success | 305
elib.spbstu.ru | | 264
elib.spbstu.ru | redirect-loop | 257
elibrary.ru | | 1367
elibrary.ru | redirect-loop | 1169
elibrary.vdi-verlag.de | | 1251
elibrary.vdi-verlag.de | no-pdf-link | 646
elibrary.vdi-verlag.de | link-loop | 537
elifesciences.org | | 328
elifesciences.org | success | 323
figshare.com | | 803
figshare.com | no-pdf-link | 714 *
files.osf.io | | 745
files.osf.io | success | 614
hammer.purdue.edu | | 244
hammer.purdue.edu | no-pdf-link | 243
heiup.uni-heidelberg.de | | 277
heiup.uni-heidelberg.de | no-pdf-link | 268
hkvalidate.perfdrive.com | no-pdf-link | 370 *
hkvalidate.perfdrive.com | | 370
ieeexplore.ieee.org | | 16675
ieeexplore.ieee.org | spn2-error:job-failed | 12927
ieeexplore.ieee.org | success | 1952
ieeexplore.ieee.org | too-many-redirects | 1193
ieeexplore.ieee.org | no-pdf-link | 419
jamanetwork.com | | 339
jamanetwork.com | success | 216
jmstt.ntou.edu.tw | | 244
jmstt.ntou.edu.tw | success | 241
journal.ipb.ac.id | | 229
journal.ipb.ac.id | success | 206
journal.nafe.org | | 221
journals.aps.org | | 614
journals.aps.org | gateway-timeout | 495
journals.asm.org | | 463
journals.asm.org | blocked-cookie | 435
journals.flvc.org | | 230
journals.lww.com | | 1300
journals.lww.com | link-loop | 1284
journals.openedition.org | | 543
journals.openedition.org | success | 311
journals.ub.uni-heidelberg.de | | 357
journals.ub.uni-heidelberg.de | success | 311
jov.arvojournals.org | | 431
jov.arvojournals.org | no-pdf-link | 422 *
kiss.kstudy.com | | 303
kiss.kstudy.com | no-pdf-link | 303 *
library.iated.org | | 364
library.iated.org | redirect-loop | 264
library.seg.org | blocked-cookie | 301
library.seg.org | | 301
link.aps.org | redirect-loop | 442
link.aps.org | | 442
linkinghub.elsevier.com | | 515
linkinghub.elsevier.com | gateway-timeout | 392
mc.sbm.org.br | | 224
mc.sbm.org.br | success | 224
mdpi-res.com | | 742
mdpi-res.com | success | 742
mdsoar.org | | 220
mediarep.org | | 269
mediarep.org | success | 264
medrxiv.org | redirect-loop | 290
medrxiv.org | | 290
muse.jhu.edu | | 429
muse.jhu.edu | terminal-bad-status | 391
mvmj.journals.ekb.eg | | 306
oapub.org | | 292
oapub.org | success | 289
onepetro.org | | 426
onepetro.org | link-loop | 406
onlinelibrary.wiley.com | | 2835
onlinelibrary.wiley.com | blocked-cookie | 2531
onlinelibrary.wiley.com | redirect-loop | 264
open.library.ubc.ca | | 569
open.library.ubc.ca | no-pdf-link | 425 *
opendata.uni-halle.de | | 407
opendata.uni-halle.de | success | 263
osf.io | | 49022
osf.io | gateway-timeout | 29810
osf.io | terminal-bad-status | 18731
osf.io | spn2-error | 247
osf.io | not-found | 205
oxford.universitypressscholarship.com | | 392
oxford.universitypressscholarship.com | link-loop | 233
panor.ru | no-pdf-link | 433 *
panor.ru | | 433
papers.ssrn.com | | 1630
papers.ssrn.com | link-loop | 1598
pdf.sciencedirectassets.com | | 3063
pdf.sciencedirectassets.com | success | 3063
peerj.com | | 464
peerj.com | no-pdf-link | 303 *
periodicos.ufpe.br | | 245
periodicos.ufpe.br | success | 232
periodicos.unb.br | | 230
periodicos.unb.br | success | 221
preprints.jmir.org | | 548
preprints.jmir.org | cdx-error | 499
publications.rwth-aachen.de | | 213
publikationen.bibliothek.kit.edu | | 346
publikationen.bibliothek.kit.edu | success | 314
publikationen.uni-tuebingen.de | | 623
publikationen.uni-tuebingen.de | no-pdf-link | 522 *
publons.com | no-pdf-link | 934 *
publons.com | | 934
pubs.acs.org | | 4507
pubs.acs.org | blocked-cookie | 4406
pubs.rsc.org | | 1638
pubs.rsc.org | link-loop | 1054
pubs.rsc.org | redirect-loop | 343
pubs.rsc.org | success | 201
repositorio.ufu.br | | 637
repositorio.ufu.br | success | 607
repository.dri.ie | | 1852
repository.dri.ie | no-pdf-link | 1852 **
repository.library.brown.edu | | 293
repository.library.brown.edu | no-pdf-link | 291 *
res.mdpi.com | | 10367
res.mdpi.com | success | 10360
retrovirology.biomedcentral.com | | 230
revistas.ufrj.br | | 284
revistas.ufrj.br | success | 283
revistas.uptc.edu.co | | 385
revistas.uptc.edu.co | success | 344
royalsocietypublishing.org | | 231
rsdjournal.org | | 347
rsdjournal.org | success | 343
s3-ap-southeast-2.amazonaws.com | | 400
s3-ap-southeast-2.amazonaws.com | success | 392
s3-eu-west-1.amazonaws.com | | 2096
s3-eu-west-1.amazonaws.com | success | 2091
s3-euw1-ap-pe-df-pch-content-store-p.s3.eu-west-1.amazonaws.com | | 289
s3-euw1-ap-pe-df-pch-content-store-p.s3.eu-west-1.amazonaws.com | success | 286
s3.ca-central-1.amazonaws.com | | 202
sage.figshare.com | | 242
sage.figshare.com | no-pdf-link | 241
sajeb.org | | 246
sajeb.org | no-pdf-link | 243
scholar.dkyobobook.co.kr | | 332
scholar.dkyobobook.co.kr | no-pdf-link | 328 *
search.mandumah.com | | 735
search.mandumah.com | redirect-loop | 726
secure.jbs.elsevierhealth.com | | 1112
secure.jbs.elsevierhealth.com | blocked-cookie | 1108
stm.bookpi.org | no-pdf-link | 468 *
stm.bookpi.org | | 468
storage.googleapis.com | | 1012
storage.googleapis.com | success | 1012
tandf.figshare.com | | 469
tandf.figshare.com | no-pdf-link | 466
teses.usp.br | | 739
teses.usp.br | success | 730
tidsskrift.dk | | 360
tidsskrift.dk | success | 346
tiedejaedistys.journal.fi | | 224
tind-customer-agecon.s3.amazonaws.com | success | 332
tind-customer-agecon.s3.amazonaws.com | | 332
valep.vc.univie.ac.at | no-pdf-link | 280
valep.vc.univie.ac.at | | 280
watermark.silverchair.com | | 1729
watermark.silverchair.com | success | 1719
www.academia.edu | | 387
www.academia.edu | no-pdf-link | 386
www.ahajournals.org | | 430
www.ahajournals.org | blocked-cookie | 413
www.atenaeditora.com.br | | 572
www.atenaeditora.com.br | terminal-bad-status | 513
www.atlantis-press.com | success | 722
www.atlantis-press.com | | 722
www.aup-online.com | | 419
www.aup-online.com | no-pdf-link | 419 *
www.beck-elibrary.de | | 269
www.beck-elibrary.de | no-pdf-link | 268 *
www.biodiversitylibrary.org | no-pdf-link | 528 *
www.biodiversitylibrary.org | | 528
www.bloomsburycollections.com | | 623
www.bloomsburycollections.com | no-pdf-link | 605 *
www.cabi.org | | 2191
www.cabi.org | no-pdf-link | 2186 *
www.cairn.info | | 1283
www.cairn.info | no-pdf-link | 713
www.cairn.info | link-loop | 345
www.cambridge.org | | 4128
www.cambridge.org | no-pdf-link | 1531
www.cambridge.org | success | 1441
www.cambridge.org | link-loop | 971
www.cureus.com | no-pdf-link | 526 *
www.cureus.com | | 526
www.dbpia.co.kr | | 637
www.dbpia.co.kr | redirect-loop | 631
www.deboni.he.com.br | | 382
www.deboni.he.com.br | success | 381
www.degruyter.com | | 17783
www.degruyter.com | no-pdf-link | 15102
www.degruyter.com | success | 2584
www.dovepress.com | | 480
www.dovepress.com | success | 472
www.e-manuscripta.ch | | 1350
www.e-manuscripta.ch | no-pdf-link | 1350 *
www.e-periodica.ch | | 1276
www.e-periodica.ch | no-pdf-link | 1275
www.e-rara.ch | | 202
www.e-rara.ch | no-pdf-link | 202
www.elgaronline.com | | 495
www.elgaronline.com | link-loop | 290
www.elibrary.ru | | 922
www.elibrary.ru | no-pdf-link | 904
www.emerald.com | | 2155
www.emerald.com | no-pdf-link | 1936 *
www.emerald.com | success | 219
www.eurekaselect.com | | 518
www.eurekaselect.com | no-pdf-link | 516 *
www.frontiersin.org | | 4163
www.frontiersin.org | no-pdf-link | 4162 **
www.hanser-elibrary.com | | 444
www.hanser-elibrary.com | blocked-cookie | 444
www.hanspub.org | | 334
www.hanspub.org | no-pdf-link | 314
www.idunn.no | | 1736
www.idunn.no | link-loop | 596
www.idunn.no | success | 577
www.idunn.no | no-pdf-link | 539
www.igi-global.com | terminal-bad-status | 458
www.igi-global.com | | 458
www.ijcai.org | | 533
www.ijcai.org | success | 532
www.ijraset.com | success | 385
www.ijraset.com | | 385
www.inderscience.com | | 712
www.inderscience.com | no-pdf-link | 605 *
www.ingentaconnect.com | | 456
www.ingentaconnect.com | no-pdf-link | 413 *
www.internationaljournalssrg.org | | 305
www.internationaljournalssrg.org | no-pdf-link | 305 *
www.isca-speech.org | | 2392
www.isca-speech.org | no-pdf-link | 2391 **
www.journals.uchicago.edu | | 228
www.journals.uchicago.edu | blocked-cookie | 227
www.jstage.jst.go.jp | | 1492
www.jstage.jst.go.jp | success | 1185
www.jstage.jst.go.jp | no-pdf-link | 289
www.jstor.org | | 301
www.jurology.com | | 887
www.jurology.com | redirect-loop | 887
www.karger.com | | 318
www.liebertpub.com | | 507
www.liebertpub.com | blocked-cookie | 496
www.morressier.com | | 4781
www.morressier.com | no-pdf-link | 4655 **
www.ncl.ecu.edu | | 413
www.ncl.ecu.edu | success | 413
www.nomos-elibrary.de | | 526
www.nomos-elibrary.de | no-pdf-link | 391
www.oecd-ilibrary.org | no-pdf-link | 1170 **
www.oecd-ilibrary.org | | 1170
www.openagrar.de | no-pdf-link | 221
www.openagrar.de | | 221
www.osapublishing.org | | 900
www.osapublishing.org | link-loop | 615
www.osapublishing.org | no-pdf-link | 269
www.osti.gov | | 630
www.osti.gov | link-loop | 573
www.oxfordlawtrove.com | no-pdf-link | 476 *
www.oxfordlawtrove.com | | 476
www.pdcnet.org | | 298
www.pdcnet.org | terminal-bad-status | 262
www.pedocs.de | | 203
www.pnas.org | | 222
www.preprints.org | | 372
www.preprints.org | success | 366
www.repository.cam.ac.uk | | 801
www.repository.cam.ac.uk | success | 359
www.repository.cam.ac.uk | no-pdf-link | 239
www.research-collection.ethz.ch | | 276
www.research-collection.ethz.ch | terminal-bad-status | 274
www.revistas.usp.br | | 207
www.revistas.usp.br | success | 204
www.rina.org.uk | no-pdf-link | 1009 **
www.rina.org.uk | | 1009
www.schweizerbart.de | no-pdf-link | 202
www.schweizerbart.de | | 202
www.scielo.br | | 544
www.scielo.br | redirect-loop | 526
www.sciencedirect.com | | 3901
www.sciencedirect.com | no-pdf-link | 3127 **
www.sciencedirect.com | link-loop | 701
www.sciendo.com | | 384
www.sciendo.com | success | 363
www.sciengine.com | | 225
www.scirp.org | | 209
www.spandidos-publications.com | | 205
www.tandfonline.com | | 8925
www.tandfonline.com | blocked-cookie | 8099
www.tandfonline.com | terminal-bad-status | 477
www.tandfonline.com | redirect-loop | 322
www.taylorfrancis.com | | 6119
www.taylorfrancis.com | no-pdf-link | 3567
www.taylorfrancis.com | link-loop | 2169
www.taylorfrancis.com | terminal-bad-status | 353
www.thieme-connect.de | | 1047
www.thieme-connect.de | redirect-loop | 472
www.thieme-connect.de | spn2-error:job-failed | 343
www.tib.eu | | 206
www.trp.org.in | | 311
www.trp.org.in | success | 311
www.un-ilibrary.org | no-pdf-link | 597 *
www.un-ilibrary.org | | 597
www.vr-elibrary.de | | 775
www.vr-elibrary.de | blocked-cookie | 774
www.wjgnet.com | | 204
www.wjgnet.com | no-pdf-link | 204
www.worldscientific.com | | 974
www.worldscientific.com | blocked-cookie | 971
www.worldwidejournals.com | | 242
www.worldwidejournals.com | no-pdf-link | 203
www.wto-ilibrary.org | no-pdf-link | 295
www.wto-ilibrary.org | | 295
www.zora.uzh.ch | | 222
zenodo.org | | 49460
zenodo.org | no-pdf-link | 39721
zenodo.org | success | 8954
zenodo.org | wrong-mimetype | 562
| | 445919
| no-pdf-link | 168035
| success | 140875
| gateway-timeout | 31809
| blocked-cookie | 26431
| terminal-bad-status | 25625
| link-loop | 19006
| spn2-error:job-failed | 13962
| redirect-loop | 12512
| wrong-mimetype | 2302
| spn2-error | 1689
| too-many-redirects | 1203
| bad-redirect | 732
| cdx-error | 539
| not-found | 420
| spn2-error:no-status | 256
(419 rows)
Get random subsets by terminal domain:
\x auto
SELECT
ingest_request.link_source_id AS link_source_id,
ingest_request.base_url as base_url ,
ingest_file_result.terminal_url as terminal_url
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.created >= NOW() - '30 day'::INTERVAL
AND ingest_request.ingest_type = 'pdf'
AND ingest_request.ingest_request_source = 'fatcat-changelog'
AND ingest_file_result.status = 'no-pdf-link'
AND ingest_file_result.terminal_url LIKE '%//DOMAIN/%'
ORDER BY random()
LIMIT 5;
## acervus.unicamp.br
Previously flagged as messy (2021-05_daily_improvements.md)
## cas.columbia.edu
-[ RECORD 1 ]--+------------------------------------------------------------------------------------------------------------------
link_source_id | 10.7916/d8-2ety-qm51
base_url | https://doi.org/10.7916/d8-2ety-qm51
terminal_url | https://cas.columbia.edu/cas/login?TARGET=https%3A%2F%2Fdlc.library.columbia.edu%2Fusers%2Fauth%2Fsaml%2Fcallback
-[ RECORD 2 ]--+------------------------------------------------------------------------------------------------------------------
link_source_id | 10.7916/d8-0zf6-d167
base_url | https://doi.org/10.7916/d8-0zf6-d167
terminal_url | https://cas.columbia.edu/cas/login?TARGET=https%3A%2F%2Fdlc.library.columbia.edu%2Fusers%2Fauth%2Fsaml%2Fcallback
-[ RECORD 3 ]--+------------------------------------------------------------------------------------------------------------------
link_source_id | 10.7916/d8-k6ha-sn43
base_url | https://doi.org/10.7916/d8-k6ha-sn43
terminal_url | https://cas.columbia.edu/cas/login?TARGET=https%3A%2F%2Fdlc.library.columbia.edu%2Fusers%2Fauth%2Fsaml%2Fcallback
-[ RECORD 4 ]--+------------------------------------------------------------------------------------------------------------------
link_source_id | 10.7916/d8-bj6t-eb07
base_url | https://doi.org/10.7916/d8-bj6t-eb07
terminal_url | https://cas.columbia.edu/cas/login?TARGET=https%3A%2F%2Fdlc.library.columbia.edu%2Fusers%2Fauth%2Fsaml%2Fcallback
-[ RECORD 5 ]--+------------------------------------------------------------------------------------------------------------------
link_source_id | 10.7916/d8-xjac-j502
base_url | https://doi.org/10.7916/d8-xjac-j502
terminal_url | https://cas.columbia.edu/cas/login?TARGET=https%3A%2F%2Fdlc.library.columbia.edu%2Fusers%2Fauth%2Fsaml%2Fcallback
these are not public (loginwalls)
DONE: '/login?TARGET=' as a login wall pattern
## doi.ala.org.au
Previously flagged as dataset repository; datacite metadata is wrong. (2021-05_daily_improvements.md)
NOTE: look at ingesting datasets
## www.isca-speech.org
-[ RECORD 1 ]--+----------------------------------------------------------------------------------
link_source_id | 10.21437/interspeech.2014-84
base_url | https://doi.org/10.21437/interspeech.2014-84
terminal_url | https://www.isca-speech.org/archive/interspeech_2014/li14b_interspeech.html
-[ RECORD 2 ]--+----------------------------------------------------------------------------------
link_source_id | 10.21437/interspeech.2004-319
base_url | https://doi.org/10.21437/interspeech.2004-319
terminal_url | https://www.isca-speech.org/archive/interspeech_2004/delcroix04_interspeech.html
-[ RECORD 3 ]--+----------------------------------------------------------------------------------
link_source_id | 10.21437/interspeech.2006-372
base_url | https://doi.org/10.21437/interspeech.2006-372
terminal_url | https://www.isca-speech.org/archive/interspeech_2006/lei06c_interspeech.html
-[ RECORD 4 ]--+----------------------------------------------------------------------------------
link_source_id | 10.21437/interspeech.2015-588
base_url | https://doi.org/10.21437/interspeech.2015-588
terminal_url | https://www.isca-speech.org/archive/interspeech_2015/polzehl15b_interspeech.html
-[ RECORD 5 ]--+----------------------------------------------------------------------------------
link_source_id | 10.21437/interspeech.2006-468
base_url | https://doi.org/10.21437/interspeech.2006-468
terminal_url | https://www.isca-speech.org/archive/interspeech_2006/chitturi06b_interspeech.html
Bespoke site. Added rule to sandcrawler.
NOTE: re-ingest/recrawl all isca-speech.org no-pdf-link terminal URLs (fatcat-ingest?)
## www.morressier.com
-[ RECORD 1 ]--+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.1115/1.0002858v
base_url | https://doi.org/10.1115/1.0002858v
terminal_url | https://www.morressier.com/article/development-new-single-highdensity-heatflux-gauges-unsteady-heat-transfer-measurements-rotating-transonic-turbine/60f162805d86378f03b49af5
-[ RECORD 2 ]--+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.1115/1.0003896v
base_url | https://doi.org/10.1115/1.0003896v
terminal_url | https://www.morressier.com/article/experimental-investigation-proton-exchange-membrane-fuel-cell-platinum-nafion-along-inplane-direction/60f16d555d86378f03b50038
-[ RECORD 3 ]--+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.1115/1.0004476v
base_url | https://doi.org/10.1115/1.0004476v
terminal_url | https://www.morressier.com/article/effect-air-release-agents-performance-results-fabric-lined-bushings/60f16d585d86378f03b502d5
-[ RECORD 4 ]--+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.1115/1.0001286v
base_url | https://doi.org/10.1115/1.0001286v
terminal_url | https://www.morressier.com/article/development-verification-modelling-practice-cfd-calculations-obtain-current-loads-fpso/60f15d3fe537565438d70ece
-[ RECORD 5 ]--+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.1115/1.0000315v
base_url | https://doi.org/10.1115/1.0000315v
terminal_url | https://www.morressier.com/article/fire-event-analysis-fire-frequency-estimation-japanese-nuclear-power-plant/60f15a6f5d86378f03b43874
Many of these seem to be presentations, as both video and slides. PDFs seem broken though.
NOTE: add to list of interesting rich media to crawl/preserve (video+slides+data)
## www.oecd-ilibrary.org
Paywall (2021-05_daily_improvements.md)
## www.rina.org.uk
-[ RECORD 1 ]--+-------------------------------------------------------
link_source_id | 10.3940/rina.ws.2002.10
base_url | https://doi.org/10.3940/rina.ws.2002.10
terminal_url | https://www.rina.org.uk/showproducts.html?product=4116
-[ RECORD 2 ]--+-------------------------------------------------------
link_source_id | 10.3940/rina.pass.2003.16
base_url | https://doi.org/10.3940/rina.pass.2003.16
terminal_url | https://www.rina.org.uk/showproducts.html?product=3566
-[ RECORD 3 ]--+-------------------------------------------------------
link_source_id | 10.3940/rina.icsotin.2013.15
base_url | https://doi.org/10.3940/rina.icsotin.2013.15
terminal_url | https://www.rina.org.uk/showproducts.html?product=8017
-[ RECORD 4 ]--+-------------------------------------------------------
link_source_id | 10.3940/rina.wfa.2010.23
base_url | https://doi.org/10.3940/rina.wfa.2010.23
terminal_url | https://www.rina.org.uk/showproducts.html?product=8177
-[ RECORD 5 ]--+-------------------------------------------------------
link_source_id | 10.3940/rina.icsotin15.2015.01
base_url | https://doi.org/10.3940/rina.icsotin15.2015.01
terminal_url | https://www.rina.org.uk/showproducts.html?product=7883
Site is broken in some way
## www.sciencedirect.com
-[ RECORD 1 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.1016/j.jhlste.2021.100332
base_url | https://doi.org/10.1016/j.jhlste.2021.100332
terminal_url | https://www.sciencedirect.com/science/article/abs/pii/S1473837621000332
-[ RECORD 2 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.1016/j.hazadv.2021.100006
base_url | https://doi.org/10.1016/j.hazadv.2021.100006
terminal_url | https://www.sciencedirect.com/science/article/pii/S2772416621000061/pdfft?md5=e51bfd495bb53073c7a379d25cb11a32&pid=1-s2.0-S2772416621000061-main.pdf
-[ RECORD 3 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.1016/b978-0-12-822844-9.00009-8
base_url | https://doi.org/10.1016/b978-0-12-822844-9.00009-8
terminal_url | https://www.sciencedirect.com/science/article/pii/B9780128228449000098
-[ RECORD 4 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.1016/j.colcom.2021.100490
base_url | https://doi.org/10.1016/j.colcom.2021.100490
terminal_url | https://www.sciencedirect.com/science/article/abs/pii/S2215038221001308
-[ RECORD 5 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.1016/b978-0-323-85245-6.00012-6
base_url | https://doi.org/10.1016/b978-0-323-85245-6.00012-6
terminal_url | https://www.sciencedirect.com/science/article/pii/B9780323852456000126
These no-pdf-url ones seem to just be not OA, which is expected for much of the
domain.
## repository.dri.ie
link_source_id | base_url | terminal_url
-----------------------+---------------------------------------+---------------------------------------------
10.7486/dri.t148v5941 | https://doi.org/10.7486/dri.t148v5941 | https://repository.dri.ie/catalog/t148v5941
10.7486/dri.2z119c98f | https://doi.org/10.7486/dri.2z119c98f | https://repository.dri.ie/catalog/2z119c98f
10.7486/dri.qf8621102 | https://doi.org/10.7486/dri.qf8621102 | https://repository.dri.ie/catalog/qf8621102
10.7486/dri.js95m457t | https://doi.org/10.7486/dri.js95m457t | https://repository.dri.ie/catalog/js95m457t
10.7486/dri.c534vb726 | https://doi.org/10.7486/dri.c534vb726 | https://repository.dri.ie/catalog/c534vb726
"Digital repository of Ireland"
Historical scanned content. Bespoke site. Fixed.
NOTE: recrawl/retry this domain
## www.frontiersin.org
-[ RECORD 1 ]--+------------------------------------------------------------------------------------------------------------------
link_source_id | 10.3389/978-2-88971-147-5
base_url | https://doi.org/10.3389/978-2-88971-147-5
terminal_url | https://www.frontiersin.org/research-topics/9081/neuroimaging-approaches-to-the-study-of-tinnitus-and-hyperacusis
-[ RECORD 2 ]--+------------------------------------------------------------------------------------------------------------------
link_source_id | 10.3389/fnins.2021.722592
base_url | https://doi.org/10.3389/fnins.2021.722592
terminal_url | https://www.frontiersin.org/articles/10.3389/fnins.2021.722592/full
-[ RECORD 3 ]--+------------------------------------------------------------------------------------------------------------------
link_source_id | 10.3389/fcell.2021.683209
base_url | https://doi.org/10.3389/fcell.2021.683209
terminal_url | https://www.frontiersin.org/articles/10.3389/fcell.2021.683209/full
-[ RECORD 4 ]--+------------------------------------------------------------------------------------------------------------------
link_source_id | 10.3389/fmicb.2021.692474
base_url | https://doi.org/10.3389/fmicb.2021.692474
terminal_url | https://www.frontiersin.org/articles/10.3389/fmicb.2021.692474/full
-[ RECORD 5 ]--+------------------------------------------------------------------------------------------------------------------
link_source_id | 10.3389/fneur.2021.676527
base_url | https://doi.org/10.3389/fneur.2021.676527
terminal_url | https://www.frontiersin.org/articles/10.3389/fneur.2021.676527/full
All the `/research-topics/` URLs are out of scope.
NOTE: recrawl missing frontiersin.org articles for PDFs
NOTE: recrawl missing frontiersin.org articles for XML (?)
-------
## direct.mit.edu
Previously "not available" (2021-05_daily_improvements.md)
## figshare.com
-[ RECORD 1 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.6084/m9.figshare.15052236.v6
base_url | https://doi.org/10.6084/m9.figshare.15052236.v6
terminal_url | https://figshare.com/articles/software/RCL-tree_rar/15052236/6
-[ RECORD 2 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.6084/m9.figshare.14907846.v5
base_url | https://doi.org/10.6084/m9.figshare.14907846.v5
terminal_url | https://figshare.com/articles/book/Conservation_of_Limestone_Ecosystems_of_Malaysia_Part_I_Acknowledgements_Methodology_Overview_of_limestone_outcrops_in_Malaysia_References_Detailed_information_on_limestone_outcrops_of_the_states_Johor_Negeri_Sembilan_Terengganu_Selangor_Pe/14907846/5
-[ RECORD 3 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.6084/m9.figshare.15157614.v1
base_url | https://doi.org/10.6084/m9.figshare.15157614.v1
terminal_url | https://figshare.com/articles/software/code_for_NN-A72265C/15157614/1
-[ RECORD 4 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.6084/m9.figshare.15172926.v1
base_url | https://doi.org/10.6084/m9.figshare.15172926.v1
terminal_url | https://figshare.com/articles/preprint/History_of_the_internet/15172926/1
-[ RECORD 5 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.6084/m9.figshare.16532574.v1
base_url | https://doi.org/10.6084/m9.figshare.16532574.v1
terminal_url | https://figshare.com/articles/media/Helen_McConnell_How_many_trees_do_you_think_you_have_planted_/16532574/1
NOTE: can determine from the redirect URL, I guess. This is helpful for ingest!
Could also potentially correct fatcat release_type using this info.
We seem to be getting the ones we can (eg, papers) just fine
## hkvalidate.perfdrive.com
Should be skipping/bailing on this domain, but not for some reason.
-[ RECORD 1 ]--+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.3847/1538-4357/ac05cc
base_url | https://doi.org/10.3847/1538-4357/ac05cc
terminal_url | https://hkvalidate.perfdrive.com/?ssa=1716a049-aeaa-4a89-8f82-bd733adaa2e7&ssb=43981203877&ssc=https%3A%2F%2Fiopscience.iop.org%2Farticle%2F10.3847%2F1538-4357%2Fac05cc&ssi=0774dd12-8427-4e27-a2ac-759c8cc2ec0e&ssk=support@shieldsquare.com&ssm=07370915269044035109047683305266&ssn=e69c743cc3d66619f960f924b562160d637e8d7f1b0f-d3bb-44d4-b075ed&sso=75a8bd85-4a097fb40f99bfb9c97b0a4ca0a38fd6d79513a466e82cc7&ssp=92054607321628531005162856888275586&ssq=33809984098158010864140981653938424553916&ssr=MjA3LjI0MS4yMjUuMTM5&sst=Mozilla/5.0%20(Windows%20NT%2010.0;%20Win64;%20x64)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/74.0.3729.169%20Safari/537.36&ssv=&ssw=
-[ RECORD 2 ]--+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.3847/1538-4357/ac0429
base_url | https://doi.org/10.3847/1538-4357/ac0429
terminal_url | https://hkvalidate.perfdrive.com/?ssa=12bca70d-0af4-4241-9c9b-384befd96a88&ssb=92559232428&ssc=https%3A%2F%2Fiopscience.iop.org%2Farticle%2F10.3847%2F1538-4357%2Fac0429&ssi=cff72ab0-8427-4acd-a0e7-db1b04cf7ce7&ssk=support@shieldsquare.com&ssm=27895673282814430105287068829605&ssn=9af36a8e10efd239c9367a2f31dde500f7455c4d5f45-bf11-4b99-ad29ea&sso=26bd22d2-b23e1bd9558f2fd9ed0768ef1acecb24715d1d463328a229&ssp=16502500621628222613162823304820671&ssq=11469693950387070477339503456478590533604&ssr=MjA3LjI0MS4yMjUuMTYw&sst=Mozilla/5.0%20(Windows%20NT%2010.0;%20Win64;%20x64)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/74.0.3729.169%20Safari/537.36&ssv=&ssw=
-[ RECORD 3 ]--+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.1149/1945-7111/ac1a85
base_url | https://doi.org/10.1149/1945-7111/ac1a85
terminal_url | https://hkvalidate.perfdrive.com/?ssa=b0fef51a-0f44-476e-b951-3341bde6aa67&ssb=84929220393&ssc=https%3A%2F%2Fiopscience.iop.org%2Farticle%2F10.1149%2F1945-7111%2Fac1a85&ssi=48c05577-8427-4421-acd3-735ca29a46e6&ssk=support@shieldsquare.com&ssm=81129482524077974103852241068134&ssn=cf6c261d2b20d518b2ebe57e40ffaec9ab4cd1955dcb-7877-4f5b-bc3b1e&sso=1d196cae-6850f1ed8143e460f2bfbb61a8ae15cfe6b53d3bcdc528ca&ssp=99289867941628195224162819241830491&ssq=16897595632212421273956322948987630170313&ssr=MjA3LjI0MS4yMjUuMjM2&sst=Mozilla/5.0%20(Windows%20NT%2010.0;%20Win64;%20x64)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/74.0.3729.169%20Safari/537.36&ssv=&ssw=
-[ RECORD 4 ]--+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.35848/1882-0786/ac1b0d
base_url | https://doi.org/10.35848/1882-0786/ac1b0d
terminal_url | https://hkvalidate.perfdrive.com/?ssa=6debdd23-c46b-4b40-b73c-d5540f04454e&ssb=95627212532&ssc=https%3A%2F%2Fiopscience.iop.org%2Farticle%2F10.35848%2F1882-0786%2Fac1b0d&ssi=78b34ff9-8427-4d07-a0db-78a3aa2c7332&ssk=support@shieldsquare.com&ssm=54055111549093989106852695053789&ssn=cb51949e15a02cb99a8d0b57c4d06327b72e8d5c87a8-d006-4ffa-939ffb&sso=1b7fd62d-8107746fe28fca252fd45ffa403937e272bf75b452b68d4a&ssp=77377533171628212164162820021422494&ssq=02679025218797637682252187852000657274192&ssr=MjA3LjI0MS4yMzMuMTIx&sst=Mozilla/5.0%20(Windows%20NT%2010.0;%20Win64;%20x64)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/74.0.3729.169%20Safari/537.36&ssv=&ssw=
-[ RECORD 5 ]--+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.3847/1538-4357/ac05ba
base_url | https://doi.org/10.3847/1538-4357/ac05ba
terminal_url | https://hkvalidate.perfdrive.com/?ssa=f127eb3d-6a05-459d-97f2-499715c04b13&ssb=06802230353&ssc=https%3A%2F%2Fiopscience.iop.org%2Farticle%2F10.3847%2F1538-4357%2Fac05ba&ssi=8d087719-8427-4046-91fb-5e96af401560&ssk=support@shieldsquare.com&ssm=21056861072205974105064006574997&ssn=d05a73cff6d9af57acd6e2c366e716176752e1164d39-b9a7-408c-837d11&sso=d3f38d1e-a562a19195042d7e471a5e4fab03b6ca16ff1711c7c61804&ssp=68781137401628744693162877909483738&ssq=79454859841502433261398415426689546750534&ssr=MjA3LjI0MS4yMzIuMTg5&sst=Mozilla/5.0%20(Windows%20NT%2010.0;%20Win64;%20x64)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/74.0.3729.169%20Safari/537.36&ssv=&ssw=
Was failing to check against blocklist again at the end of attempts.
Could retry all these to update status, but probably not worth it.
## jov.arvojournals.org
link_source_id | base_url | terminal_url
-----------------------+---------------------------------------+-------------------------------------------------------------
10.1167/jov.21.9.1933 | https://doi.org/10.1167/jov.21.9.1933 | https://jov.arvojournals.org/article.aspx?articleid=2777021
10.1167/jov.21.9.2910 | https://doi.org/10.1167/jov.21.9.2910 | https://jov.arvojournals.org/article.aspx?articleid=2777561
10.1167/jov.21.9.1895 | https://doi.org/10.1167/jov.21.9.1895 | https://jov.arvojournals.org/article.aspx?articleid=2777057
10.1167/jov.21.9.2662 | https://doi.org/10.1167/jov.21.9.2662 | https://jov.arvojournals.org/article.aspx?articleid=2777793
10.1167/jov.21.9.2246 | https://doi.org/10.1167/jov.21.9.2246 | https://jov.arvojournals.org/article.aspx?articleid=2777441
These seem to just not be published/available yet.
But they also use watermark.silverchair.com
NOTE: re-crawl (force-retry?) all non-recent papers with fatcat-ingest
NOTE: for watermark.silverchair.com terminal bad-status, re-crawl from initial URL (base_url) using heritrix
## kiss.kstudy.com
Previously unable to download (2021-05_daily_improvements.md)
## open.library.ubc.ca
link_source_id | base_url | terminal_url
--------------------+------------------------------------+----------------------------------------------------------------------------------
10.14288/1.0400664 | https://doi.org/10.14288/1.0400664 | https://open.library.ubc.ca/collections/bcnewspapers/nelsondaily/items/1.0400664
10.14288/1.0401189 | https://doi.org/10.14288/1.0401189 | https://open.library.ubc.ca/collections/bcnewspapers/nelsondaily/items/1.0401189
10.14288/1.0401487 | https://doi.org/10.14288/1.0401487 | https://open.library.ubc.ca/cIRcle/collections/48630/items/1.0401487
10.14288/1.0400994 | https://doi.org/10.14288/1.0400994 | https://open.library.ubc.ca/collections/bcnewspapers/nelsondaily/items/1.0400994
10.14288/1.0401312 | https://doi.org/10.14288/1.0401312 | https://open.library.ubc.ca/collections/bcnewspapers/nelsondaily/items/1.0401312
Historical newspapers, out of scope?
Video content:
https://open.library.ubc.ca/cIRcle/collections/48630/items/1.0401487
Another video: https://open.library.ubc.ca/cIRcle/collections/48630/items/1.0400764
NOTE: add video link to alternative content demo ingest: https://open.library.ubc.ca/cIRcle/collections/48630/items/1.0400764
NOTE: handle this related withdrawn notice? https://open.library.ubc.ca/cIRcle/collections/48630/items/1.0401512
## panor.ru
link_source_id | base_url | terminal_url
-------------------------+-----------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
10.33920/med-14-2108-06 | https://doi.org/10.33920/med-14-2108-06 | https://panor.ru/articles/otsenka-dinamiki-pokazateley-morfofunktsionalnykh-kharakteristik-kozhi-upatsientov-s-spr-pod-vliyaniem-kompleksnoy-fototerapii/66351.html
10.33920/nik-02-2105-01 | https://doi.org/10.33920/nik-02-2105-01 | https://panor.ru/articles/innovatsionnost-obrazovatelnykh-tekhnologiy-kak-istoricheski-oposredovannyy-fenomen/65995.html
10.33920/pro-1-2101-10 | https://doi.org/10.33920/pro-1-2101-10 | https://panor.ru/articles/obespechenie-bezopasnosti-na-promyshlennykh-predpriyatiyakh-s-pomoshchyu-sredstv-individualnoy-zashchity/66299.html
10.33920/sel-4-2008-04 | https://doi.org/10.33920/sel-4-2008-04 | https://panor.ru/articles/osobennosti-regulirovaniya-zemelnykh-otnosheniy-na-prigranichnykh-territoriyakh-rossiyskoy-federatsii/66541.html
10.33920/pro-2-2104-03 | https://doi.org/10.33920/pro-2-2104-03 | https://panor.ru/articles/organizatsiya-samorazvivayushchegosya-proizvodstva-v-realnykh-usloviyakh/65054.html
"The full version of the article is available only to subscribers of the journal"
Paywall
## peerj.com
Previously: this is HTML of reviews (2021-05_daily_improvements.md)
NOTE: Should be HTML ingest, possibly special case scope
## publons.com
Previously: this is HTML (2021-05_daily_improvements.md)
NOTE: Should be HTML ingest, possibly special case scope (length of works)
## stm.bookpi.org
link_source_id | base_url | terminal_url
-----------------------------+---------------------------------------------+----------------------------------------------------
10.9734/bpi/nfmmr/v7/11547d | https://doi.org/10.9734/bpi/nfmmr/v7/11547d | https://stm.bookpi.org/NFMMR-V7/article/view/3231
10.9734/bpi/ecafs/v1/9773d | https://doi.org/10.9734/bpi/ecafs/v1/9773d | https://stm.bookpi.org/ECAFS-V1/article/view/3096
10.9734/bpi/mpebm/v5/3391f | https://doi.org/10.9734/bpi/mpebm/v5/3391f | https://stm.bookpi.org/MPEBM-V5/article/view/3330
10.9734/bpi/castr/v13/3282f | https://doi.org/10.9734/bpi/castr/v13/3282f | https://stm.bookpi.org/CASTR-V13/article/view/2810
10.9734/bpi/hmms/v13 | https://doi.org/10.9734/bpi/hmms/v13 | https://stm.bookpi.org/HMMS-V13/issue/view/274
These are... just abstracts of articles within a book? Weird. Maybe sketchy? DOIs via Crossref
## www.cabi.org
link_source_id | base_url | terminal_url
--------------------------+------------------------------------------+----------------------------------------------------
10.1079/dfb/20133414742 | https://doi.org/10.1079/dfb/20133414742 | https://www.cabi.org/cabreviews/review/20133414742
10.1079/dmpd/20056500471 | https://doi.org/10.1079/dmpd/20056500471 | https://www.cabi.org/cabreviews/review/20056500471
10.1079/dmpp/20056600544 | https://doi.org/10.1079/dmpp/20056600544 | https://www.cabi.org/cabreviews/review/20056600544
10.1079/dmpd/20056500117 | https://doi.org/10.1079/dmpd/20056500117 | https://www.cabi.org/cabreviews/review/20056500117
10.1079/dmpp20056600337 | https://doi.org/10.1079/dmpp20056600337 | https://www.cabi.org/cabreviews/review/20056600337
Reviews? but just abstracts?
## www.cureus.com
-[ RECORD 1 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.7759/cureus.17547
base_url | https://doi.org/10.7759/cureus.17547
terminal_url | https://www.cureus.com/articles/69542-tramadol-induced-jerks
-[ RECORD 2 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.7759/cureus.16867
base_url | https://doi.org/10.7759/cureus.16867
terminal_url | https://www.cureus.com/articles/66793-advanced-squamous-cell-carcinoma-of-gall-bladder-masquerading-as-liver-abscess-with-review-of-literature-review-on-advanced-biliary-tract-cancer
-[ RECORD 3 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.7759/cureus.17425
base_url | https://doi.org/10.7759/cureus.17425
terminal_url | https://www.cureus.com/articles/67438-attitudes-and-knowledge-of-medical-students-towards-healthcare-for-lesbian-gay-bisexual-and-transgender-seniors-impact-of-a-case-based-discussion-with-facilitators-from-the-community
-[ RECORD 4 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.7759/cureus.17313
base_url | https://doi.org/10.7759/cureus.17313
terminal_url | https://www.cureus.com/articles/67258-utilizing-google-trends-to-track-online-interest-in-elective-hand-surgery-during-the-covid-19-pandemic
-[ RECORD 5 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
link_source_id | 10.7759/cureus.16943
base_url | https://doi.org/10.7759/cureus.16943
terminal_url | https://www.cureus.com/articles/19364-small-bowel-obstruction-a-rare-presentation-of-the-inferior-pancreaticoduodenal-artery-pseudoaneurysm-bleed
Ugh, stupid "email to get PDF". but ingest seems to work anyways?
NOTE: re-crawl/re-ingest all (eg, fatcat-ingest or similar)
## www.e-manuscripta.ch
link_source_id | base_url | terminal_url
------------------------------+----------------------------------------------+-------------------------------------------------------------------
10.7891/e-manuscripta-114031 | https://doi.org/10.7891/e-manuscripta-114031 | https://www.e-manuscripta.ch/swa/doi/10.7891/e-manuscripta-114031
10.7891/e-manuscripta-112064 | https://doi.org/10.7891/e-manuscripta-112064 | https://www.e-manuscripta.ch/zut/doi/10.7891/e-manuscripta-112064
10.7891/e-manuscripta-112176 | https://doi.org/10.7891/e-manuscripta-112176 | https://www.e-manuscripta.ch/zut/doi/10.7891/e-manuscripta-112176
10.7891/e-manuscripta-115200 | https://doi.org/10.7891/e-manuscripta-115200 | https://www.e-manuscripta.ch/swa/doi/10.7891/e-manuscripta-115200
10.7891/e-manuscripta-114008 | https://doi.org/10.7891/e-manuscripta-114008 | https://www.e-manuscripta.ch/swa/doi/10.7891/e-manuscripta-114008
Historical docs, single pages, but do have full PDF downloads.
NOTE: re-ingest
## www.inderscience.com
Previously: paywall (2021-05_daily_improvements.md)
## www.un-ilibrary.org
link_source_id | base_url | terminal_url
----------------------------+--------------------------------------------+-------------------------------------------------------------
10.18356/9789210550307 | https://doi.org/10.18356/9789210550307 | https://www.un-ilibrary.org/content/books/9789210550307
10.18356/9789210586719c011 | https://doi.org/10.18356/9789210586719c011 | https://www.un-ilibrary.org/content/books/9789210586719c011
10.18356/9789210058575c014 | https://doi.org/10.18356/9789210058575c014 | https://www.un-ilibrary.org/content/books/9789210058575c014
10.18356/9789210550307c020 | https://doi.org/10.18356/9789210550307c020 | https://www.un-ilibrary.org/content/books/9789210550307c020
10.18356/9789213631423c005 | https://doi.org/10.18356/9789213631423c005 | https://www.un-ilibrary.org/content/books/9789213631423c005
Books and chapters. Doesn't seem to have actual download ability?
# Re-Ingest / Re-Crawl
Using fatcat-ingest helper tool.
- www.isca-speech.org doi_prefix:10.21437
doi:* doi_prefix:10.21437 in_ia:false
9,233
./fatcat_ingest.py --allow-non-oa query 'doi:* doi_prefix:10.21437' > /srv/fatcat/tasks/2021-09-03_ingest_isca.json
=> Counter({'ingest_request': 9221, 'elasticsearch_release': 9221, 'estimate': 9221})
- repository.dri.ie doi_prefix:10.7486
doi:* in_ia:false doi_prefix:10.7486
56,532
./fatcat_ingest.py --allow-non-oa query 'doi:* doi_prefix:10.7486' > /srv/fatcat/tasks/2021-09-03_ingest_dri.json
=> Counter({'ingest_request': 56532, 'elasticsearch_release': 56532, 'estimate': 56532})
- *.arvojournals.org doi_prefix:10.1167 (force recrawl if no-pdf-link)
25,598
many are meeting abstracts
./fatcat_ingest.py --allow-non-oa query doi_prefix:10.1167 > /srv/fatcat/tasks/2021-09-03_ingest_arvo.json
=> Counter({'ingest_request': 25598, 'elasticsearch_release': 25598, 'estimate': 25598})
- www.cureus.com doi_prefix:10.7759
1,537
./fatcat_ingest.py --allow-non-oa query doi_prefix:10.7759 > /srv/fatcat/tasks/2021-09-03_ingest_cureus.json
=> Counter({'ingest_request': 1535, 'elasticsearch_release': 1535, 'estimate': 1535})
- www.e-manuscripta.ch doi_prefix:10.7891 10.7891/e-manuscripta
110,945
TODO: all are marked 'unpublished', but that is actually probably right?
- www.frontiersin.org doi_prefix:10.3389 (both PDF and XML!)
doi:* in_ia:false doi_prefix:10.3389
212,370
doi:10.3389/conf.* => most seem to be just abstracts? how many like this?
container_id:kecnf6vtpngn7j2avgfpdyw5ym => "topics" (2.2k)
fatcat-cli search release 'doi:* in_ia:false doi_prefix:10.3389 !container_id:kecnf6vtpngn7j2avgfpdyw5ym' --index-json -n0 | jq '[.ident, .container_id, .doi] | @tsv' -r | rg -v 10.3389/conf | pv -l | gzip > frontiers_to_crawl.tsv.gz
=> 191k
but many might be components? this is actually kind of a mess
fatcat-cli search release 'doi:* in_ia:false doi_prefix:10.3389 !container_id:kecnf6vtpngn7j2avgfpdyw5ym !type:component stage:published' --index-json -n0 | jq '[.ident, .container_id, .doi] | @tsv' -r | rg -v 10.3389/conf | pv -l | gzip > frontiers_to_crawl.tsv.gz
=> 19.2k
./fatcat_ingest.py --allow-non-oa query 'doi:* in_ia:false doi_prefix:10.3389 !container_id:kecnf6vtpngn7j2avgfpdyw5ym !type:component stage:published' | rg -v 10.3389/conf > /srv/fatcat/tasks/2021-09-03_frontiers.json
# Remaining Tasks / Domains (TODO)
more complex crawling/content:
- add video link to alternative content demo ingest: https://open.library.ubc.ca/cIRcle/collections/48630/items/1.0400764
- watermark.silverchair.com: if terminal-bad-status, then do recrawl via heritrix with base_url
- www.morressier.com: interesting site for rich web crawling/preservation (video+slides+data)
- doi.ala.org.au: possible dataset ingest source
- peerj.com, at least reviews, should be HTML ingest? or are some PDF?
- publons.com should be HTML ingest, possibly special case for scope
- frontiersin.org: any 'component' releases with PDF file are probably a metadata bug
other tasks:
- handle this related withdrawn notice? https://open.library.ubc.ca/cIRcle/collections/48630/items/1.0401512
- push/deploy sandcrawler changes
|