aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest/2020-03-04_mag.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-05-26 14:47:17 -0700
committerBryan Newbold <bnewbold@archive.org>2020-05-26 14:47:17 -0700
commit5dd8785d710cf7d067afdc691069bfa74406e06a (patch)
tree8ff16b25cee10f38127caf7fdb266d41fea12d83 /notes/ingest/2020-03-04_mag.md
parent4598ea9242d1001e473e6340342afea854868577 (diff)
downloadsandcrawler-5dd8785d710cf7d067afdc691069bfa74406e06a.tar.gz
sandcrawler-5dd8785d710cf7d067afdc691069bfa74406e06a.zip
ingests: normalize file names; commit updates
Diffstat (limited to 'notes/ingest/2020-03-04_mag.md')
-rw-r--r--notes/ingest/2020-03-04_mag.md576
1 files changed, 0 insertions, 576 deletions
diff --git a/notes/ingest/2020-03-04_mag.md b/notes/ingest/2020-03-04_mag.md
deleted file mode 100644
index 428ce05..0000000
--- a/notes/ingest/2020-03-04_mag.md
+++ /dev/null
@@ -1,576 +0,0 @@
-
-Rough plan:
-
-- run bulk and/or regular ingest requests for just those of AIT partners (200k?)
-- persist ingest requests (22 million or so)
-- run bulk ingest over 'no status' / 'no match' requests (aka, those not in unpaywall)
-- crawl those which are no-capture
-
-
-## Generate Requests
-
-Newer version of `mag_ingest_request.sh` script requires venv with urlcanon
-installed.
-
-Starting with the 2020-01-23 MAG dump, will generate a full ingest request set
-(including DOI `ext_id` when available), with any dominant domains removed (eg,
-arxiv.org):
-
- export LC_ALL=C
- cat PaperUrls_mag_url_doi.all.txt | rg -a -v arxiv.org | rg -a "://" | ./mag_ingest_request.py - --created-date 2020-01-23 | pv -l > ingest_requests_mag-2020-01-23.doi.json
- => previously 25.6M
- => 25.6M 2:29:43 [2.85k/s]
-
- export LC_ALL=C
- zcat PaperUrls_mag_url_pmid.txt.gz | rg -a -v arxiv.org | rg -a "://" | ./mag_ingest_request.py - --created-date 2020-01-23 --pmid | pv -l > ingest_requests_mag-2020-01-23.pmid.json
- => 4.3M 0:25:45 [2.78k/s]
-
- export LC_ALL=C
- cat ingest_requests_mag-2020-01-23.json | jq -r "[.base_url, .ext_ids.doi] | @tsv" | sort -u -S 4G > ingest_requests_mag-2020-01-23.full.seed_id
-
- zcat PaperUrls_PaperExtendedAttributes_pdf.txt.gz | wc -l
- => 6,504,907
-
- zcat PaperUrls_mag_url_pmid.txt.gz | wc -l
- => 4,369,832
-
- cat ingest_requests_mag-2020-01-23.json | jq .ext_ids.doi -r | rg -a -v '^null$' | wc -l
- => previously 15,707,405
- => 15,702,581
-
- cat ingest_requests_mag-2020-01-23.pmid.json | jq .base_url -r | rg ' ' | wc -l
- => 0
- URL encoding seems to be working
-
-## Persist Ingest Requests
-
-First pmid ingest requests, then the all/doi file. The reason to do this order
-is that the all/doi file will have some rows with no DOI (and thus no
-`ext_id`), while the PMID file will not.
-
- # small sample
- head /schnell/mag/20200123/ingest_requests_mag-2020-01-23.pmid.json | ./persist_tool.py ingest-request -
- Worker: Counter({'total': 10, 'skip-result-fields': 10})
- JSON lines pushed: Counter({'total': 10, 'pushed': 10})
-
- cat /schnell/mag/20200123/ingest_requests_mag-2020-01-23.pmid.json | ./persist_tool.py ingest-request -
- => 4.3M 0:16:46 [4.27k/s]
- Worker: Counter({'total': 4295026, 'insert-requests': 4241862, 'update-requests': 0})
- JSON lines pushed: Counter({'total': 4295026, 'pushed': 4295026})
- => hit a bug on first attempt, which is why total/insert results don't match
-
- cat /schnell/mag/20200123/ingest_requests_mag-2020-01-23.doi.json | ./persist_tool.py ingest-request -
- => 25.6M 2:21:54 [3.01k/s]
- Worker: Counter({'total': 25596559, 'insert-requests': 21348393, 'update-requests': 0})
- JSON lines pushed: Counter({'pushed': 25596559, 'total': 25596559})
-
-
-## Crawl/Dupe Status
-
- SELECT ingest_file_result.status, COUNT(*)
- FROM ingest_request
- LEFT JOIN ingest_file_result
- ON ingest_file_result.ingest_type = ingest_request.ingest_type
- AND ingest_file_result.base_url = ingest_request.base_url
- WHERE
- ingest_request.ingest_type = 'pdf'
- AND ingest_request.link_source = 'mag'
- GROUP BY status
- ORDER BY COUNT DESC
- LIMIT 20;
-
-After just PMID links:
-
- status | count
- ---------------------+---------
- | 3000115
- success | 1126881
- no-capture | 69459
- terminal-bad-status | 30259
- redirect-loop | 11656
- no-pdf-link | 2836
- wrong-mimetype | 1456
- link-loop | 1259
- wayback-error | 1232
- cdx-error | 932
- null-body | 85
- petabox-error | 50
- bad-redirect | 1
- (13 rows)
-
-After all links:
-
- SELECT COUNT(*)
- FROM ingest_request
- WHERE
- ingest_request.ingest_type = 'pdf'
- AND ingest_request.link_source = 'mag';
- => 25596563
-
-
- status | count
- ---------------------+----------
- | 21130841
- success | 3915682
- no-capture | 391813
- terminal-bad-status | 76488
- redirect-loop | 44202
- wrong-mimetype | 16418
- no-pdf-link | 10995
- wayback-error | 3679
- cdx-error | 3414
- link-loop | 2098
- null-body | 709
- petabox-error | 221
- bad-gzip-encoding | 2
- bad-redirect | 1
- (14 rows)
-
-Somewhat more un-ingested than expected.
-
-Dump requests:
-
- COPY (
- SELECT row_to_json(ingest_request.*) FROM ingest_request
- LEFT JOIN ingest_file_result
- ON ingest_file_result.ingest_type = ingest_request.ingest_type
- AND ingest_file_result.base_url = ingest_request.base_url
- WHERE
- ingest_request.ingest_type = 'pdf'
- AND ingest_request.link_source = 'mag'
- AND ingest_file_result.status IS NULL
- ) TO '/grande/snapshots/mag_noingest_20200305.rows.json';
- => COPY 21,130,841
-
-Transform and shuf:
-
- ./scripts/ingestrequest_row2json.py /grande/snapshots/mag_noingest_20200305.rows.json | pv -l | shuf | gzip > /grande/snapshots/mag_noingest_20200305.shuf.json.gz
- => 21.1M 0:18:57 [18.6k/s]
-
-## Bulk Ingest Partner Output
-
-These are subsets of the full list from potential AIT-S partners; want to run
-these through the pipeline before the full batch. Duplication against the full
-batch should be minimal.
-
-Size:
-
- bnewbold@ia601101$ cat ingest_requests_mag-2020-01-23.cornell.json | jq .ext_ids.doi | rg -v '^null$' | wc -l
- 29007
- bnewbold@ia601101$ wc -l ingest_requests_mag-2020-01-23.cornell.json
- 34265 ingest_requests_mag-2020-01-23.cornell.json
-
-Test ingest:
-
- head -n200 ingest_requests_mag-2020-01-23.cornell.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-
-Full ingests:
-
- cat ingest_requests_mag-2020-01-23.cornell.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
- cat ingest_requests_mag-2020-01-23.alberta.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
- cat ingest_requests_mag-2020-01-23.columbia.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
- cat ingest_requests_mag-2020-01-23.emory.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
- cat ingest_requests_mag-2020-01-23.stanford.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-
-## Bulk Ingest
-
-Shard it into batches of roughly 1 million:
-
- cd /grande/snapshots/
- zcat /grande/snapshots/mag_noingest_20200305.shuf.json.gz | split -n r/20 -d - mag_noingest_20200305.ingest_request.split_ --additional-suffix=.json
-
-Add a single batch like:
-
- cat mag_noingest_20200305.ingest_request.split_00.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-
- partner ingests (see above)
- => 2020-03-05 12:49: 118,396
- 1056543 mag_noingest_20200305.ingest_request.split_00.json
- => 2020-03-05 14:34: 1,055,224
- => check on stats/ratios; filter by ingest update time?
- 1056542 mag_noingest_20200305.ingest_request.split_01.json
- 1056542 mag_noingest_20200305.ingest_request.split_02.json
- 1056542 mag_noingest_20200305.ingest_request.split_03.json
- 1056542 mag_noingest_20200305.ingest_request.split_04.json
- 1056542 mag_noingest_20200305.ingest_request.split_05.json
- 1056542 mag_noingest_20200305.ingest_request.split_06.json
- 1056542 mag_noingest_20200305.ingest_request.split_07.json
- 1056542 mag_noingest_20200305.ingest_request.split_08.json
- 1056542 mag_noingest_20200305.ingest_request.split_09.json
- => 2020-03-05 18:04: 10,009,297
- => 2020-03-06 16:53: 6,553,946
- 1056542 mag_noingest_20200305.ingest_request.split_10.json
- 1056542 mag_noingest_20200305.ingest_request.split_11.json
- 1056542 mag_noingest_20200305.ingest_request.split_12.json
- 1056542 mag_noingest_20200305.ingest_request.split_13.json
- 1056542 mag_noingest_20200305.ingest_request.split_14.json
- 1056542 mag_noingest_20200305.ingest_request.split_15.json
- 1056542 mag_noingest_20200305.ingest_request.split_16.json
- 1056542 mag_noingest_20200305.ingest_request.split_17.json
- 1056542 mag_noingest_20200305.ingest_request.split_18.json
- 1056542 mag_noingest_20200305.ingest_request.split_19.json
- => 2020-03-06 16:59: 17,001,032
-
-Stats from bulk ingest:
-
- SELECT ingest_file_result.status, COUNT(*)
- FROM ingest_request
- LEFT JOIN ingest_file_result
- ON ingest_file_result.ingest_type = ingest_request.ingest_type
- AND ingest_file_result.base_url = ingest_request.base_url
- WHERE
- ingest_request.ingest_type = 'pdf'
- AND ingest_request.link_source = 'mag'
- GROUP BY status
- ORDER BY COUNT DESC
- LIMIT 20;
-
- status | count
- ---------------------+----------
- no-capture | 12237193
- success | 11991293
- no-pdf-link | 521691
- redirect-loop | 437192
- terminal-bad-status | 231181
- link-loop | 92633
- cdx-error | 33631
- wrong-mimetype | 28638
- wayback-error | 19651
- null-body | 2682
- petabox-error | 727
- | 47
- bad-redirect | 44
- bad-gzip-encoding | 7
- (14 rows)
-
-Failures by domain:
-
- SELECT domain, status, COUNT((domain, status))
- FROM (
- SELECT
- ingest_file_result.ingest_type,
- ingest_file_result.status,
- substring(ingest_file_result.terminal_url FROM '[^/]+://([^/]*)') AS domain
- FROM ingest_file_result
- LEFT JOIN ingest_request
- ON ingest_file_result.ingest_type = ingest_request.ingest_type
- AND ingest_file_result.base_url = ingest_request.base_url
- WHERE
- ingest_file_result.ingest_type = 'pdf'
- AND ingest_request.link_source = 'mag'
- ) t1
- WHERE t1.domain != ''
- AND t1.status != 'success'
- AND t1.status != 'no-capture'
- GROUP BY domain, status
- ORDER BY COUNT DESC
- LIMIT 30;
-
- domain | status | count
- --------------------------------------+---------------------+--------
- dialnet.unirioja.es | redirect-loop | 240967
- onlinelibrary.wiley.com | no-pdf-link | 147696
- agupubs.onlinelibrary.wiley.com | no-pdf-link | 72639
- iopscience.iop.org | terminal-bad-status | 69591
- febs.onlinelibrary.wiley.com | no-pdf-link | 49874
- www.researchgate.net | redirect-loop | 42859
- journals.sagepub.com | no-pdf-link | 27448
- papers.ssrn.com | redirect-loop | 27328
- dialnet.unirioja.es | terminal-bad-status | 20320
- physoc.onlinelibrary.wiley.com | no-pdf-link | 20232
- science.sciencemag.org | link-loop | 17811
- espace.library.uq.edu.au | redirect-loop | 17185
- bpspubs.onlinelibrary.wiley.com | no-pdf-link | 15785
- obgyn.onlinelibrary.wiley.com | no-pdf-link | 15301
- anthrosource.onlinelibrary.wiley.com | no-pdf-link | 13746
- www.tandfonline.com | no-pdf-link | 13303
- aasldpubs.onlinelibrary.wiley.com | no-pdf-link | 11070
- link.springer.com | redirect-loop | 10594
- www.redalyc.org:9081 | no-pdf-link | 10515
- watermark.silverchair.com | terminal-bad-status | 9739
- www.bmj.com | link-loop | 9389
- www.repository.naturalis.nl | redirect-loop | 8213
- bjp.rcpsych.org | link-loop | 8045
- aslopubs.onlinelibrary.wiley.com | no-pdf-link | 7814
- nph.onlinelibrary.wiley.com | no-pdf-link | 7801
- iopscience.iop.org | redirect-loop | 7697
- journals.tubitak.gov.tr | wrong-mimetype | 7159
- www.biorxiv.org | wrong-mimetype | 7067
- www.erudit.org | redirect-loop | 6819
- besjournals.onlinelibrary.wiley.com | no-pdf-link | 6254
- (30 rows)
-
-Domains to follow-up (eg, sandcrawler ingest tests/tweaks):
-- dialnet.unirioja.es | redirect-loop | 240967
-- www.researchgate.net | redirect-loop | 42859
-- www.redalyc.org:9081 | no-pdf-link | 10515
-- www.repository.naturalis.nl | redirect-loop | 8213
-- bjp.rcpsych.org | link-loop | 8045
-- journals.tubitak.gov.tr | wrong-mimetype | 7159
-- www.erudit.org | redirect-loop | 6819
-
-The dialnet.unirioja.es ones may be worth re-crawling via heritrix?
-
-Top uncrawled domains:
-
- SELECT domain, status, COUNT((domain, status))
- FROM (
- SELECT
- ingest_file_result.ingest_type,
- ingest_file_result.status,
- substring(ingest_file_result.base_url FROM '[^/]+://([^/]*)') AS domain
- FROM ingest_file_result
- LEFT JOIN ingest_request
- ON ingest_file_result.ingest_type = ingest_request.ingest_type
- AND ingest_file_result.base_url = ingest_request.base_url
- WHERE
- ingest_file_result.ingest_type = 'pdf'
- AND ingest_request.link_source = 'mag'
- ) t1
- WHERE t1.domain != ''
- AND t1.status = 'no-capture'
- GROUP BY domain, status
- ORDER BY COUNT DESC
- LIMIT 30;
-
- domain | status | count
- ---------------------------------+------------+--------
- ieeexplore.ieee.org | no-capture | 957835
- link.springer.com | no-capture | 394121
- www.researchgate.net | no-capture | 376974
- cyberleninka.ru | no-capture | 376012
- iopscience.iop.org | no-capture | 348791
- papers.ssrn.com | no-capture | 286860
- dergipark.org.tr | no-capture | 217556
- dialnet.unirioja.es | no-capture | 214398
- academic.oup.com | no-capture | 212364
- www.tandfonline.com | no-capture | 148940
- journals.sagepub.com | no-capture | 144695
- www.papersearch.net | no-capture | 138986
- absimage.aps.org | no-capture | 111976
- apps.dtic.mil | no-capture | 106984
- www.cambridge.org | no-capture | 97533
- www.bmj.com | no-capture | 92437
- bioone.org | no-capture | 87573
- science.sciencemag.org | no-capture | 75723
- shodhganga.inflibnet.ac.in:8080 | no-capture | 75395
- www.jstor.org | no-capture | 73230
- works.bepress.com | no-capture | 68747
- www.scielo.org.co | no-capture | 59650
- hrcak.srce.hr | no-capture | 59332
- muse.jhu.edu | no-capture | 57828
- onlinelibrary.wiley.com | no-capture | 55621
- www.jbc.org | no-capture | 54608
- www.jstage.jst.go.jp | no-capture | 53631
- www.redalyc.org | no-capture | 50406
- lup.lub.lu.se | no-capture | 47469
- www.dtic.mil | no-capture | 41820
- (30 rows)
-
-## Heritrix Seedlist Generation
-
-Dump ingest requests (filtered for some domains that don't expect to crawl via
-heritrix):
-
- COPY (
- SELECT row_to_json(ingest_request.*) FROM ingest_request
- LEFT JOIN ingest_file_result
- ON ingest_file_result.ingest_type = ingest_request.ingest_type
- AND ingest_file_result.base_url = ingest_request.base_url
- WHERE
- ingest_request.ingest_type = 'pdf'
- AND ingest_request.link_source = 'mag'
- AND ingest_file_result.status = 'no-capture'
- AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
- AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
- AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
- AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
- AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
- AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
- AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
- ) TO '/grande/snapshots/mag_nocapture_20200313.rows.json';
- => COPY 11714199
-
- # in sandcrawler pipenv
- ./scripts/ingestrequest_row2json.py /grande/snapshots/mag_nocapture_20200313.rows.json > /grande/snapshots/mag_nocapture_20200313.json
-
-## Bulk Ingest of Heritrix Content
-
-Small sample:
-
- head -n 1000 mag_nocapture_20200313.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-
-Full run:
-
- cat mag_nocapture_20200313.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-
- 2020-04-07 12:19 (pacific): 11,703,871
-
-## Post-bulk-ingest
-
-Around 2020-04-28, seems like main wave of bulk ingest is complete. Will need
-to re-try things like cdx-error.
-
-Current status:
-
- status | count
- -------------------------------+----------
- success | 18491799
- redirect-loop | 1968530
- no-capture | 1373657
- no-pdf-link | 1311842
- link-loop | 1296439
- terminal-bad-status | 627577
- cdx-error | 418278
- wrong-mimetype | 50141
- wayback-error | 37159
- petabox-error | 11249
- null-body | 6295
- gateway-timeout | 3051
- spn2-cdx-lookup-failure | 328
- spn2-error:invalid-url-syntax | 93
- bad-redirect | 75
- | 47
- invalid-host-resolution | 28
- spn2-error | 10
- bad-gzip-encoding | 7
- redirects-exceeded | 2
- (20 rows)
-
-Lots of cdx-error to retry.
-
-The no-capture links are probably a mix of domain-blocklist and things that
-failed in bulk mode. Will dump and re-attempt them:
-
-
- COPY (
- SELECT row_to_json(ingest_request.*) FROM ingest_request
- LEFT JOIN ingest_file_result
- ON ingest_file_result.ingest_type = ingest_request.ingest_type
- AND ingest_file_result.base_url = ingest_request.base_url
- WHERE
- ingest_request.ingest_type = 'pdf'
- AND ingest_request.link_source = 'mag'
- AND ingest_file_result.status = 'no-capture'
- AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
- AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
- AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
- AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
- AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
- AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
- AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
- ) TO '/grande/snapshots/mag_nocapture_20200420.rows.json';
- => 859849
-
-What domains are these?
-
- cat mag_nocapture_20200420.rows.json | jq .base_url -r | cut -f3 -d/ | sort | uniq -c | sort -nr | head -n30
-
-Let's filter down more:
-
- cat mag_nocapture_20200420.rows.json | rg -v 'www.researchgate.net' | rg -v 'muse.jhu.edu' | rg -v 'www.omicsonline.org' | rg -v 'link.springer.com' | rg -v 'iopscience.iop.org' | rg -v 'ieeexplore.ieee.org' | shuf > mag_nocapture_20200420.rows.filtered.json
-
- wc -l mag_nocapture_20200420.rows.filtered.json
- 423085 mag_nocapture_20200420.rows.filtered.json
-
-Ok, enqueue!
-
- cat mag_nocapture_20200420.rows.filtered.json | shuf | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests -p -1
-
-## Final Stats
-
-... for this round of ingest:
-
- SELECT ingest_file_result.status, COUNT(*)
- FROM ingest_request
- LEFT JOIN ingest_file_result
- ON ingest_file_result.ingest_type = ingest_request.ingest_type
- AND ingest_file_result.base_url = ingest_request.base_url
- WHERE
- ingest_request.ingest_type = 'pdf'
- AND ingest_request.link_source = 'mag'
- GROUP BY status
- ORDER BY COUNT DESC
- LIMIT 20;
-
-
- status | count
- -------------------------------------+----------
- success | 18712849
- redirect-loop | 2008110
- no-pdf-link | 1337012
- link-loop | 1326761
- no-capture | 1030693
- terminal-bad-status | 637143
- gateway-timeout | 193194
- cdx-error | 125907
- spn2-cdx-lookup-failure | 77842
- wrong-mimetype | 50882
- wayback-error | 40278
- invalid-host-resolution | 35201
- petabox-error | 11254
- null-body | 6485
- spn2-error | 1643
- spn2-error:job-failed | 747
- spn2-error:invalid-url-syntax | 325
- spn2-error:soft-time-limit-exceeded | 190
- bad-redirect | 77
- | 47
- (20 rows)
-
-Failures by domain:
-
- SELECT domain, status, COUNT((domain, status))
- FROM (
- SELECT
- ingest_file_result.ingest_type,
- ingest_file_result.status,
- substring(ingest_file_result.terminal_url FROM '[^/]+://([^/]*)') AS domain
- FROM ingest_file_result
- LEFT JOIN ingest_request
- ON ingest_file_result.ingest_type = ingest_request.ingest_type
- AND ingest_file_result.base_url = ingest_request.base_url
- WHERE
- ingest_file_result.ingest_type = 'pdf'
- AND ingest_request.link_source = 'mag'
- ) t1
- WHERE t1.domain != ''
- AND t1.status != 'success'
- GROUP BY domain, status
- ORDER BY COUNT DESC
- LIMIT 30;
-
-
- domain | status | count
- ---------------------------------+---------------------+--------
- ieeexplore.ieee.org | redirect-loop | 677712
- cyberleninka.ru | link-loop | 308390
- papers.ssrn.com | link-loop | 281804
- ieeexplore.ieee.org | link-loop | 273559
- dialnet.unirioja.es | redirect-loop | 240504
- dialnet.unirioja.es | terminal-bad-status | 232481
- onlinelibrary.wiley.com | no-pdf-link | 220932
- iopscience.iop.org | terminal-bad-status | 172480
- validate.perfdrive.com | no-pdf-link | 172312
- link.springer.com | redirect-loop | 130398
- agupubs.onlinelibrary.wiley.com | no-pdf-link | 113382
- iopscience.iop.org | redirect-loop | 105234
- www.bmj.com | link-loop | 100354
- www.researchgate.net | redirect-loop | 84366
- www.cambridge.org | link-loop | 83171
- jamanetwork.com | no-pdf-link | 75053
- febs.onlinelibrary.wiley.com | no-pdf-link | 74872
- www.jstor.org | redirect-loop | 72059
- journals.sagepub.com | no-pdf-link | 63028
- science.sciencemag.org | redirect-loop | 62927
- profile.thieme.de | no-pdf-link | 62406
- cyberleninka.ru | redirect-loop | 56733
- link.springer.com | link-loop | 47608
- physoc.onlinelibrary.wiley.com | no-pdf-link | 30180
- science.sciencemag.org | link-loop | 29908
- papers.ssrn.com | redirect-loop | 27255
- obgyn.onlinelibrary.wiley.com | no-pdf-link | 26789
- www.computer.org | no-pdf-link | 26444
- watermark.silverchair.com | terminal-bad-status | 25934
- www.nature.com | redirect-loop | 25306
- (30 rows)