aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest/2020-11-04_arxiv.md
blob: f9abe09a549b6ab268eeeceb4c6f498e26c86f51 (plain)
1
2
3
4
5
6
7
8
9
10
11
12

Ran a bulk dump using fatcat ingest tool several months ago, and had Martin run
a crawl.

Crawl is now done, so going to ingest, hoping to get the majority of the
millions of remaining arxiv.org PDFs.

    zcat /grande/snapshots/fatcat_missing_arxiv_ingest_request.2020-08-21.json.gz | wc -l
    => 1,288,559

    zcat /grande/snapshots/fatcat_missing_arxiv_ingest_request.2020-08-21.json.gz | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1