diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-05-26 14:47:17 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-05-26 14:47:17 -0700 |
commit | 5dd8785d710cf7d067afdc691069bfa74406e06a (patch) | |
tree | 8ff16b25cee10f38127caf7fdb266d41fea12d83 /notes/ingest/2020-03_s2_ingest.md | |
parent | 4598ea9242d1001e473e6340342afea854868577 (diff) | |
download | sandcrawler-5dd8785d710cf7d067afdc691069bfa74406e06a.tar.gz sandcrawler-5dd8785d710cf7d067afdc691069bfa74406e06a.zip |
ingests: normalize file names; commit updates
Diffstat (limited to 'notes/ingest/2020-03_s2_ingest.md')
-rw-r--r-- | notes/ingest/2020-03_s2_ingest.md | 35 |
1 files changed, 0 insertions, 35 deletions
diff --git a/notes/ingest/2020-03_s2_ingest.md b/notes/ingest/2020-03_s2_ingest.md deleted file mode 100644 index fedaba0..0000000 --- a/notes/ingest/2020-03_s2_ingest.md +++ /dev/null @@ -1,35 +0,0 @@ - -Crawled some 6 million new PDFs from pdfs.semanticscholar.org. Should get these -ingested, as well as any previous existing content. - -Also, there are a bunch of PDF outlinks to the web; should do S2-specific -matching and ingest of those. - -There are a few categories of paper from pdfs.s.o: - -1. we had previous GWB crawl, didn't re-crawl -2. we had PDF from elsewhere on the web, didn't re-crawl -3. crawled successfully -4. crawl failed - -In this ingest, want to get all of categories 1 and 3. Could try to do this by -dumping sandcrawler CDX table matching pdfs.s.o (which includes recent crawl), -and join that against the ingest request list. - -For other random web URLs, can do the usual persist/backfill/recrawl pipeline. - -## Create Seedlist - - zcat s2-corpus-pdfUrls.json.gz | parallel -j5 --linebuffer --round-robin --pipe ./s2_ingestrequest.py - | pv -l | gzip > s2-corpus-pdfUrls.2019.ingest_request.json.gz - zcat s2-corpus-s2PdfUrl.json.gz | parallel -j5 --linebuffer --round-robin --pipe ./s2_ingestrequest.py - | pv -l | gzip > s2-corpus-s2PdfUrl.2019.ingest_request.json.gz - - zcat s2-corpus-s2PdfUrl.json.gz | jq .id -r | sort -u -S 2G > s2-corpus-s2PdfUrl.id_list - zcat s2-corpus-pdfUrls.json.gz | jq .id -r | sort -u -S 2G > s2-corpus-pdfUrls.id_list - - zcat s2-corpus-pdfUrls.2019.ingest_request.json.gz s2-corpus-s2PdfUrl.2019.ingest_request.json.gz | rg pdfs.semanticscholar.org | sort -u -S 3G | gzip > s2_hosted_ingestrequest.json.gz - zcat s2-corpus-pdfUrls.2019.ingest_request.json.gz s2-corpus-s2PdfUrl.2019.ingest_request.json.gz | rg -v pdfs.semanticscholar.org | sort -u -S 3G | gzip > s2_external_ingestrequest.json.gz - - zcat s2_external_ingestrequest.json.gz | wc -l - 41201427 - zcat s2_hosted_ingestrequest.json.gz | wc -l - 23345761 |