aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest/2020-09_scielo.md
blob: 4ec6fbd0284cde52f033b144d9a71be328200d30 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

As a follow-up to `SCIELO-CRAWL-2020-07`, going to bulk ingest all existing
fatcat releases with no IA copy and with `publisher_type:scielo`. There are
200k+ such releases.

It seems like some of these are HTML or XML, eg: https://doi.org/10.4321/s1132-12962011000300008

Could try XML ingest of these!

## Bulk Ingest

Dump ingest requests

    ./fatcat_ingest.py --allow-non-oa query "publisher_type:scielo" | pv -l > /srv/fatcat/snapshots/scielo_papers_20200914.ingest_request.json
    Expecting 212529 release objects in search queries

Enqueue

    cat /srv/fatcat/snapshots/scielo_papers_20200914.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
    => done 2020-09-14