diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2020-03-20 16:31:22 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2020-03-20 16:31:22 -0700 |
commit | 4bcef62ecd98f2719fc4d1cef35394b0bad5cb2b (patch) | |
tree | 422aebdba5bdbcea0e71afa0ebf6997808361c79 /notes/bulk_edits | |
parent | a6f74183dd1cf1eaa44f7edeb98dbc5dc737dabb (diff) | |
download | fatcat-4bcef62ecd98f2719fc4d1cef35394b0bad5cb2b.tar.gz fatcat-4bcef62ecd98f2719fc4d1cef35394b0bad5cb2b.zip |
notes on arxiv+pubmed backfill
Diffstat (limited to 'notes/bulk_edits')
-rw-r--r-- | notes/bulk_edits/2020-03-19_arxiv_pubmed.md | 37 |
1 files changed, 37 insertions, 0 deletions
diff --git a/notes/bulk_edits/2020-03-19_arxiv_pubmed.md b/notes/bulk_edits/2020-03-19_arxiv_pubmed.md new file mode 100644 index 00000000..25220ad3 --- /dev/null +++ b/notes/bulk_edits/2020-03-19_arxiv_pubmed.md @@ -0,0 +1,37 @@ + +On 2020-03-20, automated daily harvesting and importing of arxiv and pubmed +medata started. In the case of pubmed, updates are enabled, so that recently +created DOI releases get updated with PMID and extra metdata. + +We also want to do last backfills of metadata since the last import up through +the first day updated by the continuous harvester. + + +## arxiv + +The previous date span was 2019-05-22 through 2019-12-20. This time we should +do 2019-12-20 through today. + +First do metha update from last harvest through today, and grab the new daily files: + + metha-sync -format arXivRaw http://export.arxiv.org/oai2 + + mkdir arxiv_20191220_20200319 + cp 2019-12-2* 2019-12-3* 2020-* arxiv_20191220_20200319/ + tar cf arxiv_20191220_20200319.tar arxiv_20191220_20200319/ + gzip arxiv_20191220_20200319.tar + +Then copy to fatcat server and run import: + + export FATCAT_AUTH_WORKER_ARXIV=... + + ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20191220_20200319/2019-12-31-00000000.xml + => Counter({'exists': 1824, 'total': 1001, 'insert': 579, 'skip': 1, 'update': 0}) + + fd .xml /srv/fatcat/datasets/arxiv_20191220_20200319/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {} + +Ran fairly quickly only some ~80-90k entities to process. + +## PubMed + +TODO: martin will import daily update files from the 2020 baseline through XYZ date. |