aboutsummaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2020-12-29 19:22:33 -0800
committerBryan Newbold <bnewbold@robocracy.org>2020-12-29 19:22:33 -0800
commit3b0a8b8f9d94fbdbbb0034e46e725b138e7bd712 (patch)
tree6ea56165f6dbeebaceaaa0054756756ee3ad06aa /notes/bulk_edits
parent85732f776c38db7c181f628993a29dcd6776ffde (diff)
downloadfatcat-3b0a8b8f9d94fbdbbb0034e46e725b138e7bd712.tar.gz
fatcat-3b0a8b8f9d94fbdbbb0034e46e725b138e7bd712.zip
dblp import notes; bulk edit changelog update
Diffstat (limited to 'notes/bulk_edits')
-rw-r--r--notes/bulk_edits/2020-12-23_dblp.md55
-rw-r--r--notes/bulk_edits/CHANGELOG.md9
2 files changed, 63 insertions, 1 deletions
diff --git a/notes/bulk_edits/2020-12-23_dblp.md b/notes/bulk_edits/2020-12-23_dblp.md
new file mode 100644
index 00000000..c3ad0587
--- /dev/null
+++ b/notes/bulk_edits/2020-12-23_dblp.md
@@ -0,0 +1,55 @@
+
+## Prod Container Import
+
+Using 2020-11-30 XML dump, then scrape and transform tooling from
+`extra/dblp/`.
+
+ wget https://archive.org/download/dblp-xml-2020-11-30/dblp_container_meta.json
+
+ # updated ISSN-to-ISSN-L.txt symlink to 20201207.ISSN-to-ISSN-L.txt
+
+ touch /srv/fatcat/datasets/blank_dblp_containers.tsv
+
+Create new `dblp-bot` user:
+
+ ./target/release/fatcat-auth create-editor --admin --bot dblp-bot
+ => gwbheb5jfngrxkcad5qgth5cra
+
+ ./target/release/fatcat-auth create-token gwbheb5jfngrxkcad5qgth5cra
+
+Run import:
+
+ # git commit: ec6b366af8df1956e1287cba2e0818b80ce1c518
+
+ export FATCAT_AUTH_WORKER_DBLP=...
+
+ ./fatcat_import.py dblp-container --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --dblp-container-map-file /srv/fatcat/datasets/blank_dblp_containers.tsv --dblp-container-map-output /srv/fatcat/datasets/all_dblp_containers.tsv /srv/fatcat/datasets/dblp_container_meta.json
+ => Got 0 existing dblp container mappings.
+ => Counter({'total': 6954, 'insert': 5202, 'exists': 1752, 'skip': 0, 'update': 0})
+
+ wc -l /srv/fatcat/datasets/all_dblp_containers.tsv
+ 6955 /srv/fatcat/datasets/all_dblp_containers.tsv
+
+## Prod Release Import
+
+Using same 2020-11-30 XML dump. Download to /srv/fatcat/datasets:
+
+ wget https://archive.org/download/dblp-xml-2020-11-30/dblp.dtd
+ wget https://archive.org/download/dblp-xml-2020-11-30/dblp.xml
+
+Run import:
+
+ export FATCAT_AUTH_WORKER_DBLP=...
+
+ ./fatcat_import.py dblp-release --dblp-container-map-file /srv/fatcat/datasets/all_dblp_containers.tsv /srv/fatcat/datasets/dblp.xml --do-updates
+
+ # started 2020-12-23 11:51 (Pacific)
+
+ # restarted/tweaked at least twice
+
+ # finally ended around 2020-12-27 after about... 48 hours?
+
+ => Counter({'total': 7953365, 'has-doi': 4277307, 'skip': 3097418, 'skip-key-type': 2640968, 'skip-update': 2480449, 'exists': 943800, 'update': 889700, 'insert': 338842, 'skip-arxiv-corr': 312872, 'exists-fuzzy': 203103, 'skip-dblp-container-missing': 143578, 'skip-arxiv': 53, 'skip-title': 1})
+
+Starting database size (roughly): Size: 684.08G
+Ending databse size: Size: 690.22G
diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md
index 5f25d769..c5f133f8 100644
--- a/notes/bulk_edits/CHANGELOG.md
+++ b/notes/bulk_edits/CHANGELOG.md
@@ -13,7 +13,14 @@ This file should not turn in to a TODO list!
Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities.
-Imported DOAJ article metadata from a 2020-11 dump.
+Imported DOAJ article metadata from a 2020-11 dump. Crawled and imported
+several hundred thousand file entities matched by DOAJ identifier. Updated
+journal metadata using chocula took (before the release ingest). Filtered out
+fuzzy-matching papers before importing.
+
+Imported dblp from a 2020 snapshot, both containers (primarily for conferences
+lacking an ISSN) and release entities (primarily conference papers). Filtered
+out fuzzy-matching papers before importing.
## 2020-03