diff options
Diffstat (limited to 'notes')
-rw-r--r-- | notes/bulk_edits/2020-12-14_doaj.md | 15 | ||||
-rw-r--r-- | notes/bulk_edits/2020-12-23_dblp.md | 55 | ||||
-rw-r--r-- | notes/bulk_edits/CHANGELOG.md | 9 |
3 files changed, 78 insertions, 1 deletions
diff --git a/notes/bulk_edits/2020-12-14_doaj.md b/notes/bulk_edits/2020-12-14_doaj.md index 64a80fda..5e897183 100644 --- a/notes/bulk_edits/2020-12-14_doaj.md +++ b/notes/bulk_edits/2020-12-14_doaj.md @@ -122,3 +122,18 @@ ahead with the full import; note that other ingest is happening in parallel zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | shuf | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - # started 2020-12-17 22:01 (Pacific) + + => 5.45M 52:38:45 [28.8 /s] + => Counter({'total': 1366458, 'exists': 1020295, 'insert': 200249, 'exists-fuzzy': 144334, 'skip': 1563, 'skip-title': 1563, 'skip-doaj-id-mismatch': 17, 'update': 0}) + +As total estimates: + +- total: 5,465,832 +- exists: 4,081,180 +- exists-fuzzy: 577,336 +- insert: 800,996 + +Ending database size: Size: 684.08G + +(note that regular imports were running during same period) + diff --git a/notes/bulk_edits/2020-12-23_dblp.md b/notes/bulk_edits/2020-12-23_dblp.md new file mode 100644 index 00000000..c3ad0587 --- /dev/null +++ b/notes/bulk_edits/2020-12-23_dblp.md @@ -0,0 +1,55 @@ + +## Prod Container Import + +Using 2020-11-30 XML dump, then scrape and transform tooling from +`extra/dblp/`. + + wget https://archive.org/download/dblp-xml-2020-11-30/dblp_container_meta.json + + # updated ISSN-to-ISSN-L.txt symlink to 20201207.ISSN-to-ISSN-L.txt + + touch /srv/fatcat/datasets/blank_dblp_containers.tsv + +Create new `dblp-bot` user: + + ./target/release/fatcat-auth create-editor --admin --bot dblp-bot + => gwbheb5jfngrxkcad5qgth5cra + + ./target/release/fatcat-auth create-token gwbheb5jfngrxkcad5qgth5cra + +Run import: + + # git commit: ec6b366af8df1956e1287cba2e0818b80ce1c518 + + export FATCAT_AUTH_WORKER_DBLP=... + + ./fatcat_import.py dblp-container --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --dblp-container-map-file /srv/fatcat/datasets/blank_dblp_containers.tsv --dblp-container-map-output /srv/fatcat/datasets/all_dblp_containers.tsv /srv/fatcat/datasets/dblp_container_meta.json + => Got 0 existing dblp container mappings. + => Counter({'total': 6954, 'insert': 5202, 'exists': 1752, 'skip': 0, 'update': 0}) + + wc -l /srv/fatcat/datasets/all_dblp_containers.tsv + 6955 /srv/fatcat/datasets/all_dblp_containers.tsv + +## Prod Release Import + +Using same 2020-11-30 XML dump. Download to /srv/fatcat/datasets: + + wget https://archive.org/download/dblp-xml-2020-11-30/dblp.dtd + wget https://archive.org/download/dblp-xml-2020-11-30/dblp.xml + +Run import: + + export FATCAT_AUTH_WORKER_DBLP=... + + ./fatcat_import.py dblp-release --dblp-container-map-file /srv/fatcat/datasets/all_dblp_containers.tsv /srv/fatcat/datasets/dblp.xml --do-updates + + # started 2020-12-23 11:51 (Pacific) + + # restarted/tweaked at least twice + + # finally ended around 2020-12-27 after about... 48 hours? + + => Counter({'total': 7953365, 'has-doi': 4277307, 'skip': 3097418, 'skip-key-type': 2640968, 'skip-update': 2480449, 'exists': 943800, 'update': 889700, 'insert': 338842, 'skip-arxiv-corr': 312872, 'exists-fuzzy': 203103, 'skip-dblp-container-missing': 143578, 'skip-arxiv': 53, 'skip-title': 1}) + +Starting database size (roughly): Size: 684.08G +Ending databse size: Size: 690.22G diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md index 5f25d769..c5f133f8 100644 --- a/notes/bulk_edits/CHANGELOG.md +++ b/notes/bulk_edits/CHANGELOG.md @@ -13,7 +13,14 @@ This file should not turn in to a TODO list! Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities. -Imported DOAJ article metadata from a 2020-11 dump. +Imported DOAJ article metadata from a 2020-11 dump. Crawled and imported +several hundred thousand file entities matched by DOAJ identifier. Updated +journal metadata using chocula took (before the release ingest). Filtered out +fuzzy-matching papers before importing. + +Imported dblp from a 2020 snapshot, both containers (primarily for conferences +lacking an ISSN) and release entities (primarily conference papers). Filtered +out fuzzy-matching papers before importing. ## 2020-03 |