aboutsummaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits
diff options
context:
space:
mode:
Diffstat (limited to 'notes/bulk_edits')
-rw-r--r--notes/bulk_edits/2019-12-20_updates.md82
-rw-r--r--notes/bulk_edits/CHANGELOG.md15
2 files changed, 95 insertions, 2 deletions
diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md
new file mode 100644
index 00000000..a8f62ea9
--- /dev/null
+++ b/notes/bulk_edits/2019-12-20_updates.md
@@ -0,0 +1,82 @@
+
+## Arxiv
+
+Used metha-sync tool to update. Then went in raw storage directory (as opposed
+to using `metha-cat`) and plucked out weekly files updated since last import.
+Created a tarball and uploaded to:
+
+ https://archive.org/download/arxiv_raw_oai_snapshot_2019-05-22/arxiv_20190522_20191220.tar.gz
+
+Downloaded, extracted, then unzipped:
+
+ gunzip *.gz
+
+Run importer:
+
+ export FATCAT_AUTH_WORKER_ARXIV=...
+
+ ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20190522_20191220/2019-05-31-00000000.xml
+ # Counter({'exists': 1785, 'total': 1001, 'insert': 549, 'skip': 1, 'update': 0})
+
+ fd .xml /srv/fatcat/datasets/arxiv_20190522_20191220/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}
+
+Things seem to run smoothly in QA. New releases get grouped with old works
+correctly, no duplication obvious.
+
+In prod, loaded just the first file as a start, waiting to see if auto-ingest
+happens. Looks like yes! Great that everything is so smooth. All seem to be new
+captures.
+
+In production prod elasticsearch, 2,377,645 arxiv releases before this
+updated import, 741,033 with files attached. Guessing about 150k new releases,
+but will check.
+
+Up to 2,531,542 arxiv releases, so only 154k or so new releases created.
+781,122 with fulltext.
+
+## Pubmed
+
+Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020>
+
+ gunzip *.xml.gz
+
+Run importer:
+
+ export FATCAT_AUTH_WORKER_PUBMED=...
+
+ ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1000.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+ # Counter({'total': 29975, 'update': 26650, 'skip': 2081, 'insert': 1193, 'warn-pmid-doi-mismatch': 36, 'exists': 36, 'skip-update-conflict': 15, 'inserted.container': 3})
+
+Noticed that `release_year` was not getting set for many releases. Made a small
+code tweak (`1bb0a2181d5a30241d80279c5930eb753733f30b`) and trying another:
+
+ time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+ # Counter({'total': 30000, 'update': 25912, 'skip': 2119, 'insert': 1935, 'exists': 29, 'warn-pmid-doi-mismatch': 27, 'skip-update-conflict': 5, 'inserted.container': 1})
+
+ real 30m45.044s
+ user 16m43.672s
+ sys 0m10.792s
+
+ time fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+More errors:
+
+ HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.3760/cma. j. issn.2095-4352. 2014. 07.014"}
+ HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.13201/j.issn.10011781.2016.06.002"}
+ HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.23750/abm.v88i2 -s.6506"}
+
+
+ 10.1037//0002-9432.72.1.50
+ BOGUS DOI: 10.1037//0021-843x.106.2.266
+ BOGUS DOI: 10.1037//0021-843x.106.2.280
+ => actual ok? at least redirect ok
+
+ unparsable medline date, skipping: Summer 2018
+
+TODO:
+x fix bad DOI error (real error, skip these)
+x remove newline after "unparsable medline date" error
+x remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning
+
diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md
index 3aa89b87..80760938 100644
--- a/notes/bulk_edits/CHANGELOG.md
+++ b/notes/bulk_edits/CHANGELOG.md
@@ -9,13 +9,24 @@ this file should probably get merged into the guide at some point.
This file should not turn in to a TODO list!
+## 2019-12
+
+Inserted about 154k new arxiv release entities. Still no automatic daily
+harvesting.
+
+"Save Paper Now" importer running. This bot only *submits* editgroups for
+review, doesn't auto-accept them.
+
+## 2019-11
+
+Daily ingest of fulltext for OA releases now enabled. New file entities created
+and merged automatically.
+
## 2019-10
Inserted 1.45m new release entities from Crossref which had been missed during
a previous gap in continuous metadata harvesting.
-## 2019-10
-
Updated 304,308 file entities to remove broken
"https://web.archive.org/web/None/*" URLs.