diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2019-05-07 17:26:58 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2019-05-07 17:26:58 -0700 |
commit | 308a627a8102b06e23fdd455fd3e3412ec8d0668 (patch) | |
tree | a7944eec6180af263879ca518048e932d28ecb71 /notes/bootstrap/20190419_arabesque_s2_doi.txt | |
parent | bfed865555b98f987652d8d43ebdf3e6d9b6cedb (diff) | |
download | fatcat-308a627a8102b06e23fdd455fd3e3412ec8d0668.tar.gz fatcat-308a627a8102b06e23fdd455fd3e3412ec8d0668.zip |
rough notes on arabesque import from a couple weeks back
Diffstat (limited to 'notes/bootstrap/20190419_arabesque_s2_doi.txt')
-rw-r--r-- | notes/bootstrap/20190419_arabesque_s2_doi.txt | 36 |
1 files changed, 36 insertions, 0 deletions
diff --git a/notes/bootstrap/20190419_arabesque_s2_doi.txt b/notes/bootstrap/20190419_arabesque_s2_doi.txt new file mode 100644 index 00000000..8c86be90 --- /dev/null +++ b/notes/bootstrap/20190419_arabesque_s2_doi.txt @@ -0,0 +1,36 @@ + +QA before s2_doi insert, changelog #2307409 +seeing about 4500 TPS in QA postgres; low CPU load: fair amount of disk I/O +should bump to ~16 or more parallelism in prod + + 1.24M 0:18:07 [1.14k/s] + [...] + NOT checking GROBID status column + Counter({'total': 41786, 'insert': 37519, 'skip-extid-not-found': 3167, 'skip': 3167, 'exists': 1100, 'skip-update-disabled': 20, 'update': 0}) + +That's a pretty high DOI not found rate! may want to revisit these imports again in the future + + 1771.00user 72.13system 18:59.09elapsed 161%CPU (0avgtext+0avgdata 53044maxresident)k + 1528inputs+365496outputs (1major+270771minor)pagefaults 0swaps + + bnewbold@wbgrp-svc500$ zcat s2_doi.json.gz | wc -l + 1240389 + +got a single 400 error in a single thread at: + + re = self.api.lookup_release(**{self.extid_type: extid}) + +doi lookup was: 10.4244/ + +final changelog: #2316439 +would expect about 451500 new files, a fraction of the 1.2 million +did GNU parallel bail early? + +## Tried again with silly patch + + worked that time + NOT checking GROBID status column + Counter({'total': 103489, 'insert': 55521, 'exists': 40023, 'skip': 7945, 'skip-extid-not-found': 7944, 'skip-update-disabled': 40, 'skip-extid-invalid': 1, 'update': 0}) + 4322.79user 166.90system 40:21.69elapsed 185%CPU (0avgtext+0avgdata 48576maxresident)k + 21432inputs+883888outputs (136major+283424minor)pagefaults 0swaps + |