aboutsummaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
commitc32154f2875a7fb9aac727013e1475cdd811e180 (patch)
treef0e061498a101fa824995fb6ec9f91e7e44257e1 /notes
parentc5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff)
downloadfatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz
fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip
move notes/bulk_edits/ to extra/bulk_edits/
Diffstat (limited to 'notes')
-rw-r--r--notes/bulk_edits/2019-06-24_unpaywall_archiveorg.md262
-rw-r--r--notes/bulk_edits/2019-09-03_chocula.md23
-rw-r--r--notes/bulk_edits/2019-10-08_file_cleanups.md59
-rw-r--r--notes/bulk_edits/2019-11-05_crossref_patch.md58
-rw-r--r--notes/bulk_edits/2019-12-20_orcid.md43
-rw-r--r--notes/bulk_edits/2019-12-20_updates.md137
-rw-r--r--notes/bulk_edits/2020-03-19_arxiv_pubmed.md57
-rw-r--r--notes/bulk_edits/2020-03-23_jalc.md23
-rw-r--r--notes/bulk_edits/2020-08-05_chocula.md17
-rw-r--r--notes/bulk_edits/2020-09-02_file_meta.md75
-rw-r--r--notes/bulk_edits/2020-10-08_chocula.md44
-rw-r--r--notes/bulk_edits/2020-12-01_orcid.md55
-rw-r--r--notes/bulk_edits/2020-12-14_doaj.md139
-rw-r--r--notes/bulk_edits/2020-12-23_dblp.md55
-rw-r--r--notes/bulk_edits/2020_datacite.md152
-rw-r--r--notes/bulk_edits/2021-05-28_dblp.md44
-rw-r--r--notes/bulk_edits/2021-05-28_doaj.md80
-rw-r--r--notes/bulk_edits/2021-11-10_case_sensitive_dois.md53
-rw-r--r--notes/bulk_edits/2021-11-10_file_release_ingest_bugfix.md108
-rw-r--r--notes/bulk_edits/2021-11-11_wayback_short_ts.md52
-rw-r--r--notes/bulk_edits/2021-11-24_file_meta.md41
-rw-r--r--notes/bulk_edits/2021-11-24_file_sha1_dedupe.md35
-rw-r--r--notes/bulk_edits/CHANGELOG.md131
23 files changed, 0 insertions, 1743 deletions
diff --git a/notes/bulk_edits/2019-06-24_unpaywall_archiveorg.md b/notes/bulk_edits/2019-06-24_unpaywall_archiveorg.md
deleted file mode 100644
index 2f3bbb98..00000000
--- a/notes/bulk_edits/2019-06-24_unpaywall_archiveorg.md
+++ /dev/null
@@ -1,262 +0,0 @@
-
-Goal is to import:
-
-- UNPAYWALL-PDF-CRAWL-2019-04.published dataset; about 6 million lines, expect
- about half (3 million) new release fulltext matches
-- archive.org fulltext, about 1.8 million files
-
-## QA UNPAYWALL-PDF-CRAWL-2019-04
-
- export FATCAT_AUTH_WORKER_CRAWL=...
-
- # this wasn't a random sample
- zcat /srv/fatcat/datasets/UNPAYWALL-PDF-CRAWL-2019-04.published.json.gz | head -n200 | ./fatcat_import.py arabesque --json-file - --extid-type doi --crawl-id UNPAYWALL-PDF-CRAWL-2019-04
-
- # this was!
- zcat /srv/fatcat/datasets/UNPAYWALL-PDF-CRAWL-2019-04.published.json.gz | shuf -n200 | ./fatcat_import.py arabesque --json-file - --extid-type doi --crawl-id UNPAYWALL-PDF-CRAWL-2019-04
- [...]
- Counter({'total': 199, 'insert': 106, 'exists': 62, 'skip': 31, 'skip-extid-not-found': 20, 'skip-update-disabled': 1, 'update': 0})
-
- # ok, big import
- zcat /srv/fatcat/datasets/UNPAYWALL-PDF-CRAWL-2019-04.published.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py arabesque --json-file - --extid-type doi --crawl-id UNPAYWALL-PDF-CRAWL-2019-04
- # ran a few hundred thousand and looked good
-
-## prod UNPAYWALL-PDF-CRAWL-2019-04
-
- export FATCAT_AUTH_WORKER_CRAWL=...
-
- zcat /srv/fatcat/datasets/UNPAYWALL-PDF-CRAWL-2019-04.published.json.gz | shuf -n200 | ./fatcat_import.py arabesque --json-file - --extid-type doi --crawl-id UNPAYWALL-PDF-CRAWL-2019-04
- Counter({'total': 198, 'insert': 115, 'exists': 56, 'skip': 27, 'skip-extid-not-found': 13, 'skip-update-disabled': 2, 'update': 0})
-
- zcat /srv/fatcat/datasets/UNPAYWALL-PDF-CRAWL-2019-04.published.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py arabesque --json-file - --extid-type doi --crawl-id UNPAYWALL-PDF-CRAWL-2019-04
- [...]
- Requiring GROBID status == 200
- Counter({'total': 520139, 'insert': 282524, 'exists': 155371, 'skip': 82244, 'skip-extid-not-found': 40000, 'skip-update-disabled': 6319, 'update': 0})
- 19008.63user 729.06system 3:01:57elapsed 180%CPU (0avgtext+0avgdata 51440maxresident)k
- 8552inputs+4335032outputs (54major+370893minor)pagefaults 0swaps
-
- (loosely repeated 12x times, of course)
-
- oh no, lots of duplicate inserts... ugh. needed a uniq in there, but really
- only "one hit per file" export. or a shuf? blech.
-
- (python)fatcat@wbgrp-svc502:/srv/fatcat/datasets$ zcat UNPAYWALL-PDF-CRAWL-2019-04.published.json.gz | jq .final_sha1 -r | sort -S 4G -u | wc -l
- 5621882
- (python)fatcat@wbgrp-svc502:/srv/fatcat/datasets$ zcat UNPAYWALL-PDF-CRAWL-2019-04.published.json.gz | wc -l
- 6181191
-
- ugh. how did this get missed in QA? sloppy.
-
-fixup is going to be
-- filter input list for duplicated sha1hex
-- for each duplicated sha1hex:
- - fetch file entity. if not single release_id, bail
- - fetch release expanded with files
- - find all files with same sha1 that *aren't* the fetched file
- - print file entity id
-- iterate over file entity ids, batches of 100x
- - create editgroup
- - delete files
- - accept editgroup
-
-fetch_dupes.py:
-
-
- #!/usr/bin/env python3
-
- import sys
- import fatcat_client
- from fatcat_tools import public_api
-
- def do_sha1(api, sha1hex):
- try:
- fe = api.lookup_file(sha1=sha1hex)
- except:
- return
- if len(fe.release_ids) != 1:
- return
- try:
- re = api.get_release(fe.release_ids[0], expand='files', hide='refs,contribs,abstracts')
- except:
- return
- for f in re.files:
- if f.sha1 == fe.sha1 and f.ident != fe.ident and f.release_ids == [re.ident]:
- print(f.ident)
-
- def run():
- api = public_api('https://api.qa.fatcat.wiki/v0')
- for l in sys.stdin:
- if l:
- do_sha1(api, l.strip())
-
- if __name__ == '__main__':
- run()
-
-delete_dupes.py:
-
-
- #!/usr/bin/env python3
-
- import sys
- import fatcat_client
- from fatcat_tools import authenticated_api
-
- #API_ENDPOINT = 'https://api.qa.fatcat.wiki/v0'
- API_ENDPOINT = 'https://api.fatcat.wiki/v0'
-
- def do_batch(api, batch):
- eg = api.create_editgroup(
- fatcat_client.Editgroup(description="Cleaning up duplicated file insertions from UNPAYWALL-CRAWL-2019-04 insert"))
- for ident in batch:
- api.delete_file(eg.editgroup_id, ident)
- api.accept_editgroup(eg.editgroup_id)
- print("deleted {} - {}...".format(eg.editgroup_id, len(batch)))
-
- def run():
- api = authenticated_api(API_ENDPOINT)
- batch = []
- for l in sys.stdin:
- l = l.strip()
- if not l:
- continue
- try:
- fe = api.get_file(l)
- except:
- continue
- if fe.state == 'active' and fe.release_ids:
- batch.append(l)
- if len(batch) >= 100:
- do_batch(api, batch)
- batch = []
- if batch:
- do_batch(api, batch)
-
- if __name__ == '__main__':
- run()
-
-commands:
-
- zcat UNPAYWALL-PDF-CRAWL-2019-04.published.json.gz | jq .final_sha1 -r | b32_hex.py | sort -S 4G | uniq -d > repeated_sha1.tsv
-
- cat repeated_sha1.tsv | pv -l | ./fetch_dupes.py > repeated_file_idents.tsv
-
- export FATCAT_API_AUTH_TOKEN=... (crawl bot)
- cat repeated_file_idents.tsv | ./delete_dupes.py
-
-## QA archive.org files
-
-Start with arxiv:
-
- # FATCAT_AUTH_WORKER_ARCHIVE_ORG
- export FATCAT_API_AUTH_TOKEN=...
-
- # had a 500 "unexpected internal error: invalid length at 196", which was
- due to syntax error in API token. should have a better error response
-
- # try sample of arxiv_id
- zcat /srv/fatcat/datasets/arxiv.match.json.gz | head -n100 | ./fatcat_import.py --editgroup-description-override "Import fulltext from archive.org journals collection" matched --default-mimetype application/pdf --default-link-rel archive -
- Counter({'skip': 100, 'total': 100, 'skip-no-releases': 72, 'skip-no-urls': 28, 'update': 0, 'insert': 0, 'exists': 0})
-
- # TODO: shouldn't re-insert if URL already in there under a different reltyp
-
- # Ok, made a bunch of code changes to "clean up" at least arxiv URLs. All
- # arxiv.org files should be 1-to-1 with releases that have full arxiv_ids
-
-Ok, try JSTOR:
-
- zcat /srv/fatcat/datasets/jstor.match.json.gz | shuf -n1000 | ./fatcat_import.py --editgroup-description-override "Import fulltext from archive.org journals collection" matched --default-mimetype application/pdf --default-link-rel archive -
- [...]
- Counter({'total': 1000, 'skip': 763, 'skip-no-releases': 763, 'insert': 162, 'exists': 74, 'update': 1})
-
-larger import got:
-
- HTTP response body: {"success":false,"error":"ConstraintViolation","message":"unexpected database error: duplicate key value violates unique constraint \"file_edit_editgroup_id_ident_id_key\""}
-
-could try getting around this with shuf?
-
- zcat /srv/fatcat/datasets/jstor.match.json.gz | shuf | pv -l | ./fatcat_import.py --editgroup-description-override "Import fulltext from archive.org journals collection" matched --default-mimetype application/pdf --default-link-rel archive -
-
-got the same errors so added "inflight" edit protection and rolled back to earlier command:
-
- zcat /srv/fatcat/datasets/jstor.match.json.gz | shuf -n1000 | ./fatcat_import.py --editgroup-description-override "Import fulltext from archive.org journals collection" matched --default-mimetype application/pdf --default-link-rel archive -
- [...]
- 451k 0:25:32 [ 294 /s]
- Counter({'total': 451178, 'skip-no-releases': 351287, 'skip': 351287, 'insert': 59644, 'exists': 39198, 'update': 1049, 'skip-update-inflight': 26})
-
-many/most of these files were already in fatcat due to earlier "paper-manifest"
-work... keep forgetting that!
-
-ok, next pmc:
-
- zcat /srv/fatcat/datasets/pmc.match.json.gz | shuf -n1000 | ./fatcat_import.py --editgroup-description-override "Import fulltext from archive.org journals collection" matched --default-mimetype application/pdf --default-link-rel archive -
- [...]
- Counter({'total': 1000, 'exists': 895, 'insert': 77, 'skip-no-releases': 22, 'skip': 22, 'update': 6})
-
- that's a surprisingly large fraction (2.2%) with `skip-no-releases`. some
- because pubmed import failed, some because multiple PMCID identifiers? hrm.
-
-ok, an finally paper-doi:
-
- zcat /srv/fatcat/datasets/paper-doi.match.json.gz | shuf -n1000 | ./fatcat_import.py --editgroup-description-override "Import fulltext from archive.org journals collection" matched --default-mimetype application/pdf --default-link-rel archive -
- [...]
- Counter({'total': 1000, 'exists': 720, 'insert': 280, 'update': 0, 'skip': 0})
-
- lots exist! probably from the pre-1923 stuff? yup.
-
-## prod archive.org files
-
- # try sample of arxiv_id
- zcat /srv/fatcat/datasets/arxiv.match.json.gz | shuf -n100 | ./fatcat_import.py --editgroup-description-override "Import fulltext from archive.org journals collection" matched --default-mimetype application/pdf --default-link-rel archive -
- Counter({'total': 100, 'insert': 80, 'update': 20, 'exists': 0, 'skip': 0})
-
- # all arxiv_id
- zcat /srv/fatcat/datasets/arxiv.match.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --editgroup-description-override '"Import fulltext from archive.org journals collection"' matched --default-mimetype application/pdf --default-link-rel archive -
- [...]
- Counter({'total': 62296, 'insert': 49503, 'update': 12413, 'exists': 269, 'skip': 111, 'skip-no-releases': 111, 'skip-update-inflight': 10})
- 2497.78user 98.87system 27:31.22elapsed 157%CPU (0avgtext+0avgdata 47604maxresident)k
- 360inputs+266104outputs (3major+265297minor)pagefaults 0swaps
-
- # derp, some of those were crawl-bot but should have been archive-org-bot. ctrl-c and re-ran
-
- # sample jstor
- zcat /srv/fatcat/datasets/jstor.match.json.gz | shuf -n100 | ./fatcat_import.py --editgroup-description-override "Import fulltext from archive.org journals collection" matched --default-mimetype application/pdf --default-link-rel archive -
- Counter({'total': 100, 'insert': 69, 'exists': 29, 'update': 1, 'skip': 1, 'skip-no-releases': 1})
-
- # all jstor
- zcat /srv/fatcat/datasets/jstor.match.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --editgroup-description-override '"Import fulltext from archive.org journals collection"' matched --default-mimetype application/pdf --default-link-rel archive -
- [...]
- Counter({'total': 41072, 'insert': 27783, 'exists': 11926, 'update': 1307, 'skip-update-inflight': 117, 'skip': 56, 'skip-no-releases': 56})
- 1257.93user 54.42system 12:45.96elapsed 171%CPU (0avgtext+0avgdata 45248maxresident)k
- 5384inputs+157016outputs (38major+259749minor)pagefaults 0swaps
-
- good, pretty low `skip-no-releases` for JSTOR imports
-
- # sample pmc
- zcat /srv/fatcat/datasets/pmc.match.json.gz | shuf -n100 | ./fatcat_import.py --editgroup-description-override "Import fulltext from archive.org journals collection" matched --default-mimetype application/pdf --default-link-rel archive -
- Counter({'total': 100, 'exists': 92, 'insert': 5, 'skip-no-releases': 2, 'skip': 2, 'update': 1})
-
- interesting, at least one longtail file which is actually known: https://fatcat.wiki/file/xnc3sarc3jfsnceeagn34zi5la
- almost all known!
-
- # all pmc
- zcat /srv/fatcat/datasets/pmc.match.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --editgroup-description-override '"Import fulltext from archive.org journals collection"' matched --default-mimetype application/pdf --default-link-rel archive -
- [...]
- Counter({'total': 18720, 'exists': 16701, 'insert': 1461, 'skip': 357, 'skip-no-releases': 357, 'update': 201, 'skip-update-inflight': 1})
-
- # sample paper-doi
- zcat /srv/fatcat/datasets/paper-doi.match.json.gz | shuf -n100 | ./fatcat_import.py --editgroup-description-override "Import fulltext from archive.org journals collection" matched --default-mimetype application/pdf --default-link-rel archive -
- Counter({'total': 100, 'exists': 73, 'insert': 27, 'skip': 0, 'update': 0})
-
- # all paper-doi
- zcat /srv/fatcat/datasets/paper-doi.match.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --editgroup-description-override '"Import fulltext from archive.org journals collection"' matched --default-mimetype application/pdf --default-link-rel archive -
- Counter({'total': 3014, 'exists': 2280, 'insert': 734, 'update': 0, 'skip': 0})
- Counter({'total': 3464, 'exists': 2483, 'insert': 981, 'update': 0, 'skip': 0})
- Counter({'total': 3437, 'exists': 2303, 'insert': 1134, 'update': 0, 'skip': 0})
- Counter({'total': 3450, 'exists': 2379, 'insert': 1071, 'update': 0, 'skip': 0})
- Counter({'total': 3467, 'exists': 2486, 'insert': 981, 'skip': 0, 'update': 0})
- Counter({'total': 3481, 'exists': 2583, 'insert': 898, 'skip': 0, 'update': 0})
- Counter({'total': 3423, 'exists': 2178, 'insert': 1245, 'update': 0, 'skip': 0})
- 62.41user 3.17system 0:31.18elapsed 210%CPU (0avgtext+0avgdata 49852maxresident)k
- 96inputs+13184outputs (7major+159215minor)pagefaults 0swaps
-
-All done!
diff --git a/notes/bulk_edits/2019-09-03_chocula.md b/notes/bulk_edits/2019-09-03_chocula.md
deleted file mode 100644
index 03311c8c..00000000
--- a/notes/bulk_edits/2019-09-03_chocula.md
+++ /dev/null
@@ -1,23 +0,0 @@
-
-Ran on my laptop, from master branch (`00e4ec6865782e4453d4dcec24ee205bf7c13ecf`).
-
-In QA:
-
- time ./fatcat_import.py --host-url https://api.qa.fatcat.wiki/v0 chocula ~/code/chocula/export_fatcat.json
- [...]
- Counter({'total': 139032, 'exists': 76952, 'exists-skip-update': 65528, 'update': 46157, 'insert': 13605, 'exists-by-issnl': 8879, 'exists-not-found': 2538, 'skip': 2318, 'skip-unknown-new-issnl': 2318, 'exists-inactive': 7})
- real 32m30.009s
- user 4m44.176s
- sys 0m12.544s
-
-In prod:
-
- time ./fatcat_import.py --host-url https://api.fatcat.wiki/v0 chocula ~/code/chocula/export_fatcat.2019-09-03.json
-
- Counter({'total': 139032, 'update': 66802, 'exists': 47586, 'exists-skip-update': 47428, 'insert': 22326, 'skip-unknown-new-issnl': 2318, 'skip': 2318, 'exists-by-issnl': 158})
-
- real 34m16.930s
- user 5m5.116s
- sys 0m13.560s
-
-The container auto-updater is currently broken, so needed to manually dump/load into elastic.
diff --git a/notes/bulk_edits/2019-10-08_file_cleanups.md b/notes/bulk_edits/2019-10-08_file_cleanups.md
deleted file mode 100644
index 2eebb363..00000000
--- a/notes/bulk_edits/2019-10-08_file_cleanups.md
+++ /dev/null
@@ -1,59 +0,0 @@
-
-These cleanups are primarily intended to fix bogus 'None' datetime links to
-wayback for files that are actually in petabox (archive.org not
-web.archive.org). These URLs were created accidentally during fatcat
-boostrapping; there are about 300k such file enties to fix.
-
-Will also update archive.org link reltype to 'archive' (instead of
-'repository'), which is the new preferred style.
-
-Generated the set of files to update like:
-
- zcat file_export.2019-07-07.json.gz | rg 'web.archive.org/web/None' | gzip > file_export.2019-07-07.None.json.gz
-
- zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | wc -l
- 304308
-
-## QA
-
-Running at git rev:
-
- 984a1b157990f42f8c57815f4b3c00f6455a114f
-
-Created a new 'cleanup-bot' account and credentials. Put token in local env.
-
-Ran with a couple hundred entities first; edits look good.
-
- zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | head -n200 | ./fatcat_cleanup.py files -
-
-Then the full command, with batchsize=100:
-
- time zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | pv -l | ./fatcat_cleanup.py --batch-size 100 files -
-
-Should finish in a couple hours.
-
- 304k 1:05:19 [77.6 /s]
-
- Counter({'cleaned': 304308, 'lines': 304308, 'updated': 297308, 'skip-revision': 7000})
-
- real 65m20.613s
- user 20m40.828s
- sys 0m34.492s
-
-## Production
-
-Again ran with a couple hundred entities first; edits look good.
-
- zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | head -n200 | ./fatcat_cleanup.py files -
-
-Then the full command, with batchsize=100:
-
- time zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | pv -l | ./fatcat_cleanup.py --batch-size 100 files -
- [...]
- 304k 1:03:10 [80.3 /s]
- Counter({'cleaned': 304308, 'lines': 304308, 'updated': 304107, 'skip-revision': 201})
-
- real 63m11.631s
- user 21m8.504s
- sys 0m31.888s
-
diff --git a/notes/bulk_edits/2019-11-05_crossref_patch.md b/notes/bulk_edits/2019-11-05_crossref_patch.md
deleted file mode 100644
index 1765fc36..00000000
--- a/notes/bulk_edits/2019-11-05_crossref_patch.md
+++ /dev/null
@@ -1,58 +0,0 @@
-
-Goal is to make sure we have imported all in-scope crossref DOI objects. There
-were a few months gap between the snapshot used as initial bootstrap and the
-start of continuous ingest; any DOIs registered during that gap and not updated
-since are not in fatcat. Expectation is that this will be a relatively small
-import.
-
-## QA Run
-
-Started Thu 31 Oct 2019 08:07:20 PM PDT
-
- export FATCAT_AUTH_WORKER_CROSSREF="..."
- time xzcat /srv/fatcat/datasets/crossref-works.2019-09-09.json.xz | time parallel -j20 --round-robin --pipe ./fatcat_import.py crossref - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
-
- # postgresql DB at start: fresh 2019-10 dump imported, 357 GB
- # over 15k TPS against postgres
-
- 20x theads of:
- Counter({'total': 5397349, 'exists': 4961058, 'skip': 360156, 'insert': 76135, 'inserted.container': 113, 'update': 0})
-
- real 1173m52.497s => 20hr
- user 13058m24.460s
- sys 319m27.716s
-
- 1.5 million new releases
- 7.2 million skips (total)
-
-Ran again with null subtitle fix and granular stats:
-
- 20x threads of:
- Counter({'total': 5368366, 'exists': 5122104, 'skip': 244072, 'skip-blank-title': 38399, 'skip-release-type': 5296, 'insert': 2190, 'skip-huge-contribs': 70, 'skip-huge-refs': 7, 'update': 0})
-
- 43k additional insets (still about 1.5m total)
- of 4.8 million skipped (why not closer to 7.2 million?), most seem to be blank title
-
-## Production Run
-
-Git: 44c23290c72ec67db38f1e1d40b76ba795b40d9d
-
-started around Tue 05 Nov 2019 02:51:19 PM PST
-
- export FATCAT_AUTH_WORKER_CROSSREF="..."
- time xzcat /srv/fatcat/datasets/crossref-works.2019-09-09.json.xz | time parallel -j20 --round-robin --pipe ./fatcat_import.py crossref - /srv/fatcat/datasets/20190730.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
-
- # postgresql DB at start: 399.03G
-
- # 20x of:
- Counter({'total': 5347938, 'exists': 5023305, 'skip': 251747, 'skip-blank-title': 247969, 'insert': 72886, 'skip-release-type': 3686, 'inserted.container': 103, 'skip-huge-contribs': 88, 'skip-huge-refs': 4, 'update': 0})
- # 1.45m new releases
- # 2k more new containers
- # 4.96m blank titles
-
- real 1139m42.231s
- user 13307m10.124s
- sys 355m18.904s
-
- # postgresql DB: 402.76G
-
diff --git a/notes/bulk_edits/2019-12-20_orcid.md b/notes/bulk_edits/2019-12-20_orcid.md
deleted file mode 100644
index 33dde32f..00000000
--- a/notes/bulk_edits/2019-12-20_orcid.md
+++ /dev/null
@@ -1,43 +0,0 @@
-
-Newer ORCID dumps are XML, not JSON. But there is a conversion tool!
-
- https://github.com/ORCID/orcid-conversion-lib
-
-Commands:
-
- wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-0.0.2-full.jar
- java -jar orcid-conversion-lib-0.0.2-full.jar OPTIONS
-
- java -jar orcid-conversion-lib-0.0.2-full.jar --tarball -i ORCID_2019_summaries.tar.gz -v v3_0rc1 -o ORCID_2019_summaries_json.tar.gz
-
- # [...]
- # Sat Dec 21 04:43:50 UTC 2019 done 7300000
- # Sat Dec 21 04:44:08 UTC 2019 done 7310000
- # Sat Dec 21 04:44:17 UTC 2019 finished errors 0
-
-Importing in QA, ran in to some lines like:
-
- {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0003-0014-6598","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"}
- {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0003-3750-5654","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"}
- {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0003-1424-4826","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"}
- {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0002-5340-9665","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"}
-
-Needed to patch to filter those out. Then ran ok like:
-
- zcat /srv/fatcat/datasets/ORCID_2019_summaries.sample_10k.json.gz | ./fatcat_import.py orcid -
- Counter({'total': 10000, 'exists': 5323, 'insert': 4493, 'skip': 184, 'skip-no-person': 160, 'update': 0})
-
-New dump is about 7.3 million rows, so expecting about 3.2 million new
-entities, 250k skips.
-
-Doing bulk run like:
-
- time zcat /srv/fatcat/datasets/ORCID_2019_summaries.json.gz | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid -
-
-Prod timing:
-
- Counter({'total': 910643, 'exists': 476812, 'insert': 416583, 'skip': 17248, 'update': 0})
-
- real 47m27.658s
- user 245m44.272s
- sys 14m50.836s
diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md
deleted file mode 100644
index bd069a7a..00000000
--- a/notes/bulk_edits/2019-12-20_updates.md
+++ /dev/null
@@ -1,137 +0,0 @@
-
-## Arxiv
-
-Used metha-sync tool to update. Then went in raw storage directory (as opposed
-to using `metha-cat`) and plucked out weekly files updated since last import.
-Created a tarball and uploaded to:
-
- https://archive.org/download/arxiv_raw_oai_snapshot_2019-05-22/arxiv_20190522_20191220.tar.gz
-
-Downloaded, extracted, then unzipped:
-
- gunzip *.gz
-
-Run importer:
-
- export FATCAT_AUTH_WORKER_ARXIV=...
-
- ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20190522_20191220/2019-05-31-00000000.xml
- # Counter({'exists': 1785, 'total': 1001, 'insert': 549, 'skip': 1, 'update': 0})
-
- fd .xml /srv/fatcat/datasets/arxiv_20190522_20191220/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}
-
-Things seem to run smoothly in QA. New releases get grouped with old works
-correctly, no duplication obvious.
-
-In prod, loaded just the first file as a start, waiting to see if auto-ingest
-happens. Looks like yes! Great that everything is so smooth. All seem to be new
-captures.
-
-In production prod elasticsearch, 2,377,645 arxiv releases before this
-updated import, 741,033 with files attached. Guessing about 150k new releases,
-but will check.
-
-Up to 2,531,542 arxiv releases, so only 154k or so new releases created.
-781,122 with fulltext.
-
-## Pubmed QA
-
-Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020>
-
- gunzip *.xml.gz
-
-Run importer:
-
- export FATCAT_AUTH_WORKER_PUBMED=...
-
- ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1000.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
-
- # Counter({'total': 29975, 'update': 26650, 'skip': 2081, 'insert': 1193, 'warn-pmid-doi-mismatch': 36, 'exists': 36, 'skip-update-conflict': 15, 'inserted.container': 3})
-
-Noticed that `release_year` was not getting set for many releases. Made a small
-code tweak (`1bb0a2181d5a30241d80279c5930eb753733f30b`) and trying another:
-
- time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
-
- # Counter({'total': 30000, 'update': 25912, 'skip': 2119, 'insert': 1935, 'exists': 29, 'warn-pmid-doi-mismatch': 27, 'skip-update-conflict': 5, 'inserted.container': 1})
-
- real 30m45.044s
- user 16m43.672s
- sys 0m10.792s
-
- time fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
-
-More errors:
-
- HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.3760/cma. j. issn.2095-4352. 2014. 07.014"}
- HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.13201/j.issn.10011781.2016.06.002"}
- HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.23750/abm.v88i2 -s.6506"}
-
-
- 10.1037//0002-9432.72.1.50
- BOGUS DOI: 10.1037//0021-843x.106.2.266
- BOGUS DOI: 10.1037//0021-843x.106.2.280
- => actual ok? at least redirect ok
-
- unparsable medline date, skipping: Summer 2018
-
-TODO:
-x fix bad DOI error (real error, skip these)
-x remove newline after "unparsable medline date" error
-x remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning
-
-NOTE: Remember having run through the entire baseline in QA, but didn't save the command or output.
-
-## Pubmed Prod (2020-01-17)
-
-This is after adding a flag to enforce no updates at all, only new releases.
-Will likely revisit and run through with updates that add important metadata
-like exact references matches for older releases, after doing release
-merge/group cleanups.
-
-
- # git commit: d55d45ad667ccf34332b2ce55e8befbd212922ec
- # had a trivial typo in fatcat_import.py, will push a fix
- export FATCAT_AUTH_WORKER_PUBMED=...
- time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
-
-Full run:
-
- fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
-
- [...]
- Command exited with non-zero status 2
- 1271708.20user 23689.44system 31:42:15elapsed 1134%CPU (0avgtext+0avgdata 584588maxresident)k
- 486129672inputs+2998072outputs (3672major+139751796minor)pagefaults 0swaps
-
- => so apparently 2x tasks failed
- => 1271708 = 353 hours... but what walltime? about 31-32 hours if divide by CPU
-
-Only received a single exception at:
-
- Jan 18, 2020 8:33:09 AM UTC
- /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml
- MalformedExternalId: 10.4149/gpb¬_2017042
-
-Not sure what the other failure was... maybe an invalid filename or argument,
-before processing actually started? Or some failure (OOM) that prevented sentry
-reporting?
-
-Patch normal.py and re-run that single file:
-
- ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
- [...]
- Counter({'total': 30000, 'exists': 27243, 'skip': 1605, 'insert': 1152, 'warn-pmid-doi-mismatch': 26, 'update': 0})
-
-Done!
-
-## Chocula
-
-Command:
-
- export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
- ./fatcat_import.py chocula /srv/fatcat/datasets/export_fatcat.2019-12-26.json
-
-Result:
-
- Counter({'total': 144455, 'exists': 139807, 'insert': 2384, 'skip': 2264, 'skip-unknown-new-issnl': 2264, 'exists-by-issnl': 306, 'update': 0})
diff --git a/notes/bulk_edits/2020-03-19_arxiv_pubmed.md b/notes/bulk_edits/2020-03-19_arxiv_pubmed.md
deleted file mode 100644
index 56e88880..00000000
--- a/notes/bulk_edits/2020-03-19_arxiv_pubmed.md
+++ /dev/null
@@ -1,57 +0,0 @@
-
-On 2020-03-20, automated daily harvesting and importing of arxiv and pubmed
-metadata started. In the case of pubmed, updates are enabled, so that recently
-created DOI releases get updated with PMID and extra metadata.
-
-We also want to do last backfills of metadata since the last import up through
-the first day updated by the continuous harvester.
-
-
-## arxiv
-
-The previous date span was 2019-05-22 through 2019-12-20. This time we should
-do 2019-12-20 through today.
-
-First do metha update from last harvest through today, and grab the new daily files:
-
- metha-sync -format arXivRaw http://export.arxiv.org/oai2
-
- mkdir arxiv_20191220_20200319
- cp 2019-12-2* 2019-12-3* 2020-* arxiv_20191220_20200319/
- tar cf arxiv_20191220_20200319.tar arxiv_20191220_20200319/
- gzip arxiv_20191220_20200319.tar
-
-Then copy to fatcat server and run import:
-
- export FATCAT_AUTH_WORKER_ARXIV=...
-
- ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20191220_20200319/2019-12-31-00000000.xml
- => Counter({'exists': 1824, 'total': 1001, 'insert': 579, 'skip': 1, 'update': 0})
-
- fd .xml /srv/fatcat/datasets/arxiv_20191220_20200319/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}
-
-Ran fairly quickly only some ~80-90k entities to process.
-
-## PubMed
-
-First, mirror update files from FTP, e.g. via lftp:
-
- mkdir -p /srv/fatcat/datasets/pubmed_updates
- lftp -e 'mirror -c /pubmed/updatefiles /srv/fatcat/datasets/pubmed_updates; bye' ftp://ftp.ncbi.nlm.nih.gov
-
-Inspect completed dates from kafka:
-
- kafkacat -b $KAFKA_BROKER -t fatcat-prod.ftp-pubmed-state -C
-
-Show dates and corresponding files:
-
- find /srv/fatcat/datasets/pubmed_updates -name "*html" | xargs cat | grep "Created" | sort
-
-For this bulk import, we used files pubmed20n1016.xml.gz (2019-12-16) up to pubmed20n1110.xml.gz (2020-03-06).
-
-To import the corresponding files, run:
-
- printf "%s\n" /srv/fatcat/datasets/pubmed_updates/pubmed20n{1016..1110}.xml.gz | shuf | \
- parallel -j16 'gunzip -c {} | ./fatcat_import.py pubmed --do-updates - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt'
-
-Import took 254 min, there were 1715427 PubmedArticle docs in these update files.
diff --git a/notes/bulk_edits/2020-03-23_jalc.md b/notes/bulk_edits/2020-03-23_jalc.md
deleted file mode 100644
index d63c3759..00000000
--- a/notes/bulk_edits/2020-03-23_jalc.md
+++ /dev/null
@@ -1,23 +0,0 @@
-
-2019-10-01 JaLC metadata snapshot: <https://archive.org/download/jalc-bulk-metadata-2019>
-
-Extracted .rdf file instead of piping it through zcat.
-
-Use correct bot:
-
- export FATCAT_AUTH_WORKER_JALC=blah
-
-Start small; do a random bunch (10k) single-threaded to pre-create containers:
-
- head -n100 /srv/fatcat/datasets/JALC-LOD-20191001.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
- shuf -n100 /srv/fatcat/datasets/JALC-LOD-20191001.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
- shuf -n10000 /srv/fatcat/datasets/JALC-LOD-20191001.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
-
-Seemed like lots of individual containers getting added after repeating, so
-just going to import single-threaded to avoid duplicate container creation:
-
- cat /srv/fatcat/datasets/JALC-LOD-20191001.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
- => Counter({'total': 8419745, 'exists': 6480683, 'insert': 1934082, 'skip': 4980, 'inserted.container': 134, 'update': 0})
-
-Had a bit fewer than 4,568,120 "doi_registrar:jalc" releases before this
-import, 6,502,202 after (based on `doi_registrar:jalc` query).
diff --git a/notes/bulk_edits/2020-08-05_chocula.md b/notes/bulk_edits/2020-08-05_chocula.md
deleted file mode 100644
index 78d2c9c3..00000000
--- a/notes/bulk_edits/2020-08-05_chocula.md
+++ /dev/null
@@ -1,17 +0,0 @@
-
-Using `journal-metadata-bot` and `chocula_fatcat_export.2020-07-31.json` export.
-
-Start small:
-
- head -n100 /srv/fatcat/datasets/chocula_fatcat_export.2020-07-31.json | ./fatcat_import.py chocula --do-updates -
- => Counter({'total': 100, 'update': 67, 'exists': 25, 'exists-skip-update': 24, 'insert': 8, 'exists-by-issnl': 1, 'skip': 0})
-
-Full batch:
-
- time cat /srv/fatcat/datasets/chocula_fatcat_export.2020-07-31.json | ./fatcat_import.py chocula --do-updates -
-
- Counter({'total': 164950, 'update': 112074, 'exists': 37243, 'exists-skip-update': 35862, 'insert': 15633, 'exists-by-issnl': 1381, 'skip': 0})
-
- real 16m9.779s
- user 6m26.324s
- sys 0m16.088s
diff --git a/notes/bulk_edits/2020-09-02_file_meta.md b/notes/bulk_edits/2020-09-02_file_meta.md
deleted file mode 100644
index b0606f2d..00000000
--- a/notes/bulk_edits/2020-09-02_file_meta.md
+++ /dev/null
@@ -1,75 +0,0 @@
-
-Approximately 18 million file entities have only partial file metadata. All
-have a sha1 (hex), but many are missing file size, md5, mimetype, etc.
-
-At least a few thousand of these are additionally *not* `application/pdf`
-mimetype based on actually retrieving the file and sniffing the file type.
-These are added earlier to the catalog likely based on CDX mimetype, which is
-server-reported and can be incorrect.
-
-## QA Testing
-
- ./fatcat_import.py --editgroup-description-override "backfill of full file-level metadata for early-imported papers" file-meta -
- => Counter({'total': 1000, 'update': 1000, 'skip': 0, 'insert': 0, 'exists': 0})
-
- # Identical command, verifying that don't double-insert:
- => Counter({'total': 1000, 'skip-existing-complete': 1000, 'skip': 0, 'insert': 0, 'update': 0, 'exists': 0})
-
-Two additional file-level cleanups should probably be done at the same time:
-
-Partial wayback URL timestamps, for cases where we have the full timestamped URL. Eg:
-
- https://web.archive.org/web/2017/http://www.geoamazonia.net/index.php/revista/article/download/51/pdf_38
- https://web.archive.org/web/20170922010835/http://www.geoamazonia.net/index.php/revista/article/download/51/pdf_38
- https://qa.fatcat.wiki/file/4udmm4zd4bgfhnaaycqoztgfgm
- https://qa.fatcat.wiki/file/k73il3k5hzemtnkqa5qyorg6ci
- https://qa.fatcat.wiki/file/7hstlrabfjb6vgyph7ntqtpkne
-
-Live-web URLs identical except for http/https flip or other trivial things (much less frequent case):
-
- http://eo1.gsfc.nasa.gov/new/validationReport/Technology/JoeCD/asner_etal_PNAS_20041.pdf
- https://eo1.gsfc.nasa.gov/new/validationReport/Technology/JoeCD/asner_etal_PNAS_20041.pdf
-
- http://homepages.math.uic.edu/~rosendal/PapersWebsite/BanachMinimalExamples.pdf
- http://homepages.math.uic.edu:80/~rosendal/PapersWebsite/BanachMinimalExamples.pdf
- https://qa.fatcat.wiki/file/h2wx6re5fjhx7c6duifzskeo6u
- https://qa.fatcat.wiki/file/vw7divmjwveftn4djj2cp32n4i
-
-Which bot to use? Let's do `sandcrawler-bot`.
-
-Trying a larger batch to see what database size increase is going to look like,
-and whether single-threaded is going to be too slow:
-
- # before: Size: 517.87G
-
- time zcat /srv/fatcat/datasets/fatcat_file_partial.file_meta.json.gz | head -n500000 | pv -l | ./fatcat_import.py --editgroup-description-override "backfill of full file-level metadata for early-imported papers" file-meta -
- => 145m18.615s
-
- # after: 518.47G
- # delta: 600 MB
-
-A million records would take about 5 hours, so 100 hours total, or 4 days. Let's do parallelism.
-
-Total size increase estimated as 24 GBytes. It all adds up!
-
- time zcat /srv/fatcat/datasets/fatcat_file_partial.file_meta.json.gz | tail -n500000 | pv -l | parallel -j8 --round-robin --pipe -q ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta -
- => real 32m53.935s
-
-## Production Import
-
-Before Size: 624.63G
-
- export FATCAT_API_AUTH_TOKEN... # sandcrawler-bot
-
- # start small
- time zcat /srv/fatcat/datasets/fatcat_file_partial.file_meta.json.gz | pv -l | head -n1000 | parallel -j8 --round-robin --pipe -q ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta -
-
- # full batch
- time zcat /srv/fatcat/datasets/fatcat_file_partial.file_meta.json.gz | pv -l | parallel -j8 --round-robin --pipe -q ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta -
-
- => 18.1M 20:53:32 [ 241 /s]
-
- Counter({'total': 2234159, 'update': 2234111, 'skip-existing-complete': 48, 'skip': 0, 'insert': 0, 'exists': 0})
- (etc, 8x)
-
-After Size: 653.69G (+29GB or so)
diff --git a/notes/bulk_edits/2020-10-08_chocula.md b/notes/bulk_edits/2020-10-08_chocula.md
deleted file mode 100644
index d60b6842..00000000
--- a/notes/bulk_edits/2020-10-08_chocula.md
+++ /dev/null
@@ -1,44 +0,0 @@
-
-Another update of journal metadata. In this case due to expanding "Keepers"
-coverage to PKP PLN, Hathitrust, Scholar's Portal, and Carniniana.
-
-Using `journal-metadata-bot` and `chocula.2020-10-08.json` export.
-
-## QA Testing
-
- shuf -n1000 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
- Counter({'total': 1000, 'exists': 640, 'exists-skip-update': 532, 'update': 348, 'exists-not-found': 108, 'insert': 12, 'skip': 0})
-
-Expecting roughly a 1/3 update rate. Most of these seem to be true updates (eg,
-adding kbart metadata). A smaller fraction are just updating DOAJ timestamp or
-not updating any metadata at all.
-
- head -n500 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
- Counter({'total': 500, 'exists': 372, 'exists-skip-update': 328, 'update': 121, 'exists-not-found': 44, 'insert': 7, 'skip': 0})
-
- head -n500 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
- Counter({'total': 500, 'exists': 481, 'exists-skip-update': 430, 'exists-not-found': 44, 'update': 19, 'exists-by-issnl': 7, 'skip': 0, 'insert': 0})
-
-Made some changes in `27fe31d5ffcac700c30b2b10d56685ef0fa4f3a8` which seem to
-have removed the spurious null updates, while retaining DOAJ date-only updates.
-
-Also as a small nit notice that occasionally `kbart` metadata gets added with
-no year spans. This seems to be common with cariniana. Presumably this happens
-when there is no year span info available, only volumes. Seems like a valuable
-thing to include as a flag anyways.
-
-## Prod Import
-
-Start small:
-
- head -n100 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
- => Counter({'total': 100, 'exists': 69, 'exists-skip-update': 68, 'update': 30, 'insert': 1, 'exists-by-issnl': 1, 'skip': 0})
-
-Full batch:
-
- time cat /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
- => Counter({'total': 167092, 'exists': 110594, 'exists-skip-update': 109852, 'update': 55274, 'insert': 1224, 'exists-by-issnl': 742, 'skip': 0})
-
- real 10m45.714s
- user 4m51.680s
- sys 0m12.236s
diff --git a/notes/bulk_edits/2020-12-01_orcid.md b/notes/bulk_edits/2020-12-01_orcid.md
deleted file mode 100644
index b6883b17..00000000
--- a/notes/bulk_edits/2020-12-01_orcid.md
+++ /dev/null
@@ -1,55 +0,0 @@
-
-Another annual ORCID dump, basically the same as last year (2019). Expecting
-around 10 million total ORCIDs, compared to 7.3 million last year, so maybe 2.5
-million new creator entities.
-
-In particular motivated to run this import before a potential dblp import
-and/or creator creation run.
-
-Files download from:
-
-- <https://orcid.figshare.com/articles/dataset/ORCID_Public_Data_File_2020/13066970>
-- <https://archive.org/details/orcid-dump-2020>
-
-## Prep
-
- wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-0.0.2-full.jar
-
- java -jar orcid-conversion-lib-0.0.2-full.jar --tarball -i ORCID_2020_10_summaries.tar.gz -v v3_0rc1 -o ORCID_2020_10_summaries_json.tar.gz
-
- tar xvf ORCID_2020_10_summaries_json.tar.gz
-
- fd .json ORCID_2020_10_summaries/ | parallel cat {} | jq . -c | pv -l | gzip > ORCID_2020_10_summaries.json.gz
-
- zcat ORCID_2020_10_summaries.json.gz | shuf -n10000 | gzip > ORCID_2020_10_summaries.sample_10k.json.gz
-
- ia upload orcid-dump-2020 ORCID_2020_10_summaries_json.tar.gz ORCID_2020_10_summaries.sample_10k.json.gz
-
-## Import
-
-Fetch to prod machine:
-
- wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.json.gz
- wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.sample_10k.json.gz
-
-Sample:
-
- export FATCAT_AUTH_WORKER_ORCID=[...]
- zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.sample_10k.json.gz | ./fatcat_import.py orcid -
- => Counter({'total': 10000, 'exists': 7356, 'insert': 2465, 'skip': 179, 'update': 0})
-
-Bulk import:
-
- export FATCAT_AUTH_WORKER_ORCID=[...]
- time zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.json.gz | pv -l | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid -
- => Counter({'total': 1208991, 'exists': 888696, 'insert': 299008, 'skip': 21287, 'update': 0})
- => (8x of the above, roughly)
-
- real 88m40.960s
- user 389m35.344s
- sys 23m18.396s
-
-
- Before: Size: 673.36G
- After: Size: 675.55G
-
diff --git a/notes/bulk_edits/2020-12-14_doaj.md b/notes/bulk_edits/2020-12-14_doaj.md
deleted file mode 100644
index 5e897183..00000000
--- a/notes/bulk_edits/2020-12-14_doaj.md
+++ /dev/null
@@ -1,139 +0,0 @@
-
-## Earlier QA Testing (November 2020)
-
- export FATCAT_API_AUTH_TOKEN=... (FATCAT_AUTH_WORKER_DOAJ)
-
- # small test:
- zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | head | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
-
- # full run
- zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
-
- before: 519.17G
- after: 542.08G
-
-
- 5.45M 6:29:17 [ 233 /s]
-
- 12x of:
- Counter({'total': 455504, 'insert': 394437, 'exists': 60615, 'skip': 452, 'skip-title': 452, 'update': 0})
-
- total: ~5,466,048
- insert: ~4,733,244
- exists: ~727,380
-
-Initial imports (before crash) were like:
-
- Counter({'total': 9339, 'insert': 9330, 'skip': 9, 'skip-title': 9, 'update': 0, 'exists': 0})
-
-Seems like there is a bug, not finding existing by DOI?
-
-## Prod Container Metadata Update (chocula)
-
-Generic update of container metadata using chocula pipeline. Need to run this
-before DOAJ import to ensure we have all the containers already updated.
-
-Also updating ISSN-L index at the same time. Using a 2020-11-19 metadata
-snapshot, which was generated on 2020-12-07; more recent snapshots had small
-upstream changes in some formats so it wasn't trivial to run with a newer
-snapshot.
-
- # git rev: 9f67c82ce8952bbe9a7a07b732830363c7865485
-
- # from laptop, then unzip on prod machine
- scp chocula_fatcat_export.2020-11-19.json.gz fatcat-prod1-vm:/srv/fatcat/datasets/
-
- # check ISSN-L symlink
- # ISSN-to-ISSN-L.txt -> 20201119.ISSN-to-ISSN-L.txt
-
- export FATCAT_AUTH_WORKER_JOURNAL_METADATA=...
- head -n200 /srv/fatcat/datasets/chocula_fatcat_export.2020-11-19.json | ./fatcat_import.py chocula -
- Counter({'total': 200, 'exists': 200, 'exists-by-issnl': 6, 'skip': 0, 'insert': 0, 'update': 0})
-
- head -n200 /srv/fatcat/datasets/chocula_fatcat_export.2020-11-19.json | ./fatcat_import.py chocula - --do-updates
- Counter({'total': 200, 'exists': 157, 'exists-skip-update': 151, 'update': 43, 'exists-by-issnl': 6, 'skip': 0, 'insert': 0})
-
-Some of these are very minor updates, so going to do just creation (no
-`--do-updates`) to start.
-
- time ./fatcat_import.py chocula /srv/fatcat/datasets/chocula_fatcat_export.2020-11-19.json
- Counter({'total': 168165, 'exists': 167497, 'exists-by-issnl': 2371, 'insert': 668, 'skip': 0, 'update': 0})
-
- real 5m37.081s
- user 3m1.648s
- sys 0m9.488s
-
-TODO: tweak chocula import script to not update on `extra.state` metadata.
-
-
-## Release Metadata Bulk Import
-
-This is the first production bulk import of DOAJ metadata!
-
- # git rev: 9f67c82ce8952bbe9a7a07b732830363c7865485
- # DB before: Size: 678.15G
-
- # ensure fatcatd is updated to have support for DOAJ identifier
-
- # create new bot user
- ./target/release/fatcat-auth create-editor --admin --bot doaj-bot
- => mir5imb3v5ctxcaqnbstvmri2a
-
- ./target/release/fatcat-auth create-token mir5imb3v5ctxcaqnbstvmri2a
- => ...
-
- # download dataset
- wget https://archive.org/download/doaj_data_2020-11-13/doaj_article_data_2020-11-13.sample_10k.json.gz
- wget https://archive.org/download/doaj_data_2020-11-13/doaj_article_data_2020-11-13_all.json.gz
-
- export FATCAT_AUTH_WORKER_DOAJ=...
-
- # start small
- zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13.sample_10k.json.gz | head -n100 | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
- => Counter({'total': 100, 'exists': 70, 'insert': 30, 'skip': 0, 'update': 0})
-
-That is about expected, in terms of fraction without DOI. However, 6 out of 10
-(randomly checked) of the inserted releases seem to be dupes, which feels too
-high. So going to pause this import until basic fuzzy matching ready from
-Martin's fuzzycat work, and will check against elasticsearch before import.
-Will shuffle the entire file, import in a single thread, and just skip
-importing if there is any fuzzy match (not try to merge/update). Expecting
-about 500k new releases after such filtering.
-
-Ok, on 2020-12-17, back with patches to use fuzzycat in filtering. Trying
-another batch:
-
- # git rev: 60e022609cd3fbbf9634577149018592e680858d
- # DB before: Size: 678.47G
-
- export FATCAT_AUTH_WORKER_DOAJ=...
-
- zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13.sample_10k.json.gz | head -n1000 | tail -n100 | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
- => Counter({'total': 100, 'exists': 71, 'insert': 19, 'exists-fuzzy': 10, 'skip': 0, 'update': 0})
-
- # https://fatcat.wiki/changelog/5033496
-
-Sampled 10x of these and they look much better: no obvious duplication. Going
-ahead with the full import; note that other ingest is happening in parallel
-(many crossref, datacite, and pubmed imports which backed up).
-
- # full run
- # note the shuf command added, in an attempt to reduce duplicates within this corpus
- zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | shuf | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
-
- # started 2020-12-17 22:01 (Pacific)
-
- => 5.45M 52:38:45 [28.8 /s]
- => Counter({'total': 1366458, 'exists': 1020295, 'insert': 200249, 'exists-fuzzy': 144334, 'skip': 1563, 'skip-title': 1563, 'skip-doaj-id-mismatch': 17, 'update': 0})
-
-As total estimates:
-
-- total: 5,465,832
-- exists: 4,081,180
-- exists-fuzzy: 577,336
-- insert: 800,996
-
-Ending database size: Size: 684.08G
-
-(note that regular imports were running during same period)
-
diff --git a/notes/bulk_edits/2020-12-23_dblp.md b/notes/bulk_edits/2020-12-23_dblp.md
deleted file mode 100644
index a33411cb..00000000
--- a/notes/bulk_edits/2020-12-23_dblp.md
+++ /dev/null
@@ -1,55 +0,0 @@
-
-## Prod Container Import
-
-Using 2020-11-30 XML dump, then scrape and transform tooling from
-`extra/dblp/`.
-
- wget https://archive.org/download/dblp-xml-2020-11-30/dblp_container_meta.json
-
- # updated ISSN-to-ISSN-L.txt symlink to 20201207.ISSN-to-ISSN-L.txt
-
- touch /srv/fatcat/datasets/blank_dblp_containers.tsv
-
-Create new `dblp-bot` user:
-
- ./target/release/fatcat-auth create-editor --admin --bot dblp-bot
- => gwbheb5jfngrxkcad5qgth5cra
-
- ./target/release/fatcat-auth create-token gwbheb5jfngrxkcad5qgth5cra
-
-Run import:
-
- # git commit: ec6b366af8df1956e1287cba2e0818b80ce1c518
-
- export FATCAT_AUTH_WORKER_DBLP=...
-
- ./fatcat_import.py dblp-container --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --dblp-container-map-file /srv/fatcat/datasets/blank_dblp_containers.tsv --dblp-container-map-output /srv/fatcat/datasets/all_dblp_containers.tsv /srv/fatcat/datasets/dblp_container_meta.json
- => Got 0 existing dblp container mappings.
- => Counter({'total': 6954, 'insert': 5202, 'exists': 1752, 'skip': 0, 'update': 0})
-
- wc -l /srv/fatcat/datasets/all_dblp_containers.tsv
- 6955 /srv/fatcat/datasets/all_dblp_containers.tsv
-
-## Prod Release Import
-
-Using same 2020-11-30 XML dump. Download to /srv/fatcat/datasets:
-
- wget https://archive.org/download/dblp-xml-2020-11-30/dblp.dtd
- wget https://archive.org/download/dblp-xml-2020-11-30/dblp.xml
-
-Run import:
-
- export FATCAT_AUTH_WORKER_DBLP=...
-
- ./fatcat_import.py dblp-release --dblp-container-map-file /srv/fatcat/datasets/all_dblp_containers.tsv /srv/fatcat/datasets/dblp.xml --do-updates
-
- # started 2020-12-23 11:51 (Pacific)
-
- # restarted/tweaked at least twice
-
- # finally ended around 2020-12-27 after about... 48 hours?
-
- => Counter({'total': 7953365, 'has-doi': 4277307, 'skip': 3097418, 'skip-key-type': 2640968, 'skip-update': 2480449, 'exists': 943800, 'update': 889700, 'insert': 338842, 'skip-arxiv-corr': 312872, 'exists-fuzzy': 203103, 'skip-dblp-container-missing': 143578, 'skip-arxiv': 53, 'skip-title': 1})
-
-Starting database size (roughly): Size: 684.08G
-Ending database size: Size: 690.22G
diff --git a/notes/bulk_edits/2020_datacite.md b/notes/bulk_edits/2020_datacite.md
deleted file mode 100644
index 05d09517..00000000
--- a/notes/bulk_edits/2020_datacite.md
+++ /dev/null
@@ -1,152 +0,0 @@
-
-
-## QA Runs
-
-Trying on 2019-12-22, using Martin commit 18d411087007a30fbf027b87e30de42344119f0c from 2019-12-20.
-
-Quick test:
-
- # this branch adds some new deps, so make sure to install them
- pipenv install --deploy --dev
- pipenv shell
- export FATCAT_AUTH_WORKER_DATACITE="..."
- xzcat /srv/fatcat/datasets/datacite.ndjson.xz | head -n100 | ./fatcat_import.py datacite - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
-
-ISSUE: `--extid-map-file` not passed through, so drop the:
-
- --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
-
-ISSUE: auth_var should be FATCAT_AUTH_WORKER_DATACITE
-
-Test full parallel command:
-
- export FATCAT_AUTH_WORKER_DATACITE="..."
- time xzcat /srv/fatcat/datasets/datacite.ndjson.xz | head -n10000 | parallel -j20 --round-robin --pipe ./fatcat_import.py datacite - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
-
- real 0m30.017s
- user 3m5.576s
- sys 0m19.640s
-
-Whole lot of:
-
- invalid literal for int() with base 10: '10,495'
- invalid literal for int() with base 10: '11,129'
-
- invalid literal for int() with base 10: 'n/a'
- invalid literal for int() with base 10: 'n/a'
-
- invalid literal for int() with base 10: 'OP98'
- invalid literal for int() with base 10: 'OP208'
-
- no mapped type: None
- no mapped type: None
- no mapped type: None
-
-Re-ran above:
-
- real 0m27.764s
- user 3m2.448s
- sys 0m12.908s
-
-Compare with `--lang-detect`:
-
- real 0m27.395s
- user 3m5.620s
- sys 0m13.344s
-
-Not noticeable?
-
-Whole run:
-
- export FATCAT_AUTH_WORKER_DATACITE="..."
- time xzcat /srv/fatcat/datasets/datacite.ndjson.xz | parallel -j20 --round-robin --pipe ./fatcat_import.py datacite - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
-
- real 35m21.051s
- user 98m57.448s
- sys 7m9.416s
-
-Huh. Kind of suspiciously fast.
-
- select count(*) from editgroup where editor_id='07445cd2-cab2-4da5-9f84-34588b7296aa';
- => 9952 editgroups
-
- select count(*) from release_edit inner join editgroup on release_edit.editgroup_id = editgroup.id where editgroup.editor_id='07445cd2-cab2-4da5-9f84-34588b7296aa';
- => 496,342 edits
-
-While running:
-
- starting around 5k TPS in pg_activity
- starting size: 367.58G
- (this is after arxiv and some other changes on top of 2019-12-13 dump)
- host doing a load average of about 5.5; fatcatd at 115% CPU
-
- ending size: 371.43G
-
-Actually seems like extremely few DOIs getting inserted? Hrm.
-
- xzcat /srv/fatcat/datasets/datacite.ndjson.xz | wc -l
- => 18,210,075
-
-Last DOIs inserted were around: 10.7916/d81v6rqr
-
-Suspect a bunch of errors or something and output getting mangled by all the
-logging? Squelched logging and running again (using same DB/config), except
-with `pv -l` inserted after `xzcat`.
-
-Seem to run at a couple hundred records a second (very volatile).
-
- Counter({'total': 42919, 'insert': 21579, 'exists': 21334, 'skip': 6, 'skip-blank-title': 6, 'inserted.container': 1, 'update': 0})
- Counter({'total': 43396, 'insert': 23274, 'exists': 20120, 'skip-blank-title': 2, 'skip': 2, 'update': 0})
-
-Ok! The actual errors:
-
-
- Traceback (most recent call last):
- File "./fatcat_import.py", line 507, in <module>
- main()
- File "./fatcat_import.py", line 504, in main
- args.func(args)
- File "./fatcat_import.py", line 182, in run_datacite
- JsonLinePusher(dci, args.json_file).run()
- File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 559, in run
- self.importer.push_record(record)
- File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 318, in push_record
- entity = self.parse_record(raw_record)
- File "/srv/fatcat/src/python/fatcat_tools/importers/datacite.py", line 447, in parse_record
- sha1 = hashlib.sha1(text.encode('utf-8')).hexdigest()
- AttributeError: 'list' object has no attribute 'encode'
-
- fatcat_openapi_client.exceptions.ApiException: (400)
- Reason: Bad Request
- HTTP response headers: HTTPHeaderDict({'Content-Length': '186', 'Content-Type': 'application/json', 'Date': 'Mon, 23 Dec 2019 08:12:16 GMT', 'X-Clacks-Overhead': 'GNU aaronsw, jpb', 'X-Span-ID': '73b0b698-bf88-4721-b869-b322dbe90cbe'})
- HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.17167/mksz.2017.2.129–155"}
-
-
- Traceback (most recent call last):
- File "./fatcat_import.py", line 507, in <module>
- main()
- File "./fatcat_import.py", line 504, in main
- args.func(args)
- File "./fatcat_import.py", line 182, in run_datacite
- JsonLinePusher(dci, args.json_file).run()
- File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 559, in run
- self.importer.push_record(record)
- File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 318, in push_record
- entity = self.parse_record(raw_record)
- File "/srv/fatcat/src/python/fatcat_tools/importers/datacite.py", line 447, in parse_record
- sha1 = hashlib.sha1(text.encode('utf-8')).hexdigest()
- AttributeError: 'list' object has no attribute 'encode'
-
-
- fatcat_openapi_client.exceptions.ApiException: (400)
- Reason: Bad Request
- HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Span-ID': 'ca141ff4-83f7-4ee5-9256-91b23ec09e94', 'Content-Length': '188', 'X-Clacks-Overhead': 'GNU aaronsw, jpb', 'Date': 'Mon, 23 Dec 2019 08:11:25 GMT'})
- HTTP response body: {"success":false,"error":"ConstraintViolation","message":"unexpected database error: new row for relation \"release_contrib\" violates check constraint \"release_contrib_raw_name_check\""}
-
-## Prod Import
-
-Around first/second week of january. Needed to restart at least once due to
-database deadlock on abstract inserts, which seems to be due to parallelism and
-duplicated records in the bulk datacite dump.
-
-TODO: specific command used by martin
diff --git a/notes/bulk_edits/2021-05-28_dblp.md b/notes/bulk_edits/2021-05-28_dblp.md
deleted file mode 100644
index 061f4f45..00000000
--- a/notes/bulk_edits/2021-05-28_dblp.md
+++ /dev/null
@@ -1,44 +0,0 @@
-
-## Container Import
-
-Following dblp README directions:
-
- export DBLP_DIR=/srv/fatcat/tasks/202105_dblp
-
- ./fatcat_import.py dblp-release $DBLP_DIR/dblp.xml --dump-json-mode | pv -l > $DBLP_DIR/dblp_releases.json
- => Counter({'total': 8328073, 'skip': 8328073, 'has-doi': 4478439, 'skip-key-type': 2764750, 'skip-arxiv-corr': 348766, 'skip-title': 1, 'insert': 0, 'update': 0, 'exists': 0})
- => 5.21M 3:38:35 [ 397 /s]
-
- cat $DBLP_DIR/dblp_releases.json | jq ._dblp_prefix -r | grep -v ^null | sort -u > $DBLP_DIR/prefix_list.txt
-
- wc -l $DBLP_DIR/prefix_list.txt
- => 7603 /srv/fatcat/tasks/202105_dblp/prefix_list.txt
-
- mkdir -p journals
- mkdir -p conf
- mkdir -p series
-
- shuf $DBLP_DIR/prefix_list.txt | pv -l | parallel -j1 wget -nc -q "https://dblp.org/db/{}/index.html" -O {}.html
-
- # switch to temporary selectolax pipenv
- fd html conf/ journals/ series/ | /srv/fatcat/src/extra/dblp/dblp_html_extract.py | pv -l > dblp_container_meta.json
- => 7.08k 0:00:15 [ 449 /s]
-
- fatcat-cli search containers dblp_prefix:* -n 0 --index-json | jq "[.dblp_prefix, .ident] | @tsv" -r | pv -l > existing_dblp_containers.tsv
- => Got 5202 hits in 47ms
- => 5.20k 0:00:13 [ 375 /s]
-
- ./fatcat_import.py dblp-container --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --dblp-container-map-file $DBLP_DIR/existing_dblp_containers.tsv --dblp-container-map-output $DBLP_DIR/all_dblp_containers.tsv $DBLP_DIR/dblp_container_meta.json
- => Counter({'total': 7083, 'exists': 7025, 'insert': 58, 'skip': 0, 'update': 0})
- => actually 108 inserted
-
-Actually imported 50 more before this, then Ctrl-C to check. Then re-did
-fatcat-cli query, upload, and re-ran all. So 108 new containers inserted.
-
-## Release Import
-
-With same exports as above:
-
- ./fatcat_import.py dblp-release --dblp-container-map-file $DBLP_DIR/all_dblp_containers.tsv $DBLP_DIR/dblp.xml
- => Counter({'total': 8328073, 'exists': 4847353, 'has-doi': 4478439, 'skip': 3259925, 'skip-key-type': 2764750, 'skip-arxiv-corr': 348766, 'exists-fuzzy': 202880, 'skip-dblp-container-missing': 146408, 'insert': 17862, 'skip-arxiv': 53, 'skip-title': 1, 'update': 0})
-
diff --git a/notes/bulk_edits/2021-05-28_doaj.md b/notes/bulk_edits/2021-05-28_doaj.md
deleted file mode 100644
index e5925eeb..00000000
--- a/notes/bulk_edits/2021-05-28_doaj.md
+++ /dev/null
@@ -1,80 +0,0 @@
-
-Note: running 2021-05-28 pacific time, but 2021-05-29 UTC.
-
-First downloaded bulk metadata from doaj.org and uploaded into archive.org item
-as a snapshot.
-
-## Journal Metadata Import
-
-Before doing article import, want to ensure journals all exist.
-
-Use chocula pipeline, and to be simple/conservative don't update any
-containers, just create if they don't already exist.
-
-Run the usual chocula source update, copy to data dir, update sources.toml,
-etc. Didn't bother with updating container counts or homepage status. Then
-export container schema:
-
- python -m chocula export_fatcat | gzip > chocula_fatcat_export.2021-05-28.json.gz
-
-Upload to fatcat prod machine, unzip, and then import to prod:
-
- export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
- ./fatcat_import.py chocula /srv/fatcat/datasets/chocula_fatcat_export.2021-05-28.json
- => Counter({'total': 175837, 'exists': 170358, 'insert': 5479, 'exists-by-issnl': 5232, 'skip': 0, 'update': 0})
-
-That is a healthy batch of new records!
-
-## Article Import
-
-Transform all the articles into a single JSON file:
-
- cat doaj_article_data_*/article_batch*.json | jq .[] -c | pv -l | gzip > doaj_article_data_2021-05-25_all.json.gz
- => 6.1M 0:18:45 [5.42k/s]
-
- zcat doaj_article_data_2021-05-25_all.json.gz | shuf -n10000 > doaj_article_data_2021-05-25_sample_10k.json
-
-Also upload this `_all` file to archive.org item.
-
-Ready to import! Start with sample:
-
- export FATCAT_AUTH_WORKER_DOAJ=...
- zcat /srv/fatcat/tasks/202105_doaj/doaj_article_data_2021-05-25_sample_10k.json.gz | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
- => Counter({'total': 10000, 'exists': 8743, 'exists-fuzzy': 1044, 'insert': 197, 'skip': 14, 'skip-title': 14, 'skip-doaj-id-mismatch': 2, 'update': 0})
-
-Then the full import, in parallel, shuffled (because we shuffled last time):
-
- zcat /srv/fatcat/tasks/202105_doaj/doaj_article_data_2021-05-25_all.json.gz | shuf | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
- => Counter({'total': 512351, 'exists': 449996, 'exists-fuzzy': 50858, 'insert': 10826, 'skip': 551, 'skip-title': 551, 'skip-doaj-id-mismatch': 120, 'update': 0})
- => extrapolating, about 129,912 new release entities. 2.1% insert rate
-
-NOTE: large number of warnings like:
-
- UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=5d2ebb760ad24ce68ec8079bc82c8d78 ident=dvtl2xpn4nespfrj6gad6mrk44
-
-## Extra Article Imports
-
-Manually disabled fuzzy matching with patch:
-
- diff --git a/python/fatcat_import.py b/python/fatcat_import.py
- index 1dcfec2..cb787cb 100755
- --- a/python/fatcat_import.py
- +++ b/python/fatcat_import.py
- @@ -260,6 +260,7 @@ def run_doaj_article(args):
- args.issn_map_file,
- edit_batch_size=args.batch_size,
- do_updates=args.do_updates,
- + do_fuzzy_match=False,
- )
- if args.kafka_mode:
- KafkaJsonPusher(
-
-Filtered out some specific articles:
-
- zcat doaj_article_data_2021-05-25_all.json.gz | rg 1665-1596 | pv -l > doaj_article_data_2021-05-25_voces.json
- => 154 0:02:05 [1.22 /s]
-
-And used this for some imports:
-
- cat /srv/fatcat/tasks/202105_doaj/doaj_article_data_2021-05-25_voces.json | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
-
diff --git a/notes/bulk_edits/2021-11-10_case_sensitive_dois.md b/notes/bulk_edits/2021-11-10_case_sensitive_dois.md
deleted file mode 100644
index 20772f56..00000000
--- a/notes/bulk_edits/2021-11-10_case_sensitive_dois.md
+++ /dev/null
@@ -1,53 +0,0 @@
-
-## Production Run
-
-Start small:
-
- export FATCAT_AUTH_WORKER_CLEANUP=[...]
-
- wc -l /srv/fatcat/datasets/nonlowercase_doi_releases.tsv
- # 140530
-
- head -n100 /srv/fatcat/datasets/nonlowercase_doi_releases.tsv \
- | python -m fatcat_tools.cleanups.release_lowercase_doi -
- # Counter({'total': 100, 'update': 100, 'skip': 0, 'insert': 0, 'exists': 0})
-
- # same command again to test not duping updates
- Counter({'total': 100, 'skip-existing-doi-fine': 100, 'skip': 0, 'insert': 0, 'update': 0, 'exists': 0})
-
- # example editgroup_cld5qe34bzg7xg7g4cz5skgaw4
-
-Database size just before, while some other edits happening, PostgreSQL 11.6: 762.66G
-
-Ok, run a bunch in parallel:
-
- cat /srv/fatcat/datasets/nonlowercase_doi_releases.tsv \
- | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_tools.cleanups.release_lowercase_doi -
- # Counter({'total': 24022, 'update': 24022, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 38836, 'update': 38836, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 38836, 'update': 38836, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 38836, 'update': 38736, 'skip-existing-doi-fine': 100, 'skip': 0, 'insert': 0, 'exists': 0})
-
-Over 3k TPS in `pg_activity`.
-
-Should have included `pv -l` in the pipeline.
-
-Final database size 763.14G, so only a couple hundred MByte of growth, totally
-fine.
-
-
-## Verification
-
-Re-dump release extids, in production:
-
- sudo -u postgres psql fatcat_prod < dump_release_extid.sql | egrep -v ^BEGIN$ | egrep -v ^ROLLBACK$ | pv -l | pigz > /srv/fatcat/snapshots/release_extid.tsv.gz
-
-Filter to non-lowercase DOIs:
-
- zcat release_extid.tsv.gz \
- | cut -f1,3 \
- | rg '[A-Z]' \
- | pv -l \
- > nonlowercase_doi.tsv
-
-Zero returned, hurray!
diff --git a/notes/bulk_edits/2021-11-10_file_release_ingest_bugfix.md b/notes/bulk_edits/2021-11-10_file_release_ingest_bugfix.md
deleted file mode 100644
index 6b5deb63..00000000
--- a/notes/bulk_edits/2021-11-10_file_release_ingest_bugfix.md
+++ /dev/null
@@ -1,108 +0,0 @@
-
-## Production Run
-
-Start small:
-
- export FATCAT_AUTH_WORKER_CLEANUP=[...]
-
- wc -l /srv/fatcat/datasets/file_release_bugfix_20211105.json
- 228826
-
- head -n100 /srv/fatcat/datasets/file_release_bugfix_20211105.json \
- | python -m fatcat_tools.cleanups.file_release_bugfix -
- # Counter({'total': 100, 'update': 100, 'skip': 0, 'insert': 0, 'exists': 0})
-
- # example editgroup_keae3rfekffuriiy77f26rf6uq
-
-These are all now stubs (no release associated), which isn't the ratio seen in QA. Going to do a random sample:
-
- shuf -n100 /srv/fatcat/datasets/file_release_bugfix_20211105.json \
- | python -m fatcat_tools.cleanups.file_release_bugfix -
- # Counter({'total': 100, 'update': 100, 'skip': 0, 'insert': 0, 'exists': 0})
-
- # example editgroup_34mk525kxvdu3hak7g7fr7awru
-
-Looked at a few and all looked more like what would be expected, correct matches.
-
-Comparing before and after counts is going to be tricky, and will require a
-full re-index for an accurate count. But did do a snapshot just before this run
-(2021-11-10-prod-stats.json), and got 31,110,184 `in_web`.
-
-Full edit, in parallel:
-
- cat /srv/fatcat/datasets/file_release_bugfix_20211105.json \
- | pv -l \
- | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_tools.cleanups.file_release_bugfix -
- # 228k 0:26:34 [ 143 /s]
- # Counter({'total': 26090, 'update': 26071, 'skip-existing-fixed': 15, 'skip': 4, 'skip-wrong-release-is-ok': 4, 'insert': 0, 'exists': 0})
- # Counter({'total': 26080, 'update': 26061, 'skip-existing-fixed': 12, 'skip': 7, 'skip-wrong-release-is-ok': 7, 'insert': 0, 'exists': 0})
- # Counter({'total': 27517, 'update': 27497, 'skip-existing-fixed': 15, 'skip': 5, 'skip-wrong-release-is-ok': 5, 'insert': 0, 'exists': 0})
- # Counter({'total': 29534, 'update': 29420, 'skip-existing-fixed': 110, 'skip': 4, 'skip-wrong-release-is-ok': 4, 'insert': 0, 'exists': 0})
- # Counter({'total': 29544, 'update': 29517, 'skip-existing-fixed': 16, 'skip': 11, 'skip-wrong-release-is-ok': 11, 'insert': 0, 'exists': 0})
- # Counter({'total': 29535, 'update': 29518, 'skip-existing-fixed': 10, 'skip': 7, 'skip-wrong-release-is-ok': 7, 'insert': 0, 'exists': 0})
- # Counter({'total': 30082, 'update': 30065, 'skip-existing-fixed': 13, 'skip': 4, 'skip-wrong-release-is-ok': 4, 'insert': 0, 'exists': 0})
- # Counter({'total': 30444, 'update': 30420, 'skip-existing-fixed': 21, 'skip': 3, 'skip-wrong-release-is-ok': 3, 'insert': 0, 'exists': 0})
-
-## Verification
-
-Counts:
-
- SELECT file_edit.extra_json->>'ingest_request_source' as source, COUNT(*) as broken_files
- FROM file_edit
- LEFT JOIN file_ident ON file_edit.ident_id = file_ident.id
- LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
- LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
- LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
- WHERE
- file_edit.extra_json->>'link_source_id' IS NOT NULL
- AND file_edit.extra_json->>'link_source_id' LIKE '10.%'
- AND lower(release_rev.doi) != lower(file_edit.extra_json->>'link_source_id')
- AND file_ident.rev_id = file_edit.rev_id
- GROUP BY file_edit.extra_json->>'ingest_request_source';
-
- source | broken_files
- -----------+--------------
- unpaywall | 233
- (1 row)
-
-Examples:
-
- SELECT file_edit.ident_id as file_ident, release_ident.id as release_ident, file_edit.extra_json->>'link_source_id' as file_edit_doi, release_rev.doi as release_doi
- FROM file_edit
- LEFT JOIN file_ident ON file_edit.ident_id = file_ident.id
- LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
- LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
- LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
- WHERE
- file_edit.extra_json->>'link_source_id' IS NOT NULL
- AND file_edit.extra_json->>'link_source_id' LIKE '10.%'
- AND lower(release_rev.doi) != lower(file_edit.extra_json->>'link_source_id')
- AND file_ident.rev_id = file_edit.rev_id
- LIMIT 20;
-
-
-Looks like many of the remaining mismatches are from "double-slash" normalization, with doi prefix 10.1037:
-
- file_ident | release_ident | file_edit_doi | release_doi
- --------------------------------------+--------------------------------------+------------------------------+------------------------------
- ae2f7864-66a6-4a82-a0e6-153cb4d0b03a | 0f436ae6-d7b4-4a45-a434-d158bc4a3437 | 10.1037/0096-1523.25.6.1568 | 10.1037//0096-1523.25.6.1568
- d02ff5ab-a882-4a86-8a94-ce6222708323 | 2d5ebbca-e4ba-4bb7-bb19-f1e081479eab | 10.1037//0021-9010.63.4.467 | 10.1037/0021-9010.63.4.467
- 2c107387-f57f-4855-bc1a-e40704f1e9b4 | 7654b956-4776-4f6f-bc35-ccf7e6bfe99c | 10.1037/0022-0663.75.4.500 | 10.1037//0022-0663.75.4.500
- 15e3636a-4bcf-4595-8a2a-b6b06a299a2f | c09e3531-1ac4-4bfa-9fcf-8acb9f0d845e | 10.1037//1064-1297.8.2.225 | 10.1037/1064-1297.8.2.225
- dc8b86c8-9b8e-4333-abbb-8811010d9c71 | bd91e7be-c360-47af-a634-f048e2c85b73 | 10.1037//0021-843x.105.4.637 | 10.1037/0021-843x.105.4.637
- 35a06e0a-6f72-4624-87ca-fbb74bc9d77d | 96befa26-6eb0-47c0-a0ec-e00282e33bff | 10.1037//0735-7044.99.5.964 | 10.1037/0735-7044.99.5.964
- 707bfaa1-65de-4dbb-9786-51b99d03d91d | 2d58524b-4216-4092-8ddf-336ac42d5955 | 10.1037/0096-1523.28.3.515 | 10.1037//0096-1523.28.3.515
- de9ea98f-672e-44ec-9d12-e11acd8990d0 | 20f1a857-ad51-4b80-9ce5-bc3a44df96b1 | 10.1037//0002-9432.71.1.72 | 10.1037/0002-9432.71.1.72
- 4275306c-11ef-4fce-bc03-3f1efe99f9a6 | c69bc740-4da1-4f96-acc9-151a0cef5c3f | 10.1037//1064-1297.6.1.107 | 10.1037/1064-1297.6.1.107
- 6a63d2ae-b953-48ba-a68a-061543d82ad4 | e3c8b8c1-defc-44ac-8c73-e21e8cf93f5c | 10.1037//0022-0167.23.6.557 | 10.1037/0022-0167.23.6.557
- 2fcbb54e-8fa8-4bbc-a2ae-4b6b6eaff412 | 8500b4a5-a693-4415-b4a4-4dcfb3403d82 | 10.1037/0021-9010.73.1.68 | 10.1037//0021-9010.73.1.68
- b9aa4601-4a1b-4146-aa6b-a410d0fc3dce | 954f2072-8c53-41c7-82b0-8c6fe9ef4d0c | 10.1037//0278-6133.13.4.319 | 10.1037/0278-6133.13.4.319
- b528b924-0680-43f3-81ad-d822e51b3373 | 69387969-40bc-451d-b567-8713296f60b0 | 10.1037//0002-9432.71.1.38 | 10.1037/0002-9432.71.1.38
- f64f1ee2-b787-4a06-87ab-46b94f9d5454 | c082d47f-175d-456a-a741-650b5eaa5173 | 10.1037//0021-843x.98.4.487 | 10.1037/0021-843x.98.4.487
- 86e8b655-963a-4c11-ae70-a8d528400682 | 6381254c-e339-4354-b0d8-711b5b5e4fcc | 10.1037/0022-0663.89.1.183 | 10.1037//0022-0663.89.1.183
- 716e1761-8120-480b-b096-e7698c65456a | 7a4d2c7d-32b7-4292-adbf-b791387a3ac5 | 10.1037//0278-7393.21.5.1209 | 10.1037/0278-7393.21.5.1209
- bb7aa131-d5e5-497e-8040-b1729850b94c | 1ce29f33-d020-4d5f-a8b3-2b8bef53ccb8 | 10.1037//1040-3590.7.4.533 | 10.1037/1040-3590.7.4.533
- 510ad392-43aa-42dc-9644-9697f425efd5 | 796c92ca-767a-495b-ad0c-f458381c071c | 10.1037//0278-7393.20.4.824 | 10.1037/0278-7393.20.4.824
- 32c57e68-0793-4ded-bca2-d05f3532ff3e | 1567a003-0b5a-48bb-bb7c-7dff2c44b90b | 10.1037//1040-3590.13.1.59 | 10.1037/1040-3590.13.1.59
- e0d1fd38-17d8-42ac-b9df-60312829ddd4 | 0cefaf5d-fcdf-4049-a9d7-0c569096478e | 10.1037//0022-006x.56.4.621 | 10.1037/0022-006x.56.4.621
- (20 rows)
diff --git a/notes/bulk_edits/2021-11-11_wayback_short_ts.md b/notes/bulk_edits/2021-11-11_wayback_short_ts.md
deleted file mode 100644
index 20349f0c..00000000
--- a/notes/bulk_edits/2021-11-11_wayback_short_ts.md
+++ /dev/null
@@ -1,52 +0,0 @@
-
-## Production Run
-
-At git commit `6ad9d24e4d7d901d6fc394e6e91575f6acba7ff4`.
-
-Start small:
-
- export FATCAT_AUTH_WORKER_CLEANUP=[...]
-
- zcat /srv/fatcat/datasets/files_20211105_moreshortts.fetched.json.gz \
- | head -n100 \
- | python -m fatcat_tools.cleanups.file_short_wayback_ts -
- # Counter({'total': 100, 'update': 99, 'skip-bad-wayback-timestamp': 1, 'skip': 0, 'insert': 0, 'exists': 0})
-
-Looks good! Run the full batch.
-
- zcat /srv/fatcat/datasets/files_20211105_moreshortts.fetched.json.gz \
- | pv -l \
- | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_tools.cleanups.file_short_wayback_ts -
-
- [...]
- bad replacement URL: partial_ts=2017 original=https://www.hydrol-earth-syst-sci.net/21/4959/2017/hess-21-4959-2017.pdf fix_url=https://web.archive.org/web/20180721004954/https://www.hydrol-earth-syst-sci.net/21/4959/2017/hess-21-4959-2017.pdf
- bad replacement URL: partial_ts=2017 original=https://www.the-cryosphere.net/11/1537/2017/tc-11-1537-2017.pdf fix_url=https://web.archive.org/web/20180719235703/https://www.the-cryosphere.net/11/1537/2017/tc-11-1537-2017.pdf
- bad replacement URL: partial_ts=2017 original=http://www.growingscience.com/msl/Vol7/msl_2017_26.pdf fix_url=https://web.archive.org/web/20180601235059/http://www.growingscience.com/msl/Vol7/msl_2017_26.pdf
- bad replacement URL: partial_ts=2017 original=https://www.hydrol-earth-syst-sci.net/21/4115/2017/hess-21-4115-2017.pdf fix_url=https://web.archive.org/web/20180719162956/https://www.hydrol-earth-syst-sci.net/21/4115/2017/hess-21-4115-2017.pdf
- bad replacement URL: partial_ts=2017 original=https://www.biogeosciences.net/14/4279/2017/bg-14-4279-2017.pdf fix_url=https://web.archive.org/web/20180720220056/https://www.biogeosciences.net/14/4279/2017/bg-14-4279-2017.pdf
- bad replacement URL: partial_ts=2017 original=https://www.biogeosciences.net/14/3669/2017/bg-14-3669-2017.pdf fix_url=https://web.archive.org/web/20180720222828/https://www.biogeosciences.net/14/3669/2017/bg-14-3669-2017.pdf
- [...]
- bad replacement URL: partial_ts=2017 original=http://www.growingscience.com/msl/Vol7/msl_2017_28.pdf fix_url=https://web.archive.org/web/20180602071632/http://www.growingscience.com/msl/Vol7/msl_2017_28.pdf
- bad replacement URL: partial_ts=2017 original=https://www.biogeosciences.net/14/4161/2017/bg-14-4161-2017.pdf fix_url=https://web.archive.org/web/20180720004438/https://www.biogeosciences.net/14/4161/2017/bg-14-4161-2017.pdf
- bad replacement URL: partial_ts=2017 original=https://core.ac.uk/download/pdf/10915563.pdf fix_url=https://web.archive.org/web/20190220174144/https://core.ac.uk/download/pdf/10915563.pdf
- bad replacement URL: partial_ts=2017 original=http://www.growingscience.com/ijiec/Vol9/IJIEC_2017_24.pdf fix_url=https://web.archive.org/web/20180602094300/http://www.growingscience.com/ijiec/Vol9/IJIEC_2017_24.pdf
- bad replacement URL: partial_ts=2017 original=https://core.ac.uk/download/pdf/36046645.pdf fix_url=https://web.archive.org/web/20190220175351/https://core.ac.uk/download/pdf/36046645.pdf
- bad replacement URL: partial_ts=2017 original=https://core.ac.uk/download/pdf/35085886.pdf fix_url=https://web.archive.org/web/20190220175410/https://core.ac.uk/download/pdf/35085886.pdf
- bad replacement URL: partial_ts=2017 original=https://www.atmos-chem-phys.net/17/10349/2017/acp-17-10349-2017.pdf fix_url=https://web.archive.org/web/20181102190649/https://www.atmos-chem-phys.net/17/10349/2017/acp-17-10349-2017.pdf
- bad replacement URL: partial_ts=2017 original=https://www.atmos-chem-phys.net/17/7775/2017/acp-17-7775-2017.pdf fix_url=https://web.archive.org/web/20181101041355/https://www.atmos-chem-phys.net/17/7775/2017/acp-17-7775-2017.pdf
- bad replacement URL: partial_ts=2017 original=http://www.veterinaryworld.org/Vol.10/March-2017/5.pdf fix_url=https://web.archive.org/web/20180721074940/http://www.veterinaryworld.org/Vol.10/March-2017/5.pdf
- bad replacement URL: partial_ts=2017 original=https://www.ann-geophys.net/35/189/2017/angeo-35-189-2017.pdf fix_url=https://web.archive.org/web/20180625214916/https://www.ann-geophys.net/35/189/2017/angeo-35-189-2017.pdf
- [...]
-
- # 9.96M 12:57:06 [ 213 /s]
-
- Counter({'total': 1272301, 'update': 1268466, 'skip-bad-wayback-timestamp': 2808, 'skip': 1026, 'skip-status': 981, 'skip-bad-replacement': 45, 'skip-bad-wayback': 1, 'insert': 0, 'exists': 0})
- Counter({'total': 1242814, 'update': 1239042, 'skip-bad-wayback-timestamp': 2734, 'skip': 1036, 'skip-status': 974, 'skip-bad-replacement': 62, 'skip-bad-wayback': 2, 'insert': 0, 'exists': 0})
- Counter({'total': 1264351, 'update': 1260695, 'skip-bad-wayback-timestamp': 2626, 'skip': 1030, 'skip-status': 977, 'skip-bad-replacement': 53, 'insert': 0, 'exists': 0})
- Counter({'total': 1244480, 'update': 1240779, 'skip-bad-wayback-timestamp': 2680, 'skip': 1020, 'skip-status': 962, 'skip-bad-replacement': 58, 'skip-bad-wayback': 1, 'insert': 0, 'exists': 0})
- Counter({'total': 1222678, 'update': 1219022, 'skip-bad-wayback-timestamp': 2698, 'skip': 956, 'skip-status': 892, 'skip-bad-replacement': 64, 'skip-bad-wayback': 2, 'insert': 0, 'exists': 0})
- Counter({'total': 1225078, 'update': 1221459, 'skip-bad-wayback-timestamp': 2597, 'skip': 1020, 'skip-status': 964, 'skip-bad-replacement': 56, 'skip-bad-wayback': 2, 'insert': 0, 'exists': 0})
- Counter({'total': 1283843, 'update': 1280014, 'skip-bad-wayback-timestamp': 2670, 'skip': 1059, 'skip-status': 997, 'skip-revision-changed': 99, 'skip-bad-replacement': 62, 'skip-bad-wayback': 1, 'insert': 0, 'exists': 0})
- Counter({'total': 1203309, 'update': 1199782, 'skip-bad-wayback-timestamp': 2556, 'skip': 971, 'skip-status': 923, 'skip-bad-replacement': 48, 'insert': 0, 'exists': 0})
-
-On the order of 99.7% were updated/fixed, over 9.5 million file entities, taking almost 13 hours.
diff --git a/notes/bulk_edits/2021-11-24_file_meta.md b/notes/bulk_edits/2021-11-24_file_meta.md
deleted file mode 100644
index 1ec1698b..00000000
--- a/notes/bulk_edits/2021-11-24_file_meta.md
+++ /dev/null
@@ -1,41 +0,0 @@
-
-Another partial batch of pure `file_meta` updates to file entities. These came
-from re-attempting ingest by URL of existing file entities.
-
-Not all ran as expected, partially because of GROBID issues, and partially
-because we had alternate captures for the same URLs.
-
-Still, about half the attempts worked, so we are going to update a fraction of
-the ~520k outstanding file entities with partial metadata (eg, missing sha256).
-
-See cleanups `file_meta` document for prep and QA testing notes.
-
-
-## Production Commands
-
- git log | head -n1
- commit 75bde4ad3970e8e63b04009cfd16ed4b9a924ce7
-
- export export FATCAT_AUTH_API_TOKEN=[...] # sandcrawler-bot
-
-Start with a small sample:
-
- cat /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.sample.json \
- | ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta -
- # Counter({'total': 100, 'skip-existing-complete': 45, 'update': 43, 'skip-no-match': 12, 'skip': 0, 'insert': 0, 'exists': 0})
-
-Then run in parallel with full batch:
-
- cat /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.json \
- | parallel -j8 --round-robin --pipe -q ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta -
- # Counter({'total': 41846, 'update': 19737, 'skip-existing-complete': 18788, 'skip-no-match': 3321, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41522, 'update': 19678, 'skip-existing-complete': 18607, 'skip-no-match': 3237, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41537, 'update': 20517, 'skip-existing-complete': 17895, 'skip-no-match': 3125, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41529, 'update': 19684, 'skip-existing-complete': 18501, 'skip-no-match': 3344, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41530, 'update': 19595, 'skip-existing-complete': 18637, 'skip-no-match': 3298, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41542, 'update': 21359, 'skip-existing-complete': 17033, 'skip-no-match': 3150, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41534, 'update': 19758, 'skip-existing-complete': 18516, 'skip-no-match': 3260, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41537, 'update': 20507, 'skip-existing-complete': 15543, 'skip-no-match': 5487, 'skip': 0, 'insert': 0, 'exists': 0})
-
-Import ran pretty fast! Updated about 160k file entities. More like 1/3 than
-1/2 of the 520k that were missing SHA-256.
diff --git a/notes/bulk_edits/2021-11-24_file_sha1_dedupe.md b/notes/bulk_edits/2021-11-24_file_sha1_dedupe.md
deleted file mode 100644
index 012bcf62..00000000
--- a/notes/bulk_edits/2021-11-24_file_sha1_dedupe.md
+++ /dev/null
@@ -1,35 +0,0 @@
-
-See notes and scripts about `file_sha1_dedupe` cleanup for prep details.
-
-## Prod Run
-
-Run as `cleanup-bot`:
-
- export FATCAT_AUTH_API_TOKEN=[...]
-
- git log | head -n1
- # commit 5bc5eeed5e3ba54c2129c4233b881291c5fa7449
-
-First do a sample in dry-run mode:
-
- head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \
- | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" --dry-run merge-files -
- # Counter({'updated-entities': 59, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0})
-
-Gah, the dry-run mode still creates (empty) editgroups:
-
- https://fatcat.wiki/editgroup/iqzjg3vxu5elvotknmmjln3gv4
- https://fatcat.wiki/editgroup/2mxsl7lxo5dezem42whnr7zxxe
-
-Actually run (merge) the sample:
-
- head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \
- | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files -
- # Counter({'updated-entities': 59, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0})
-
-
-Run the full batch:
-
- cat /srv/fatcat/datasets/file_sha1_dupes.json \
- | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files -
- # Counter({'updated-entities': 6197, 'lines': 2039, 'merged': 2014, 'skip': 25, 'skip-not-active-entity': 25, 'updated-total': 0})
diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md
deleted file mode 100644
index 6156721c..00000000
--- a/notes/bulk_edits/CHANGELOG.md
+++ /dev/null
@@ -1,131 +0,0 @@
-
-# Fatcat Production Import CHANGELOG
-
-This file tracks major content (metadata) imports to the Fatcat production
-database (at https://fatcat.wiki). It complements the code CHANGELOG file.
-
-In general, changes that impact more than 50k entities will get logged here;
-this file should probably get merged into the guide at some point.
-
-This file should not turn in to a TODO list!
-
-
-## 2021-11
-
-Ran a series of cleanups. See background and prep notes in `notes/cleanups/`
-and specific final commands in this directory. Quick summary:
-
-- more than 9.5 million file entities had truncated timestamps wayback URLs,
- and were fixed with the full timestamps. there are still a small fraction
- (0.5%) which were identified but not corrected in this first pass
-- over 140k release entities with non-lowercase DOIs were updated with
- lowercase DOI. all DOIs in current release entities now lowercase (at least,
- no ASCII uppercase characters found)
-- over 220k file entities with incorrect release relation, due to an
- import-time code bug, were fixed. a couple hundred questionable cases remain,
- but are all mismatched due to DOI slash/double-slash issues and will not be
- fixed in an automated way.
-- de-uplicated a few thousand file entities, on the basis of SHA-1 hash
-- updated file metadata for around 160k file entities (a couple hundred
- thousand remain with partial metadata)
-
-
-## 2021-06
-
-Created new containers via chocula pipeline. Did not update any existing
-chocula entities.
-
-Ran DOAJ import manually, yielding almost 130k new release entities.
-
-Ran dblp import manually, resulting in about 17k new release entities, as well
-as 108 new containers. Note that 146k releases were not inserted due to
-`skip-dblp-container-missing` and 203k due to `exists-fuzzy`.
-
-## 2020-12
-
-Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities.
-
-Imported DOAJ article metadata from a 2020-11 dump. Crawled and imported
-several hundred thousand file entities matched by DOAJ identifier. Updated
-journal metadata using chocula took (before the release ingest). Filtered out
-fuzzy-matching papers before importing.
-
-Imported dblp from a 2020 snapshot, both containers (primarily for conferences
-lacking an ISSN) and release entities (primarily conference papers). Filtered
-out fuzzy-matching papers before importing.
-
-## 2020-03
-
-Started harvesting both Arxiv and Pubmed metadata daily and importing to
-fatcat. Did backfill imports for both sources.
-
-JALC DOI registry update from 2019 dump.
-
-## 2020-01
-
-Imported around 2,500 new containers (journals, by ISSN-L) from chocula
-analysis script.
-
-Imported DOIs from Datacite (around 16 million, plus or minus a couple
-million).
-
-Imported new release entities from 2020 Pubmed/MEDLINE baseline. This import
-included only new Pubmed works cataloged in 2019 (up until December or so).
-Only a few hundred thousand new release entities.
-
-Daily "ingest" (crawling) pipeline running.
-
-## 2019-12
-
-Started continuous harvesting Datacite DOI metadata; first date harvested was
-`2019-12-13`. No importer running yet.
-
-Imported about 3.3m new ORCID identifiers from 2019 bulk dump (after converting
-from XML to JSON): <https://archive.org/details/orcid-dump-2019>
-
-Inserted about 154k new arxiv release entities. Still no automatic daily
-harvesting.
-
-"Save Paper Now" importer running. This bot only *submits* editgroups for
-review, doesn't auto-accept them.
-
-## 2019-11
-
-Daily ingest of fulltext for OA releases now enabled. New file entities created
-and merged automatically.
-
-## 2019-10
-
-Inserted 1.45m new release entities from Crossref which had been missed during
-a previous gap in continuous metadata harvesting.
-
-Updated 304,308 file entities to remove broken
-"https://web.archive.org/web/None/*" URLs.
-
-## 2019-09
-
-Created and updated metadata for tens of thousands of containers, using
-"chocula" pipeline.
-
-## 2019-08
-
-Merged/fixed roughly 100 container entities with invalid ISSN-L numbers (eg,
-invalid ISSN checksum).
-
-## 2019-04
-
-Imported files (matched to releases by DOI) from Semantic Scholar
-(`DIRECT-OA-CRAWL-2019` crawl).
-
-Imported files (matched to releases by DOI) from pre-1923/pre-1909 items uploaded
-by a user to archive.org.
-
-Imported files (matched to releases by DOI) from CORE.ac.uk
-(`DIRECT-OA-CRAWL-2019` crawl).
-
-Imported files (matched to releases by DOI) from the public web (including many
-repositories) from the `UNPAYWALL` 2018 crawl.
-
-## 2019-02
-
-Bootstrapped!