summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2020-11-19 14:54:03 -0800
committerBryan Newbold <bnewbold@robocracy.org>2020-11-19 14:54:03 -0800
commitfcfcd3224a113fa90da2045a3c7fe90127088ebe (patch)
tree24300e0b2b7548a58f0348e589a5b3d59bc44337
parent03eadfc7e2bee4213345f6464378e87b8f741d20 (diff)
downloadfatcat-fcfcd3224a113fa90da2045a3c7fe90127088ebe.tar.gz
fatcat-fcfcd3224a113fa90da2045a3c7fe90127088ebe.zip
ingest and proposal updates
-rw-r--r--notes/bulk_edits/2020-10-08_chocula.md44
-rw-r--r--proposals/20190911_v04_schema_tweaks.md1
2 files changed, 45 insertions, 0 deletions
diff --git a/notes/bulk_edits/2020-10-08_chocula.md b/notes/bulk_edits/2020-10-08_chocula.md
new file mode 100644
index 00000000..d60b6842
--- /dev/null
+++ b/notes/bulk_edits/2020-10-08_chocula.md
@@ -0,0 +1,44 @@
+
+Another update of journal metadata. In this case due to expanding "Keepers"
+coverage to PKP PLN, Hathitrust, Scholar's Portal, and Carniniana.
+
+Using `journal-metadata-bot` and `chocula.2020-10-08.json` export.
+
+## QA Testing
+
+ shuf -n1000 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
+ Counter({'total': 1000, 'exists': 640, 'exists-skip-update': 532, 'update': 348, 'exists-not-found': 108, 'insert': 12, 'skip': 0})
+
+Expecting roughly a 1/3 update rate. Most of these seem to be true updates (eg,
+adding kbart metadata). A smaller fraction are just updating DOAJ timestamp or
+not updating any metadata at all.
+
+ head -n500 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
+ Counter({'total': 500, 'exists': 372, 'exists-skip-update': 328, 'update': 121, 'exists-not-found': 44, 'insert': 7, 'skip': 0})
+
+ head -n500 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
+ Counter({'total': 500, 'exists': 481, 'exists-skip-update': 430, 'exists-not-found': 44, 'update': 19, 'exists-by-issnl': 7, 'skip': 0, 'insert': 0})
+
+Made some changes in `27fe31d5ffcac700c30b2b10d56685ef0fa4f3a8` which seem to
+have removed the spurious null updates, while retaining DOAJ date-only updates.
+
+Also as a small nit notice that occasionally `kbart` metadata gets added with
+no year spans. This seems to be common with cariniana. Presumably this happens
+when there is no year span info available, only volumes. Seems like a valuable
+thing to include as a flag anyways.
+
+## Prod Import
+
+Start small:
+
+ head -n100 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
+ => Counter({'total': 100, 'exists': 69, 'exists-skip-update': 68, 'update': 30, 'insert': 1, 'exists-by-issnl': 1, 'skip': 0})
+
+Full batch:
+
+ time cat /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
+ => Counter({'total': 167092, 'exists': 110594, 'exists-skip-update': 109852, 'update': 55274, 'insert': 1224, 'exists-by-issnl': 742, 'skip': 0})
+
+ real 10m45.714s
+ user 4m51.680s
+ sys 0m12.236s
diff --git a/proposals/20190911_v04_schema_tweaks.md b/proposals/20190911_v04_schema_tweaks.md
index f5253519..8e61bf36 100644
--- a/proposals/20190911_v04_schema_tweaks.md
+++ b/proposals/20190911_v04_schema_tweaks.md
@@ -16,6 +16,7 @@ SQL (and API, and elasticsearch):
includes. Eg, `book`, `chapter`, `article`/`work`, `issue`, `volume`,
`abstract`, `component`. Unclear how to initialize this field; default to
`article`/`work`?
+- TODO: webcapture: lookup by primary URL sha1?
- TODO: release: switch how pages work? first/last?
- TODO: indication of peer-review process? at release or container level?
- TODO: container: separate canonical and disambiguating titles (?)