aboutsummaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2021-11-24_file_meta.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
commitc32154f2875a7fb9aac727013e1475cdd811e180 (patch)
treef0e061498a101fa824995fb6ec9f91e7e44257e1 /notes/bulk_edits/2021-11-24_file_meta.md
parentc5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff)
downloadfatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz
fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip
move notes/bulk_edits/ to extra/bulk_edits/
Diffstat (limited to 'notes/bulk_edits/2021-11-24_file_meta.md')
-rw-r--r--notes/bulk_edits/2021-11-24_file_meta.md41
1 files changed, 0 insertions, 41 deletions
diff --git a/notes/bulk_edits/2021-11-24_file_meta.md b/notes/bulk_edits/2021-11-24_file_meta.md
deleted file mode 100644
index 1ec1698b..00000000
--- a/notes/bulk_edits/2021-11-24_file_meta.md
+++ /dev/null
@@ -1,41 +0,0 @@
-
-Another partial batch of pure `file_meta` updates to file entities. These came
-from re-attempting ingest by URL of existing file entities.
-
-Not all ran as expected, partially because of GROBID issues, and partially
-because we had alternate captures for the same URLs.
-
-Still, about half the attempts worked, so we are going to update a fraction of
-the ~520k outstanding file entities with partial metadata (eg, missing sha256).
-
-See cleanups `file_meta` document for prep and QA testing notes.
-
-
-## Production Commands
-
- git log | head -n1
- commit 75bde4ad3970e8e63b04009cfd16ed4b9a924ce7
-
- export export FATCAT_AUTH_API_TOKEN=[...] # sandcrawler-bot
-
-Start with a small sample:
-
- cat /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.sample.json \
- | ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta -
- # Counter({'total': 100, 'skip-existing-complete': 45, 'update': 43, 'skip-no-match': 12, 'skip': 0, 'insert': 0, 'exists': 0})
-
-Then run in parallel with full batch:
-
- cat /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.json \
- | parallel -j8 --round-robin --pipe -q ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta -
- # Counter({'total': 41846, 'update': 19737, 'skip-existing-complete': 18788, 'skip-no-match': 3321, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41522, 'update': 19678, 'skip-existing-complete': 18607, 'skip-no-match': 3237, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41537, 'update': 20517, 'skip-existing-complete': 17895, 'skip-no-match': 3125, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41529, 'update': 19684, 'skip-existing-complete': 18501, 'skip-no-match': 3344, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41530, 'update': 19595, 'skip-existing-complete': 18637, 'skip-no-match': 3298, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41542, 'update': 21359, 'skip-existing-complete': 17033, 'skip-no-match': 3150, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41534, 'update': 19758, 'skip-existing-complete': 18516, 'skip-no-match': 3260, 'skip': 0, 'insert': 0, 'exists': 0})
- # Counter({'total': 41537, 'update': 20507, 'skip-existing-complete': 15543, 'skip-no-match': 5487, 'skip': 0, 'insert': 0, 'exists': 0})
-
-Import ran pretty fast! Updated about 160k file entities. More like 1/3 than
-1/2 of the 520k that were missing SHA-256.