diff options
Diffstat (limited to 'notes/bulk_edits/2021-11-24_file_meta.md')
-rw-r--r-- | notes/bulk_edits/2021-11-24_file_meta.md | 41 |
1 files changed, 0 insertions, 41 deletions
diff --git a/notes/bulk_edits/2021-11-24_file_meta.md b/notes/bulk_edits/2021-11-24_file_meta.md deleted file mode 100644 index 1ec1698b..00000000 --- a/notes/bulk_edits/2021-11-24_file_meta.md +++ /dev/null @@ -1,41 +0,0 @@ - -Another partial batch of pure `file_meta` updates to file entities. These came -from re-attempting ingest by URL of existing file entities. - -Not all ran as expected, partially because of GROBID issues, and partially -because we had alternate captures for the same URLs. - -Still, about half the attempts worked, so we are going to update a fraction of -the ~520k outstanding file entities with partial metadata (eg, missing sha256). - -See cleanups `file_meta` document for prep and QA testing notes. - - -## Production Commands - - git log | head -n1 - commit 75bde4ad3970e8e63b04009cfd16ed4b9a924ce7 - - export export FATCAT_AUTH_API_TOKEN=[...] # sandcrawler-bot - -Start with a small sample: - - cat /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.sample.json \ - | ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta - - # Counter({'total': 100, 'skip-existing-complete': 45, 'update': 43, 'skip-no-match': 12, 'skip': 0, 'insert': 0, 'exists': 0}) - -Then run in parallel with full batch: - - cat /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.json \ - | parallel -j8 --round-robin --pipe -q ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta - - # Counter({'total': 41846, 'update': 19737, 'skip-existing-complete': 18788, 'skip-no-match': 3321, 'skip': 0, 'insert': 0, 'exists': 0}) - # Counter({'total': 41522, 'update': 19678, 'skip-existing-complete': 18607, 'skip-no-match': 3237, 'skip': 0, 'insert': 0, 'exists': 0}) - # Counter({'total': 41537, 'update': 20517, 'skip-existing-complete': 17895, 'skip-no-match': 3125, 'skip': 0, 'insert': 0, 'exists': 0}) - # Counter({'total': 41529, 'update': 19684, 'skip-existing-complete': 18501, 'skip-no-match': 3344, 'skip': 0, 'insert': 0, 'exists': 0}) - # Counter({'total': 41530, 'update': 19595, 'skip-existing-complete': 18637, 'skip-no-match': 3298, 'skip': 0, 'insert': 0, 'exists': 0}) - # Counter({'total': 41542, 'update': 21359, 'skip-existing-complete': 17033, 'skip-no-match': 3150, 'skip': 0, 'insert': 0, 'exists': 0}) - # Counter({'total': 41534, 'update': 19758, 'skip-existing-complete': 18516, 'skip-no-match': 3260, 'skip': 0, 'insert': 0, 'exists': 0}) - # Counter({'total': 41537, 'update': 20507, 'skip-existing-complete': 15543, 'skip-no-match': 5487, 'skip': 0, 'insert': 0, 'exists': 0}) - -Import ran pretty fast! Updated about 160k file entities. More like 1/3 than -1/2 of the 520k that were missing SHA-256. |