Another partial batch of pure `file_meta` updates to file entities. These came from re-attempting ingest by URL of existing file entities. Not all ran as expected, partially because of GROBID issues, and partially because we had alternate captures for the same URLs. Still, about half the attempts worked, so we are going to update a fraction of the ~520k outstanding file entities with partial metadata (eg, missing sha256). See cleanups `file_meta` document for prep and QA testing notes. ## Production Commands git log | head -n1 commit 75bde4ad3970e8e63b04009cfd16ed4b9a924ce7 export export FATCAT_AUTH_API_TOKEN=[...] # sandcrawler-bot Start with a small sample: cat /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.sample.json \ | ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta - # Counter({'total': 100, 'skip-existing-complete': 45, 'update': 43, 'skip-no-match': 12, 'skip': 0, 'insert': 0, 'exists': 0}) Then run in parallel with full batch: cat /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.json \ | parallel -j8 --round-robin --pipe -q ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta - # Counter({'total': 41846, 'update': 19737, 'skip-existing-complete': 18788, 'skip-no-match': 3321, 'skip': 0, 'insert': 0, 'exists': 0}) # Counter({'total': 41522, 'update': 19678, 'skip-existing-complete': 18607, 'skip-no-match': 3237, 'skip': 0, 'insert': 0, 'exists': 0}) # Counter({'total': 41537, 'update': 20517, 'skip-existing-complete': 17895, 'skip-no-match': 3125, 'skip': 0, 'insert': 0, 'exists': 0}) # Counter({'total': 41529, 'update': 19684, 'skip-existing-complete': 18501, 'skip-no-match': 3344, 'skip': 0, 'insert': 0, 'exists': 0}) # Counter({'total': 41530, 'update': 19595, 'skip-existing-complete': 18637, 'skip-no-match': 3298, 'skip': 0, 'insert': 0, 'exists': 0}) # Counter({'total': 41542, 'update': 21359, 'skip-existing-complete': 17033, 'skip-no-match': 3150, 'skip': 0, 'insert': 0, 'exists': 0}) # Counter({'total': 41534, 'update': 19758, 'skip-existing-complete': 18516, 'skip-no-match': 3260, 'skip': 0, 'insert': 0, 'exists': 0}) # Counter({'total': 41537, 'update': 20507, 'skip-existing-complete': 15543, 'skip-no-match': 5487, 'skip': 0, 'insert': 0, 'exists': 0}) Import ran pretty fast! Updated about 160k file entities. More like 1/3 than 1/2 of the 520k that were missing SHA-256.