summaryrefslogtreecommitdiffstats
path: root/notes/cleanups
Commit message (Collapse)AuthorAgeFilesLines
* document cleanups run this weekBryan Newbold2021-11-121-0/+13
|
* Merge branch 'bnewbold-import-refactors' into 'master'bnewbold2021-11-111-0/+46
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | import refactors and deprecations Some of these are from old stale branches (the datacite subject metadata patch), but most are from yesterday and today. Sort of a hodge-podge, but the general theme is getting around to deferred cleanups and refactors specific to importer code before making some behavioral changes. The Datacite-specific stuff could use review here. Remove unused/deprecated/dead code: - cdl_dash_dat and wayback_static importers, which were for specific early example entities and have been superseded by other importers - "extid map" sqlite3 feature from several importers, was only used for initial bulk imports (and maybe should not have been used) Refactors: - moved a number of large datastructures out of importer code and into a dedicated static file (`biblio_lookup_tables.py`). Didn't move all, just the ones that were either generic or very large (making it hard to read code) - shuffled around relative imports and some function names ("clean_str" vs. "clean") Some actual behavioral changes: - remove some Datacite-specific license slugs - stop trying to fix double-slashes in DOIs, that was causing more harm than help (some DOIs do actually have double-slashes!) - remove some excess metadata from datacite 'extra' fields
| * add notes about 'double slash in DOI' issueBryan Newbold2021-11-091-0/+46
|
* wayback ts cleanup: one more filter tweakBryan Newbold2021-11-091-1/+2
|
* update cleanups notesBryan Newbold2021-11-092-0/+72
|
* initial file/release bugfix cleanup worker and notesBryan Newbold2021-11-091-0/+144
|
* updates to lowercase DOI cleanupBryan Newbold2021-11-091-0/+71
|
* more iteration on short wayback timestamp cleanupBryan Newbold2021-11-092-3/+128
|
* cleanups: tweaks to wayback CDX cleanup scriptsBryan Newbold2021-11-091-1/+8
|
* wayback timestamps: updates to handle 4-digit caseBryan Newbold2021-11-092-11/+108
|
* start work on wayback short-timestamp cleanupBryan Newbold2021-11-092-0/+238