summaryrefslogtreecommitdiffstats
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* file elasticsearch index workerBryan Newbold2021-12-153-1/+63
|
* mergers: fix typo in env var nameBryan Newbold2021-12-073-3/+3
|
* ES container schema: add 'sim_pubid' and `ia_sim_collection` fieldsBryan Newbold2021-12-031-0/+2
|
* ES transform: remove prototype microfilm linksBryan Newbold2021-12-031-20/+0
| | | | This ended up being a feature in scholar.archive.org, not fatcat.
* chocula importer: handle not-upper-case ISSNsBryan Newbold2021-11-301-2/+6
|
* chocula importer: handle broken ISSNs in extra metadataBryan Newbold2021-11-301-2/+7
|
* chocula importer: tweak counting, conditions for doing updatesBryan Newbold2021-11-301-15/+7
|
* chocula importer: move issne/issnp 'extra' to top-level fields if doing updatesBryan Newbold2021-11-301-0/+6
|
* chocula: don't do name cleanups in importerBryan Newbold2021-11-301-8/+2
| | | | This kind of cleanup should be done in 'chocula' instead.
* container merger: fix bug with filtering by release countBryan Newbold2021-11-301-13/+15
| | | | | Also apply the "human edit" and "release count" checks only to the dupe (to-be-redirected) idents.
* release merger: same editgroup_id fixes as for file and container mergersBryan Newbold2021-11-241-1/+5
|
* container merger: fixes from QA testingBryan Newbold2021-11-241-8/+13
|
* mergers: don't try to accept empty editgroups in dry-run-modeBryan Newbold2021-11-241-2/+4
|
* ES release transform: handle redirected containers betterBryan Newbold2021-11-241-1/+1
| | | | | Despite the inline comment, we were not actually grabbing the "redirected" ident correctly, meaning some counts would not be accurate.
* container merger: defer allocation of editgroup_id; and dummy code pathBryan Newbold2021-11-241-1/+5
|
* initial implementation of container mergerBryan Newbold2021-11-242-0/+353
|
* file merger: allocate editgroup id later in 'merge' processBryan Newbold2021-11-241-1/+5
| | | | | The motivation is to avoid creating empty editgroups in dry-run mode, and when all entities are "skipped"
* Merge branch 'bnewbold-mergers' into 'master'bnewbold2021-11-255-0/+800
|\ | | | | | | | | entity mergers framework See merge request webgroup/fatcat!133
| * mergers common: remove inaccurate commentBryan Newbold2021-11-241-2/+0
| | | | | | | | Caught in review, thanks miku
| * file merger: add content_scope to list of merged fieldsBryan Newbold2021-11-241-1/+1
| |
| * release merger: some progress, but also disable (not complete)Bryan Newbold2021-11-231-12/+72
| |
| * file merges: fixes from testing in QABryan Newbold2021-11-231-14/+23
| |
| * mergers: small tweaksBryan Newbold2021-11-232-3/+3
| |
| * mergers: remove entity mergers from __init__ (to work around warning)Bryan Newbold2021-11-231-2/+0
| |
| * initial file merger, with testsBryan Newbold2021-11-232-0/+388
| |
| * mergers: fmt, lint, refactorsBryan Newbold2021-11-233-96/+200
| | | | | | | | | | These old merger code is from an old branch and needed significant cleanup
| * remove top-level fatcat_merge.py; going to call module __main__ going forwardBryan Newbold2021-11-231-112/+0
| |
| * first iteration of mergersBryan Newbold2021-11-234-0/+355
| |
* | codespell fixes to various other docsBryan Newbold2021-11-241-1/+1
| |
* | codespell fixes in python code (comments)Bryan Newbold2021-11-244-6/+6
| |
* | codespell fixes in web interface templatesBryan Newbold2021-11-2414-19/+19
|/
* Merge branch 'bnewbold-content-scope'Bryan Newbold2021-11-225-1/+8
|\
| * bump python client to 0.5.0Bryan Newbold2021-11-171-1/+1
| |
| * content_scope: include in file ES schema and transformBryan Newbold2021-11-171-0/+1
| |
| * minimal python test coverage of content_scope fieldsBryan Newbold2021-11-173-0/+6
| |
| * python code: update python_openapi_client in lockfileBryan Newbold2021-11-171-1/+1
| |
* | typo: don't expand containers for release revs (TOML)Bryan Newbold2021-11-191-1/+1
| |
* | web editgroup diff: don't enrich in TOML diff; fix overlapping breakBryan Newbold2021-11-192-5/+8
| |
* | web generic entity helpers: make enrichment optionalBryan Newbold2021-11-191-18/+49
| |
* | polish editgroup diff viewBryan Newbold2021-11-184-92/+83
| | | | | | | | Still not as great as it could be, but useful in this state.
* | initial implementation of editgroup 'diff' for reviewBryan Newbold2021-11-174-6/+183
| |
* | web: fix API URL link for review pages of entitiesBryan Newbold2021-11-171-2/+2
|/
* web: handle ES non-int error codes betterBryan Newbold2021-11-121-9/+12
|
* Merge branch 'bnewbold-import-refactors' into 'master'bnewbold2021-11-1126-1599/+828
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | import refactors and deprecations Some of these are from old stale branches (the datacite subject metadata patch), but most are from yesterday and today. Sort of a hodge-podge, but the general theme is getting around to deferred cleanups and refactors specific to importer code before making some behavioral changes. The Datacite-specific stuff could use review here. Remove unused/deprecated/dead code: - cdl_dash_dat and wayback_static importers, which were for specific early example entities and have been superseded by other importers - "extid map" sqlite3 feature from several importers, was only used for initial bulk imports (and maybe should not have been used) Refactors: - moved a number of large datastructures out of importer code and into a dedicated static file (`biblio_lookup_tables.py`). Didn't move all, just the ones that were either generic or very large (making it hard to read code) - shuffled around relative imports and some function names ("clean_str" vs. "clean") Some actual behavioral changes: - remove some Datacite-specific license slugs - stop trying to fix double-slashes in DOIs, that was causing more harm than help (some DOIs do actually have double-slashes!) - remove some excess metadata from datacite 'extra' fields
| * update datacite tests for license slug changesBryan Newbold2021-11-102-8/+7
| | | | | | | | | | Use datacite-specific wrapper function, and remove a couple non-OA/TDM-limited licenses.
| * improve lookup_license_slug helper and lookup tableBryan Newbold2021-11-102-56/+62
| |
| * refactor importer metadata tables into separate file; move some helpers aroundBryan Newbold2021-11-1010-702/+682
| | | | | | | | | | | | | | - MAX_ABSTRACT_LENGTH set in a single place (importer common) - merge datacite license slug table in to common table, removing some TDM-specific licenses (which do not apply in the context of preserving the full work)
| * importers: refactor imports of clean() and other normalization helpersBryan Newbold2021-11-1012-95/+104
| |
| * remove cdl_dash_dat and wayback_static importersBryan Newbold2021-11-104-596/+0
| | | | | | | | | | | | | | | | Cleaning out dead code. These importers were used to create demonstration fileset and webcapture entities early in development. They have been replaced by the fileset and webcapture ingest importers.
| * datacite import: store less subject metadataBryan Newbold2021-11-101-1/+7
| | | | | | | | | | | | | | | | Many of these 'subject' objects have the equivalent of several lines of text, with complex URLs that don't compress well. I think it is fine we have included these thus far instead of parsing more deeply, but going forward I don't think this nested 'extra' metadata is worth the database space.