Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | fileset ingest: remove a TODO | Bryan Newbold | 2022-04-04 | 1 | -1/+0 |
| | |||||
* | filesets: typo bugfix, and test 'mimetype' on entity, not extra | Bryan Newbold | 2022-04-04 | 1 | -1/+1 |
| | |||||
* | fileset ingest: fix mimetype handling | Bryan Newbold | 2022-03-31 | 1 | -4/+5 |
| | |||||
* | bugfix: logic flow in fileset release checking | Bryan Newbold | 2022-03-23 | 1 | -3/+6 |
| | |||||
* | single-file variant of fileset importer for dataset attempts | Bryan Newbold | 2022-03-23 | 2 | -0/+202 |
| | |||||
* | fix typo in fileset comparison helper | Bryan Newbold | 2022-03-23 | 1 | -1/+1 |
| | |||||
* | ingest fileset fixes, and some test coverage | Bryan Newbold | 2022-03-23 | 2 | -13/+30 |
| | |||||
* | dataset ingest: JSON object fixes | Bryan Newbold | 2022-03-22 | 1 | -5/+5 |
| | |||||
* | Merge branch 'bnewbold-container-web' into 'master' | bnewbold | 2022-03-10 | 3 | -2/+185 |
|\ | | | | | | | | | container web interface improvements See merge request webgroup/fatcat!140 | ||||
| * | move container_status ES query code from fatcat_web to fatcat_tools | Bryan Newbold | 2022-02-09 | 3 | -2/+185 |
| | | | | | | | | | | | | The main motivation is to never have fatcat_tools import from fatcat_web, only vica-versa. Some code in fatcat_tools needs container stats, so starting with that code path (plus some generic helpers). | ||||
* | | entity updates: don't try to ingest arxiv DOIs (for now) | Bryan Newbold | 2022-02-28 | 1 | -0/+2 |
| | | |||||
* | | datacite importer: skip container_id for some repository sources | Bryan Newbold | 2022-02-09 | 1 | -0/+34 |
|/ | |||||
* | doaj importer: TODO note to skip some larger publishers | Bryan Newbold | 2022-02-09 | 1 | -0/+4 |
| | |||||
* | container ES transform: include old extra.issne/p fields | Bryan Newbold | 2022-02-03 | 1 | -1/+4 |
| | | | | | These were removed prematurely. Not all containers have been updated to use these fields yet. | ||||
* | Merge branch 'bnewbold-file-es' into 'master' | bnewbold | 2022-01-21 | 3 | -4/+38 |
|\ | | | | | | | | | File entity elasticsearch index worker See merge request webgroup/fatcat!136 | ||||
| * | entity worker: expand creators in release entities | Bryan Newbold | 2021-12-15 | 1 | -1/+1 |
| | | |||||
| * | small default config typo fixes for elasticsearch workers | Bryan Newbold | 2021-12-15 | 1 | -2/+2 |
| | | |||||
| * | file elasticsearch index worker | Bryan Newbold | 2021-12-15 | 2 | -1/+35 |
| | | |||||
* | | crossref importer: skip affiliations lacking 'name' | Bryan Newbold | 2021-12-15 | 1 | -0/+3 |
|/ | | | | Relatedly, we should start handling ROR affiliations in contribs soon. | ||||
* | mergers: fix typo in env var name | Bryan Newbold | 2021-12-07 | 3 | -3/+3 |
| | |||||
* | ES container schema: add 'sim_pubid' and `ia_sim_collection` fields | Bryan Newbold | 2021-12-03 | 1 | -0/+2 |
| | |||||
* | ES transform: remove prototype microfilm links | Bryan Newbold | 2021-12-03 | 1 | -20/+0 |
| | | | | This ended up being a feature in scholar.archive.org, not fatcat. | ||||
* | chocula importer: handle not-upper-case ISSNs | Bryan Newbold | 2021-11-30 | 1 | -2/+6 |
| | |||||
* | chocula importer: handle broken ISSNs in extra metadata | Bryan Newbold | 2021-11-30 | 1 | -2/+7 |
| | |||||
* | chocula importer: tweak counting, conditions for doing updates | Bryan Newbold | 2021-11-30 | 1 | -15/+7 |
| | |||||
* | chocula importer: move issne/issnp 'extra' to top-level fields if doing updates | Bryan Newbold | 2021-11-30 | 1 | -0/+6 |
| | |||||
* | chocula: don't do name cleanups in importer | Bryan Newbold | 2021-11-30 | 1 | -8/+2 |
| | | | | This kind of cleanup should be done in 'chocula' instead. | ||||
* | container merger: fix bug with filtering by release count | Bryan Newbold | 2021-11-30 | 1 | -13/+15 |
| | | | | | Also apply the "human edit" and "release count" checks only to the dupe (to-be-redirected) idents. | ||||
* | release merger: same editgroup_id fixes as for file and container mergers | Bryan Newbold | 2021-11-24 | 1 | -1/+5 |
| | |||||
* | container merger: fixes from QA testing | Bryan Newbold | 2021-11-24 | 1 | -8/+13 |
| | |||||
* | mergers: don't try to accept empty editgroups in dry-run-mode | Bryan Newbold | 2021-11-24 | 1 | -2/+4 |
| | |||||
* | ES release transform: handle redirected containers better | Bryan Newbold | 2021-11-24 | 1 | -1/+1 |
| | | | | | Despite the inline comment, we were not actually grabbing the "redirected" ident correctly, meaning some counts would not be accurate. | ||||
* | container merger: defer allocation of editgroup_id; and dummy code path | Bryan Newbold | 2021-11-24 | 1 | -1/+5 |
| | |||||
* | initial implementation of container merger | Bryan Newbold | 2021-11-24 | 1 | -0/+237 |
| | |||||
* | file merger: allocate editgroup id later in 'merge' process | Bryan Newbold | 2021-11-24 | 1 | -1/+5 |
| | | | | | The motivation is to avoid creating empty editgroups in dry-run mode, and when all entities are "skipped" | ||||
* | Merge branch 'bnewbold-mergers' into 'master' | bnewbold | 2021-11-25 | 4 | -0/+640 |
|\ | | | | | | | | | entity mergers framework See merge request webgroup/fatcat!133 | ||||
| * | mergers common: remove inaccurate comment | Bryan Newbold | 2021-11-24 | 1 | -2/+0 |
| | | | | | | | | Caught in review, thanks miku | ||||
| * | file merger: add content_scope to list of merged fields | Bryan Newbold | 2021-11-24 | 1 | -1/+1 |
| | | |||||
| * | release merger: some progress, but also disable (not complete) | Bryan Newbold | 2021-11-23 | 1 | -12/+72 |
| | | |||||
| * | file merges: fixes from testing in QA | Bryan Newbold | 2021-11-23 | 1 | -14/+23 |
| | | |||||
| * | mergers: small tweaks | Bryan Newbold | 2021-11-23 | 2 | -3/+3 |
| | | |||||
| * | mergers: remove entity mergers from __init__ (to work around warning) | Bryan Newbold | 2021-11-23 | 1 | -2/+0 |
| | | |||||
| * | initial file merger, with tests | Bryan Newbold | 2021-11-23 | 1 | -0/+228 |
| | | |||||
| * | mergers: fmt, lint, refactors | Bryan Newbold | 2021-11-23 | 3 | -96/+200 |
| | | | | | | | | | | These old merger code is from an old branch and needed significant cleanup | ||||
| * | first iteration of mergers | Bryan Newbold | 2021-11-23 | 3 | -0/+243 |
| | | |||||
* | | codespell fixes in python code (comments) | Bryan Newbold | 2021-11-24 | 2 | -3/+3 |
|/ | |||||
* | content_scope: include in file ES schema and transform | Bryan Newbold | 2021-11-17 | 1 | -0/+1 |
| | |||||
* | Merge branch 'bnewbold-import-refactors' into 'master' | bnewbold | 2021-11-11 | 18 | -1462/+811 |
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | import refactors and deprecations Some of these are from old stale branches (the datacite subject metadata patch), but most are from yesterday and today. Sort of a hodge-podge, but the general theme is getting around to deferred cleanups and refactors specific to importer code before making some behavioral changes. The Datacite-specific stuff could use review here. Remove unused/deprecated/dead code: - cdl_dash_dat and wayback_static importers, which were for specific early example entities and have been superseded by other importers - "extid map" sqlite3 feature from several importers, was only used for initial bulk imports (and maybe should not have been used) Refactors: - moved a number of large datastructures out of importer code and into a dedicated static file (`biblio_lookup_tables.py`). Didn't move all, just the ones that were either generic or very large (making it hard to read code) - shuffled around relative imports and some function names ("clean_str" vs. "clean") Some actual behavioral changes: - remove some Datacite-specific license slugs - stop trying to fix double-slashes in DOIs, that was causing more harm than help (some DOIs do actually have double-slashes!) - remove some excess metadata from datacite 'extra' fields | ||||
| * | improve lookup_license_slug helper and lookup table | Bryan Newbold | 2021-11-10 | 2 | -56/+62 |
| | | |||||
| * | refactor importer metadata tables into separate file; move some helpers around | Bryan Newbold | 2021-11-10 | 10 | -702/+682 |
| | | | | | | | | | | | | | | - MAX_ABSTRACT_LENGTH set in a single place (importer common) - merge datacite license slug table in to common table, removing some TDM-specific licenses (which do not apply in the context of preserving the full work) |