aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools
Commit message (Collapse)AuthorAgeFilesLines
* pubmed: ignore empty map during baseline updateMartin Czygan2022-12-121-3/+13
| | | | | | | | | | | > NLM produces a baseline set of PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. -- https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt Last occurence Dec 8, 2022. Since we do not know the exact date, but the Pubmed docs explicitly state "December", we ignore empty map error in this month.
* Merge branch 'bnewbold-dblp-iteration' into 'master'bnewbold2022-07-252-2/+9
|\ | | | | | | | | dblp import iteration See merge request webgroup/fatcat!141
| * ingest: generate URLs for hdl (handle.net)Bryan Newbold2022-07-191-0/+4
| |
| * dblp: more skip patterns, and rename variableBryan Newbold2022-07-191-2/+5
| |
* | chocula importer: do update if publisher_type was nullBryan Newbold2022-07-211-0/+3
| |
* | doaj: fix tests now that container_id is requiredBryan Newbold2022-07-191-1/+1
| |
* | doaj: require container linkage for release importBryan Newbold2022-07-191-0/+4
|/
* ingest: DOAJ article URLsBryan Newbold2022-07-121-0/+4
|
* arxiv: work-around hack for strange titleBryan Newbold2022-07-071-0/+8
|
* fileset ingest: handle missing/partial file-level metadataBryan Newbold2022-04-051-3/+3
|
* ingest importer: improved extra/edit_extra code flowBryan Newbold2022-04-051-20/+13
|
* fileset ingest: remove a TODOBryan Newbold2022-04-041-1/+0
|
* filesets: typo bugfix, and test 'mimetype' on entity, not extraBryan Newbold2022-04-041-1/+1
|
* fileset ingest: fix mimetype handlingBryan Newbold2022-03-311-4/+5
|
* bugfix: logic flow in fileset release checkingBryan Newbold2022-03-231-3/+6
|
* single-file variant of fileset importer for dataset attemptsBryan Newbold2022-03-232-0/+202
|
* fix typo in fileset comparison helperBryan Newbold2022-03-231-1/+1
|
* ingest fileset fixes, and some test coverageBryan Newbold2022-03-232-13/+30
|
* dataset ingest: JSON object fixesBryan Newbold2022-03-221-5/+5
|
* Merge branch 'bnewbold-container-web' into 'master'bnewbold2022-03-103-2/+185
|\ | | | | | | | | container web interface improvements See merge request webgroup/fatcat!140
| * move container_status ES query code from fatcat_web to fatcat_toolsBryan Newbold2022-02-093-2/+185
| | | | | | | | | | | | The main motivation is to never have fatcat_tools import from fatcat_web, only vica-versa. Some code in fatcat_tools needs container stats, so starting with that code path (plus some generic helpers).
* | entity updates: don't try to ingest arxiv DOIs (for now)Bryan Newbold2022-02-281-0/+2
| |
* | datacite importer: skip container_id for some repository sourcesBryan Newbold2022-02-091-0/+34
|/
* doaj importer: TODO note to skip some larger publishersBryan Newbold2022-02-091-0/+4
|
* container ES transform: include old extra.issne/p fieldsBryan Newbold2022-02-031-1/+4
| | | | | These were removed prematurely. Not all containers have been updated to use these fields yet.
* Merge branch 'bnewbold-file-es' into 'master'bnewbold2022-01-213-4/+38
|\ | | | | | | | | File entity elasticsearch index worker See merge request webgroup/fatcat!136
| * entity worker: expand creators in release entitiesBryan Newbold2021-12-151-1/+1
| |
| * small default config typo fixes for elasticsearch workersBryan Newbold2021-12-151-2/+2
| |
| * file elasticsearch index workerBryan Newbold2021-12-152-1/+35
| |
* | crossref importer: skip affiliations lacking 'name'Bryan Newbold2021-12-151-0/+3
|/ | | | Relatedly, we should start handling ROR affiliations in contribs soon.
* mergers: fix typo in env var nameBryan Newbold2021-12-073-3/+3
|
* ES container schema: add 'sim_pubid' and `ia_sim_collection` fieldsBryan Newbold2021-12-031-0/+2
|
* ES transform: remove prototype microfilm linksBryan Newbold2021-12-031-20/+0
| | | | This ended up being a feature in scholar.archive.org, not fatcat.
* chocula importer: handle not-upper-case ISSNsBryan Newbold2021-11-301-2/+6
|
* chocula importer: handle broken ISSNs in extra metadataBryan Newbold2021-11-301-2/+7
|
* chocula importer: tweak counting, conditions for doing updatesBryan Newbold2021-11-301-15/+7
|
* chocula importer: move issne/issnp 'extra' to top-level fields if doing updatesBryan Newbold2021-11-301-0/+6
|
* chocula: don't do name cleanups in importerBryan Newbold2021-11-301-8/+2
| | | | This kind of cleanup should be done in 'chocula' instead.
* container merger: fix bug with filtering by release countBryan Newbold2021-11-301-13/+15
| | | | | Also apply the "human edit" and "release count" checks only to the dupe (to-be-redirected) idents.
* release merger: same editgroup_id fixes as for file and container mergersBryan Newbold2021-11-241-1/+5
|
* container merger: fixes from QA testingBryan Newbold2021-11-241-8/+13
|
* mergers: don't try to accept empty editgroups in dry-run-modeBryan Newbold2021-11-241-2/+4
|
* ES release transform: handle redirected containers betterBryan Newbold2021-11-241-1/+1
| | | | | Despite the inline comment, we were not actually grabbing the "redirected" ident correctly, meaning some counts would not be accurate.
* container merger: defer allocation of editgroup_id; and dummy code pathBryan Newbold2021-11-241-1/+5
|
* initial implementation of container mergerBryan Newbold2021-11-241-0/+237
|
* file merger: allocate editgroup id later in 'merge' processBryan Newbold2021-11-241-1/+5
| | | | | The motivation is to avoid creating empty editgroups in dry-run mode, and when all entities are "skipped"
* Merge branch 'bnewbold-mergers' into 'master'bnewbold2021-11-254-0/+640
|\ | | | | | | | | entity mergers framework See merge request webgroup/fatcat!133
| * mergers common: remove inaccurate commentBryan Newbold2021-11-241-2/+0
| | | | | | | | Caught in review, thanks miku
| * file merger: add content_scope to list of merged fieldsBryan Newbold2021-11-241-1/+1
| |
| * release merger: some progress, but also disable (not complete)Bryan Newbold2021-11-231-12/+72
| |