summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools
Commit message (Expand)AuthorAgeFilesLines
...
* more consistent and defensive lower-casing of DOIsBryan Newbold2021-06-233-3/+8
* datacite: more careful title string access; fixes sentry #88350Martin Czygan2021-06-111-1/+1
* clean_doi() should lower-case returned DOIBryan Newbold2021-06-071-1/+4
* ingest: swap ingest and file checks, to result in clearer stats/counts of ski...Bryan Newbold2021-06-031-2/+2
* ingest: don't accept mag and s2 URLsBryan Newbold2021-06-031-4/+4
* changelog worker: fix file/fileset typo, caught by lintBryan Newbold2021-05-251-1/+1
* small python lint fixes (no behavior change)Bryan Newbold2021-05-253-4/+2
* ingest: add per-container ingest type overridesBryan Newbold2021-05-211-1/+17
* arabesque importer: ensure full 14-digit timestampsBryan Newbold2021-05-211-1/+3
* transforms: fix 'display_ame' typoBryan Newbold2021-04-191-2/+2
* prefer contrib.creator.display_name over contrib.raw_nameBryan Newbold2021-04-122-4/+7
* es worker: ensure kafka messages get clearedBryan Newbold2021-04-121-0/+2
* es indexing: more 'wip' fixesBryan Newbold2021-04-121-1/+5
* ES indexing: skip 'wip' entities with a warningBryan Newbold2021-04-121-11/+16
* container ES index worker: support for querying statusBryan Newbold2021-04-061-5/+32
* ES schema updates: doc_index_ts as a str, not datetimeBryan Newbold2021-04-061-4/+4
* container search schema: preservation stats, new fieldsBryan Newbold2021-04-061-2/+18
* release ES: add discipline fieldBryan Newbold2021-04-061-0/+2
* ES schemas: add doc_index_ts to all mappingsBryan Newbold2021-04-061-0/+4
* indexing: don't use document namesBryan Newbold2021-04-061-14/+4
* datacite: a missing surname should be None, not the empty stringMartin Czygan2021-04-021-2/+1
* elasticsearch: simple new dblp and doaj fieldsBryan Newbold2021-01-201-0/+4
* web ingest: terminal URL mismatch as skip, not assertBryan Newbold2020-12-301-1/+3
* dblp release import: skip arxiv_id releasesBryan Newbold2020-12-241-0/+9
* normalizer: test for un-versioned arxiv_idBryan Newbold2020-12-241-0/+4
* dblp import: fix arxiv_id typoBryan Newbold2020-12-231-1/+1
* ingest: allow dblp importsBryan Newbold2020-12-231-1/+1
* fuzzy: set 120 second timeout on ES lookupsBryan Newbold2020-12-231-1/+1
* dblp: polish HTML scrape/extract pipelineBryan Newbold2020-12-171-0/+14
* dblp: flesh out update code path (especially to add container_id linkage)Bryan Newbold2020-12-171-2/+6
* dblp: run fuzzy matching at try_update time (same as DOAJ)Bryan Newbold2020-12-171-1/+8
* improve dblp release importBryan Newbold2020-12-171-1/+2
* very simple dblp container importerBryan Newbold2020-12-172-0/+145
* dblp release importer: container_id lookup TSV, and dump JSON modeBryan Newbold2020-12-171-10/+66
* wikidata QID normalize helperBryan Newbold2020-12-171-2/+24
* initial implementation of dblp release importer (in progress)Bryan Newbold2020-12-172-0/+445
* add 'lxml' mode for large XML file import, and multi-tagsBryan Newbold2020-12-171-15/+28
* add dblp as an ingest source and identifierBryan Newbold2020-12-171-1/+2
* ingest: allow doaj ingest responsesBryan Newbold2020-12-171-1/+2
* bug fix: is_preserved should always be boolBryan Newbold2020-12-171-2/+2
* Merge branch 'bnewbold-doaj-fuzzy' into 'master'bnewbold2020-12-182-4/+71
|\
| * update fuzzy helper to pass 'reason' through to import codeBryan Newbold2020-12-171-3/+3
| * add fuzzy match filtering to DOAJ importerBryan Newbold2020-12-161-2/+9
| * add fuzzy matching helper to importer base classBryan Newbold2020-12-161-2/+62
* | entity update worker: treat fileset and webcapture updates like file updatesBryan Newbold2020-12-161-3/+25
* | fix indentationBryan Newbold2020-12-161-2/+2
* | have release elasticsearch transform count webcaptures and filesets towards p...Bryan Newbold2020-12-161-26/+57
* | small release_to_elasticsearch refactorsBryan Newbold2020-12-161-7/+12
* | refactor release_to_elasticsearch transformBryan Newbold2020-12-161-131/+148
|/
* html ingest: small fixes to try_update() code pathBryan Newbold2020-12-151-5/+5