aboutsummaryrefslogtreecommitdiffstats
path: root/python
Commit message (Expand)AuthorAgeFilesLines
* web ingest: terminal URL mismatch as skip, not assertBryan Newbold2020-12-301-1/+3
* dblp release import: skip arxiv_id releasesBryan Newbold2020-12-241-0/+9
* normalizer: test for un-versioned arxiv_idBryan Newbold2020-12-241-0/+4
* dblp import: fix arxiv_id typoBryan Newbold2020-12-231-1/+1
* ingest: allow dblp importsBryan Newbold2020-12-231-1/+1
* fuzzy: set 120 second timeout on ES lookupsBryan Newbold2020-12-231-1/+1
* dblp: polish HTML scrape/extract pipelineBryan Newbold2020-12-171-0/+14
* dblp: flesh out update code path (especially to add container_id linkage)Bryan Newbold2020-12-171-2/+6
* dblp: run fuzzy matching at try_update time (same as DOAJ)Bryan Newbold2020-12-171-1/+8
* improve dblp release importBryan Newbold2020-12-173-4/+17
* very simple dblp container importerBryan Newbold2020-12-177-7/+256
* dblp release importer: container_id lookup TSV, and dump JSON modeBryan Newbold2020-12-172-13/+73
* basic test coverage of dblp release importerBryan Newbold2020-12-174-0/+503
* wikidata QID normalize helperBryan Newbold2020-12-171-2/+24
* initial implementation of dblp release importer (in progress)Bryan Newbold2020-12-173-0/+474
* add 'lxml' mode for large XML file import, and multi-tagsBryan Newbold2020-12-173-19/+31
* fix sloppy is_preserved ES transfom test failureBryan Newbold2020-12-171-1/+1
* add dblp as an ingest source and identifierBryan Newbold2020-12-171-1/+2
* ingest: allow doaj ingest responsesBryan Newbold2020-12-171-1/+2
* bug fix: is_preserved should always be boolBryan Newbold2020-12-171-2/+2
* Merge branch 'bnewbold-doaj-fuzzy' into 'master'bnewbold2020-12-187-267/+544
|\
| * update fuzzy helper to pass 'reason' through to import codeBryan Newbold2020-12-172-5/+5
| * pipenv: bump fuzzycat to 0.1.9Bryan Newbold2020-12-172-5/+5
| * add fuzzy match filtering to DOAJ importerBryan Newbold2020-12-162-4/+23
| * add fuzzy matching helper to importer base classBryan Newbold2020-12-163-2/+147
| * pipenv: add fuzzycat dependencyBryan Newbold2020-12-162-261/+374
* | entity update worker: treat fileset and webcapture updates like file updatesBryan Newbold2020-12-161-3/+25
* | fix indentationBryan Newbold2020-12-161-2/+2
* | have release elasticsearch transform count webcaptures and filesets towards p...Bryan Newbold2020-12-161-26/+57
* | improve release elasticsearch transform test coverageBryan Newbold2020-12-163-11/+86
* | small release_to_elasticsearch refactorsBryan Newbold2020-12-161-7/+12
* | refactor release_to_elasticsearch transformBryan Newbold2020-12-161-131/+148
|/
* html ingest: small fixes to try_update() code pathBryan Newbold2020-12-151-5/+5
* HACK: squash intermitent failure of detect_text_lang() testBryan Newbold2020-12-111-1/+2
* DOAJ: remove accidentally commited 'skip' of a testBryan Newbold2020-11-201-1/+0
* langdetect: more text for 'zh' test caseBryan Newbold2020-11-201-1/+1
* DOAJ: update importer README with example invocationBryan Newbold2020-11-201-0/+7
* crossref+datacite: remove confusing early update bailBryan Newbold2020-11-202-4/+0
* doaj: fix update code path (getattr not __dict__)Bryan Newbold2020-11-203-15/+70
* DOAJ: handle empty identifier 'id' caseBryan Newbold2020-11-201-0/+2
* clean DOI: ban all non-ASCII charactersBryan Newbold2020-11-191-1/+4
* normal: handle langdetect of 'zh-cn' (not len=2)Bryan Newbold2020-11-191-0/+3
* tweak DOAJ importer class args and default for do_updatesBryan Newbold2020-11-191-2/+2
* show DOAJ (and dblp) identifiers in release viewBryan Newbold2020-11-191-1/+7
* if a release has DOAJ article id, count as OABryan Newbold2020-11-191-0/+3
* implement remainder of DOAJ article importerBryan Newbold2020-11-193-68/+168
* handle more non-ASCII DOI casesBryan Newbold2020-11-191-1/+3
* more python normalizers, and move from importer commonBryan Newbold2020-11-192-154/+326
* initial implementation of DOAJ importerBryan Newbold2020-11-194-0/+387
* html ingest: actual xhtml mimetypeBryan Newbold2020-11-161-2/+2