summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools
Commit message (Expand)AuthorAgeFilesLines
* entity update worker: treat fileset and webcapture updates like file updatesBryan Newbold2020-12-161-3/+25
* fix indentationBryan Newbold2020-12-161-2/+2
* have release elasticsearch transform count webcaptures and filesets towards p...Bryan Newbold2020-12-161-26/+57
* small release_to_elasticsearch refactorsBryan Newbold2020-12-161-7/+12
* refactor release_to_elasticsearch transformBryan Newbold2020-12-161-131/+148
* html ingest: small fixes to try_update() code pathBryan Newbold2020-12-151-5/+5
* HACK: squash intermitent failure of detect_text_lang() testBryan Newbold2020-12-111-1/+2
* langdetect: more text for 'zh' test caseBryan Newbold2020-11-201-1/+1
* crossref+datacite: remove confusing early update bailBryan Newbold2020-11-202-4/+0
* doaj: fix update code path (getattr not __dict__)Bryan Newbold2020-11-201-4/+3
* DOAJ: handle empty identifier 'id' caseBryan Newbold2020-11-201-0/+2
* clean DOI: ban all non-ASCII charactersBryan Newbold2020-11-191-1/+4
* normal: handle langdetect of 'zh-cn' (not len=2)Bryan Newbold2020-11-191-0/+3
* tweak DOAJ importer class args and default for do_updatesBryan Newbold2020-11-191-2/+2
* if a release has DOAJ article id, count as OABryan Newbold2020-11-191-0/+3
* implement remainder of DOAJ article importerBryan Newbold2020-11-191-57/+125
* handle more non-ASCII DOI casesBryan Newbold2020-11-191-1/+3
* more python normalizers, and move from importer commonBryan Newbold2020-11-192-154/+326
* initial implementation of DOAJ importerBryan Newbold2020-11-192-0/+290
* html ingest: actual xhtml mimetypeBryan Newbold2020-11-161-2/+2
* ingest tool: support for setting ingest typeBryan Newbold2020-11-061-6/+6
* html ingest: remaining implementationBryan Newbold2020-11-061-22/+19
* ingest: progress on HTML ingestBryan Newbold2020-11-051-14/+30
* ingest: initial 'web' worker implementationBryan Newbold2020-11-052-67/+259
* refactor: white/black -> allow/blockBryan Newbold2020-11-051-4/+4
* ingest: whitelist -> allowlistBryan Newbold2020-11-051-3/+3
* ingest: basic checks for ingest_typeBryan Newbold2020-11-051-3/+29
* normalizer: filter out a specific non-ASCII character in DOIBryan Newbold2020-11-041-1/+3
* entity updates: don't ingest JSTOR DOI prefixesBryan Newbold2020-10-231-0/+2
* entity updater: new work update feed (ident and changelog metadata only)Bryan Newbold2020-10-161-2/+24
* chocula importer: small tweaks to update behaviorBryan Newbold2020-10-081-8/+6
* elastic transform: more preservation keepersBryan Newbold2020-10-081-1/+2
* address spammy datacite titlesMartin Czygan2020-09-231-0/+19
* ingest: default to crawl protocols.io DOIsBryan Newbold2020-09-101-0/+2
* datacite: handle case of empty-string versionBryan Newbold2020-09-101-1/+1
* remove spurious print statementBryan Newbold2020-09-031-1/+0
* generic file entity clean-ups as part of file_meta importerBryan Newbold2020-09-022-0/+50
* fix comment typo (thanks martin)Bryan Newbold2020-08-271-1/+1
* fixes and test coverage for file_meta importerBryan Newbold2020-08-211-5/+10
* initial implementation of file_meta importerBryan Newbold2020-08-212-0/+71
* entity updater: handle doi=None case betterBryan Newbold2020-08-141-1/+1
* entity updater: es['publisher_type'] not always setBryan Newbold2020-08-141-1/+1
* Merge branch 'bnewbold-ingest-improvements' into 'master'Martin Czygan2020-08-132-33/+114
|\
| * entity update: change big5 ingest behaviorBryan Newbold2020-08-111-9/+15
| * entity update: default to ingest non-OA worksBryan Newbold2020-08-111-9/+10
| * entity update: skip ingest of figshare+zenodo 'group' DOIsBryan Newbold2020-08-111-0/+15
| * datacite import: figshare-specific hacksBryan Newbold2020-08-111-3/+3
| * datacite import: refactor release_type detection into static methodBryan Newbold2020-08-111-14/+51
| * datacite import: refactor publisher-specific hacks into static methodBryan Newbold2020-08-111-15/+29
| * update crawl blocklist for SPNv2 requests which mostly failBryan Newbold2020-08-101-2/+10