aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
* | dblp import: fix arxiv_id typoBryan Newbold2020-12-231-1/+1
| | | | | | | | Would have been caught by mypy!
* | ingest: allow dblp importsBryan Newbold2020-12-231-1/+1
| |
* | fuzzy: set 120 second timeout on ES lookupsBryan Newbold2020-12-231-1/+1
| |
* | DOAJ import notes, and SQL/stats updateBryan Newbold2020-12-235-0/+109
|/
* dblp: polish HTML scrape/extract pipelineBryan Newbold2020-12-174-3/+30
|
* dblp: flesh out update code path (especially to add container_id linkage)Bryan Newbold2020-12-171-2/+6
|
* dblp: run fuzzy matching at try_update time (same as DOAJ)Bryan Newbold2020-12-171-1/+8
|
* small dblp proposal updatesBryan Newbold2020-12-171-5/+2
|
* dblp: script and notes on container metadata generationBryan Newbold2020-12-174-0/+134
|
* improve dblp release importBryan Newbold2020-12-173-4/+17
|
* very simple dblp container importerBryan Newbold2020-12-177-7/+256
|
* dblp release importer: container_id lookup TSV, and dump JSON modeBryan Newbold2020-12-172-13/+73
|
* commit DBLP proposal progressBryan Newbold2020-12-171-7/+10
|
* dblp import proposalBryan Newbold2020-12-171-0/+159
| | | | | Had notes on this floating around since August (not in git), but mostly rewrote these in past couple days.
* basic test coverage of dblp release importerBryan Newbold2020-12-174-0/+503
|
* wikidata QID normalize helperBryan Newbold2020-12-171-2/+24
|
* initial implementation of dblp release importer (in progress)Bryan Newbold2020-12-173-0/+474
|
* add 'lxml' mode for large XML file import, and multi-tagsBryan Newbold2020-12-173-19/+31
|
* rust: fix malformed ext id error typeBryan Newbold2020-12-171-2/+2
| | | | This bug was due to copy/paste of SHA-1 check
* rust: rename and improve dblp key (id) syntax checkBryan Newbold2020-12-172-9/+17
|
* fix sloppy is_preserved ES transfom test failureBryan Newbold2020-12-171-1/+1
|
* DOAJ import notesBryan Newbold2020-12-172-2/+23
|
* add dblp as an ingest source and identifierBryan Newbold2020-12-171-1/+2
|
* ingest: allow doaj ingest responsesBryan Newbold2020-12-171-1/+2
|
* bug fix: is_preserved should always be boolBryan Newbold2020-12-171-2/+2
|
* Merge branch 'bnewbold-doaj-fuzzy' into 'master'bnewbold2020-12-187-267/+544
|\ | | | | | | | | DOAJ import fuzzy match filter See merge request webgroup/fatcat!92
| * update fuzzy helper to pass 'reason' through to import codeBryan Newbold2020-12-172-5/+5
| | | | | | | | | | The motivation for this change is to enable passing the 'reason' through to edit extra metadata, in cases where we merge or cluster releases.
| * pipenv: bump fuzzycat to 0.1.9Bryan Newbold2020-12-172-5/+5
| |
| * add fuzzy match filtering to DOAJ importerBryan Newbold2020-12-162-4/+23
| | | | | | | | | | | | | | | | | | | | | | In this default configuration, any entities with a fuzzy match (even "ambiguous") will be skipped at import time, to prevent creating duplicates. This is conservative towards not creating new/duplicate entities. In the future, as we get more confidence in fuzzy match/verification, we can start to ignore AMBIGUOUS, handle EXACT as same release, and merge STRONG (and WEAK?) matches under the same work entity.
| * add fuzzy matching helper to importer base classBryan Newbold2020-12-163-2/+147
| | | | | | | | Using fuzzycat. Add basic test coverage.
| * pipenv: add fuzzycat dependencyBryan Newbold2020-12-162-261/+374
| |
* | Merge pull request #65 from ibnesayeed/patch-1bnewbold2020-12-171-1/+1
|\ \ | | | | | | Improve status counting efficiency
| * | Improve status counting efficiencySawood Alam2020-12-171-1/+1
| | | | | | | | | When the input is large with a small number of unique items to be counted then counting as we go would be linear and more efficient approach than sorting and unique counting.
* | | Merge branch 'bnewbold-es-transform-html' into 'master'Martin Czygan2020-12-175-146/+296
|\ \ \ | |_|/ |/| | | | | | | | Elasticsearch release transform updates: handle webcaptures better, and refactoring See merge request webgroup/fatcat!91
| * | entity update worker: treat fileset and webcapture updates like file updatesBryan Newbold2020-12-161-3/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | When webcapture or fileset entities are updated, then the release entities associated with them also need to be updated (and work entities, recursively). A TODO is to handle the case where a release_id is *removed* as well as *added*, and reprocess the releases in that case as well.
| * | fix indentationBryan Newbold2020-12-161-2/+2
| | |
| * | have release elasticsearch transform count webcaptures and filesets towards ↵Bryan Newbold2020-12-161-26/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | preservation These are simple/partial changes to have webcaptures and filesets show up in 'preservation', 'in_ia', and 'in_web' ES schema flags. A longer-term TODO is to update the ES schema to have more granular analytic flags. Also includes a small generalization refactor for URL object parsing into preservation status, shared across file+fileset+webcapture entity types (all have similar URL objects with url+rel fields).
| * | improve release elasticsearch transform test coverageBryan Newbold2020-12-163-11/+86
| | |
| * | small release_to_elasticsearch refactorsBryan Newbold2020-12-161-7/+12
| | | | | | | | | | | | | | | | | | | | | These should have almost no change in behavior, but improve code quality. The one behavior change is counting ftp URLs as "in_web"
| * | refactor release_to_elasticsearch transformBryan Newbold2020-12-161-131/+148
|/ / | | | | | | | | | | | | | | | | | | | | | | This method was huge an monolithic. This commit splits out the content and container specific sections into helper functions to make it more managable. This involved refactoring to make many flags ("is_*" and "in_*") part of the output dict through the entire code path, allowing simple update() calls on the dict. Noting that in the future should refactor to use a type-annotated class for the elasticsearch output object. Perhaps something auto-generated from the ES schema itself (JSON files).
* | html ingest: small fixes to try_update() code pathBryan Newbold2020-12-151-5/+5
| | | | | | | | | | Don't currently have test coverage for most try_update() code; run the inserts manually in testing.
* | notes on partial-progress DOAJ release metadata importBryan Newbold2020-12-141-0/+105
| |
* | bulk import notes on ORCIDBryan Newbold2020-12-141-0/+55
| |
* | Revert "gitlab CI: explicitly use xenial tag of image"Bryan Newbold2020-12-111-1/+1
| | | | | | | | This reverts commit dbfc6e9bacaab4960e814192d66eefea87ef8930.
* | Revert "docker xenial base image: include python3.8"Bryan Newbold2020-12-111-6/+1
| | | | | | | | This reverts commit 91628426678a635f26cf992dbd5caedb4a3ae24b.
* | gitlab CI: explicitly use xenial tag of imageBryan Newbold2020-12-111-1/+1
| |
* | docker xenial base image: include python3.8Bryan Newbold2020-12-111-1/+6
| |
* | HACK: squash intermitent failure of detect_text_lang() testBryan Newbold2020-12-111-1/+2
| | | | | | | | | | This is an open bug; it is important that tests pass on master branch however.
* | guide: small updates to container extra schema notes (from dblp work)Bryan Newbold2020-12-111-2/+7
| |
* | bulk edits: note ORCID updateBryan Newbold2020-12-111-1/+5
| |