aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
| * docker xenial: use get-pipenv.py to install pipenv et alBryan Newbold2020-12-221-5/+6
| |
| * docker xenial: switch to rust 1.43.0Bryan Newbold2020-12-221-1/+1
| |
| * docker xenial: include python3.8Bryan Newbold2020-12-221-1/+6
| |
* | web ingest: terminal URL mismatch as skip, not assertBryan Newbold2020-12-301-1/+3
| |
* | update stats (post DOAJ and dblp imports)Bryan Newbold2020-12-292-0/+47
| |
* | dblp import notes; bulk edit changelog updateBryan Newbold2020-12-292-1/+63
| |
* | finally update CHANGELOG for actual v0.3.3 tag/releasev0.3.3Bryan Newbold2020-12-241-15/+16
| |
* | rust openapi lib: bump version to v0.3.3Bryan Newbold2020-12-241-1/+1
| |
* | rust: update lazy_static dependencyBryan Newbold2020-12-243-35/+26
| | | | | | | | | | The motivation for this is to quiet very verbose warnings about some deprecated use of std::sync. Expect no actual runtime/behavior change.
* | dblp release import: skip arxiv_id releasesBryan Newbold2020-12-241-0/+9
| |
* | normalizer: test for un-versioned arxiv_idBryan Newbold2020-12-241-0/+4
| |
* | dblp import: fix arxiv_id typoBryan Newbold2020-12-231-1/+1
| | | | | | | | Would have been caught by mypy!
* | ingest: allow dblp importsBryan Newbold2020-12-231-1/+1
| |
* | fuzzy: set 120 second timeout on ES lookupsBryan Newbold2020-12-231-1/+1
| |
* | DOAJ import notes, and SQL/stats updateBryan Newbold2020-12-235-0/+109
|/
* dblp: polish HTML scrape/extract pipelineBryan Newbold2020-12-174-3/+30
|
* dblp: flesh out update code path (especially to add container_id linkage)Bryan Newbold2020-12-171-2/+6
|
* dblp: run fuzzy matching at try_update time (same as DOAJ)Bryan Newbold2020-12-171-1/+8
|
* small dblp proposal updatesBryan Newbold2020-12-171-5/+2
|
* dblp: script and notes on container metadata generationBryan Newbold2020-12-174-0/+134
|
* improve dblp release importBryan Newbold2020-12-173-4/+17
|
* very simple dblp container importerBryan Newbold2020-12-177-7/+256
|
* dblp release importer: container_id lookup TSV, and dump JSON modeBryan Newbold2020-12-172-13/+73
|
* commit DBLP proposal progressBryan Newbold2020-12-171-7/+10
|
* dblp import proposalBryan Newbold2020-12-171-0/+159
| | | | | Had notes on this floating around since August (not in git), but mostly rewrote these in past couple days.
* basic test coverage of dblp release importerBryan Newbold2020-12-174-0/+503
|
* wikidata QID normalize helperBryan Newbold2020-12-171-2/+24
|
* initial implementation of dblp release importer (in progress)Bryan Newbold2020-12-173-0/+474
|
* add 'lxml' mode for large XML file import, and multi-tagsBryan Newbold2020-12-173-19/+31
|
* rust: fix malformed ext id error typeBryan Newbold2020-12-171-2/+2
| | | | This bug was due to copy/paste of SHA-1 check
* rust: rename and improve dblp key (id) syntax checkBryan Newbold2020-12-172-9/+17
|
* fix sloppy is_preserved ES transfom test failureBryan Newbold2020-12-171-1/+1
|
* DOAJ import notesBryan Newbold2020-12-172-2/+23
|
* add dblp as an ingest source and identifierBryan Newbold2020-12-171-1/+2
|
* ingest: allow doaj ingest responsesBryan Newbold2020-12-171-1/+2
|
* bug fix: is_preserved should always be boolBryan Newbold2020-12-171-2/+2
|
* Merge branch 'bnewbold-doaj-fuzzy' into 'master'bnewbold2020-12-187-267/+544
|\ | | | | | | | | DOAJ import fuzzy match filter See merge request webgroup/fatcat!92
| * update fuzzy helper to pass 'reason' through to import codeBryan Newbold2020-12-172-5/+5
| | | | | | | | | | The motivation for this change is to enable passing the 'reason' through to edit extra metadata, in cases where we merge or cluster releases.
| * pipenv: bump fuzzycat to 0.1.9Bryan Newbold2020-12-172-5/+5
| |
| * add fuzzy match filtering to DOAJ importerBryan Newbold2020-12-162-4/+23
| | | | | | | | | | | | | | | | | | | | | | In this default configuration, any entities with a fuzzy match (even "ambiguous") will be skipped at import time, to prevent creating duplicates. This is conservative towards not creating new/duplicate entities. In the future, as we get more confidence in fuzzy match/verification, we can start to ignore AMBIGUOUS, handle EXACT as same release, and merge STRONG (and WEAK?) matches under the same work entity.
| * add fuzzy matching helper to importer base classBryan Newbold2020-12-163-2/+147
| | | | | | | | Using fuzzycat. Add basic test coverage.
| * pipenv: add fuzzycat dependencyBryan Newbold2020-12-162-261/+374
| |
* | Merge pull request #65 from ibnesayeed/patch-1bnewbold2020-12-171-1/+1
|\ \ | | | | | | Improve status counting efficiency
| * | Improve status counting efficiencySawood Alam2020-12-171-1/+1
| | | | | | | | | When the input is large with a small number of unique items to be counted then counting as we go would be linear and more efficient approach than sorting and unique counting.
* | | Merge branch 'bnewbold-es-transform-html' into 'master'Martin Czygan2020-12-175-146/+296
|\ \ \ | |_|/ |/| | | | | | | | Elasticsearch release transform updates: handle webcaptures better, and refactoring See merge request webgroup/fatcat!91
| * | entity update worker: treat fileset and webcapture updates like file updatesBryan Newbold2020-12-161-3/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | When webcapture or fileset entities are updated, then the release entities associated with them also need to be updated (and work entities, recursively). A TODO is to handle the case where a release_id is *removed* as well as *added*, and reprocess the releases in that case as well.
| * | fix indentationBryan Newbold2020-12-161-2/+2
| | |
| * | have release elasticsearch transform count webcaptures and filesets towards ↵Bryan Newbold2020-12-161-26/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | preservation These are simple/partial changes to have webcaptures and filesets show up in 'preservation', 'in_ia', and 'in_web' ES schema flags. A longer-term TODO is to update the ES schema to have more granular analytic flags. Also includes a small generalization refactor for URL object parsing into preservation status, shared across file+fileset+webcapture entity types (all have similar URL objects with url+rel fields).
| * | improve release elasticsearch transform test coverageBryan Newbold2020-12-163-11/+86
| | |
| * | small release_to_elasticsearch refactorsBryan Newbold2020-12-161-7/+12
| | | | | | | | | | | | | | | | | | | | | These should have almost no change in behavior, but improve code quality. The one behavior change is counting ftp URLs as "in_web"