Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | small python 3.7 -> 3.8 tweaks | Bryan Newbold | 2021-01-05 | 2 | -3/+3 | |
| | ||||||
* | Merge branch 'bnewbold-ci-cleanups' into 'master' | bnewbold | 2021-01-05 | 2 | -10/+32 | |
|\ | | | | | | | | | Gitlab CI and docker base image cleanups See merge request webgroup/fatcat!94 | |||||
| * | gitlab CI: cleanups | Bryan Newbold | 2020-12-22 | 1 | -4/+20 | |
| | | ||||||
| * | gitlab CI: explicitly use xenial tag of image | Bryan Newbold | 2020-12-22 | 1 | -1/+1 | |
| | | ||||||
| * | docker xenial: use get-pipenv.py to install pipenv et al | Bryan Newbold | 2020-12-22 | 1 | -5/+6 | |
| | | ||||||
| * | docker xenial: switch to rust 1.43.0 | Bryan Newbold | 2020-12-22 | 1 | -1/+1 | |
| | | ||||||
| * | docker xenial: include python3.8 | Bryan Newbold | 2020-12-22 | 1 | -1/+6 | |
| | | ||||||
* | | web ingest: terminal URL mismatch as skip, not assert | Bryan Newbold | 2020-12-30 | 1 | -1/+3 | |
| | | ||||||
* | | update stats (post DOAJ and dblp imports) | Bryan Newbold | 2020-12-29 | 2 | -0/+47 | |
| | | ||||||
* | | dblp import notes; bulk edit changelog update | Bryan Newbold | 2020-12-29 | 2 | -1/+63 | |
| | | ||||||
* | | finally update CHANGELOG for actual v0.3.3 tag/releasev0.3.3 | Bryan Newbold | 2020-12-24 | 1 | -15/+16 | |
| | | ||||||
* | | rust openapi lib: bump version to v0.3.3 | Bryan Newbold | 2020-12-24 | 1 | -1/+1 | |
| | | ||||||
* | | rust: update lazy_static dependency | Bryan Newbold | 2020-12-24 | 3 | -35/+26 | |
| | | | | | | | | | | The motivation for this is to quiet very verbose warnings about some deprecated use of std::sync. Expect no actual runtime/behavior change. | |||||
* | | dblp release import: skip arxiv_id releases | Bryan Newbold | 2020-12-24 | 1 | -0/+9 | |
| | | ||||||
* | | normalizer: test for un-versioned arxiv_id | Bryan Newbold | 2020-12-24 | 1 | -0/+4 | |
| | | ||||||
* | | dblp import: fix arxiv_id typo | Bryan Newbold | 2020-12-23 | 1 | -1/+1 | |
| | | | | | | | | Would have been caught by mypy! | |||||
* | | ingest: allow dblp imports | Bryan Newbold | 2020-12-23 | 1 | -1/+1 | |
| | | ||||||
* | | fuzzy: set 120 second timeout on ES lookups | Bryan Newbold | 2020-12-23 | 1 | -1/+1 | |
| | | ||||||
* | | DOAJ import notes, and SQL/stats update | Bryan Newbold | 2020-12-23 | 5 | -0/+109 | |
|/ | ||||||
* | dblp: polish HTML scrape/extract pipeline | Bryan Newbold | 2020-12-17 | 4 | -3/+30 | |
| | ||||||
* | dblp: flesh out update code path (especially to add container_id linkage) | Bryan Newbold | 2020-12-17 | 1 | -2/+6 | |
| | ||||||
* | dblp: run fuzzy matching at try_update time (same as DOAJ) | Bryan Newbold | 2020-12-17 | 1 | -1/+8 | |
| | ||||||
* | small dblp proposal updates | Bryan Newbold | 2020-12-17 | 1 | -5/+2 | |
| | ||||||
* | dblp: script and notes on container metadata generation | Bryan Newbold | 2020-12-17 | 4 | -0/+134 | |
| | ||||||
* | improve dblp release import | Bryan Newbold | 2020-12-17 | 3 | -4/+17 | |
| | ||||||
* | very simple dblp container importer | Bryan Newbold | 2020-12-17 | 7 | -7/+256 | |
| | ||||||
* | dblp release importer: container_id lookup TSV, and dump JSON mode | Bryan Newbold | 2020-12-17 | 2 | -13/+73 | |
| | ||||||
* | commit DBLP proposal progress | Bryan Newbold | 2020-12-17 | 1 | -7/+10 | |
| | ||||||
* | dblp import proposal | Bryan Newbold | 2020-12-17 | 1 | -0/+159 | |
| | | | | | Had notes on this floating around since August (not in git), but mostly rewrote these in past couple days. | |||||
* | basic test coverage of dblp release importer | Bryan Newbold | 2020-12-17 | 4 | -0/+503 | |
| | ||||||
* | wikidata QID normalize helper | Bryan Newbold | 2020-12-17 | 1 | -2/+24 | |
| | ||||||
* | initial implementation of dblp release importer (in progress) | Bryan Newbold | 2020-12-17 | 3 | -0/+474 | |
| | ||||||
* | add 'lxml' mode for large XML file import, and multi-tags | Bryan Newbold | 2020-12-17 | 3 | -19/+31 | |
| | ||||||
* | rust: fix malformed ext id error type | Bryan Newbold | 2020-12-17 | 1 | -2/+2 | |
| | | | | This bug was due to copy/paste of SHA-1 check | |||||
* | rust: rename and improve dblp key (id) syntax check | Bryan Newbold | 2020-12-17 | 2 | -9/+17 | |
| | ||||||
* | fix sloppy is_preserved ES transfom test failure | Bryan Newbold | 2020-12-17 | 1 | -1/+1 | |
| | ||||||
* | DOAJ import notes | Bryan Newbold | 2020-12-17 | 2 | -2/+23 | |
| | ||||||
* | add dblp as an ingest source and identifier | Bryan Newbold | 2020-12-17 | 1 | -1/+2 | |
| | ||||||
* | ingest: allow doaj ingest responses | Bryan Newbold | 2020-12-17 | 1 | -1/+2 | |
| | ||||||
* | bug fix: is_preserved should always be bool | Bryan Newbold | 2020-12-17 | 1 | -2/+2 | |
| | ||||||
* | Merge branch 'bnewbold-doaj-fuzzy' into 'master' | bnewbold | 2020-12-18 | 7 | -267/+544 | |
|\ | | | | | | | | | DOAJ import fuzzy match filter See merge request webgroup/fatcat!92 | |||||
| * | update fuzzy helper to pass 'reason' through to import code | Bryan Newbold | 2020-12-17 | 2 | -5/+5 | |
| | | | | | | | | | | The motivation for this change is to enable passing the 'reason' through to edit extra metadata, in cases where we merge or cluster releases. | |||||
| * | pipenv: bump fuzzycat to 0.1.9 | Bryan Newbold | 2020-12-17 | 2 | -5/+5 | |
| | | ||||||
| * | add fuzzy match filtering to DOAJ importer | Bryan Newbold | 2020-12-16 | 2 | -4/+23 | |
| | | | | | | | | | | | | | | | | | | | | | | In this default configuration, any entities with a fuzzy match (even "ambiguous") will be skipped at import time, to prevent creating duplicates. This is conservative towards not creating new/duplicate entities. In the future, as we get more confidence in fuzzy match/verification, we can start to ignore AMBIGUOUS, handle EXACT as same release, and merge STRONG (and WEAK?) matches under the same work entity. | |||||
| * | add fuzzy matching helper to importer base class | Bryan Newbold | 2020-12-16 | 3 | -2/+147 | |
| | | | | | | | | Using fuzzycat. Add basic test coverage. | |||||
| * | pipenv: add fuzzycat dependency | Bryan Newbold | 2020-12-16 | 2 | -261/+374 | |
| | | ||||||
* | | Merge pull request #65 from ibnesayeed/patch-1 | bnewbold | 2020-12-17 | 1 | -1/+1 | |
|\ \ | | | | | | | Improve status counting efficiency | |||||
| * | | Improve status counting efficiency | Sawood Alam | 2020-12-17 | 1 | -1/+1 | |
| | | | | | | | | | When the input is large with a small number of unique items to be counted then counting as we go would be linear and more efficient approach than sorting and unique counting. | |||||
* | | | Merge branch 'bnewbold-es-transform-html' into 'master' | Martin Czygan | 2020-12-17 | 5 | -146/+296 | |
|\ \ \ | |_|/ |/| | | | | | | | | Elasticsearch release transform updates: handle webcaptures better, and refactoring See merge request webgroup/fatcat!91 | |||||
| * | | entity update worker: treat fileset and webcapture updates like file updates | Bryan Newbold | 2020-12-16 | 1 | -3/+25 | |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When webcapture or fileset entities are updated, then the release entities associated with them also need to be updated (and work entities, recursively). A TODO is to handle the case where a release_id is *removed* as well as *added*, and reprocess the releases in that case as well. |