| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
Had notes on this floating around since August (not in git), but mostly
rewrote these in past couple days.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
This bug was due to copy/paste of SHA-1 check
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|\
| |
| |
| |
| | |
DOAJ import fuzzy match filter
See merge request webgroup/fatcat!92
|
| |
| |
| |
| |
| | |
The motivation for this change is to enable passing the 'reason' through
to edit extra metadata, in cases where we merge or cluster releases.
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In this default configuration, any entities with a fuzzy match (even
"ambiguous") will be skipped at import time, to prevent creating
duplicates. This is conservative towards not creating new/duplicate
entities.
In the future, as we get more confidence in fuzzy match/verification, we
can start to ignore AMBIGUOUS, handle EXACT as same release, and merge
STRONG (and WEAK?) matches under the same work entity.
|
| |
| |
| |
| | |
Using fuzzycat. Add basic test coverage.
|
| | |
|
|\ \
| | |
| | | |
Improve status counting efficiency
|
| | |
| | |
| | | |
When the input is large with a small number of unique items to be counted then counting as we go would be linear and more efficient approach than sorting and unique counting.
|
|\ \ \
| |_|/
|/| |
| | |
| | | |
Elasticsearch release transform updates: handle webcaptures better, and refactoring
See merge request webgroup/fatcat!91
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When webcapture or fileset entities are updated, then the release
entities associated with them also need to be updated (and work
entities, recursively).
A TODO is to handle the case where a release_id is *removed* as well as
*added*, and reprocess the releases in that case as well.
|
| | | |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
preservation
These are simple/partial changes to have webcaptures and filesets show
up in 'preservation', 'in_ia', and 'in_web' ES schema flags. A
longer-term TODO is to update the ES schema to have more granular
analytic flags.
Also includes a small generalization refactor for URL object parsing
into preservation status, shared across file+fileset+webcapture entity
types (all have similar URL objects with url+rel fields).
|
| | | |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
These should have almost no change in behavior, but improve code
quality.
The one behavior change is counting ftp URLs as "in_web"
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This method was huge an monolithic. This commit splits out the content
and container specific sections into helper functions to make it more
managable. This involved refactoring to make many flags ("is_*" and
"in_*") part of the output dict through the entire code path, allowing
simple update() calls on the dict.
Noting that in the future should refactor to use a type-annotated class
for the elasticsearch output object. Perhaps something auto-generated
from the ES schema itself (JSON files).
|
| |
| |
| |
| |
| | |
Don't currently have test coverage for most try_update() code; run the
inserts manually in testing.
|
| | |
|
| | |
|
| |
| |
| |
| | |
This reverts commit dbfc6e9bacaab4960e814192d66eefea87ef8930.
|
| |
| |
| |
| | |
This reverts commit 91628426678a635f26cf992dbd5caedb4a3ae24b.
|
| | |
|
| | |
|
| |
| |
| |
| |
| | |
This is an open bug; it is important that tests pass on master branch
however.
|
| | |
|
| | |
|
| | |
|
|\ \
| | |
| | |
| | |
| | | |
DOAJ article metadata import
See merge request webgroup/fatcat!89
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Older sentry had an unsafe memory initialization error, which wasn't
caught by older compilers. Rust 1.48 catches the problem at runtime and
raises a panic. This meant that new builds (eg, on QA machine after
update) were panic-ing.
Newest versions of sentry have modern dependencies, which breaks our
crufty old 'iron' dependency tree. Work-around is to only partially
update (v0.12 to v0.15).
This is a fairly frustrating situation. I'm hopeful that when we update
to a different web framework and openapi generator 5.0 (not yet
released), many of these dependency issues will be resolved, but i'm not
certain. I did notice that if we entirely remove Sentry, which has not
really been used much (only a small handful of issues reported over
several years), we might be able to resolve openssl dependency issues.
|
| | | |
|
| | |
| | |
| | |
| | |
| | |
| | | |
This is an attempt to fix spurious test failures, in which this text
block was getting detected as 'kr' on occasion. Apparently there is
non-determinism in the langdetect package.
|
| | | |
|
| | |
| | |
| | |
| | |
| | | |
Easy to miss that we skip updates *twice*, and with this early bailout
were not updating counts correctly.
|
| | |
| | |
| | |
| | | |
Also add missing code coverage for update path (disabled by default).
|