| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|\ \ \
| |_|/
|/| |
| | |
| | | |
Elasticsearch release transform updates: handle webcaptures better, and refactoring
See merge request webgroup/fatcat!91
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When webcapture or fileset entities are updated, then the release
entities associated with them also need to be updated (and work
entities, recursively).
A TODO is to handle the case where a release_id is *removed* as well as
*added*, and reprocess the releases in that case as well.
|
| | | |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
preservation
These are simple/partial changes to have webcaptures and filesets show
up in 'preservation', 'in_ia', and 'in_web' ES schema flags. A
longer-term TODO is to update the ES schema to have more granular
analytic flags.
Also includes a small generalization refactor for URL object parsing
into preservation status, shared across file+fileset+webcapture entity
types (all have similar URL objects with url+rel fields).
|
| | | |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
These should have almost no change in behavior, but improve code
quality.
The one behavior change is counting ftp URLs as "in_web"
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This method was huge an monolithic. This commit splits out the content
and container specific sections into helper functions to make it more
managable. This involved refactoring to make many flags ("is_*" and
"in_*") part of the output dict through the entire code path, allowing
simple update() calls on the dict.
Noting that in the future should refactor to use a type-annotated class
for the elasticsearch output object. Perhaps something auto-generated
from the ES schema itself (JSON files).
|
| |
| |
| |
| |
| | |
Don't currently have test coverage for most try_update() code; run the
inserts manually in testing.
|
| | |
|
| | |
|
| |
| |
| |
| | |
This reverts commit dbfc6e9bacaab4960e814192d66eefea87ef8930.
|
| |
| |
| |
| | |
This reverts commit 91628426678a635f26cf992dbd5caedb4a3ae24b.
|
| | |
|
| | |
|
| |
| |
| |
| |
| | |
This is an open bug; it is important that tests pass on master branch
however.
|
| | |
|
| | |
|
| | |
|
|\ \
| | |
| | |
| | |
| | | |
DOAJ article metadata import
See merge request webgroup/fatcat!89
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Older sentry had an unsafe memory initialization error, which wasn't
caught by older compilers. Rust 1.48 catches the problem at runtime and
raises a panic. This meant that new builds (eg, on QA machine after
update) were panic-ing.
Newest versions of sentry have modern dependencies, which breaks our
crufty old 'iron' dependency tree. Work-around is to only partially
update (v0.12 to v0.15).
This is a fairly frustrating situation. I'm hopeful that when we update
to a different web framework and openapi generator 5.0 (not yet
released), many of these dependency issues will be resolved, but i'm not
certain. I did notice that if we entirely remove Sentry, which has not
really been used much (only a small handful of issues reported over
several years), we might be able to resolve openssl dependency issues.
|
| | | |
|
| | |
| | |
| | |
| | |
| | |
| | | |
This is an attempt to fix spurious test failures, in which this text
block was getting detected as 'kr' on occasion. Apparently there is
non-determinism in the langdetect package.
|
| | | |
|
| | |
| | |
| | |
| | |
| | | |
Easy to miss that we skip updates *twice*, and with this early bailout
were not updating counts correctly.
|
| | |
| | |
| | |
| | | |
Also add missing code coverage for update path (disabled by default).
|
| | | |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
I believe this is safe and matches the regex filter in rust (fatcatd).
Keep hitting one-off DOIs that were passing through python check, so
being more strict from here forward.
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Moved several normalizer helpers out of fatcat_tools.importers.common to
fatcat_tools.normal.
Copied language name and country name parser helpers from chocula
repository (built on existing pycountry helper library).
Have not gone through and refactored other importers to point to these
helpers yet; that should be a separate PR when this branch is merged.
Current changes are backwards compatible via re-imports.
|
| | |
| | |
| | |
| | | |
Several things to finish implementing and polish.
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
|/ / |
|
| | |
|
|\ \
| |/
|/|
| |
| | |
HTML webcapture ingest (and XML file ingest)
See merge request webgroup/fatcat!88
|
| | |
|
| | |
|
| | |
|