| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|\
| |
| |
| |
| | |
DOAJ import fuzzy match filter
See merge request webgroup/fatcat!92
|
| |
| |
| |
| |
| | |
The motivation for this change is to enable passing the 'reason' through
to edit extra metadata, in cases where we merge or cluster releases.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In this default configuration, any entities with a fuzzy match (even
"ambiguous") will be skipped at import time, to prevent creating
duplicates. This is conservative towards not creating new/duplicate
entities.
In the future, as we get more confidence in fuzzy match/verification, we
can start to ignore AMBIGUOUS, handle EXACT as same release, and merge
STRONG (and WEAK?) matches under the same work entity.
|
| |
| |
| |
| | |
Using fuzzycat. Add basic test coverage.
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When webcapture or fileset entities are updated, then the release
entities associated with them also need to be updated (and work
entities, recursively).
A TODO is to handle the case where a release_id is *removed* as well as
*added*, and reprocess the releases in that case as well.
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
preservation
These are simple/partial changes to have webcaptures and filesets show
up in 'preservation', 'in_ia', and 'in_web' ES schema flags. A
longer-term TODO is to update the ES schema to have more granular
analytic flags.
Also includes a small generalization refactor for URL object parsing
into preservation status, shared across file+fileset+webcapture entity
types (all have similar URL objects with url+rel fields).
|
| |
| |
| |
| |
| |
| |
| | |
These should have almost no change in behavior, but improve code
quality.
The one behavior change is counting ftp URLs as "in_web"
|
|/
|
|
|
|
|
|
|
|
|
|
| |
This method was huge an monolithic. This commit splits out the content
and container specific sections into helper functions to make it more
managable. This involved refactoring to make many flags ("is_*" and
"in_*") part of the output dict through the entire code path, allowing
simple update() calls on the dict.
Noting that in the future should refactor to use a type-annotated class
for the elasticsearch output object. Perhaps something auto-generated
from the ES schema itself (JSON files).
|
|
|
|
|
| |
Don't currently have test coverage for most try_update() code; run the
inserts manually in testing.
|
|
|
|
|
| |
This is an open bug; it is important that tests pass on master branch
however.
|
|
|
|
|
|
| |
This is an attempt to fix spurious test failures, in which this text
block was getting detected as 'kr' on occasion. Apparently there is
non-determinism in the langdetect package.
|
|
|
|
|
| |
Easy to miss that we skip updates *twice*, and with this early bailout
were not updating counts correctly.
|
|
|
|
| |
Also add missing code coverage for update path (disabled by default).
|
| |
|
|
|
|
|
|
|
| |
I believe this is safe and matches the regex filter in rust (fatcatd).
Keep hitting one-off DOIs that were passing through python check, so
being more strict from here forward.
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Moved several normalizer helpers out of fatcat_tools.importers.common to
fatcat_tools.normal.
Copied language name and country name parser helpers from chocula
repository (built on existing pycountry helper library).
Have not gone through and refactored other importers to point to these
helpers yet; that should be a separate PR when this branch is merged.
Current changes are backwards compatible via re-imports.
|
|
|
|
| |
Several things to finish implementing and polish.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
seemingly from zenodo:
* https://fatcat.wiki/release/rzcpjwukobd4pj36ipla22cnoi
* https://doi.org/10.5281/zenodo.4041777
About 3400 records with "FULL MOVIE" in title, currently.
|
| |
|
|
|
|
|
| |
Includes a tiny tweak to the datacite import sample file to test this
code path.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
This is a small bugfix for a production issue.
|
|\
| |
| |
| |
| | |
ingest behavior changes; some datacite metadata tweaks
See merge request webgroup/fatcat!78
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In addition to changing the OA default, this was the main intended
behavior change in this group of commits: want to ingest fewer attempts
that we *expect* to fail, but default to ingest/crawl attempt if we are
uncertain. This is because there is a long tail of journals that
register DOIs and are defacto OA (fulltext is available), but we don't
have metadata indicating them as such.
|
| | |
|