| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
After noticing more upper/lower ambiguity in production. In particular,
we have some old ingest requests in sandcrawler DB, which get
re-submitted/re-tried, which have capitalized DOIs in the link source id
field.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
These will be getting updates from ORCID and are usually more complete
and more correct for display, attribution, and search purposes.
Might need to tweak fuzzycat code to handle multiple names at the
verification stage.
|
|
|
|
|
| |
The schema is a timestamp, but python needs to serialize as JSON, and
doesn't do datetime automatically.
|
|
|
|
| |
Includes transform code updates and partial test coverage.
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
preservation
These are simple/partial changes to have webcaptures and filesets show
up in 'preservation', 'in_ia', and 'in_web' ES schema flags. A
longer-term TODO is to update the ES schema to have more granular
analytic flags.
Also includes a small generalization refactor for URL object parsing
into preservation status, shared across file+fileset+webcapture entity
types (all have similar URL objects with url+rel fields).
|
|
|
|
|
|
|
| |
These should have almost no change in behavior, but improve code
quality.
The one behavior change is counting ftp URLs as "in_web"
|
|
|
|
|
|
|
|
|
|
|
|
| |
This method was huge an monolithic. This commit splits out the content
and container specific sections into helper functions to make it more
managable. This involved refactoring to make many flags ("is_*" and
"in_*") part of the output dict through the entire code path, allowing
simple update() calls on the dict.
Noting that in the future should refactor to use a type-annotated class
for the elasticsearch output object. Perhaps something auto-generated
from the ES schema itself (JSON files).
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
pass-through publisher_type from container extra metadata (ES field
already existed; this is from newer chocula metadata)
count arxiv and PMCID papers which haven't been crawled (by IA) as
"dark", not "bright"
|
| |
|
|
|
|
| |
Thanks @martin
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Frequently when looking at preservation coverage of journals, the
current year shows as "un-preserved" when in fact there is robust KBART
(keepers, eg CLOCKSS/Portico) coverage. This is partially because we
don't update containers with KBART year spans very frequently (which is
on us), and partially because KBART reports are often a bit out of day
(eg, doesn't show coverage for the current year. For that matter, they
probably take a few months to update the previous year as well, but that
is a larger time span to fudge over.
This patch means we will count Portico/LOCKSS/etc coverage for "last
year" to count as coverage of publications dated "this year". Note that
for this to be effective/correct, it is assumed that we will update
containers with coverage year spans at least once a year, and that we
will re-index all releases at least once a year.
|
| |
|
|
|
|
|
|
|
|
|
| |
This will increase index size (URLs are often long in our corpus, and we
have many file entities), but seems worth it.
Initially added `ia_url` as a second field, guaranteed to always be an
*.archive.org URL, but `best_url` defaults to that anyways so didn't
seem worthwhile.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This tries to show the citeproc (bibtext, MLA, CSL-JSON) options for
more releases, and not show the links when they would break.
The primary motivation here is to work around two exceptions being
thrown in prod every day (according to sentry):
KeyError: 'role'
ValueError: CLS requries some surname (family name)
I'm guessing these are mostly coming from crawlers following the
citeproc links on release landing pages.
|
|\ |
|
| |
| |
| |
| |
| |
| | |
Particularly, the ezb=green match seems mostly incorrect.
Note that pmcid being assigned could still be in an embargo window?
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
| |
| |
| |
| | |
Includes a trivial test and transform, but not any workers or doc
updates.
|
|/
|
|
|
|
|
| |
For cases where there might be both PMC and DOI urls, do the europmc.org
PMC ones over DOI option.
May want to turn this into a config or command-line option in the future.
|
|
|
|
| |
Refactoring to move this filter elsewhere
|
| |
|
|
|
|
|
| |
This is mostly changing ingest_type from 'file' to 'pdf', and adding
'link_source'/'link_source_id', plus some small cleanups.
|
| |
|
| |
|
| |
|