| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Code in a number of places (including Pubmed importer) assumed that this
was already lower-casing DOIs, resulting in some broken metadata getting
created.
See also: https://github.com/internetarchive/fatcat/issues/83
This is just the first step of mitigation.
|
| |
|
| |
|
|
|
|
|
| |
This is an open bug; it is important that tests pass on master branch
however.
|
|
|
|
|
|
| |
This is an attempt to fix spurious test failures, in which this text
block was getting detected as 'kr' on occasion. Apparently there is
non-determinism in the langdetect package.
|
|
|
|
|
|
|
| |
I believe this is safe and matches the regex filter in rust (fatcatd).
Keep hitting one-off DOIs that were passing through python check, so
being more strict from here forward.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Moved several normalizer helpers out of fatcat_tools.importers.common to
fatcat_tools.normal.
Copied language name and country name parser helpers from chocula
repository (built on existing pycountry helper library).
Have not gone through and refactored other importers to point to these
helpers yet; that should be a separate PR when this branch is merged.
Current changes are backwards compatible via re-imports.
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
Technically, [...] DOI names may incorporate any printable characters
from the Universal Character Set (UCS-2), of ISO/IEC 10646, which is the
character set defined by Unicode (https://www.doi.org/doi_handbook/2_Numbering.html#2.5.1).
For mostly QA reasons, we currently treat a DOI with an "en dash" as
invalid.
|
|
|
|
|
|
|
|
|
|
| |
Replace emdash with regular dash.
Replace double slash after partner ID with single slash. This conversion
seems to be done by crossref automatically on lookup. I tried several
examples, using doi.org resolver and Crossref API lookup.
Note that there are a number of fatcat entities with '//' in the DOI.
|
| |
|
|
|
|
|
| |
This makes it possible to, eg, past an arxiv identifier or SHA-1 hash in
the general search box and do a quick lookup.
|
|
|