Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | html ingest: remaining implementation | Bryan Newbold | 2020-11-06 | 1 | -22/+19 |
| | |||||
* | ingest: progress on HTML ingest | Bryan Newbold | 2020-11-05 | 1 | -14/+30 |
| | |||||
* | ingest: initial 'web' worker implementation | Bryan Newbold | 2020-11-05 | 2 | -67/+259 |
| | |||||
* | refactor: white/black -> allow/block | Bryan Newbold | 2020-11-05 | 1 | -4/+4 |
| | |||||
* | ingest: whitelist -> allowlist | Bryan Newbold | 2020-11-05 | 1 | -3/+3 |
| | |||||
* | ingest: basic checks for ingest_type | Bryan Newbold | 2020-11-05 | 1 | -3/+29 |
| | |||||
* | normalizer: filter out a specific non-ASCII character in DOI | Bryan Newbold | 2020-11-04 | 1 | -1/+3 |
| | |||||
* | entity updates: don't ingest JSTOR DOI prefixes | Bryan Newbold | 2020-10-23 | 1 | -0/+2 |
| | |||||
* | entity updater: new work update feed (ident and changelog metadata only) | Bryan Newbold | 2020-10-16 | 1 | -2/+24 |
| | |||||
* | chocula importer: small tweaks to update behavior | Bryan Newbold | 2020-10-08 | 1 | -8/+6 |
| | |||||
* | elastic transform: more preservation keepers | Bryan Newbold | 2020-10-08 | 1 | -1/+2 |
| | |||||
* | address spammy datacite titles | Martin Czygan | 2020-09-23 | 1 | -0/+19 |
| | | | | | | | | | seemingly from zenodo: * https://fatcat.wiki/release/rzcpjwukobd4pj36ipla22cnoi * https://doi.org/10.5281/zenodo.4041777 About 3400 records with "FULL MOVIE" in title, currently. | ||||
* | ingest: default to crawl protocols.io DOIs | Bryan Newbold | 2020-09-10 | 1 | -0/+2 |
| | |||||
* | datacite: handle case of empty-string version | Bryan Newbold | 2020-09-10 | 1 | -1/+1 |
| | | | | | Includes a tiny tweak to the datacite import sample file to test this code path. | ||||
* | remove spurious print statement | Bryan Newbold | 2020-09-03 | 1 | -1/+0 |
| | |||||
* | generic file entity clean-ups as part of file_meta importer | Bryan Newbold | 2020-09-02 | 2 | -0/+50 |
| | |||||
* | fix comment typo (thanks martin) | Bryan Newbold | 2020-08-27 | 1 | -1/+1 |
| | |||||
* | fixes and test coverage for file_meta importer | Bryan Newbold | 2020-08-21 | 1 | -5/+10 |
| | |||||
* | initial implementation of file_meta importer | Bryan Newbold | 2020-08-21 | 2 | -0/+71 |
| | |||||
* | entity updater: handle doi=None case better | Bryan Newbold | 2020-08-14 | 1 | -1/+1 |
| | |||||
* | entity updater: es['publisher_type'] not always set | Bryan Newbold | 2020-08-14 | 1 | -1/+1 |
| | | | | This is a small bugfix for a production issue. | ||||
* | Merge branch 'bnewbold-ingest-improvements' into 'master' | Martin Czygan | 2020-08-13 | 2 | -33/+114 |
|\ | | | | | | | | | ingest behavior changes; some datacite metadata tweaks See merge request webgroup/fatcat!78 | ||||
| * | entity update: change big5 ingest behavior | Bryan Newbold | 2020-08-11 | 1 | -9/+15 |
| | | | | | | | | | | | | | | | | | | In addition to changing the OA default, this was the main intended behavior change in this group of commits: want to ingest fewer attempts that we *expect* to fail, but default to ingest/crawl attempt if we are uncertain. This is because there is a long tail of journals that register DOIs and are defacto OA (fulltext is available), but we don't have metadata indicating them as such. | ||||
| * | entity update: default to ingest non-OA works | Bryan Newbold | 2020-08-11 | 1 | -9/+10 |
| | | |||||
| * | entity update: skip ingest of figshare+zenodo 'group' DOIs | Bryan Newbold | 2020-08-11 | 1 | -0/+15 |
| | | |||||
| * | datacite import: figshare-specific hacks | Bryan Newbold | 2020-08-11 | 1 | -3/+3 |
| | | |||||
| * | datacite import: refactor release_type detection into static method | Bryan Newbold | 2020-08-11 | 1 | -14/+51 |
| | | |||||
| * | datacite import: refactor publisher-specific hacks into static method | Bryan Newbold | 2020-08-11 | 1 | -15/+29 |
| | | | | | | | | Also tweak title/publisher detection to use DOI prefixes | ||||
| * | update crawl blocklist for SPNv2 requests which mostly fail | Bryan Newbold | 2020-08-10 | 1 | -2/+10 |
| | | |||||
* | | harvest: datacite API yields HTTP 200 with broken JSON | Martin Czygan | 2020-08-10 | 1 | -1/+8 |
|/ | | | | As a first step: log response body for debugging. | ||||
* | release ES transform tweaks | Bryan Newbold | 2020-08-07 | 1 | -3/+5 |
| | | | | | | | | pass-through publisher_type from container extra metadata (ES field already existed; this is from newer chocula metadata) count arxiv and PMCID papers which haven't been crawled (by IA) as "dark", not "bright" | ||||
* | chocula import update tweaks | Bryan Newbold | 2020-08-04 | 1 | -10/+14 |
| | |||||
* | more update keys and cases for chocula importer | Bryan Newbold | 2020-08-04 | 1 | -5/+11 |
| | |||||
* | fix key name mismatch in chocula importer | Bryan Newbold | 2020-08-04 | 1 | -1/+1 |
| | | | | chocula 'export-fatcat' uses 'ident', not 'fatcat_ident' | ||||
* | basic toml transform helper | Bryan Newbold | 2020-07-30 | 2 | -4/+20 |
| | |||||
* | Merge branch 'bnewbold-more-lint-fixes' into 'master' | Martin Czygan | 2020-07-24 | 6 | -25/+18 |
|\ | | | | | | | | | more lint fixes See merge request webgroup/fatcat!69 | ||||
| * | fix issnl typo in pubmed | Bryan Newbold | 2020-07-23 | 1 | -1/+1 |
| | | | | | | | | | | | | | | | | | | | | Oh no! This bug may actually have had significant negative impact on metadata in fatcat, in terms of missing container_id associations with pubmed entities. There are about 500k release entities with a PMID but no container_id. Of those, 89k have at least a container_name. Unclear how many would have matched to ISSN-L and thus to a container. | ||||
| * | remove isascii() work around definition in importers/datacite.py | Bryan Newbold | 2020-07-23 | 1 | -7/+1 |
| | | | | | | | | We are python3.7 now, so this isn't needed. | ||||
| * | simple lint (flake8) fixes over python codebase | Bryan Newbold | 2020-07-23 | 5 | -17/+16 |
| | | | | | | | | | | | | These should not have any behavior changes, though a number of exception catches are now more general, and there may be long-tail exceptions getting thrown in these statements. | ||||
* | | simplify in_kbart check statement | Bryan Newbold | 2020-07-23 | 1 | -1/+1 |
| | | | | | | | | Thanks @martin | ||||
* | | make in_kbart transform inclusive of last year | Bryan Newbold | 2020-07-23 | 1 | -0/+9 |
|/ | | | | | | | | | | | | | | | | | Frequently when looking at preservation coverage of journals, the current year shows as "un-preserved" when in fact there is robust KBART (keepers, eg CLOCKSS/Portico) coverage. This is partially because we don't update containers with KBART year spans very frequently (which is on us), and partially because KBART reports are often a bit out of day (eg, doesn't show coverage for the current year. For that matter, they probably take a few months to update the previous year as well, but that is a larger time span to fudge over. This patch means we will count Portico/LOCKSS/etc coverage for "last year" to count as coverage of publications dated "this year". Note that for this to be effective/correct, it is assumed that we will update containers with coverage year spans at least once a year, and that we will re-index all releases at least once a year. | ||||
* | Merge branch 'martin-datacite-duplicated-author-gh-59' into 'master' | bnewbold | 2020-07-11 | 1 | -6/+60 |
|\ | | | | | | | | | datacite: address duplicated contributor issue See merge request webgroup/fatcat!65 | ||||
| * | datacite: resolve formatting issues in tests | Martin Czygan | 2020-07-10 | 33 | -133/+51 |
| |\ | |||||
| * | | datacite: there should be no index gaps | Martin Czygan | 2020-07-10 | 1 | -2/+8 |
| | | | |||||
| * | | datacite: document contributor types | Martin Czygan | 2020-07-10 | 1 | -0/+25 |
| | | | |||||
| * | | wip: contrib, GH59 | Martin Czygan | 2020-07-10 | 1 | -16/+22 |
| | | | |||||
| * | | datacite: address duplicated contributor issue | Martin Czygan | 2020-07-07 | 1 | -0/+16 |
| | | | | | | | | | | | | | | | | | | | | | Use string comparison. * https://fatcat.wiki/release/spjysmrnsrgyzgq6ise5o44rlu/contribs * https://api.datacite.org/dois/10.25940/roper-31098406 | ||||
* | | | Merge branch 'martin-datacite-bugfix-sentry-44035' into 'master' | bnewbold | 2020-07-11 | 1 | -0/+4 |
|\ \ \ | |_|/ |/| | | | | | | | | datacite: mitigate sentry #44035 See merge request webgroup/fatcat!66 | ||||
| * | | datacite: mitigate sentry #44035 | Martin Czygan | 2020-07-10 | 1 | -0/+4 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to sentry, running `c.get('nameIdentifiers', []) or []` on a c with value: ``` {'affiliation': [], 'familyName': 'Guidon', 'givenName': 'Manuel', 'nameIdentifiers': {'nameIdentifier': 'https://orcid.org/0000-0003-3543-6683', 'nameIdentifierScheme': 'ORCID', 'schemeUri': 'https://orcid.org'}, 'nameType': 'Personal'} ``` results in a string, which I cannot reproduce. The document in question at: https://api.datacite.org/dois/10.26275/kuw1-fdls seems fine, too. | ||||
* | | | Merge branch 'martin-arxiv-fix-http-503' into 'master' | bnewbold | 2020-07-10 | 1 | -1/+1 |
|\ \ \ | |/ / |/| | | | | | | | | arxiv: address 503, "Retry after specified interval" error See merge request webgroup/fatcat!64 |