Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | more python normalizers, and move from importer common | Bryan Newbold | 2020-11-19 | 2 | -154/+326 |
| | | | | | | | | | | | | Moved several normalizer helpers out of fatcat_tools.importers.common to fatcat_tools.normal. Copied language name and country name parser helpers from chocula repository (built on existing pycountry helper library). Have not gone through and refactored other importers to point to these helpers yet; that should be a separate PR when this branch is merged. Current changes are backwards compatible via re-imports. | ||||
* | initial implementation of DOAJ importer | Bryan Newbold | 2020-11-19 | 2 | -0/+290 |
| | | | | Several things to finish implementing and polish. | ||||
* | html ingest: actual xhtml mimetype | Bryan Newbold | 2020-11-16 | 1 | -2/+2 |
| | |||||
* | ingest tool: support for setting ingest type | Bryan Newbold | 2020-11-06 | 1 | -6/+6 |
| | |||||
* | html ingest: remaining implementation | Bryan Newbold | 2020-11-06 | 1 | -22/+19 |
| | |||||
* | ingest: progress on HTML ingest | Bryan Newbold | 2020-11-05 | 1 | -14/+30 |
| | |||||
* | ingest: initial 'web' worker implementation | Bryan Newbold | 2020-11-05 | 2 | -67/+259 |
| | |||||
* | refactor: white/black -> allow/block | Bryan Newbold | 2020-11-05 | 1 | -4/+4 |
| | |||||
* | ingest: whitelist -> allowlist | Bryan Newbold | 2020-11-05 | 1 | -3/+3 |
| | |||||
* | ingest: basic checks for ingest_type | Bryan Newbold | 2020-11-05 | 1 | -3/+29 |
| | |||||
* | normalizer: filter out a specific non-ASCII character in DOI | Bryan Newbold | 2020-11-04 | 1 | -1/+3 |
| | |||||
* | entity updates: don't ingest JSTOR DOI prefixes | Bryan Newbold | 2020-10-23 | 1 | -0/+2 |
| | |||||
* | entity updater: new work update feed (ident and changelog metadata only) | Bryan Newbold | 2020-10-16 | 1 | -2/+24 |
| | |||||
* | chocula importer: small tweaks to update behavior | Bryan Newbold | 2020-10-08 | 1 | -8/+6 |
| | |||||
* | elastic transform: more preservation keepers | Bryan Newbold | 2020-10-08 | 1 | -1/+2 |
| | |||||
* | address spammy datacite titles | Martin Czygan | 2020-09-23 | 1 | -0/+19 |
| | | | | | | | | | seemingly from zenodo: * https://fatcat.wiki/release/rzcpjwukobd4pj36ipla22cnoi * https://doi.org/10.5281/zenodo.4041777 About 3400 records with "FULL MOVIE" in title, currently. | ||||
* | ingest: default to crawl protocols.io DOIs | Bryan Newbold | 2020-09-10 | 1 | -0/+2 |
| | |||||
* | datacite: handle case of empty-string version | Bryan Newbold | 2020-09-10 | 1 | -1/+1 |
| | | | | | Includes a tiny tweak to the datacite import sample file to test this code path. | ||||
* | remove spurious print statement | Bryan Newbold | 2020-09-03 | 1 | -1/+0 |
| | |||||
* | generic file entity clean-ups as part of file_meta importer | Bryan Newbold | 2020-09-02 | 2 | -0/+50 |
| | |||||
* | fix comment typo (thanks martin) | Bryan Newbold | 2020-08-27 | 1 | -1/+1 |
| | |||||
* | fixes and test coverage for file_meta importer | Bryan Newbold | 2020-08-21 | 1 | -5/+10 |
| | |||||
* | initial implementation of file_meta importer | Bryan Newbold | 2020-08-21 | 2 | -0/+71 |
| | |||||
* | entity updater: handle doi=None case better | Bryan Newbold | 2020-08-14 | 1 | -1/+1 |
| | |||||
* | entity updater: es['publisher_type'] not always set | Bryan Newbold | 2020-08-14 | 1 | -1/+1 |
| | | | | This is a small bugfix for a production issue. | ||||
* | Merge branch 'bnewbold-ingest-improvements' into 'master' | Martin Czygan | 2020-08-13 | 2 | -33/+114 |
|\ | | | | | | | | | ingest behavior changes; some datacite metadata tweaks See merge request webgroup/fatcat!78 | ||||
| * | entity update: change big5 ingest behavior | Bryan Newbold | 2020-08-11 | 1 | -9/+15 |
| | | | | | | | | | | | | | | | | | | In addition to changing the OA default, this was the main intended behavior change in this group of commits: want to ingest fewer attempts that we *expect* to fail, but default to ingest/crawl attempt if we are uncertain. This is because there is a long tail of journals that register DOIs and are defacto OA (fulltext is available), but we don't have metadata indicating them as such. | ||||
| * | entity update: default to ingest non-OA works | Bryan Newbold | 2020-08-11 | 1 | -9/+10 |
| | | |||||
| * | entity update: skip ingest of figshare+zenodo 'group' DOIs | Bryan Newbold | 2020-08-11 | 1 | -0/+15 |
| | | |||||
| * | datacite import: figshare-specific hacks | Bryan Newbold | 2020-08-11 | 1 | -3/+3 |
| | | |||||
| * | datacite import: refactor release_type detection into static method | Bryan Newbold | 2020-08-11 | 1 | -14/+51 |
| | | |||||
| * | datacite import: refactor publisher-specific hacks into static method | Bryan Newbold | 2020-08-11 | 1 | -15/+29 |
| | | | | | | | | Also tweak title/publisher detection to use DOI prefixes | ||||
| * | update crawl blocklist for SPNv2 requests which mostly fail | Bryan Newbold | 2020-08-10 | 1 | -2/+10 |
| | | |||||
* | | harvest: datacite API yields HTTP 200 with broken JSON | Martin Czygan | 2020-08-10 | 1 | -1/+8 |
|/ | | | | As a first step: log response body for debugging. | ||||
* | release ES transform tweaks | Bryan Newbold | 2020-08-07 | 1 | -3/+5 |
| | | | | | | | | pass-through publisher_type from container extra metadata (ES field already existed; this is from newer chocula metadata) count arxiv and PMCID papers which haven't been crawled (by IA) as "dark", not "bright" | ||||
* | chocula import update tweaks | Bryan Newbold | 2020-08-04 | 1 | -10/+14 |
| | |||||
* | more update keys and cases for chocula importer | Bryan Newbold | 2020-08-04 | 1 | -5/+11 |
| | |||||
* | fix key name mismatch in chocula importer | Bryan Newbold | 2020-08-04 | 1 | -1/+1 |
| | | | | chocula 'export-fatcat' uses 'ident', not 'fatcat_ident' | ||||
* | basic toml transform helper | Bryan Newbold | 2020-07-30 | 2 | -4/+20 |
| | |||||
* | Merge branch 'bnewbold-more-lint-fixes' into 'master' | Martin Czygan | 2020-07-24 | 6 | -25/+18 |
|\ | | | | | | | | | more lint fixes See merge request webgroup/fatcat!69 | ||||
| * | fix issnl typo in pubmed | Bryan Newbold | 2020-07-23 | 1 | -1/+1 |
| | | | | | | | | | | | | | | | | | | | | Oh no! This bug may actually have had significant negative impact on metadata in fatcat, in terms of missing container_id associations with pubmed entities. There are about 500k release entities with a PMID but no container_id. Of those, 89k have at least a container_name. Unclear how many would have matched to ISSN-L and thus to a container. | ||||
| * | remove isascii() work around definition in importers/datacite.py | Bryan Newbold | 2020-07-23 | 1 | -7/+1 |
| | | | | | | | | We are python3.7 now, so this isn't needed. | ||||
| * | simple lint (flake8) fixes over python codebase | Bryan Newbold | 2020-07-23 | 5 | -17/+16 |
| | | | | | | | | | | | | These should not have any behavior changes, though a number of exception catches are now more general, and there may be long-tail exceptions getting thrown in these statements. | ||||
* | | simplify in_kbart check statement | Bryan Newbold | 2020-07-23 | 1 | -1/+1 |
| | | | | | | | | Thanks @martin | ||||
* | | make in_kbart transform inclusive of last year | Bryan Newbold | 2020-07-23 | 1 | -0/+9 |
|/ | | | | | | | | | | | | | | | | | Frequently when looking at preservation coverage of journals, the current year shows as "un-preserved" when in fact there is robust KBART (keepers, eg CLOCKSS/Portico) coverage. This is partially because we don't update containers with KBART year spans very frequently (which is on us), and partially because KBART reports are often a bit out of day (eg, doesn't show coverage for the current year. For that matter, they probably take a few months to update the previous year as well, but that is a larger time span to fudge over. This patch means we will count Portico/LOCKSS/etc coverage for "last year" to count as coverage of publications dated "this year". Note that for this to be effective/correct, it is assumed that we will update containers with coverage year spans at least once a year, and that we will re-index all releases at least once a year. | ||||
* | Merge branch 'martin-datacite-duplicated-author-gh-59' into 'master' | bnewbold | 2020-07-11 | 1 | -6/+60 |
|\ | | | | | | | | | datacite: address duplicated contributor issue See merge request webgroup/fatcat!65 | ||||
| * | datacite: resolve formatting issues in tests | Martin Czygan | 2020-07-10 | 33 | -133/+51 |
| |\ | |||||
| * | | datacite: there should be no index gaps | Martin Czygan | 2020-07-10 | 1 | -2/+8 |
| | | | |||||
| * | | datacite: document contributor types | Martin Czygan | 2020-07-10 | 1 | -0/+25 |
| | | | |||||
| * | | wip: contrib, GH59 | Martin Czygan | 2020-07-10 | 1 | -16/+22 |
| | | |