| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
Additionally, try the unspecific (%Y) pattern last.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Current version succeeded to import a random sample of 100000 records
(0.5%) from datacite.
The --debug (write JSON to stdout) and --insert-log-file (log batch
before committing to db) flags are temporary added to help debugging.
Add few unit tests.
Some edge cases:
a) Existing keys without value requires a slightly awkward:
```
titles = attributes.get('titles', []) or []
```
b) There can be 0, 1, or more (first one wins) titles.
c) Date handling is probably not ideal. Datacite has a potentiall fine
grained list of dates.
The test case (tests/files/datacite_sample.jsonl) refers to
https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main
descriptor) 1986. The datacite record contains: 2017 (publicationYear,
probably the year of record creation with reference system), 1978-06-03
(collected, e.g. experimental sample), 1986 ("Accepted"). The online
version of the resource knows even one more date (2019-06-05 10:14:43 by
WIEWS update).
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Currently using two external libraries:
* dateparser
* langcodes
Note: This commit includes lots of wip docs and field stat in comment,
which should be removed.
|
|
|
|
|
|
| |
* contributors, title, date, publisher, container, license
Field and value analysis via https://github.com/miku/indigo.
|
| |
|
|
|
|
|
| |
The bracket syntax is inclusive. See also:
https://www.elastic.co/guide/en/elasticsearch/reference/7.5/query-dsl-query-string-query.html#_ranges
|
| |
|
|
|
|
|
|
|
|
|
|
| |
As a first iteration, just mark the daily batch complete and continue.
The occasional HTTP 400 issue has been reported as
https://github.com/datacite/datacite/issues/897.
A possible improvement would be to shrink the window, so losses will be
smaller.
|
| |
|
|
|
|
|
|
|
|
|
| |
Update parameter update for datacite API v2. Works fine, but there are
occasional HTTP 400 responses when using the cursor API (daily updates
can exceed the 10000 record limit for search queries).
The HTTP 400 issue is not solved yet, but reported to datacite as
https://github.com/datacite/datacite/issues/897.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Replace emdash with regular dash.
Replace double slash after partner ID with single slash. This conversion
seems to be done by crossref automatically on lookup. I tried several
examples, using doi.org resolver and Crossref API lookup.
Note that there are a number of fatcat entities with '//' in the DOI.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
Check was happing after the `return True` by mistake, allowing
duplicates in SPN editgroups, and potentially in ingest request
editgroups as well.
|
|
|
|
|
| |
During debugging, it can be helpful to keep stdout (e.g. processing
results) and dignostic messages separate.
|
|\
| |
| |
| |
| | |
Update EntityImporter docstring.
See merge request webgroup/fatcat!9
|
| | |
|
| |
| |
| |
| | |
I believe the required method is `parse_record`, not `parse`.
|
| |
| |
| |
| |
| |
| |
| |
| | |
The common case is the same URL being submitted repeatedly during
testing.
This is only within-editgroup, and per importer (eg, won't work across
spn importer "submitted" editgroups), but is better than nothing.
|
| |
| |
| |
| |
| | |
This is mostly changing ingest_type from 'file' to 'pdf', and adding
'link_source'/'link_source_id', plus some small cleanups.
|
| |
| |
| |
| | |
We really should just use file_meta result or nothing.
|
| |
| |
| |
| | |
Also fix a spurious typo.
|
| | |
|
| | |
|
| |
| |
| |
| | |
Based on ingest-file-results importer
|
| | |
|
|/
|
|
|
| |
For use with bots that don't have admin privileges, or where human
follow-up review is desired.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
producer creation/configuration should be happening in __init__() time,
not 'daily' call.
This specific refactor motivated by mocking out the producer in unit
tests.
|
|
|
|
|
| |
- allow overriding source filter whitelist (common case for CLI use)
- fix editgroup description env variable pass-through
|
|
|
|
|
|
|
|
| |
I thought this would filter for metadata updates to an existing DOI, but
actually "updates" are a type of DOI (eg, a retraction).
TODO: handle 'updates' field. Should both do a lookup and set work_ident
appropriately, and store in crossref-specific metadata.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Initially was going to create a new worker to consume from the release
update channel, but couldn't get the edit context ("is this a new
release, or update to an existing") from that context.
Currently there is a flag in source code to control whether we only do
OA releases or all releases. Starting with OA only to start slow, but
should probably default to all, and make this a config flag. Should
probably also have a config flag to control this entire feature.
Tested locally in dev.
|