| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
No partial docs (e.g. abstract), too generic components and entries, not
HTML blogs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
according to release_rev.release_type, we have 29 values:
fatcat_prod=# select release_type, count(release_type) from release_rev group by release_type;
release_type | count
-------------------+-----------
abstract | 2264
article | 6371076
article-journal | 101083841
article-newspaper | 17062
book | 1676941
chapter | 13914854
component | 58990
dataset | 6860325
editorial | 133573
entry | 1628487
graphic | 1809471
interview | 19898
legal_case | 3581
legislation | 1626
letter | 275119
paper-conference | 6074669
peer_review | 30581
post | 245807
post-weblog | 135
report | 1010699
retraction | 1292
review-book | 96219
software | 316
song | 24027
speech | 4263
standard | 312364
stub | 1036813
thesis | 414397
| 0
(29 rows)
|
|
|
|
|
|
|
| |
These are journal/publisher patterns which we suspect to actually be OA
based on the large quantity of papers that crawl successfully. The
better long-term solution will be to flag containers in some way as OA
(or "should crawl"), but this is a good short-term solution.
|
| |
|
|
|
|
|
|
|
|
| |
If release is a dataset or image, don't do a pdf ingest request.
If release is a datacite DOI, and release_type is a "document", crawl
regardless of is_oa detection. This is mostly to crawl repositories
(institutional or subject).
|
|
|
|
| |
Now that ingest is fixed
|
| |
|
| |
|
|
|
|
| |
From martin, thanks.
|
|
|
|
|
|
|
| |
`ingest_oa_only` behavior, and other filters, now handled in the entity
update worker, instead of in the transform function.
Also add a DOI prefix blocklist feature.
|
|
|
|
|
| |
This is mostly changing ingest_type from 'file' to 'pdf', and adding
'link_source'/'link_source_id', plus some small cleanups.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Initially was going to create a new worker to consume from the release
update channel, but couldn't get the edit context ("is this a new
release, or update to an existing") from that context.
Currently there is a flag in source code to control whether we only do
OA releases or all releases. Starting with OA only to start slow, but
should probably default to all, and make this a config flag. Should
probably also have a config flag to control this entire feature.
Tested locally in dev.
|
| |
|
|
|
|
|
|
|
|
| |
- decrease default changelog pipeline to 5.0sec
- fix missing KafkaException harvester imports
- more confluent-kafka tweaks
- updates to kafka consumer configs
- bump elastic updates consumergroup (again)
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
The previous group seems to have gotten corrupted; my hypothesis is that
this is due to pykafka being somewhat flakey, and am planning to move to
librdkafka anyways. Re-indexing all the containers is pretty small/easy,
so starting a new consumer group works find in this case; release
indexer would be a bigger problem.
|
| |
|
| |
|
|
|
|
|
| |
matching produce sizes. may want to tweak this config in the future for
throughput.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
A bunch of remaining TODOs here
|
| |
|
|
|
|
|
| |
Forgot that this worker really doesn't want/need any API connection at
all; just an ApiClient to deserialize objects from Kafka.
|
| |
|
| |
|
| |
|
| |
|
|\
| |
| |
| |
| | |
Fixed a conflict in:
python/fatcat_export.py
|
| | |
|
|/ |
|
| |
|
| |
|
|
|
|
|
|
|
| |
- Add __init__.py files for fatcat_tools submodules, and use them in
imports
- Add a bunch of comments to files.
- rename a number of classes and functions to be less verbose
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
This is the classic/correct way to do consumer group updates for higher
throughput, when "at least once" semantics are acceptible (as they are
here; double processing should be safe/fine).
|
| |
|
| |
|
|
|