|  | Commit message (Collapse) | Author | Age | Files | Lines | 
|---|
| | 
| 
| 
| 
| 
| 
| 
| 
| | When webcapture or fileset entities are updated, then the release
entities associated with them also need to be updated (and work
entities, recursively).
A TODO is to handle the case where a release_id is *removed* as well as
*added*, and reprocess the releases in that case as well. | 
| | |  | 
| | |  | 
| | |  | 
| | |  | 
| | 
| 
| 
| | This is a small bugfix for a production issue. | 
| | 
| 
| 
| 
| 
| 
| 
| 
| | In addition to changing the OA default, this was the main intended
behavior change in this group of commits: want to ingest fewer attempts
that we *expect* to fail, but default to ingest/crawl attempt if we are
uncertain. This is because there is a long tail of journals that
register DOIs and are defacto OA (fulltext is available), but we don't
have metadata indicating them as such. | 
| | |  | 
| | |  | 
| | |  | 
| | |  | 
| | |  | 
| | 
| 
| 
| 
| | The API fetch update may be needed for old changelog entries in the
kafka feed. | 
| |\  
| | 
| | 
| | 
| | | derive changelog worker from release worker
See merge request webgroup/fatcat!43 | 
| | | 
| | 
| | 
| | 
| | | Early versions of changelog entries may not have all the fields
required for the current transform. | 
| | | 
| | 
| | 
| | 
| | | No partial docs (e.g. abstract), too generic components and entries, not
HTML blogs. | 
| |/  
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
|   
| | according to release_rev.release_type, we have 29 values:
    fatcat_prod=# select release_type, count(release_type) from release_rev group by release_type;
       release_type    |   count
    -------------------+-----------
     abstract          |      2264
     article           |   6371076
     article-journal   | 101083841
     article-newspaper |     17062
     book              |   1676941
     chapter           |  13914854
     component         |     58990
     dataset           |   6860325
     editorial         |    133573
     entry             |   1628487
     graphic           |   1809471
     interview         |     19898
     legal_case        |      3581
     legislation       |      1626
     letter            |    275119
     paper-conference  |   6074669
     peer_review       |     30581
     post              |    245807
     post-weblog       |       135
     report            |   1010699
     retraction        |      1292
     review-book       |     96219
     software          |       316
     song              |     24027
     speech            |      4263
     standard          |    312364
     stub              |   1036813
     thesis            |    414397
                       |         0
    (29 rows) | 
| | 
| 
| 
| 
| 
| 
| | These are journal/publisher patterns which we suspect to actually be OA
based on the large quantity of papers that crawl successfully. The
better long-term solution will be to flag containers in some way as OA
(or "should crawl"), but this is a good short-term solution. | 
| | |  | 
| | 
| 
| 
| 
| 
| 
| 
| | If release is a dataset or image, don't do a pdf ingest request.
If release is a datacite DOI, and release_type is a "document", crawl
regardless of is_oa detection. This is mostly to crawl repositories
(institutional or subject). | 
| | 
| 
| 
| | Now that ingest is fixed | 
| | |  | 
| | |  | 
| | 
| 
| 
| | From martin, thanks. | 
| | 
| 
| 
| 
| 
| 
| | `ingest_oa_only` behavior, and other filters, now handled in the entity
update worker, instead of in the transform function.
Also add a DOI prefix blocklist feature. | 
| | 
| 
| 
| 
| | This is mostly changing ingest_type from 'file' to 'pdf', and adding
'link_source'/'link_source_id', plus some small cleanups. | 
| | |  | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Initially was going to create a new worker to consume from the release
update channel, but couldn't get the edit context ("is this a new
release, or update to an existing") from that context.
Currently there is a flag in source code to control whether we only do
OA releases or all releases. Starting with OA only to start slow, but
should probably default to all, and make this a config flag. Should
probably also have a config flag to control this entire feature.
Tested locally in dev. | 
| | |  | 
| | 
| 
| 
| 
| 
| 
| 
| | - decrease default changelog pipeline to 5.0sec
- fix missing KafkaException harvester imports
- more confluent-kafka tweaks
- updates to kafka consumer configs
- bump elastic updates consumergroup (again) | 
| | |  | 
| | |  | 
| | 
| 
| 
| 
| 
| 
| 
| | The previous group seems to have gotten corrupted; my hypothesis is that
this is due to pykafka being somewhat flakey, and am planning to move to
librdkafka anyways. Re-indexing all the containers is pretty small/easy,
so starting a new consumer group works find in this case; release
indexer would be a bigger problem. | 
| | |  | 
| | |  | 
| | 
| 
| 
| 
| | matching produce sizes. may want to tweak this config in the future for
throughput. | 
| | |  | 
| | |  | 
| | |  | 
| | |  | 
| | 
| 
| 
| | A bunch of remaining TODOs here | 
| | |  | 
| | 
| 
| 
| 
| | Forgot that this worker really doesn't want/need any API connection at
all; just an ApiClient to deserialize objects from Kafka. | 
| | |  | 
| | |  | 
| | |  | 
| | |  | 
| |\  
| | 
| | 
| | 
| | | Fixed a conflict in:
  python/fatcat_export.py | 
| | | |  | 
| |/ |  |