summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/transforms
Commit message (Collapse)AuthorAgeFilesLines
* ES schema: add best_url to file schemaBryan Newbold2020-06-041-0/+12
| | | | | | | | | This will increase index size (URLs are often long in our corpus, and we have many file entities), but seems worth it. Initially added `ia_url` as a second field, guaranteed to always be an *.archive.org URL, but `best_url` defaults to that anyways so didn't seem worthwhile.
* improve citeproc/CSL web interfaceBryan Newbold2020-03-251-6/+12
| | | | | | | | | | | | | | This tries to show the citeproc (bibtext, MLA, CSL-JSON) options for more releases, and not show the links when they would break. The primary motivation here is to work around two exceptions being thrown in prod every day (according to sentry): KeyError: 'role' ValueError: CLS requries some surname (family name) I'm guessing these are mostly coming from crawlers following the citeproc links on release landing pages.
* Merge branch 'bnewbold-elastic-v03b'Bryan Newbold2020-02-262-46/+198
|\
| * improve is_oa flag accuracyBryan Newbold2020-02-261-8/+4
| | | | | | | | | | | | Particularly, the ezb=green match seems mostly incorrect. Note that pmcid being assigned could still be in an embargo window?
| * ES container last tweaksBryan Newbold2020-02-261-0/+3
| |
| * ES release: last minor tweaksBryan Newbold2020-02-261-2/+2
| |
| * ES files: don't remove archive.org domains/hostsBryan Newbold2020-02-071-5/+0
| |
| * ES releases: host/domain fixesBryan Newbold2020-01-311-2/+2
| |
| * fix release es transform missing 'issue'Bryan Newbold2020-01-301-0/+1
| |
| * add upper-case work-around from kibana map joinBryan Newbold2020-01-301-0/+1
| |
| * tweak file ES archive.org domain trackingBryan Newbold2020-01-301-0/+6
| |
| * implement host+domain parsing for file ES transformBryan Newbold2020-01-301-9/+5
| |
| * fix ES file schema plural field namesBryan Newbold2020-01-291-4/+3
| |
| * elastic schema fixesBryan Newbold2020-01-291-0/+5
| |
| * add country to v03b release schemaBryan Newbold2020-01-291-0/+2
| |
| * actually implement changelog transformBryan Newbold2020-01-291-17/+45
| |
| * fix some transform bugs, add some testsBryan Newbold2020-01-291-6/+8
| |
| * ES release schema updatesBryan Newbold2020-01-291-5/+76
| |
| * container ES schema changesBryan Newbold2020-01-291-16/+18
| |
| * first implementation of ES file schemaBryan Newbold2020-01-292-1/+46
| | | | | | | | | | Includes a trivial test and transform, but not any workers or doc updates.
* | default to PMC ingest URLs over DOIBryan Newbold2020-02-041-4/+4
|/ | | | | | | For cases where there might be both PMC and DOI urls, do the europmc.org PMC ones over DOI option. May want to turn this into a config or command-line option in the future.
* remove 'oa_only' feature from ingest transformBryan Newbold2020-01-281-14/+1
| | | | Refactoring to move this filter elsewhere
* transform ingests via pmc/pmcid, not pubmed/pmidBryan Newbold2019-12-241-4/+4
|
* update ingest request schemaBryan Newbold2019-12-131-5/+22
| | | | | This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups.
* tweaks to ingest-file transformBryan Newbold2019-12-121-13/+7
|
* project -> ingest_request_sourceBryan Newbold2019-11-151-2/+2
|
* fix release.pmcid typoBryan Newbold2019-11-151-2/+2
|
* more ingest importer comments and countsBryan Newbold2019-11-151-1/+1
|
* add ingest request transform (and test)Bryan Newbold2019-11-152-0/+67
|
* dict wrapper for entity_from_json()Bryan Newbold2019-10-082-3/+7
|
* refactor all python source for client lib nameBryan Newbold2019-09-053-3/+3
|
* comment clarifying container.ident in ES release transformBryan Newbold2019-09-031-0/+2
|
* fix previous fix (need tests)Bryan Newbold2019-09-031-2/+2
|
* fix typo bug in container ES transformBryan Newbold2019-09-031-2/+2
|
* use EZB and szczepanski as OA signals (ES)Bryan Newbold2019-09-031-0/+12
|
* elasticsearch transform: fix url.url bugBryan Newbold2019-05-241-11/+11
|
* add 'superceded' release extra flag to elastic schemaBryan Newbold2019-05-231-0/+1
|
* also track work_id in release elasticsearch tableBryan Newbold2019-05-221-0/+1
|
* count linked refs (not just raw refs) in elasticsearchBryan Newbold2019-05-221-0/+3
|
* include creator_ids in release elastic schemaBryan Newbold2019-05-201-0/+6
| | | | Intent is to allow fast creator search/lookup
* elastic release schema updateBryan Newbold2019-05-201-2/+5
|
* improved CSL transform (structured author names)Bryan Newbold2019-05-201-12/+11
|
* make some XXX into TODOBryan Newbold2019-05-201-2/+2
|
* fix elastic file pdf checkBryan Newbold2019-05-161-1/+3
|
* elastic transforms: work around missing pdf mimetypesBryan Newbold2019-05-151-1/+1
|
* partial python impl of ext_id and release_stage refactorsBryan Newbold2019-05-132-14/+15
|
* handle null abstracts for releaseBryan Newbold2019-05-071-1/+1
|
* improve test coverageBryan Newbold2019-04-041-0/+1
|
* expose bibtex and citeproc; revert /unstable/ prefixesBryan Newbold2019-03-181-1/+1
|
* refactor and test citeproc codeBryan Newbold2019-03-182-3/+55
|