summaryrefslogtreecommitdiffstats
path: root/fatcat_scholar/sandcrawler.py
Commit message (Collapse)AuthorAgeFilesLines
* add requests session around postgrest fetchesBryan Newbold2021-12-071-5/+32
| | | | | This is expected to drastically improve throughput of intermediate bundle generation, and reduce load on postgrest itself.
* fetch GROBID-parsed refs along with crossref metadataBryan Newbold2021-12-061-2/+4
|
* Revert "pull GROBID refs along with crossref records into bundles"Bryan Newbold2021-11-101-3/+1
| | | | | | This reverts commit c164970449a392b5165d903d213c2bb51f2a187f. Didn't mean to merge this to master just yet.
* pull GROBID refs along with crossref records into bundlesBryan Newbold2021-11-101-1/+3
|
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-3/+14
|
* re-style imports (isort) on all core python filesBryan Newbold2021-10-271-1/+2
|
* lint fixes, and run fmtBryan Newbold2021-06-021-3/+1
|
* add 'crossref' hydration to work pipelineBryan Newbold2021-06-021-0/+11
| | | | | | | | The immediate motivation is to include recent crossref refs in citation graph transforms. May also be valuable for researchers to have authoritative/publisher metadata in the bundle dumps.
* Modernize Python syntax with pyupgrade --py38-plus **/*.pyChristian Clauss2021-02-231-1/+1
|
* add basic html fulltext support to fetch pipelineBryan Newbold2020-11-181-0/+11
|
* make fmtBryan Newbold2020-06-291-1/+3
|
* fetch pdftotext and pdf_meta from blobs, postgrestBryan Newbold2020-06-291-0/+9
| | | | | This replaces the temporary COVID-19 content hack with production content (text, thumbnail URLs) stored in postgrest and seaweedfs.
* fmtBryan Newbold2020-06-041-1/+8
|
* more type annotations and fixesBryan Newbold2020-06-041-2/+2
|
* flake8 fixes (partial)Bryan Newbold2020-06-031-1/+0
|
* reformat python code with blackBryan Newbold2020-06-031-21/+14
|
* WIP on release-to-sim fetchingBryan Newbold2020-05-191-0/+75