aboutsummaryrefslogtreecommitdiffstats
path: root/chocula.py
Commit message (Collapse)AuthorAgeFilesLines
* start refactoring files into moduleBryan Newbold2020-05-061-1469/+0
|
* update to new(er) ISSN-L mapping fileBryan Newbold2020-05-011-1/+1
|
* update URL crawl status snapshotBryan Newbold2019-12-261-1/+1
|
* add stats and URL crawl status filesBryan Newbold2019-12-241-2/+3
|
* update chocula usage of argparseBryan Newbold2019-12-231-14/+22
|
* update norwegian CSV importer schemaBryan Newbold2019-12-231-2/+4
|
* update chocula input data filesBryan Newbold2019-12-231-10/+10
| | | | | Including updating fetch script, README links, and chocula.py path references.
* use newer fatcat contianer dumpBryan Newbold2019-09-061-1/+1
|
* filter out bad ISSN{e,p}Bryan Newbold2019-09-061-0/+5
| | | | | Unfortunately a few hundred of these got pushed into fatcat already; will probably fix with a new fixer bot tool.
* last name/publisher cleanupsBryan Newbold2019-09-031-2/+6
|
* don't include doaj.org or NCBI homepage URLsBryan Newbold2019-09-031-0/+4
|
* improve fatcat_export metadata qualityBryan Newbold2019-09-031-3/+12
|
* fix SZCEPANSKI typoBryan Newbold2019-09-031-2/+2
|
* improve export_fatcatBryan Newbold2019-08-281-5/+22
|
* only fatcat_export 'valid' (syntax) ISSN-LsBryan Newbold2019-08-271-1/+1
|
* include Szczepanski in everything command (oops)Bryan Newbold2019-08-271-0/+1
|
* updated crossref title file; ISSN-L file linkBryan Newbold2019-08-271-1/+1
|
* update IA_CRAWL_FILEBryan Newbold2019-07-311-1/+1
|
* webarchive_urls separate from regular URLsBryan Newbold2019-07-311-1/+21
|
* add 'export_fatcat'Bryan Newbold2019-07-311-1/+51
|
* handle 'ttp://' URL prefix corner caseBryan Newbold2019-07-311-0/+2
|
* iterate on homepage url import/statsBryan Newbold2019-07-311-18/+40
|
* chocula: sherpa_color in summary; cleanupsBryan Newbold2019-07-301-5/+9
|
* chocula: openapcBryan Newbold2019-07-301-1/+31
|
* chocula: json exportBryan Newbold2019-07-301-0/+17
|
* chocula: fix wikidata_qid inclusionBryan Newbold2019-07-301-2/+3
|
* chocula: fix wikidata_qid inclusionBryan Newbold2019-07-301-0/+2
|
* chocula: better ISSN-L handlingBryan Newbold2019-07-301-11/+16
|
* chocula: updated fetches, new ISSN-L and DOAJ filesBryan Newbold2019-07-301-3/+3
|
* chocula: wikidata indexingBryan Newbold2019-07-301-4/+48
|
* chocula: crude publisher type bucketing; field cleanupBryan Newbold2019-07-301-20/+164
|
* shorter/simpler table namesBryan Newbold2019-07-261-7/+15
|
* chocula: more host/domain fixesBryan Newbold2019-07-261-3/+8
|
* GOLD OA parsingBryan Newbold2019-07-261-40/+54
|
* chocula: fix domain parsingBryan Newbold2019-07-261-10/+47
|
* more chocula progressBryan Newbold2019-07-141-57/+171
|
* EZB and szczepanski indexersBryan Newbold2019-07-111-45/+146
|
* chocula early workBryan Newbold2019-07-101-0/+798
(non-functional)