aboutsummaryrefslogtreecommitdiffstats
path: root/skate/reduce.go
Commit message (Collapse)AuthorAgeFilesLines
* reduce: use mockable timeMartin Czygan2021-11-231-6/+7
| | | | | While basically the same, we save a bit with a default mock and prepare a bit better for some future encapsulation.
* rename module to gitlab.com/internetarchive/refcatMartin Czygan2021-10-201-3/+3
| | | | | This changes all the import paths to the current canonical location on http://gitlab.com/internetarchive/refcat.
* misc: fix and improve commentsMartin Czygan2021-09-231-0/+15
|
* reduce: remove log statementsMartin Czygan2021-07-281-4/+0
|
* leave ref.index unchangedMartin Czygan2021-07-281-6/+6
| | | | | | | | | previously, we started with 0-indexed input, but wanted 1-indexed values so we added increments at various points which probably lead to bug (missing refs), since at one point we would fuse the original ref data (w/o increments) with the matched data (w/ increments); with scholar:528804ad2e55983cf3e5e6659d8f46db0cab02b7 we can now leave indices as is
* reduce: add caseMartin Czygan2021-07-281-0/+1
|
* reduce: add more logging, temporarilyMartin Czygan2021-07-271-1/+6
|
* update docsMartin Czygan2021-07-271-4/+1
|
* reuse timestampsMartin Czygan2021-07-271-6/+14
| | | | | | | | | | while time.Now is not really slow, thanks to vDSO (cf. https://git.io/J4SOH), it will be even faster to just call it once at the start of the processing; also: https://twitter.com/davidcrawshaw/status/1414243408936280073 > Turns out http://time.Now was taking its usual amount of time on linux, ~50 nanoseconds [...]
* reduce: explicitly name magic numbersMartin Czygan2021-07-271-3/+8
|
* reduce: use pascal caseMartin Czygan2021-07-261-2/+2
|
* reduce: mention upcoming change to indexingMartin Czygan2021-07-261-1/+1
| | | | see: scholar:528804ad2e55983cf3e5e6659d8f46db0cab02b7
* skate: pass-through match_provenance in more situationsBryan Newbold2021-07-251-0/+2
|
* schema: switch from '.name' to '.raw_name' for un-parsed CSL name fieldBryan Newbold2021-07-251-2/+2
|
* skate: use date-parts for year, not 'raw'Bryan Newbold2021-07-251-6/+7
|
* schema: have issued+accessed (CSLDate) actually omitemptyBryan Newbold2021-07-241-1/+1
| | | | | Similar to TargetCSL, these should be pointer types so they don't get encoded as empty objects when not set.
* xio: improve namingMartin Czygan2021-07-211-7/+7
|
* reduce: use fixed length sha1 for url id partMartin Czygan2021-07-201-3/+5
| | | | | base32 would occassionally exceed elasticsearch id field limit ("must be no longer than 512 bytes but was: 649")
* reduce: fix wb idMartin Czygan2021-07-201-1/+1
|
* reduce: a preliminary id for wb linksMartin Czygan2021-07-201-0/+5
|
* reduce: temp fix 0 source release yearMartin Czygan2021-07-191-1/+4
|
* add ZippyWayback reducerMartin Czygan2021-07-151-1/+59
|
* update docsMartin Czygan2021-07-141-8/+7
|
* reduce: add testMartin Czygan2021-07-141-18/+21
|
* reduce: add todoMartin Czygan2021-07-141-0/+2
|
* reduce: add csl fieldMartin Czygan2021-07-141-3/+32
|
* reduce: fix off-by-one errorMartin Czygan2021-07-141-1/+1
| | | | duplication detection required a +1 on the index in the ref document
* reduce: temp bug fix for line cutterMartin Czygan2021-07-131-1/+5
| | | | | | | | we wanted to trim whitespace at one point, because values contained the separator values; however this breaks with empty values; move back to not trimming values except for the newline, when requesting the last value; moving forward, we need to clean or reject dirty values or use a different delimiter
* reduce: small tweaksMartin Czygan2021-07-131-3/+4
|
* wip: csl loggingMartin Czygan2021-07-131-1/+1
|
* update docsMartin Czygan2021-07-131-1/+7
|
* reduce/schema: add cslMartin Czygan2021-07-131-1/+7
|
* wiki: include lang in encoded page titleMartin Czygan2021-07-131-7/+12
|
* reduce: add todoMartin Czygan2021-07-131-1/+3
|
* mock out time.Now for testsMartin Czygan2021-07-131-3/+6
|
* reduce: log broken line onlyMartin Czygan2021-07-101-1/+1
|
* reduce: add key and indexed ts for exact matchesMartin Czygan2021-07-101-0/+2
|
* reduce: ol, fuzzy, w/ unstructuredMartin Czygan2021-07-101-1/+1
|
* release to unstructured stubMartin Czygan2021-07-101-2/+2
|
* reduce: open library id tweaksMartin Czygan2021-07-101-5/+27
|
* reduce: tweak wiki brefMartin Czygan2021-07-101-4/+5
|
* reduce: filter out duplicate wiki linksMartin Czygan2021-07-101-0/+8
|
* wiki: use lowercase base32 of page titleMartin Czygan2021-07-091-2/+3
| | | | * mostly case insensitive, same case as ident
* reduce: use a base64 encoded title as keyMartin Czygan2021-07-091-1/+7
|
* reduce: wiki doc in column 3Martin Czygan2021-07-091-1/+1
|
* reduce: move batch sizeMartin Czygan2021-07-091-8/+6
|
* reduce: set default batch sizeMartin Czygan2021-07-081-6/+8
|
* simplify importsMartin Czygan2021-07-081-1/+1
|
* reduce: separate batch callsMartin Czygan2021-07-081-18/+18
|
* reduce: remove log lineMartin Czygan2021-07-061-1/+0
|