index
:
sandcrawler
bnewbold-args
bnewbold-backfill
bnewbold-persist-grobid-errors
bnewbold-refactor-loggging
master
trawler
[no description]
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
scalding
/
src
/
test
Commit message (
Expand
)
Author
Age
Files
Lines
*
Merge branch 'bnewbold-backfill' into 'master'
bnewbold
2021-10-04
1
-0
/
+175
|
\
|
*
do sha1 pattern match correctly
Bryan Newbold
2018-07-24
1
-0
/
+11
|
*
fix CdxBackfillJob tests
Bryan Newbold
2018-07-24
1
-2
/
+2
|
*
CdxBackfillJob: implement other fields
Bryan Newbold
2018-07-24
1
-9
/
+60
|
*
CdxBackfillJob back to HBase; tests work
Bryan Newbold
2018-07-24
1
-8
/
+8
|
*
variant of CdxBackfillJob that writes to TSV
Bryan Newbold
2018-07-24
1
-0
/
+113
*
|
FatcatScorable and ScoreSelfFatcat job
Bryan Newbold
2019-08-10
1
-0
/
+160
*
|
hack scorejob variant with extra context joined in
Bryan Newbold
2018-09-12
1
-0
/
+262
*
|
blacklist -> denylist
Bryan Newbold
2018-09-05
1
-1
/
+1
*
|
changed style of ScoreJobTest.bundle
Ellen Spertus
2018-09-04
1
-14
/
+10
*
|
minor style improvement
Ellen Spertus
2018-09-04
1
-2
/
+2
*
|
fixed tests after replacing NoSlug with None
Ellen Spertus
2018-09-04
4
-77
/
+85
*
|
make similarity score case-insensitive
Bryan Newbold
2018-08-27
1
-0
/
+8
*
|
basic crossref subtitle concatination support
Bryan Newbold
2018-08-27
1
-0
/
+18
*
|
more special characters to strip
Bryan Newbold
2018-08-27
1
-1
/
+1
*
|
rename DumpUnGrobidedJob
Bryan Newbold
2018-08-24
1
-4
/
+4
*
|
scalding: UnGrobidedDumpJob
Bryan Newbold
2018-08-24
1
-0
/
+72
*
|
add a content-type filter for crossref works
Bryan Newbold
2018-08-23
1
-0
/
+9
*
|
require crossref works to have at least one author (for matching)
Bryan Newbold
2018-08-23
1
-0
/
+6
*
|
author parsing (and year, for crossref)
Bryan Newbold
2018-08-23
2
-1
/
+6
*
|
set a minimum slug size (8 chars)
Bryan Newbold
2018-08-23
3
-14
/
+26
*
|
Fixed style violations.
Ellen Spertus
2018-08-22
1
-2
/
+1
*
|
Added ScoreJob test for title-length filtering.
Ellen Spertus
2018-08-22
1
-5
/
+13
*
|
Merge branch 'master' into ellen-length-filtering
Ellen Spertus
2018-08-22
1
-1
/
+1
|
\
\
|
*
|
add more punctuation characters to slug filter
Bryan Newbold
2018-08-22
1
-1
/
+1
*
|
|
Added title-length filtering to CrossrefScorable.
Ellen Spertus
2018-08-22
1
-2
/
+34
*
|
|
Added more tests of GrobidScorable.keepRecord
Ellen Spertus
2018-08-22
1
-0
/
+5
*
|
|
Added title length filtering to GrobidScorable
Ellen Spertus
2018-08-22
1
-2
/
+29
|
/
/
*
|
remove slug-blacklist conservative test
Bryan Newbold
2018-08-21
1
-16
/
+0
*
|
Merge branch 'bnewbold-match-scale'
Bryan Newbold
2018-08-21
1
-0
/
+2
|
\
\
|
*
|
add a trap to ScoreJob
Bryan Newbold
2018-08-20
1
-0
/
+2
*
|
|
fix bugs/typos in HBaseColCountJob and HBaseStatusCountJob
Bryan Newbold
2018-08-21
2
-14
/
+7
*
|
|
distinction between status_code and status counting
Bryan Newbold
2018-08-21
2
-6
/
+75
*
|
|
add GrobidScorableDumpJob and basic test
Bryan Newbold
2018-08-21
1
-0
/
+124
*
|
|
Merge branch 'strings'
Bryan Newbold
2018-08-21
2
-0
/
+22
|
\
\
\
|
*
|
|
Reads blacklist from file.
Ellen Spertus
2018-08-20
2
-0
/
+22
|
|
/
/
*
|
|
Created static factory method for ScorableCreations to deal with null.
Ellen Spertus
2018-08-20
1
-3
/
+3
*
|
|
Disabled scalastyle null checking where we want to test null values.
Ellen Spertus
2018-08-20
1
-0
/
+2
*
|
|
Reduced boilerplate code.
Ellen Spertus
2018-08-20
1
-11
/
+11
|
/
/
*
|
change slugification behavior to not split on colon
Bryan Newbold
2018-08-15
2
-23
/
+23
*
|
add a stub title blacklist
Bryan Newbold
2018-08-15
1
-0
/
+6
*
|
handle null status_code lines
Bryan Newbold
2018-08-15
1
-3
/
+7
*
|
unrelated TODO about testing with null HBase values
Bryan Newbold
2018-08-15
1
-0
/
+1
*
|
scorable: test for more punctuation removal
Bryan Newbold
2018-08-15
1
-0
/
+8
*
|
crossref: test for empty-string title
Bryan Newbold
2018-08-15
1
-0
/
+6
*
|
scorable: test for null strings
Bryan Newbold
2018-08-15
1
-0
/
+5
*
|
grobid scoring: status_code as signed int, not string
Bryan Newbold
2018-08-15
1
-2
/
+3
*
|
Fixed style problems (or disabled warning when appropriate) for tests.
Ellen Spertus
2018-08-14
9
-100
/
+128
*
|
Minor improvements.
Ellen Spertus
2018-08-14
1
-10
/
+7
*
|
Now ignores grobid entries with status other than 200.
Ellen Spertus
2018-08-14
2
-17
/
+32
[next]