blob: 1f1c2b99a94b6e2d27d6b29e244612e801961803 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
- paper match heuristic: include 10.1007%2F978-3-319-49304-6_18 (URL-escaped slash)
- catch EOFFail fetching from wayback
- "author counts match" in scoring
- refactor "scorable" to "matchable"
- look at refactoring to reduce JSON serializations
- QA tool for matches (PDF + Crossref JSON + landing page?)
=> python; talks directly to HBase
- author counts should match (+/- one?)
match strategies (hbase columns)
- legacy_doi
- url_doi
- grobid_crossref (doi)
- grobid_fatcat (fatcat ID)
scalding:
- better JSON library
- less verbose sbt test output (set log level to WARN)
- auto-formatting: addSbtPlugin("com.geirsson" % "sbt-scalafmt" % "1.6.0-RC3")
pig:
- potentially want to *not* de-dupe CDX lines by uniq sha1 in all cases; run
this as a second-stage filter? for example, may want many URL links in fatcat
for a single file (different links, different policies)
- fix pig gitlab-ci tests (JAVA_HOME)
python:
- include input file name (and chunk? and CDX?) in sentry context
- how to get argument (like --hbase-table) into mrjob.conf, or similar?
|