aboutsummaryrefslogtreecommitdiffstats
path: root/TODO
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2018-09-12 15:32:27 -0700
committerBryan Newbold <bnewbold@archive.org>2018-09-12 15:32:27 -0700
commit6681ee7d699fc481b3dc0e1e8f905395a0b42a3b (patch)
tree4e220199556070c1c8c6f300cb68bc2268be70fc /TODO
parent31537f21333cda37458cfc88331feaecbd1d72c8 (diff)
downloadsandcrawler-6681ee7d699fc481b3dc0e1e8f905395a0b42a3b.tar.gz
sandcrawler-6681ee7d699fc481b3dc0e1e8f905395a0b42a3b.zip
TODO updates
Diffstat (limited to 'TODO')
-rw-r--r--TODO11
1 files changed, 11 insertions, 0 deletions
diff --git a/TODO b/TODO
index 5c57a98..1f1c2b9 100644
--- a/TODO
+++ b/TODO
@@ -1,7 +1,18 @@
+- paper match heuristic: include 10.1007%2F978-3-319-49304-6_18 (URL-escaped slash)
+- catch EOFFail fetching from wayback
- "author counts match" in scoring
- refactor "scorable" to "matchable"
- look at refactoring to reduce JSON serializations
+- QA tool for matches (PDF + Crossref JSON + landing page?)
+ => python; talks directly to HBase
+- author counts should match (+/- one?)
+
+match strategies (hbase columns)
+- legacy_doi
+- url_doi
+- grobid_crossref (doi)
+- grobid_fatcat (fatcat ID)
scalding:
- better JSON library