TODO updates

author: Bryan Newbold <bnewbold@archive.org> 2018-09-12 15:32:27 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2018-09-12 15:32:27 -0700
commit: 6681ee7d699fc481b3dc0e1e8f905395a0b42a3b (patch)
tree: 4e220199556070c1c8c6f300cb68bc2268be70fc
parent: 31537f21333cda37458cfc88331feaecbd1d72c8 (diff)
download: sandcrawler-6681ee7d699fc481b3dc0e1e8f905395a0b42a3b.tar.gz
sandcrawler-6681ee7d699fc481b3dc0e1e8f905395a0b42a3b.zip
1 files changed, 11 insertions, 0 deletions
diff --git a/TODO b/TODO
index 5c57a98..1f1c2b9 100644
--- a/TODO
+++ b/TODO
@@ -1,7 +1,18 @@
 
+- paper match heuristic: include 10.1007%2F978-3-319-49304-6_18 (URL-escaped slash)
+- catch EOFFail fetching from wayback
 - "author counts match" in scoring
 - refactor "scorable" to "matchable"
 - look at refactoring to reduce JSON serializations
+- QA tool for matches (PDF + Crossref JSON + landing page?)
+    => python; talks directly to HBase
+- author counts should match (+/- one?)
+
+match strategies (hbase columns)
+- legacy_doi
+- url_doi
+- grobid_crossref (doi)
+- grobid_fatcat (fatcat ID)
 
 scalding:
 - better JSON library
author	Bryan Newbold <bnewbold@archive.org>	2018-09-12 15:32:27 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2018-09-12 15:32:27 -0700
commit	6681ee7d699fc481b3dc0e1e8f905395a0b42a3b (patch)
tree	4e220199556070c1c8c6f300cb68bc2268be70fc
parent	31537f21333cda37458cfc88331feaecbd1d72c8 (diff)
download	sandcrawler-6681ee7d699fc481b3dc0e1e8f905395a0b42a3b.tar.gz sandcrawler-6681ee7d699fc481b3dc0e1e8f905395a0b42a3b.zip