diff options
author | Bryan Newbold <bnewbold@archive.org> | 2018-09-12 15:32:27 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2018-09-12 15:32:27 -0700 |
commit | 6681ee7d699fc481b3dc0e1e8f905395a0b42a3b (patch) | |
tree | 4e220199556070c1c8c6f300cb68bc2268be70fc | |
parent | 31537f21333cda37458cfc88331feaecbd1d72c8 (diff) | |
download | sandcrawler-6681ee7d699fc481b3dc0e1e8f905395a0b42a3b.tar.gz sandcrawler-6681ee7d699fc481b3dc0e1e8f905395a0b42a3b.zip |
TODO updates
-rw-r--r-- | TODO | 11 |
1 files changed, 11 insertions, 0 deletions
@@ -1,7 +1,18 @@ +- paper match heuristic: include 10.1007%2F978-3-319-49304-6_18 (URL-escaped slash) +- catch EOFFail fetching from wayback - "author counts match" in scoring - refactor "scorable" to "matchable" - look at refactoring to reduce JSON serializations +- QA tool for matches (PDF + Crossref JSON + landing page?) + => python; talks directly to HBase +- author counts should match (+/- one?) + +match strategies (hbase columns) +- legacy_doi +- url_doi +- grobid_crossref (doi) +- grobid_fatcat (fatcat ID) scalding: - better JSON library |