From 250181aead188499ce8a567183d1287289a127b5 Mon Sep 17 00:00:00 2001
From: Martin Czygan <martin.czygan@gmail.com>
Date: Sat, 31 Oct 2020 00:42:08 +0100
Subject: cleanup dirs

---
 extra/grobid_references/README.md | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

(limited to 'extra/grobid_references')

diff --git a/extra/grobid_references/README.md b/extra/grobid_references/README.md
index e69de29..c880f3b 100644
--- a/extra/grobid_references/README.md
+++ b/extra/grobid_references/README.md
@@ -0,0 +1,24 @@
+# Grobid refs
+
+References extracted from [grobid](https://grobid.readthedocs.io).
+
+## TODO
+
+* For a given reference string in grobid, find a matching release in fatcat.
+
+## Approach
+
+Two general ways:
+
+* do queries against elasticsearch, which would max out at a few hundred queries/s
+* offline compute a key (e.g. title, title ngram plus authors, etc.); then do comparisons
+
+## Misc
+
+Example grobid outputs:
+
+* [grobid.tei.xml](grobid.tei.xml),
+  [pdf](http://dss.in.tum.de/files/brandt-research/me.pdf) -- here grobid does
+not extract many refs; GS looks ok
+* [pdf](https://ia803202.us.archive.org/21/items/jstor-1064270/1064270.pdf)
+
-- 
cgit v1.2.3