aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2020-08-25 19:17:56 +0200
committerMartin Czygan <martin.czygan@gmail.com>2020-08-25 19:17:56 +0200
commitff20a5a9ef621364b45625d0c42ee42fda5bff52 (patch)
treef7779cfb75d1dc397ad334c14614aecd6c02bf21
parent7c5d6a600b4fb620881cd5c32b5947462d9cf6b3 (diff)
downloadfuzzycat-ff20a5a9ef621364b45625d0c42ee42fda5bff52.tar.gz
fuzzycat-ff20a5a9ef621364b45625d0c42ee42fda5bff52.zip
start datasets section
Datasets to run fuzzy matching over, including a way to download all inputs, run with various parameters, etc.
-rw-r--r--datasets/.gitkeep0
-rw-r--r--datasets/README.md16
2 files changed, 16 insertions, 0 deletions
diff --git a/datasets/.gitkeep b/datasets/.gitkeep
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/datasets/.gitkeep
diff --git a/datasets/README.md b/datasets/README.md
new file mode 100644
index 0000000..cb0f24e
--- /dev/null
+++ b/datasets/README.md
@@ -0,0 +1,16 @@
+# Datasets
+
+These are example datasets to run fuzzy matching over. The data is too large to
+be committed in the repository, but the example inputs are kept in an archive
+item.
+
+## Grobid References (grobid_refs)
+
+## Title list (titlelist)
+
+## Name only containers (name_only_containers)
+
+## OAI harvest metadata
+
+* [https://archive.org/details/oai_harvest_20200215](https://archive.org/details/oai_harvest_20200215)
+* [oai.ndjson.zst](https://archive.org/download/oai_harvest_20200215/oai.ndjson.zst)