aboutsummaryrefslogtreecommitdiffstats
path: root/projects
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2020-08-27 16:18:08 +0200
committerMartin Czygan <martin.czygan@gmail.com>2020-08-27 16:18:08 +0200
commitce6a2ee453d29d0521c1dc3672363ec8934d2f2a (patch)
tree6ad7ca90649dc99672cdcfce25a7450eda6eabd3 /projects
parent190e60c95898e105444a398523c24b7656acd660 (diff)
downloadfuzzycat-ce6a2ee453d29d0521c1dc3672363ec8934d2f2a.tar.gz
fuzzycat-ce6a2ee453d29d0521c1dc3672363ec8934d2f2a.zip
move datasets to projects
Diffstat (limited to 'projects')
-rw-r--r--projects/.gitkeep0
-rw-r--r--projects/README.md17
-rw-r--r--projects/fuzzycat.pngbin0 -> 28757 bytes
-rw-r--r--projects/oai_harvest_md/README.md10
4 files changed, 27 insertions, 0 deletions
diff --git a/projects/.gitkeep b/projects/.gitkeep
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/projects/.gitkeep
diff --git a/projects/README.md b/projects/README.md
new file mode 100644
index 0000000..bfbbaef
--- /dev/null
+++ b/projects/README.md
@@ -0,0 +1,17 @@
+# Datasets
+
+Example datasets for fuzzycat, fatcat fuzzy matching utilities.
+
+* repo: [fuzycat](https://github.com/miku/fuzzycat)
+* data: [fuzzycat_samples](https://archive.org/details/fuzzycat_samples)
+
+## Grobid References (grobid_refs)
+
+## Title list (titlelist)
+
+## Name only containers (name_only_containers)
+
+## OAI harvest metadata
+
+* [https://archive.org/details/oai_harvest_20200215](https://archive.org/details/oai_harvest_20200215)
+* [oai.ndjson.zst](https://archive.org/download/oai_harvest_20200215/oai.ndjson.zst)
diff --git a/projects/fuzzycat.png b/projects/fuzzycat.png
new file mode 100644
index 0000000..27f6ed4
--- /dev/null
+++ b/projects/fuzzycat.png
Binary files differ
diff --git a/projects/oai_harvest_md/README.md b/projects/oai_harvest_md/README.md
new file mode 100644
index 0000000..bbaa915
--- /dev/null
+++ b/projects/oai_harvest_md/README.md
@@ -0,0 +1,10 @@
+# OAI metadata matching
+
+## Plan
+
+* [ ] get JSON version, via [oai_harvest_20200215](https://archive.org/details/oai_harvest_20200215)
+* [ ] filter out out of scope data
+* [ ] (a) for items that have a doi, figure out, whether we already have md for this doi via API
+* [ ] (b) for items w/o doi, get a list of (id, title)
+* [ ] run fuzzy matching over title list to find out which one we have
+