From ce6a2ee453d29d0521c1dc3672363ec8934d2f2a Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Thu, 27 Aug 2020 16:18:08 +0200 Subject: move datasets to projects --- datasets/.gitkeep | 0 datasets/README.md | 17 ----------------- datasets/fuzzycat.png | Bin 28757 -> 0 bytes projects/.gitkeep | 0 projects/README.md | 17 +++++++++++++++++ projects/fuzzycat.png | Bin 0 -> 28757 bytes projects/oai_harvest_md/README.md | 10 ++++++++++ 7 files changed, 27 insertions(+), 17 deletions(-) delete mode 100644 datasets/.gitkeep delete mode 100644 datasets/README.md delete mode 100644 datasets/fuzzycat.png create mode 100644 projects/.gitkeep create mode 100644 projects/README.md create mode 100644 projects/fuzzycat.png create mode 100644 projects/oai_harvest_md/README.md diff --git a/datasets/.gitkeep b/datasets/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/datasets/README.md b/datasets/README.md deleted file mode 100644 index bfbbaef..0000000 --- a/datasets/README.md +++ /dev/null @@ -1,17 +0,0 @@ -# Datasets - -Example datasets for fuzzycat, fatcat fuzzy matching utilities. - -* repo: [fuzycat](https://github.com/miku/fuzzycat) -* data: [fuzzycat_samples](https://archive.org/details/fuzzycat_samples) - -## Grobid References (grobid_refs) - -## Title list (titlelist) - -## Name only containers (name_only_containers) - -## OAI harvest metadata - -* [https://archive.org/details/oai_harvest_20200215](https://archive.org/details/oai_harvest_20200215) -* [oai.ndjson.zst](https://archive.org/download/oai_harvest_20200215/oai.ndjson.zst) diff --git a/datasets/fuzzycat.png b/datasets/fuzzycat.png deleted file mode 100644 index 27f6ed4..0000000 Binary files a/datasets/fuzzycat.png and /dev/null differ diff --git a/projects/.gitkeep b/projects/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/projects/README.md b/projects/README.md new file mode 100644 index 0000000..bfbbaef --- /dev/null +++ b/projects/README.md @@ -0,0 +1,17 @@ +# Datasets + +Example datasets for fuzzycat, fatcat fuzzy matching utilities. + +* repo: [fuzycat](https://github.com/miku/fuzzycat) +* data: [fuzzycat_samples](https://archive.org/details/fuzzycat_samples) + +## Grobid References (grobid_refs) + +## Title list (titlelist) + +## Name only containers (name_only_containers) + +## OAI harvest metadata + +* [https://archive.org/details/oai_harvest_20200215](https://archive.org/details/oai_harvest_20200215) +* [oai.ndjson.zst](https://archive.org/download/oai_harvest_20200215/oai.ndjson.zst) diff --git a/projects/fuzzycat.png b/projects/fuzzycat.png new file mode 100644 index 0000000..27f6ed4 Binary files /dev/null and b/projects/fuzzycat.png differ diff --git a/projects/oai_harvest_md/README.md b/projects/oai_harvest_md/README.md new file mode 100644 index 0000000..bbaa915 --- /dev/null +++ b/projects/oai_harvest_md/README.md @@ -0,0 +1,10 @@ +# OAI metadata matching + +## Plan + +* [ ] get JSON version, via [oai_harvest_20200215](https://archive.org/details/oai_harvest_20200215) +* [ ] filter out out of scope data +* [ ] (a) for items that have a doi, figure out, whether we already have md for this doi via API +* [ ] (b) for items w/o doi, get a list of (id, title) +* [ ] run fuzzy matching over title list to find out which one we have + -- cgit v1.2.3