aboutsummaryrefslogtreecommitdiffstats
path: root/projects
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2020-08-27 16:52:13 +0200
committerMartin Czygan <martin.czygan@gmail.com>2020-08-27 16:52:13 +0200
commit4ab53ddfeef8fa99f5cf507f582c224a32e4c8b9 (patch)
tree99b9669335bcf1eb282b731eeec19b3b79343763 /projects
parentce6a2ee453d29d0521c1dc3672363ec8934d2f2a (diff)
downloadfuzzycat-4ab53ddfeef8fa99f5cf507f582c224a32e4c8b9.tar.gz
fuzzycat-4ab53ddfeef8fa99f5cf507f582c224a32e4c8b9.zip
update project README
Diffstat (limited to 'projects')
-rw-r--r--projects/oai_harvest_md/.gitignore1
-rw-r--r--projects/oai_harvest_md/Makefile5
-rw-r--r--projects/oai_harvest_md/README.md10
-rw-r--r--projects/titlelist/README.md6
4 files changed, 22 insertions, 0 deletions
diff --git a/projects/oai_harvest_md/.gitignore b/projects/oai_harvest_md/.gitignore
new file mode 100644
index 0000000..96077ed
--- /dev/null
+++ b/projects/oai_harvest_md/.gitignore
@@ -0,0 +1 @@
+oai.ndjson.zst
diff --git a/projects/oai_harvest_md/Makefile b/projects/oai_harvest_md/Makefile
new file mode 100644
index 0000000..f9deb7f
--- /dev/null
+++ b/projects/oai_harvest_md/Makefile
@@ -0,0 +1,5 @@
+SHELL := /bin/bash
+
+oai_harvest_20200215.ndjson.zst:
+ wget -c https://archive.org/download/oai_harvest_20200215/oai.ndjson.zst
+
diff --git a/projects/oai_harvest_md/README.md b/projects/oai_harvest_md/README.md
index bbaa915..5f2b655 100644
--- a/projects/oai_harvest_md/README.md
+++ b/projects/oai_harvest_md/README.md
@@ -1,5 +1,7 @@
# OAI metadata matching
+Goal: end-to-end data workflow (acquisition, harvest, matching, new release entities).
+
## Plan
* [ ] get JSON version, via [oai_harvest_20200215](https://archive.org/details/oai_harvest_20200215)
@@ -8,3 +10,11 @@
* [ ] (b) for items w/o doi, get a list of (id, title)
* [ ] run fuzzy matching over title list to find out which one we have
+## Get data
+
+```
+$ make
+```
+
+* compressed 12G, around 100G uncompressed
+
diff --git a/projects/titlelist/README.md b/projects/titlelist/README.md
new file mode 100644
index 0000000..a58f50e
--- /dev/null
+++ b/projects/titlelist/README.md
@@ -0,0 +1,6 @@
+# Title list
+
+Given a list of over 20M publication titles, determine matches, to generate a
+seed list.
+
+