aboutsummaryrefslogtreecommitdiffstats
path: root/TODO.md
diff options
context:
space:
mode:
Diffstat (limited to 'TODO.md')
-rw-r--r--TODO.md44
1 files changed, 44 insertions, 0 deletions
diff --git a/TODO.md b/TODO.md
new file mode 100644
index 0000000..9e002a7
--- /dev/null
+++ b/TODO.md
@@ -0,0 +1,44 @@
+
+# Grant related tasks
+
+3/4 phases of the grant contain citation graph related tasks.
+
+* [x] Link PID or DOI to archived versions
+
+> As of v2, we have linkage between fatcat release entities by doi, pmid, pmcid, arxiv.
+
+* [ ] URLs in corpus linked to best possible timestamp (GWB)
+
+> CDX API probably good for sampling; we'll need to tap into `/user/wmdata2/cdx-all-index/` - (note: try pyspark)
+
+* [ ] Harvest all URLs in citation corpus (maybe do a sample first)
+
+> A seed-list (from refs; not from the full-text) is done; need to prepare a
+> crawl and lookups in GWB. In 05/2021 we did a test lookup of GWB index on the
+> cluster. A full lookup failed, due to [map
+> spill](https://community.cloudera.com/t5/Support-Questions/Explain-process-of-spilling-in-Hadoop-s-map-reduce-program/m-p/237246/highlight/true#M199059).
+
+* [ ] Links between records w/o DOI (fuzzy matching)
+
+> As of v2, we do have a fuzzy matching procedure (yielding about 5-10% of the total results).
+
+* [ ] Publication of augmented citation graph, explore data mining, etc.
+* [ ] Interlinkage with other source, monographs, commercial publications, etc.
+
+> As of v3, we have a minimal linkage with wikipedia. In 05/2021 we extended Open Library matching (isbn, fuzzy matching)
+
+* [ ] Wikipedia (en) references metadata or archived record
+
+> This is ongoing and should be part of v3.
+
+* [ ] Metadata records for often cited non-scholarly web publications
+* [ ] Collaborations: I4OC, wikicite
+
+We attended an online workshop in 09/2020, organized in part by OCI members;
+recording: [fatcat five minute
+intro](https://archive.org/details/fatcat_workshop_open_citations_open_scholarly_metadata_2020)
+
+# TODO
+
+* [ ] create a first index, ES7 [schema PR](https://git.archive.org/webgroup/fatcat/-/merge_requests/99)
+* [ ] build API, [spec notes](https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md)