aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--python/notes/coci_notes.md43
1 files changed, 43 insertions, 0 deletions
diff --git a/python/notes/coci_notes.md b/python/notes/coci_notes.md
new file mode 100644
index 0000000..6d7a968
--- /dev/null
+++ b/python/notes/coci_notes.md
@@ -0,0 +1,43 @@
+# COCI Notes
+
+* [https://opencitations.net/download](https://opencitations.net/download)
+* [https://figshare.com/articles/dataset/Crossref_Open_Citation_Index_CSV_dataset_of_all_the_citation_data/6741422/9](https://figshare.com/articles/dataset/Crossref_Open_Citation_Index_CSV_dataset_of_all_the_citation_data/6741422/9)
+
+> 6741422v9.zip [19G]
+
+> Dump created on 2020-12-07. This dump includes information on:
+
+* 60,778,357 bibliographic resources;
+* 759,516,507 citation links.
+
+
+```
+extracted/2020-06-13T18_18_05_1-2.zip
+extracted/2020-08-20T18_12_28_1-2.zip
+extracted/2020-04-25T04_48_36_1-5.zip
+extracted/2020-11-22T17_48_01_1-3.zip
+extracted/2020-01-13T19_31_19_1-4.zip
+extracted/2019-10-21T22_41_20_1-63.zip
+```
+
+* extracted to 79 CSV files
+
+Raw data example.
+
+```
+oci,citing,cited,creation,timespan,journal_sc,author_sc
+02003080406360106010101060909370200010237070005020502-02001000106361937231430122422370200000837000737000200,10.3846/16111699.2012.705252,10.1016/j.neucom.2008.07.020,2012-10-04,P3Y0M,no,no
+02003080406360106010101060909370200010237070005020502-0200308040636010601016301060909370200000837093701080963010908,10.3846/16111699.2012.705252,10.3846/1611-1699.2008.9.189-198,2012-10-04,P4Y0M4D,yes,no
+02003080406360106010101060909370200010237070005020502-02001000106361937102818141224370200000737000237000003,10.3846/16111699.2012.705252,10.1016/j.asieco.2007.02.003,2012-10-04,P5Y6M,no,no
+02003080406360106010101060909370200010237070005020502-02003080406360106010101060909370200010137050505030808,10.3846/16111699.2012.705252,10.3846/16111699.2011.555388,2012-10-04,P1Y5M22D,yes,no
+...
+```
+
+For comparison, we need also a DOI-DOI matching list.
+
+Example approach:
+
+* extract source-target release ident, sort by source ident
+* from fatcat db dump, extract source id and ext ids, sort by source ident
+* "zip together"
+