aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-09-28 14:21:26 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-09-28 14:21:26 +0200
commit67d163e6e57275a3969e0450b3ea2cb6c3e285b6 (patch)
treeee178d6149a38ce1bf93266fe12bf9d86562fc97
parent8c778bcb7e928ab5183519603e83b2f8bfaebf34 (diff)
downloadrefcat-67d163e6e57275a3969e0450b3ea2cb6c3e285b6.tar.gz
refcat-67d163e6e57275a3969e0450b3ea2cb6c3e285b6.zip
mag: update notes
-rw-r--r--extra/mag/README.md15
1 files changed, 15 insertions, 0 deletions
diff --git a/extra/mag/README.md b/extra/mag/README.md
index 9e20c44..cd4ec70 100644
--- a/extra/mag/README.md
+++ b/extra/mag/README.md
@@ -64,3 +64,18 @@ In order to generate a doi-to-doi version, we need to:
```
Finding 1,315,040,677 DOI-to-DOI mappings.
+
+Total edges: 1,832,226,781
+
+```
+$ zstdcat -T0 PaperReferences.txt.zst | pv -l | wc -l
+1832226781
+```
+
+Non-DOI edges: 517,186,104
+
+Creating lowercase, unique sorted version:
+
+```
+$ time zstdcat -T0 doi_refs.tsv.zst| tr '[[:upper:]]' '[[:lower:]]' | LC_ALL=C sort -u -T /sandcrawler-db/tmp-refcat/ -S50% > doi_refs_lower_sorted.tsv.zst
+```