From 67d163e6e57275a3969e0450b3ea2cb6c3e285b6 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Tue, 28 Sep 2021 14:21:26 +0200 Subject: mag: update notes --- extra/mag/README.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/extra/mag/README.md b/extra/mag/README.md index 9e20c44..cd4ec70 100644 --- a/extra/mag/README.md +++ b/extra/mag/README.md @@ -64,3 +64,18 @@ In order to generate a doi-to-doi version, we need to: ``` Finding 1,315,040,677 DOI-to-DOI mappings. + +Total edges: 1,832,226,781 + +``` +$ zstdcat -T0 PaperReferences.txt.zst | pv -l | wc -l +1832226781 +``` + +Non-DOI edges: 517,186,104 + +Creating lowercase, unique sorted version: + +``` +$ time zstdcat -T0 doi_refs.tsv.zst| tr '[[:upper:]]' '[[:lower:]]' | LC_ALL=C sort -u -T /sandcrawler-db/tmp-refcat/ -S50% > doi_refs_lower_sorted.tsv.zst +``` -- cgit v1.2.3