aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--notes/2021_10_27_doaj_subgraph.md134
-rw-r--r--notes/doaj_graph.md20
2 files changed, 134 insertions, 20 deletions
diff --git a/notes/2021_10_27_doaj_subgraph.md b/notes/2021_10_27_doaj_subgraph.md
new file mode 100644
index 0000000..b750d0c
--- /dev/null
+++ b/notes/2021_10_27_doaj_subgraph.md
@@ -0,0 +1,134 @@
+# DOAJ Citation Graph
+
+Based on Refcat (v1), the Internet Archive (IA) Scholar Citation Graph.
+
+> 2021-10-27
+
+## Basic numbers
+
+We started with a set of 4,887,241 DOI from DOAJ, after normalization we find
+4,773,245 metadata records in https://fatcat.wiki (catalog).
+
+| | count |
+|--------------------------------------------------------------- |------------- |
+| matched edges | 124,760,397 |
+| matched edges (by identifier) | 118,314,316 |
+| matched edges (by fuzzy matching) | 6,446,081 |
+| citations **from** a DOAJ document | 98,616,033 |
+| citations having a DOAJ document as **target** | 34,910,769 |
+| citation where source and target are in DOAJ (**intra-DOAJ**) | 8,766,405 |
+| unique source documents (all) | 12,730,677 |
+| unique source documents (doaj) | 3,471,878 |
+| unique target documents (all) | 24,331,406 |
+| unique target documents (doaj) | 2,678,972 |
+
+In words:
+
+For 72% of DOAJ documents, we have recorded at least one reference to a target
+and for 56% of the DOAJ documents, we have record at least one citation
+pointing to it.
+
+About 7% of the citation we find are intra-DOAJ, that is both the citing and
+the cited article is in DOAJ.
+
+## Charts
+
+Top referenced articles in this dataset are:
+
+| Cited By | Fatcat Release Identifier | Title |
+|---------- |---------------------------- |--------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| 27043 | pedretid7rd6xknd6gsrrh3wum | A short history ofSHELX |
+| 26974 | hzhcy7rsoravrilgyhzohwlmai | Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement |
+| 22543 | fiqrt3cc5jgupls3fvroghzb4y | Fitting Linear Mixed-Effects Models Usinglme4 |
+| 19735 | 4dxke54hnjh4nmsjbrrlu2o5zq | Self-efficacy: Toward a unifying theory of behavioral change. |
+| 17670 | 3zmp4orkdff7tk3tc3q7hvyvay | Analysis of Relative Gene Expression Data Using Real-Time Quantitative PCR and the 2−ΔΔCT Method |
+| 16186 | bdsantixljesjkofonh3oqalzq | The Achromatic Interfero Coronagraph |
+| 8758 | jubvkngt7zflbfkwsff44fxa6q | BEAST: Bayesian evolutionary analysis by sampling trees |
+| 8713 | ztl7z2e3engvtad4l5qhldmd64 | Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries |
+| 8646 | ctdiwqadirftjgu77untvwbpiu | A rapid and sensitive method for the quantitation of microgram quantities of protein utilizing the principle of protein-dye binding |
+| 8195 | 5dcgafogfvg4tfqqhobidybpna | Basic local alignment search tool |
+| 7741 | 27tkrqbmjrfctnhmodskvwhhqa | RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome |
+| 7488 | fyhpfh5lkjgl7ewr7pcgrzekha | Structure validation in chemical crystallography |
+| 7266 | tdsusrfiuzcqxnnlbmm6uzyh4m | The PRISMA Statement for Reporting Systematic Reviews and Meta-Analyses of Studies That Evaluate Health Care Interventions: Explanation and Elaboration |
+| 7242 | qhqpojpbuvh4zffs4dvqs4beyi | BLAST+: architecture and applications |
+| 7085 | joktmyyu5vdv3kuxm42zzqhn3e | Hallmarks of Cancer: The Next Generation |
+| 6934 | xku5g3hmm5eangsczpzjrctd7e | Gapped BLAST and PSI-BLAST: a new generation of protein database search programs |
+| 6806 | srzvnzj7rvbbhig37uw6vh6m4u | The Sequence Alignment/Map format and SAMtools |
+| 6685 | tgwxkq5jnjfc3eu3zpycilq7xm | Using thematic analysis in psychology |
+| 6554 | 5g42373tjfecxp44yqns7qwzoe | The RAST Server: Rapid Annotations using Subsystems Technology |
+| 6489 | wbkhvqxm2napppgmaxin66upgm | WGCNA: an R package for weighted correlation network analysis |
+| 6215 | atq75qnkkzdadbhaslevbmdlaq | Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 |
+| 6192 | jhoeu43y7rhoxd5eaw3dqzc4tm | Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing |
+| 6124 | pebeuwozure4xiaygfs6om4fya | Arlequin (version 3.0): An integrated software package for population genetics data analysis |
+| 5900 | ym7irtp4dveurpinpuyfjjdyuu | FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments |
+| 5894 | dy3dpacbd5a6dag42nnsjh3pte | Fast gapped-read alignment with Bowtie 2 |
+| 5891 | nm4tov3wxndjjjpnyoqe5lirom | MUSCLE: multiple sequence alignment with high accuracy and high throughput |
+| 5861 | tcwbgpm3kfbnxk3lhwgsaswmrm | Trimmomatic: a flexible trimmer for Illumina sequence data |
+| 5853 | j5bjclahkjfxtm6px3germagpm | MEGA6: Molecular Evolutionary Genetics Analysis Version 6.0 |
+| 5818 | nttk476glncuhbuy4vvskrwfoi | Projections of Global Mortality and Burden of Disease from 2002 to 2030 |
+| 5644 | 7bsqead3n5he3gmbzkmfetdj3e | MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets |
+
+Top most referenced articles belonging to DOAJ:
+
+| Cited By | Fatcat Release Identifier | Title |
+|---------- |---------------------------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| 257 | 42lwecjh4nhjbbfx5j6feoy4re | Evidence for large domains of similarly expressed genes in the <it>Drosophila </it> genome |
+| 254 | ns4v2jvhgbhh7mbg45bjtpzway | A new natural hybrid of Iris (Iridaceae) from Chongqing, China |
+| 220 | yqhzw62yhbd4xnfm6qkplk5gky | Three new subterranean species of Baezia (Curculionidae, Molytinae) for the Canary Islands |
+| 206 | fbr3cmn7svdyrk2de74p4ibhra | Dwarfs of the fortress: A new cryptic species of dwarf gecko of the genus Cnemaspis Strauch, 1887 (Squamata, Gekkonidae) from Rajgad fort in the northern Western Ghats of Maharashtra, India |
+| 187 | vwbvqztj7zbznhejreb6nmkghq | A role for <it>cryptochromes</it> in sleep regulation |
+| 187 | pv7gwyji7nbh7et776cnmdok3a | A new species of day gecko (Reptilia, Gekkonidae, Cnemaspis Strauch, 1887) from Sri Lanka with an updated ND2 gene phylogeny of Sri Lankan and Indian species |
+| 164 | x3ahxq56c5bwrlf6pkmpmohqmm | The laboratory rat: Relating its age with human′s |
+| 162 | 75iqitudtbcrxn73dvwv7vka5m | On the Generalized Distance in Statistics |
+| 157 | rdk724wf75ddhc5qszf53jytuy | Immunocytochemical evidence for co-expression of Type III IP<sub>3</sub> receptor with signaling components of bitter taste transduction |
+| 142 | dlnghkvx7bgotfftqvp5rsgeg4 | Reactivation of a silenced <it>H19</it> gene in human rhabdomyosarcoma by demethylation of DNA but not by histone hyperacetylation |
+| 126 | evrrqdpegnhvpggmnxtdxjdnou | Frequent Promoter Methylation of <it>CDH1, DAPK, RARB</it>, and <it>HIC1 </it>Genes in Carcinoma of Cervix Uteri: Its Relationship to Clinical Outcome |
+| 122 | p43ke27vpff6lcakjy4zchczhy | A tandem repeats database for bacterial genomes: application to the genotyping of <it>Yersinia pestis</it> and <it>Bacillus anthracis</it> |
+| 119 | 4vipha52brfmpk5ydwb2tqbxh4 | Dividend Policy Growth and the Valuation of Shares |
+| 117 | oj66fyr4nncipn4rmc77px7q2y | PGC-1alpha Deficiency Causes Multi-System Energy Metabolic Derangements: Muscle Dysfunction, Abnormal Weight Control and Hepatic Steatosis |
+| 114 | fdeqimfgg5ac7e6tqeov3lnkb4 | The molecular genetic linkage map of the model legume <it>Medicago truncatula</it>: an essential tool for comparative legume genomics and the isolation of agronomically important genes |
+| 112 | i4rp4yjw3bd6taihp3gkvjln2a | Aprendendo a entrevistar: como fazer entrevistas em Ciências Sociais |
+| 111 | ix3qnhyhovbwxiwycgcqofdrje | Malarone treatment failure and <it>in vitro</it> confirmation of resistance of <it>Plasmodium falciparum</it> isolate from Lagos, Nigeria |
+| 110 | yd7hojmywvexrpcoyql2bnlhyi | OPERATIONAL EARTHQUAKE FORECASTING. State of Knowledge and Guidelines for Utilization |
+| 101 | uqjudwtgjngbtpr3ey3fog3roa | Italian Privileges and Trade in Byzantium before the Fourth Crusade: A Reconsideration |
+| 101 | 6b6rdxf6fve6rj6pc7a7er4mfe | Speciation and phylogeography in the cosmopolitan marine moon jelly, <it>Aurelia</it> sp |
+| 100 | njdobqruvzgabdzbifrbtfnhye | The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs: Correction |
+| 98 | dovwbef4crar5dvpbgbsnmjegu | Antispilina ludwigi Hering, 1941 (Lepidoptera, Heliozelidae) a rare but overlooked European leaf miner of Bistorta officinalis (Polygonaceae): new records, redescription, biology and conservation |
+| 97 | tlewnaq64zdclbzdva7vd3pjy4 | Beyond Empathy. Phenomenological Approaches to Intersubjectivity |
+| 95 | 4xwg6e5qpnfuxmobn3ntj4beyi | Knowledge and attitude toward COVID-19 among healthcare workers at District 2 Hospital, Ho Chi Minh City |
+| 93 | 5ajqrpxqdjdzhahzkpqcqyp754 | HPLC-DAD-ESI-MSn identification of phenolic compounds in cultivated strawberries from Macedonia |
+| 91 | nofyxyrfcjclhayeeymhtyaeia | Biofilm formation by nontypeable <it>Haemophilus influenzae:</it> strain variability, outer membrane antigen expression and role of pili |
+| 86 | tgrf2rfdjvhv3h2j55gydzjwiu | Labiobaetis Novikova & Kluge in West Africa (Ephemeroptera, Baetidae), with description of a new species |
+| 79 | z33rdxu3cnh65lbaze44nfi6cm | Molecular phylogeny of Subtribe Artemisiinae (Asteraceae), including <it>Artemisia</it> and its allied and segregate genera |
+| 79 | r2acgmnjlfcpjalpsaw6srcq5y | Haplotype analysis of the PPARγ Pro12Ala and C1431T variants reveals opposing associations with body weight |
+
+## Glossary
+
+### Edge
+
+An edge connect a source metadata document with a target metadata document
+(from the fatcat catalog) and records a certain or highly likely citation of
+target document in source document.
+
+We also record (and display) unmatched references, that is reference
+information from a source, that has not been matched to a target yet. These are
+called "unmatched refs", sometimes.
+
+### Fatcat.wiki
+
+The catalog underlying Internet Archive Scholar
+
+### Internet Archive Scholar
+
+Search engine over 100M metadata and over 30M fulltext documents, updated in
+near real-time as new metadata and fulltext document become available in
+fatcat.
+
+### Internet Archive Scholar Citation Graph
+
+A citation graph derived from scholarly metadata and fulltext documents curated
+at the Internet Archive. Version 1 has been released in 10/2021. Futher information can be found here:
+
+* https://guide.fatcat.wiki/reference_graph.html
+* https://blog.archive.org/2021/10/19/internet-archive-releases-refcat-the-ia-scholar-index-of-over-1-3-billion-scholarly-citations/
+* https://arxiv.org/abs/2110.06595
diff --git a/notes/doaj_graph.md b/notes/doaj_graph.md
deleted file mode 100644
index 449220b..0000000
--- a/notes/doaj_graph.md
+++ /dev/null
@@ -1,20 +0,0 @@
-# DOAJ Citation Graph
-
-This dataset contains a subset of the edges of the Internet Archive (IA)
-Scholar Citation Graph (v1, 2021-07-28, named: refcat) where either the citing
-or the cited work (or both) are part of DOAJ.
-
-Basic numbers:
-
-* DOAJ DOI used for matching edges: 4,886,099
-* Catalog entries via DOI in fatcat: 4,773,245
-* We find 124,760,397 edges, of these; 98,616,033 have a source belonging to
- DOAJ; 34,910,769 have an article in DOAJ as target; intra-DOAJ: 8,766,405
-* How do we find these edges? By id: 118,314,316; via fuzzy matching:
- 6,446,081 (5.17%)
-
-The IA Scholar citation graph is documented in various places:
-
-* https://blog.archive.org/2021/10/19/internet-archive-releases-refcat-the-ia-scholar-index-of-over-1-3-billion-scholarly-citations/
-* https://guide.fatcat.wiki/reference_graph.html
-* https://arxiv.org/abs/2110.06595