From 53e23f8ead8490359cb5706bfc240b6bcb960349 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Wed, 27 Oct 2021 22:34:39 +0200 Subject: add doaj subgraph notes --- notes/2021_10_27_doaj_subgraph.md | 134 ++++++++++++++++++++++++++++++++++++++ notes/doaj_graph.md | 20 ------ 2 files changed, 134 insertions(+), 20 deletions(-) create mode 100644 notes/2021_10_27_doaj_subgraph.md delete mode 100644 notes/doaj_graph.md diff --git a/notes/2021_10_27_doaj_subgraph.md b/notes/2021_10_27_doaj_subgraph.md new file mode 100644 index 0000000..b750d0c --- /dev/null +++ b/notes/2021_10_27_doaj_subgraph.md @@ -0,0 +1,134 @@ +# DOAJ Citation Graph + +Based on Refcat (v1), the Internet Archive (IA) Scholar Citation Graph. + +> 2021-10-27 + +## Basic numbers + +We started with a set of 4,887,241 DOI from DOAJ, after normalization we find +4,773,245 metadata records in https://fatcat.wiki (catalog). + +| | count | +|--------------------------------------------------------------- |------------- | +| matched edges | 124,760,397 | +| matched edges (by identifier) | 118,314,316 | +| matched edges (by fuzzy matching) | 6,446,081 | +| citations **from** a DOAJ document | 98,616,033 | +| citations having a DOAJ document as **target** | 34,910,769 | +| citation where source and target are in DOAJ (**intra-DOAJ**) | 8,766,405 | +| unique source documents (all) | 12,730,677 | +| unique source documents (doaj) | 3,471,878 | +| unique target documents (all) | 24,331,406 | +| unique target documents (doaj) | 2,678,972 | + +In words: + +For 72% of DOAJ documents, we have recorded at least one reference to a target +and for 56% of the DOAJ documents, we have record at least one citation +pointing to it. + +About 7% of the citation we find are intra-DOAJ, that is both the citing and +the cited article is in DOAJ. + +## Charts + +Top referenced articles in this dataset are: + +| Cited By | Fatcat Release Identifier | Title | +|---------- |---------------------------- |--------------------------------------------------------------------------------------------------------------------------------------------------------- | +| 27043 | pedretid7rd6xknd6gsrrh3wum | A short history ofSHELX | +| 26974 | hzhcy7rsoravrilgyhzohwlmai | Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement | +| 22543 | fiqrt3cc5jgupls3fvroghzb4y | Fitting Linear Mixed-Effects Models Usinglme4 | +| 19735 | 4dxke54hnjh4nmsjbrrlu2o5zq | Self-efficacy: Toward a unifying theory of behavioral change. | +| 17670 | 3zmp4orkdff7tk3tc3q7hvyvay | Analysis of Relative Gene Expression Data Using Real-Time Quantitative PCR and the 2−ΔΔCT Method | +| 16186 | bdsantixljesjkofonh3oqalzq | The Achromatic Interfero Coronagraph | +| 8758 | jubvkngt7zflbfkwsff44fxa6q | BEAST: Bayesian evolutionary analysis by sampling trees | +| 8713 | ztl7z2e3engvtad4l5qhldmd64 | Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries | +| 8646 | ctdiwqadirftjgu77untvwbpiu | A rapid and sensitive method for the quantitation of microgram quantities of protein utilizing the principle of protein-dye binding | +| 8195 | 5dcgafogfvg4tfqqhobidybpna | Basic local alignment search tool | +| 7741 | 27tkrqbmjrfctnhmodskvwhhqa | RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome | +| 7488 | fyhpfh5lkjgl7ewr7pcgrzekha | Structure validation in chemical crystallography | +| 7266 | tdsusrfiuzcqxnnlbmm6uzyh4m | The PRISMA Statement for Reporting Systematic Reviews and Meta-Analyses of Studies That Evaluate Health Care Interventions: Explanation and Elaboration | +| 7242 | qhqpojpbuvh4zffs4dvqs4beyi | BLAST+: architecture and applications | +| 7085 | joktmyyu5vdv3kuxm42zzqhn3e | Hallmarks of Cancer: The Next Generation | +| 6934 | xku5g3hmm5eangsczpzjrctd7e | Gapped BLAST and PSI-BLAST: a new generation of protein database search programs | +| 6806 | srzvnzj7rvbbhig37uw6vh6m4u | The Sequence Alignment/Map format and SAMtools | +| 6685 | tgwxkq5jnjfc3eu3zpycilq7xm | Using thematic analysis in psychology | +| 6554 | 5g42373tjfecxp44yqns7qwzoe | The RAST Server: Rapid Annotations using Subsystems Technology | +| 6489 | wbkhvqxm2napppgmaxin66upgm | WGCNA: an R package for weighted correlation network analysis | +| 6215 | atq75qnkkzdadbhaslevbmdlaq | Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 | +| 6192 | jhoeu43y7rhoxd5eaw3dqzc4tm | Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing | +| 6124 | pebeuwozure4xiaygfs6om4fya | Arlequin (version 3.0): An integrated software package for population genetics data analysis | +| 5900 | ym7irtp4dveurpinpuyfjjdyuu | FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments | +| 5894 | dy3dpacbd5a6dag42nnsjh3pte | Fast gapped-read alignment with Bowtie 2 | +| 5891 | nm4tov3wxndjjjpnyoqe5lirom | MUSCLE: multiple sequence alignment with high accuracy and high throughput | +| 5861 | tcwbgpm3kfbnxk3lhwgsaswmrm | Trimmomatic: a flexible trimmer for Illumina sequence data | +| 5853 | j5bjclahkjfxtm6px3germagpm | MEGA6: Molecular Evolutionary Genetics Analysis Version 6.0 | +| 5818 | nttk476glncuhbuy4vvskrwfoi | Projections of Global Mortality and Burden of Disease from 2002 to 2030 | +| 5644 | 7bsqead3n5he3gmbzkmfetdj3e | MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets | + +Top most referenced articles belonging to DOAJ: + +| Cited By | Fatcat Release Identifier | Title | +|---------- |---------------------------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| 257 | 42lwecjh4nhjbbfx5j6feoy4re | Evidence for large domains of similarly expressed genes in the Drosophila genome | +| 254 | ns4v2jvhgbhh7mbg45bjtpzway | A new natural hybrid of Iris (Iridaceae) from Chongqing, China | +| 220 | yqhzw62yhbd4xnfm6qkplk5gky | Three new subterranean species of Baezia (Curculionidae, Molytinae) for the Canary Islands | +| 206 | fbr3cmn7svdyrk2de74p4ibhra | Dwarfs of the fortress: A new cryptic species of dwarf gecko of the genus Cnemaspis Strauch, 1887 (Squamata, Gekkonidae) from Rajgad fort in the northern Western Ghats of Maharashtra, India | +| 187 | vwbvqztj7zbznhejreb6nmkghq | A role for cryptochromes in sleep regulation | +| 187 | pv7gwyji7nbh7et776cnmdok3a | A new species of day gecko (Reptilia, Gekkonidae, Cnemaspis Strauch, 1887) from Sri Lanka with an updated ND2 gene phylogeny of Sri Lankan and Indian species | +| 164 | x3ahxq56c5bwrlf6pkmpmohqmm | The laboratory rat: Relating its age with human′s | +| 162 | 75iqitudtbcrxn73dvwv7vka5m | On the Generalized Distance in Statistics | +| 157 | rdk724wf75ddhc5qszf53jytuy | Immunocytochemical evidence for co-expression of Type III IP3 receptor with signaling components of bitter taste transduction | +| 142 | dlnghkvx7bgotfftqvp5rsgeg4 | Reactivation of a silenced H19 gene in human rhabdomyosarcoma by demethylation of DNA but not by histone hyperacetylation | +| 126 | evrrqdpegnhvpggmnxtdxjdnou | Frequent Promoter Methylation of CDH1, DAPK, RARB, and HIC1 Genes in Carcinoma of Cervix Uteri: Its Relationship to Clinical Outcome | +| 122 | p43ke27vpff6lcakjy4zchczhy | A tandem repeats database for bacterial genomes: application to the genotyping of Yersinia pestis and Bacillus anthracis | +| 119 | 4vipha52brfmpk5ydwb2tqbxh4 | Dividend Policy Growth and the Valuation of Shares | +| 117 | oj66fyr4nncipn4rmc77px7q2y | PGC-1alpha Deficiency Causes Multi-System Energy Metabolic Derangements: Muscle Dysfunction, Abnormal Weight Control and Hepatic Steatosis | +| 114 | fdeqimfgg5ac7e6tqeov3lnkb4 | The molecular genetic linkage map of the model legume Medicago truncatula: an essential tool for comparative legume genomics and the isolation of agronomically important genes | +| 112 | i4rp4yjw3bd6taihp3gkvjln2a | Aprendendo a entrevistar: como fazer entrevistas em Ciências Sociais | +| 111 | ix3qnhyhovbwxiwycgcqofdrje | Malarone treatment failure and in vitro confirmation of resistance of Plasmodium falciparum isolate from Lagos, Nigeria | +| 110 | yd7hojmywvexrpcoyql2bnlhyi | OPERATIONAL EARTHQUAKE FORECASTING. State of Knowledge and Guidelines for Utilization | +| 101 | uqjudwtgjngbtpr3ey3fog3roa | Italian Privileges and Trade in Byzantium before the Fourth Crusade: A Reconsideration | +| 101 | 6b6rdxf6fve6rj6pc7a7er4mfe | Speciation and phylogeography in the cosmopolitan marine moon jelly, Aurelia sp | +| 100 | njdobqruvzgabdzbifrbtfnhye | The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs: Correction | +| 98 | dovwbef4crar5dvpbgbsnmjegu | Antispilina ludwigi Hering, 1941 (Lepidoptera, Heliozelidae) a rare but overlooked European leaf miner of Bistorta officinalis (Polygonaceae): new records, redescription, biology and conservation | +| 97 | tlewnaq64zdclbzdva7vd3pjy4 | Beyond Empathy. Phenomenological Approaches to Intersubjectivity | +| 95 | 4xwg6e5qpnfuxmobn3ntj4beyi | Knowledge and attitude toward COVID-19 among healthcare workers at District 2 Hospital, Ho Chi Minh City | +| 93 | 5ajqrpxqdjdzhahzkpqcqyp754 | HPLC-DAD-ESI-MSn identification of phenolic compounds in cultivated strawberries from Macedonia | +| 91 | nofyxyrfcjclhayeeymhtyaeia | Biofilm formation by nontypeable Haemophilus influenzae: strain variability, outer membrane antigen expression and role of pili | +| 86 | tgrf2rfdjvhv3h2j55gydzjwiu | Labiobaetis Novikova & Kluge in West Africa (Ephemeroptera, Baetidae), with description of a new species | +| 79 | z33rdxu3cnh65lbaze44nfi6cm | Molecular phylogeny of Subtribe Artemisiinae (Asteraceae), including Artemisia and its allied and segregate genera | +| 79 | r2acgmnjlfcpjalpsaw6srcq5y | Haplotype analysis of the PPARγ Pro12Ala and C1431T variants reveals opposing associations with body weight | + +## Glossary + +### Edge + +An edge connect a source metadata document with a target metadata document +(from the fatcat catalog) and records a certain or highly likely citation of +target document in source document. + +We also record (and display) unmatched references, that is reference +information from a source, that has not been matched to a target yet. These are +called "unmatched refs", sometimes. + +### Fatcat.wiki + +The catalog underlying Internet Archive Scholar + +### Internet Archive Scholar + +Search engine over 100M metadata and over 30M fulltext documents, updated in +near real-time as new metadata and fulltext document become available in +fatcat. + +### Internet Archive Scholar Citation Graph + +A citation graph derived from scholarly metadata and fulltext documents curated +at the Internet Archive. Version 1 has been released in 10/2021. Futher information can be found here: + +* https://guide.fatcat.wiki/reference_graph.html +* https://blog.archive.org/2021/10/19/internet-archive-releases-refcat-the-ia-scholar-index-of-over-1-3-billion-scholarly-citations/ +* https://arxiv.org/abs/2110.06595 diff --git a/notes/doaj_graph.md b/notes/doaj_graph.md deleted file mode 100644 index 449220b..0000000 --- a/notes/doaj_graph.md +++ /dev/null @@ -1,20 +0,0 @@ -# DOAJ Citation Graph - -This dataset contains a subset of the edges of the Internet Archive (IA) -Scholar Citation Graph (v1, 2021-07-28, named: refcat) where either the citing -or the cited work (or both) are part of DOAJ. - -Basic numbers: - -* DOAJ DOI used for matching edges: 4,886,099 -* Catalog entries via DOI in fatcat: 4,773,245 -* We find 124,760,397 edges, of these; 98,616,033 have a source belonging to - DOAJ; 34,910,769 have an article in DOAJ as target; intra-DOAJ: 8,766,405 -* How do we find these edges? By id: 118,314,316; via fuzzy matching: - 6,446,081 (5.17%) - -The IA Scholar citation graph is documented in various places: - -* https://blog.archive.org/2021/10/19/internet-archive-releases-refcat-the-ia-scholar-index-of-over-1-3-billion-scholarly-citations/ -* https://guide.fatcat.wiki/reference_graph.html -* https://arxiv.org/abs/2110.06595 -- cgit v1.2.3