aboutsummaryrefslogtreecommitdiffstats
path: root/notes/explore.md
blob: a64921717ccab072d857b164654f98e00c757007 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120

## 2020-10-08

How about preservation coverage from Scholar's Portal and Cariniana?


    sqlite> select count(*), sum(journal.release_count), sum(journal.preserved_count) from journal join directory on journal.issnl = directory.issnl where directory.slug = 'cariniana';

    count(*)    sum(journal.release_count)  sum(journal.preserved_count)
    ----------  --------------------------  ----------------------------
    777         165832                      145352                      

Just 20k releases (or less).


    sqlite> select count(*), sum(journal.release_count), sum(journal.preserved_count) from journal join directory on journal.issnl = directory.issnl where directory.slug = 'scholarsportal';

    count(*)    sum(journal.release_count)  sum(journal.preserved_count)
    ----------  --------------------------  ----------------------------
    16973       45111605                    40182992                    

Almost 5 million releases, potentially huge.

## 2020-08-31

How big of a difference in preservation coverage stats does the inclusion of
PKP PLN numbers result in?

    select count(*), sum(journal.release_count), sum(journal.preserved_count) from journal join directory on journal.issnl = directory.issnl where directory.slug = 'pkp_pln';

    count(*)    sum(journal.release_count)  sum(journal.preserved_count)
    ----------  --------------------------  ----------------------------
    1356        343333                      283984                      

So about 60k releases.

How about Hathitrust?

    select count(*), sum(journal.release_count), sum(journal.preserved_count) from journal join directory on journal.issnl = directory.issnl where directory.slug = 'hathitrust';

    count(*)    sum(journal.release_count)  sum(journal.preserved_count)
    ----------  --------------------------  ----------------------------
    26628       48160184                    36905342

Much larger potential impact, of 11+ million releases, though unclear how many
are acutally in the hathitrust archives.

## 2020-06-23

Where do back ISSN-Ls come from? Answer: exiting fatcat metadata.

    select count(*) from journal where valid_issnl = 0;
    => 4

    select count(*) from journal where known_issnl = 0;
    => 2304

    select directory.slug, count(*) from journal join directory on journal.issnl = directory.issnl where journal.known_issnl = 0 group by directory.slug order by count(*) desc limit 20;

    select count(*) from journal join fatcat_container on journal.issnl = fatcat_container.issnl where journal.known_issnl = 0;
    => 2,328
    => note: still a few dupe ISSN-L in fatcat_container


How many journals might be longtail but have just a handful of DOIs? And would
be lost if we filter by `has_doi=0`?

    select count(*) from journal where has_dois = 1 and release_count <= 10;
    9054

    select count(*) from journal where has_dois = 1 and is_longtail = 1;
    15,575

    select count(*) from journal where has_dois = 1 and (is_active is null or is_active = 1) and release_count <= 10 and any_homepage = 1;
    => 5,843

How many *journals* would old query turn up? NOTE: have not verified new homepage URLs

    SELECT COUNT(DISTINCT journal.issnl), COUNT(DISTINCT homepage.url)
    FROM homepage
    LEFT JOIN journal ON homepage.issnl = journal.issnl
    WHERE
        homepage.terminal_status_code = 200
        AND journal.is_longtail = 1
        AND homepage.domain != 'archive.org'
        AND homepage.host NOT LIKE '%scielo%'
        AND homepage.domain != 'jst.go.jp'
        AND homepage.host != 'books.google.com'
        AND homepage.host != 'www.google.com'
        AND journal.has_dois = 0;
    => 16,471 journals
    => 19,460 URLs

New tweaks:

    SELECT COUNT(DISTINCT journal.issnl), COUNT(DISTINCT homepage.url)
    FROM homepage
    LEFT JOIN journal ON homepage.issnl = journal.issnl
    WHERE
        (homepage.terminal_status_code = 200 or homepage.blocked or homepage.terminal_status_code is null)
        AND homepage.domain != 'archive.org'
        AND homepage.host NOT LIKE '%scielo%'
        AND homepage.domain != 'jst.go.jp'
        AND homepage.host != 'books.google.com'
        AND homepage.host != 'www.google.com'
        AND homepage.domain != 'oclc.org'
        AND homepage.host != 'www.ncbi.nlm.nih.gov'
        AND homepage.domain != 'umi.com'
        AND homepage.domain != 'doi.org'
        AND homepage.host != 'www.thefreelibrary.com'
        AND (journal.is_longtail = 1
            OR journal.publisher_type = 'society'
            OR journal.publisher_type = 'unipress' 
            OR journal.publisher_type IS NULL)
        AND (journal.has_dois = 0 or journal.release_count < 20);
    => 57,637 journals
    => 70,770 homepages

This is a significant increase in size over previous crawls!