summaryrefslogtreecommitdiffstats
path: root/notes/fatcat_sim_intersection.md
blob: bb818840dec67af7c22c47d4c3b93c83a6fdc834 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98

investigate how many fatcat releases match to SIM:
- dump archive.org SIM collection-level metadata
- dump archive.org issue item-level metadata
- releases with: in_sim, volume, issue, page, year (month?)
    => 22m   in_ia_sim
    =>  1.1m in_ia_sim preservation:none
    => 20m   in_ia_sim volume
    => 20m   in_ia_sim volume year
    => 19m   in_ia_sim volume pages
    =>  5m   in_ia_sim volume year date
    =>  7m   in_ia_sim volume issue
    =>  7m   in_ia_sim volume issue pages
    =>  6m   in_ia_sim volume issue pages first_page
    =>  5.3m in_ia_sim volume issue pages first_page in_web:false
    =>  0.7m in_ia_sim volume issue pages first_page preservation:none
    =>  2.5m in_ia_sim volume issue pages first_page date
- how many (any?) SIM journals with no fatcat container
    total: 14860
    missing-issn: 2863
    no-match: 554
    -> 3417./ 14860 = 23%
- how many SIM journals/issues/years with ~no fatcat releases?
    as of 2020-07-21: of 212 pubids (with scanned issues so far), 129 have any fatcat releases (60%)

at least some (release_jpruczlec5gsjpbc2cbvwedsdy) have updated crossref
metadata with issue numbers


## 2020-07-20

Categories of interesction:

- fatcat catalog record and in SIM corpus: have good metadata, could
  potentially extract just pages (for fulltext search) and link directly to
  access
    => at least 22m records; 18.4m no known public fulltext
    => at least 6m with enough metadata to match; 5.3m no known public fulltext
        => TODO: how many of this 16m metadata gap can be fixed by finding better metadata?
    => SIM digitized yet?
    => TODO: estimate at issue level

    example:
        "The Savings Gained From Participation in Health Promotion Programs for Medicare Beneficiaries"
        https://fatcat.wiki/release/hp7jsz2cfnc3dgepk7oxj7kvjm
        https://archive.org/details/sim_journal-of-occupational-and-environmental-medicine_2006-11_48_11/page/1125
        https://scholar-qa.archive.org/search?q=%22responses+to+the+hra+or+from+data+gathered%22

- SIM corpus paper, no fatcat catalog record
    => TODO: estimate from issue count/ratio

- fatcat catalog record, and fulltext, no SIM paper

Current scholar.archive.org behavior is to use fatcat metadata to create a
work-level document (with multiple pages) if possible. If not, the entire issue
is issue is split into page-level documents.

TODO: only "count" paper/records which have enough metadata to actually link
(eg, volume, issue, pages), not just any `in_ia_sim`.

#### SQL Queries

    select count(*) from sim_pub;
    => 212

    select count(distinct sim_pubid) from release_counts;
    => 129

    select count(*) from sim_issue;
    => 78301

    select count(*) from sim_issue left join release_counts on sim_issue.year = release_counts.year and sim_issue.sim_pubid = release_counts.sim_pubid and sim_issue.volume = release_counts.volume where release_counts.sim_pubid is not null and release_counts.release_count > 0;
    => 218

    select count(*) from sim_issue left join release_counts on sim_issue.year = release_counts.year and sim_issue.sim_pubid = release_counts.sim_pubid and sim_issue.volume = release_counts.volume where release_counts.sim_pubid is not null and release_counts.release_count >= 3;
    => 179

    select count(*) from sim_issue left join release_counts on sim_issue.year = release_counts.year and sim_issue.sim_pubid = release_counts.sim_pubid and sim_issue.volume = release_counts.volume where release_counts.sim_pubid is not null and release_counts.release_count >= 10;
    => 166

    select sum(release_count) from release_counts;
    => 9968

    select sum(release_count) from release_counts where release_count >= 3;
    => 9940

    select count(*) from (select 1 from sim_issue group by sim_pubid, volume);
    => 6405

    select count(*) from (select sim_pubid, SUM(release_count) as release_count from release_counts group by sim_pubid);
    => 129

    select count(*) from (select sim_pubid, SUM(release_count) as release_count from release_counts group by sim_pubid) where release_count >= 10;
    => 86

    select count(*) from (select sim_pubid, SUM(release_count) as release_count from release_counts group by sim_pubid) where release_count >= 100;
    => 27