notes/missing_homepage_task.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100


## Goal

For many long-tail journals, we have no known homepage. It is likely many of
these metadata records were actually never published, or are otherwise bad
metadata, but many are legitimate but simply missing metadata.

Want to rapidly skim though thousands of such journals and record homepage URLs
if they exist.

## Instructions

For each row in the spreadsheet, search the web or other sources for a journal
homepage. This should be an official, active site where new papers are
published, as well as historical papers.

The recommended workflow is to search for the ISSN-L and name in google, skim
the first page for likely hits, then click through to confirm that any hits are
actually journal sites. An easy way to do this is to check for the ISSN (or the
alternate "ISSNe" or "ISSNp") in the webpage itself; we will also check for
these identifiers in an automated manner to verify homepage matches. If there
do not seem to be any hits, mark the row as skipped and move on. You will
notice that many journals are published on platforms or using common software
like OJS (Open Journal Systems), Wordpress, or SciElo. If you notice this,
please tag in the `platform` column.

Generally are not interested in URLs to sites that are just indexing or listing
metadata about a journal, which often show up in search results. If it seems
like a journal has been retired, archived, or mirrored elsewhere, with all the
papers available, you can put such a URL in `other_url`. This is relatively
rare.

If the metadata (journal name) is aggregiously poor or mangled, and you find
the corrected canonical title, you can put that in the `corrected_title` column
(optional).

Recommend running through 25 random rows first without recording results to get
a feel for the process and ask any question.

Specific platforms we don't want any URLs from (not a complete list):

- issn.org
- sherpa.ac.uk
- any other lists of journal information
- wikidata.org
- scimago

Platforms which are ok to link to in the `other_url` column if no other hits:

- web.archive.org

Core columns to fill in for each row:

- `skipped` (yes or blank)
- `homepage_url`
- `platform` (eg, OJS, scielo, hypothesis, or blank)

Other columns that can be filled in, but aren't expecting them for most:

- `other_url`
- `corrected_title`
- `original_title` (non-English)
- `corrected_publisher`
- `inactive` (yes/no)
- `comment`

## Export Task List

Dump to TSV:

    .headers on
    .mode tabs
    .output chocula_missing_hompages_longtail.2020-05-05.tsv

    SELECT issnl, issnp, issne, name, publisher, country, lang, release_count
    FROM journal
    WHERE
        any_homepage=0
        AND has_dois=0
        AND is_longtail=1
        AND release_count < 10
        AND valid_issnl=1;

NOTE: this is a partial list, as of 2020-05-05 about 4600 rows, 

After the first round of manual homepage identification, as of 2020-07-08 there
are only 264 journals remaining selected by the above query.

## 2020-10-13 Update

    .headers on
    .mode tabs
    .output chocula_missing_hompages.2020-10-13.tsv

    SELECT issnl, issnp, issne, name, publisher, country, lang, release_count, has_dois
    FROM journal
    WHERE
        any_homepage=0
        AND name IS NOT NULL
        AND valid_issnl=1;