notes/bulk_edits/CHANGELOG.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131


# Fatcat Production Import CHANGELOG

This file tracks major content (metadata) imports to the Fatcat production
database (at https://fatcat.wiki). It complements the code CHANGELOG file.

In general, changes that impact more than 50k entities will get logged here;
this file should probably get merged into the guide at some point.

This file should not turn in to a TODO list!


## 2021-11

Ran a series of cleanups. See background and prep notes in `notes/cleanups/`
and specific final commands in this directory. Quick summary:

- more than 9.5 million file entities had truncated timestamps wayback URLs,
  and were fixed with the full timestamps. there are still a small fraction
  (0.5%) which were identified but not corrected in this first pass
- over 140k release entities with non-lowercase DOIs were updated with
  lowercase DOI. all DOIs in current release entities now lowercase (at least,
  no ASCII uppercase characters found)
- over 220k file entities with incorrect release relation, due to an
  import-time code bug, were fixed. a couple hundred questionable cases remain,
  but are all mismatched due to DOI slash/double-slash issues and will not be
  fixed in an automated way.
- de-uplicated a few thousand file entities, on the basis of SHA-1 hash
- updated file metadata for around 160k file entities (a couple hundred
  thousand remain with partial metadata)


## 2021-06

Created new containers via chocula pipeline. Did not update any existing
chocula entities.

Ran DOAJ import manually, yielding almost 130k new release entities.

Ran dblp import manually, resulting in about 17k new release entities, as well
as 108 new containers. Note that 146k releases were not inserted due to
`skip-dblp-container-missing` and 203k due to `exists-fuzzy`.

## 2020-12

Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities.

Imported DOAJ article metadata from a 2020-11 dump. Crawled and imported
several hundred thousand file entities matched by DOAJ identifier. Updated
journal metadata using chocula took (before the release ingest). Filtered out
fuzzy-matching papers before importing.

Imported dblp from a 2020 snapshot, both containers (primarily for conferences
lacking an ISSN) and release entities (primarily conference papers). Filtered
out fuzzy-matching papers before importing.

## 2020-03

Started harvesting both Arxiv and Pubmed metadata daily and importing to
fatcat. Did backfill imports for both sources.

JALC DOI registry update from 2019 dump.

## 2020-01

Imported around 2,500 new containers (journals, by ISSN-L) from chocula
analysis script.

Imported DOIs from Datacite (around 16 million, plus or minus a couple
million).

Imported new release entities from 2020 Pubmed/MEDLINE baseline. This import
included only new Pubmed works cataloged in 2019 (up until December or so).
Only a few hundred thousand new release entities.

Daily "ingest" (crawling) pipeline running.

## 2019-12

Started continuous harvesting Datacite DOI metadata; first date harvested was
`2019-12-13`. No importer running yet.

Imported about 3.3m new ORCID identifiers from 2019 bulk dump (after converting
from XML to JSON): <https://archive.org/details/orcid-dump-2019>

Inserted about 154k new arxiv release entities. Still no automatic daily
harvesting.

"Save Paper Now" importer running. This bot only *submits* editgroups for
review, doesn't auto-accept them.

## 2019-11

Daily ingest of fulltext for OA releases now enabled. New file entities created
and merged automatically.

## 2019-10

Inserted 1.45m new release entities from Crossref which had been missed during
a previous gap in continuous metadata harvesting.

Updated 304,308 file entities to remove broken
"https://web.archive.org/web/None/*" URLs.

## 2019-09

Created and updated metadata for tens of thousands of containers, using
"chocula" pipeline.

## 2019-08

Merged/fixed roughly 100 container entities with invalid ISSN-L numbers (eg,
invalid ISSN checksum).

## 2019-04

Imported files (matched to releases by DOI) from Semantic Scholar
(`DIRECT-OA-CRAWL-2019` crawl).

Imported files (matched to releases by DOI) from pre-1923/pre-1909 items uploaded
by a user to archive.org.

Imported files (matched to releases by DOI) from CORE.ac.uk
(`DIRECT-OA-CRAWL-2019` crawl).

Imported files (matched to releases by DOI) from the public web (including many
repositories) from the `UNPAYWALL` 2018 crawl.

## 2019-02

Bootstrapped!