1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
|
# Fatcat Production Import CHANGELOG
This file tracks major content (metadata) imports to the Fatcat production
database (at https://fatcat.wiki). It complements the code CHANGELOG file.
In general, changes that impact more than 50k entities will get logged here;
this file should probably get merged into the guide at some point.
This file should not turn in to a TODO list!
## 2021-06
Created new containers via chocula pipeline. Did not update any existing
chocula entities.
Ran DOAJ import manually, yielding almost 130k new release entities.
Ran dblp import manually, resulting in about 17k new release entities, as well
as 108 new containers. Note that 146k releases were not inserted due to
`skip-dblp-container-missing` and 203k due to `exists-fuzzy`.
## 2020-12
Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities.
Imported DOAJ article metadata from a 2020-11 dump. Crawled and imported
several hundred thousand file entities matched by DOAJ identifier. Updated
journal metadata using chocula took (before the release ingest). Filtered out
fuzzy-matching papers before importing.
Imported dblp from a 2020 snapshot, both containers (primarily for conferences
lacking an ISSN) and release entities (primarily conference papers). Filtered
out fuzzy-matching papers before importing.
## 2020-03
Started harvesting both Arxiv and Pubmed metadata daily and importing to
fatcat. Did backfill imports for both sources.
JALC DOI registry update from 2019 dump.
## 2020-01
Imported around 2,500 new containers (journals, by ISSN-L) from chocula
analysis script.
Imported DOIs from Datacite (around 16 million, plus or minus a couple
million).
Imported new release entities from 2020 Pubmed/MEDLINE baseline. This import
included only new Pubmed works cataloged in 2019 (up until December or so).
Only a few hundred thousand new release entities.
Daily "ingest" (crawling) pipeline running.
## 2019-12
Started continuous harvesting Datacite DOI metadata; first date harvested was
`2019-12-13`. No importer running yet.
Imported about 3.3m new ORCID identifiers from 2019 bulk dump (after converting
from XML to JSON): <https://archive.org/details/orcid-dump-2019>
Inserted about 154k new arxiv release entities. Still no automatic daily
harvesting.
"Save Paper Now" importer running. This bot only *submits* editgroups for
review, doesn't auto-accept them.
## 2019-11
Daily ingest of fulltext for OA releases now enabled. New file entities created
and merged automatically.
## 2019-10
Inserted 1.45m new release entities from Crossref which had been missed during
a previous gap in continuous metadata harvesting.
Updated 304,308 file entities to remove broken
"https://web.archive.org/web/None/*" URLs.
## 2019-09
Created and updated metadata for tens of thousands of containers, using
"chocula" pipeline.
## 2019-08
Merged/fixed roughly 100 container entities with invalid ISSN-L numbers (eg,
invalid ISSN checksum).
## 2019-04
Imported files (matched to releases by DOI) from Semantic Scholar
(`DIRECT-OA-CRAWL-2019` crawl).
Imported files (matched to releases by DOI) from pre-1923/pre-1909 items uploaded
by a user to archive.org.
Imported files (matched to releases by DOI) from CORE.ac.uk
(`DIRECT-OA-CRAWL-2019` crawl).
Imported files (matched to releases by DOI) from the public web (including many
repositories) from the `UNPAYWALL` 2018 crawl.
## 2019-02
Bootstrapped!
|