blob: b753b6891378698e971224f1f346c879215282df (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
|
## 2019-03-19
Importing web captures of some works that already have DOIs.
editgroup_id: kpuel5gcgjfrzkowokq54k633q
doi:10.1629/14239 # OOPS, really doi:10.1045/june2001-reich
http://web.archive.org/web/20010712114837/http://www.dlib.org/dlib/june01/reich/06reich.html
https://fatcat.wiki/webcapture/pic2w7vlpnct3hmwvoh3anjpkq
doi:10.31859/20180528.1521
http://web.archive.org/web/20180921041617/https://joi.ito.com/weblog/2018/05/28/citing-blogs.html
https://fatcat.wiki/webcapture/u33en3554bacfanygvb3bhoday
doi:10.31859/20180822.2140
http://web.archive.org/web/20181203180836/https://joi.ito.com/weblog/2018/08/22/blog-doi-enabled.html
https://fatcat.wiki/webcapture/res6q5m3avgstd4dtk4y4jouey
doi:10.1045/november2012-beaudoin1
http://web.archive.org/web/20180726175116/http://www.dlib.org/dlib/november12/beaudoin/11beaudoin1.html
https://fatcat.wiki/webcapture/jskwwf4zvjcm3pkpwafcbgpijq
doi:10.1045/march2008-marshall-pt1
http://web.archive.org/web/20190106185812/http://www.dlib.org/dlib/march08/marshall/03marshall-pt1.html
https://fatcat.wiki/webcapture/z7uaeatyvfgwdpuxtrdu4okqii
First command:
./fatcat_import.py --host-url https://api.fatcat.wiki/v0 wayback-static \
--extid doi:10.1045/june2001-reich \
'http://web.archive.org/web/20010712114837/http://www.dlib.org/dlib/june01/reich/06reich.html'
Later commands like:
./fatcat_import.py --host-url https://api.fatcat.wiki/v0 wayback-static \
--editgroup-id kpuel5gcgjfrzkowokq54k633q \
--extid doi:10.31859/20180528.1521 \
'http://web.archive.org/web/20180921041617/https://joi.ito.com/weblog/2018/05/28/citing-blogs.html'
And then:
./fatcat_util.py --host-url https://api.fatcat.wiki/v0 editgroup-accept kpuel5gcgjfrzkowokq54k633q
## Links/Works
http://worrydream.com/ClimateChange/
https://joi.ito.com/weblog/2018/05/28/citing-blogs.html
=> https://fatcat.wiki/release/sejvdbc4mrh6ja73r5ov64l4vi
http://kcoyle.net/mexico.html
http://www.dlib.org/dlib/june01/reich/06reich.html
=> https://fatcat.wiki/release/z477qzrwfvg2vbx226qwo2gosy
=> http://web.archive.org/web/20010712114837/http://www.dlib.org/dlib/june01/reich/06reich.html
http://www.dlib.org/dlib/november12/beaudoin/11beaudoin1.html
=> https://fatcat.wiki/release/rm4afnxm2jfotbsky2ca5uqlzm
http://www.dlib.org/dlib/march08/marshall/03marshall-pt1.html
=> https://fatcat.wiki/release/mjtqtuyhwfdr7j2c3l36uor7uy
https://web.archive.org/web/20141222133249/http://www.genders.org/g58/g58_doyle.html
=> https://fatcat.wiki/container/nzyvsqxghrhhppt7ruhfsvcnru (?)
=> https://fatcat.wiki/container/47b5x547gvbw3pbjdpqicyne7u (?)
https://blog.dshr.org/2014/03/the-half-empty-archive.html
https://blog.dshr.org/2018/10/brief-talk-at-internet-archive-event.html
https://distill.pub/2017/momentum/
=> https://fatcat.wiki/release/urz24xenybawtlfaflo3yxhcoa
http://people.csail.mit.edu/junyanz/cat/cat_papers.html
## Goals
"static page" script that takes extid (or fatcat id) and wayback link
x=> looks up fatcat release entity
x=> checks for existing webcapture object with same params
x=> fetch wayback base HTML, in re-write mode
x=> extract list of all embeds
x=> hit CDX server for each embed, as well as base URL
x=> create webcapture entity locally
=> write out CDX snippet to local disk
x=> submit to API (controlled by flag) and print editgroup
"add warc file" script; takes CDX snippet and webcapture id
=> CDX-to-WARC locally
=> push to a petabox item
=> update webcapture entity with link
=> print editgroup
webrecorder workflow
=> capture single page on webrecorder
=> download WARC
=> upload to petabox item
=> generate CDX snippet
=> create webcapture entity locally
=> submit to API (controlled by flag) and print editgroup
helpers:
x "submit" and "accept" util functions (for editgroups)
- web view to show submitted/recent/accepted editgroups by editor
- create entity from JSON
other ideas:
- general "add a URL" (for files, filesets, webcaptures) helper command
## Commands
cat gwb_20050408060956.replay.html | hxwls -l \
| rg -v '^a\t' \
| rg -v '\t//archive.org/' \
| rg '\t/web/' \
| cut -f3 \
| sort -u
|