summaryrefslogtreecommitdiffstats
path: root/extra/demo_entities/webcaptures.txt
blob: b753b6891378698e971224f1f346c879215282df (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118

## 2019-03-19

Importing web captures of some works that already have DOIs.

editgroup_id: kpuel5gcgjfrzkowokq54k633q

doi:10.1629/14239 # OOPS, really doi:10.1045/june2001-reich
http://web.archive.org/web/20010712114837/http://www.dlib.org/dlib/june01/reich/06reich.html
https://fatcat.wiki/webcapture/pic2w7vlpnct3hmwvoh3anjpkq

doi:10.31859/20180528.1521
http://web.archive.org/web/20180921041617/https://joi.ito.com/weblog/2018/05/28/citing-blogs.html
https://fatcat.wiki/webcapture/u33en3554bacfanygvb3bhoday

doi:10.31859/20180822.2140
http://web.archive.org/web/20181203180836/https://joi.ito.com/weblog/2018/08/22/blog-doi-enabled.html
https://fatcat.wiki/webcapture/res6q5m3avgstd4dtk4y4jouey

doi:10.1045/november2012-beaudoin1
http://web.archive.org/web/20180726175116/http://www.dlib.org/dlib/november12/beaudoin/11beaudoin1.html
https://fatcat.wiki/webcapture/jskwwf4zvjcm3pkpwafcbgpijq

doi:10.1045/march2008-marshall-pt1
http://web.archive.org/web/20190106185812/http://www.dlib.org/dlib/march08/marshall/03marshall-pt1.html
https://fatcat.wiki/webcapture/z7uaeatyvfgwdpuxtrdu4okqii


First command:

    ./fatcat_import.py --host-url https://api.fatcat.wiki/v0 wayback-static \
        --extid doi:10.1045/june2001-reich \
        'http://web.archive.org/web/20010712114837/http://www.dlib.org/dlib/june01/reich/06reich.html'

Later commands like:

    ./fatcat_import.py --host-url https://api.fatcat.wiki/v0 wayback-static \
        --editgroup-id kpuel5gcgjfrzkowokq54k633q \
        --extid doi:10.31859/20180528.1521 \
        'http://web.archive.org/web/20180921041617/https://joi.ito.com/weblog/2018/05/28/citing-blogs.html'

And then:

    ./fatcat_util.py --host-url https://api.fatcat.wiki/v0 editgroup-accept kpuel5gcgjfrzkowokq54k633q


## Links/Works

http://worrydream.com/ClimateChange/

https://joi.ito.com/weblog/2018/05/28/citing-blogs.html
    => https://fatcat.wiki/release/sejvdbc4mrh6ja73r5ov64l4vi

http://kcoyle.net/mexico.html

http://www.dlib.org/dlib/june01/reich/06reich.html
    => https://fatcat.wiki/release/z477qzrwfvg2vbx226qwo2gosy
    => http://web.archive.org/web/20010712114837/http://www.dlib.org/dlib/june01/reich/06reich.html
http://www.dlib.org/dlib/november12/beaudoin/11beaudoin1.html
    => https://fatcat.wiki/release/rm4afnxm2jfotbsky2ca5uqlzm
http://www.dlib.org/dlib/march08/marshall/03marshall-pt1.html
    => https://fatcat.wiki/release/mjtqtuyhwfdr7j2c3l36uor7uy

https://web.archive.org/web/20141222133249/http://www.genders.org/g58/g58_doyle.html
    => https://fatcat.wiki/container/nzyvsqxghrhhppt7ruhfsvcnru (?)
    => https://fatcat.wiki/container/47b5x547gvbw3pbjdpqicyne7u (?)

https://blog.dshr.org/2014/03/the-half-empty-archive.html
https://blog.dshr.org/2018/10/brief-talk-at-internet-archive-event.html

https://distill.pub/2017/momentum/
    => https://fatcat.wiki/release/urz24xenybawtlfaflo3yxhcoa

http://people.csail.mit.edu/junyanz/cat/cat_papers.html

## Goals

"static page" script that takes extid (or fatcat id) and wayback link
   x=> looks up fatcat release entity
   x=> checks for existing webcapture object with same params
   x=> fetch wayback base HTML, in re-write mode
   x=> extract list of all embeds
   x=> hit CDX server for each embed, as well as base URL
   x=> create webcapture entity locally
    => write out CDX snippet to local disk
   x=> submit to API (controlled by flag) and print editgroup

"add warc file" script; takes CDX snippet and webcapture id
    => CDX-to-WARC locally
    => push to a petabox item
    => update webcapture entity with link
    => print editgroup

webrecorder workflow
    => capture single page on webrecorder
    => download WARC
    => upload to petabox item
    => generate CDX snippet
    => create webcapture entity locally
    => submit to API (controlled by flag) and print editgroup

helpers:
x "submit" and "accept" util functions (for editgroups)
- web view to show submitted/recent/accepted editgroups by editor
- create entity from JSON

other ideas:
- general "add a URL" (for files, filesets, webcaptures) helper command

## Commands

    cat gwb_20050408060956.replay.html | hxwls -l \
        | rg -v '^a\t' \
        | rg -v '\t//archive.org/' \
        | rg '\t/web/' \
        | cut -f3 \
        | sort -u