summaryrefslogtreecommitdiffstats
path: root/extra/demo_entities
diff options
context:
space:
mode:
Diffstat (limited to 'extra/demo_entities')
-rw-r--r--extra/demo_entities/filesets.txt73
-rw-r--r--extra/demo_entities/webcaptures.txt73
2 files changed, 146 insertions, 0 deletions
diff --git a/extra/demo_entities/filesets.txt b/extra/demo_entities/filesets.txt
new file mode 100644
index 00000000..9a3beae3
--- /dev/null
+++ b/extra/demo_entities/filesets.txt
@@ -0,0 +1,73 @@
+
+## Goals
+
+"DASH/CDL/IA/Dat importer"
+ => start with local dat clone w/ discovery key; releases that have DOI
+ => but may need to create release if datacite
+ => enumerate and hash all the files under 'data/'
+ => process metadata from cdl_dash_metadata.json
+ => construct fileset entity
+ => set extra['ark_id']
+ => set extra['related_works'] = [] (?)
+ => or group under the work?
+ => add: rel=dweb url=dat://.../files/
+ => add CDL... repo-bundle?
+ https://merritt.cdlib.org/u/ark%3A%2Fb5068%2Fd1rp49/2
+ => add CDL... repo-dir?
+ https://merritt.cdlib.org/d/ark%3A%2Fb5068%2Fd1rp49/2/021611_H929.txt
+
+## Example Works
+
+https://dash.ucop.edu/stash/dataset/doi:10.7280/D1J37Z
+"Jakobshavn Glacier Bed Elevation"
+< 1MByte
+doi:10.7280/D1J37Z
+ark:/13030/m5rg0r8q
+dat://77e94744aa5f967e6ed7e3990bfc29f141dbf2c0fff572eb1212b3bd706882f4
+NOTE: abstract was unicode-mangled for this one; I fixed by hand
+https://fatcat.wiki/fileset/ho376wmdanckpp66iwfs7g22ne
+
+https://dash.ucop.edu/stash/dataset/doi:10.5068/D1RP49
+"Live cell interferometry cell division tracking data files"
+54 MByte, couple dozen files, no directorie
+doi:10.5068/D1RP49
+ark:/b5068/d1rp49
+dat://7f5f95752650ab2968ec6a0c491fe320937ab928f57bd88692b1086248ee2925
+https://fatcat.wiki/fileset/ltjp7k2nrbes3or5h4na5qgxlu
+
+https://dash.ucop.edu/stash/dataset/doi:10.15146/R3201J
+"Data associated with Britten, Thatcher and Caro (PLOS One, 2016). "Zebras and biting flies: quantitative analysis of reflected light from zebra coats in their natural habitat.""
+CC-0
+783 MByte
+doi:10.15146/R3201J
+ark:/13030/m53r5pzm
+dat://c02c88d3989df551e203089d67b1c2a3ae36e933b229c464d78356935acedfd1
+existing fatcat work:h5cb6baxnragxlg4tamgsgpef4 release:qws4ekug5bgivkxsvsgrtwuybe
+https://fatcat.wiki/fileset/vp2azlpw5zgsrjr7d3w7csej2u
+
+stress test:
+https://dash.ucop.edu/stash/dataset/doi:10.7272/Q66Q1V54
+doi:10.7272/Q66Q1V54
+ark:/b7272/q66q1v54
+dat://f0c1cbc00720ff03c47234c737e3a62088f3ec51c5b911f5e6cc73d4571bd3c0
+16 GByte, many files, in sub-directories (for which the dat is broken)
+
+Unfortunately, looks like these ARKs don't result (get a tombstone, "Object in
+restricted Merritt collection"): http://n2t.net/ark:/13030/m53r5pzm
+
+## Commands
+
+First:
+
+ ./fatcat_import.py --host-url https://api.fatcat.wiki/v0 cdl-dash-dat \
+ 77e94744aa5f967e6ed7e3990bfc29f141dbf2c0fff572eb1212b3bd706882f4
+
+Then:
+
+ ./fatcat_import.py --host-url https://api.fatcat.wiki/v0 cdl-dash-dat \
+ --editgroup-id xl3rz6uxfrb2pgprzxictbkvxi \
+ 7f5f95752650ab2968ec6a0c491fe320937ab928f57bd88692b1086248ee2925
+
+ [etc]
+
+
diff --git a/extra/demo_entities/webcaptures.txt b/extra/demo_entities/webcaptures.txt
index 2d86fcbb..b753b689 100644
--- a/extra/demo_entities/webcaptures.txt
+++ b/extra/demo_entities/webcaptures.txt
@@ -43,3 +43,76 @@ And then:
./fatcat_util.py --host-url https://api.fatcat.wiki/v0 editgroup-accept kpuel5gcgjfrzkowokq54k633q
+
+## Links/Works
+
+http://worrydream.com/ClimateChange/
+
+https://joi.ito.com/weblog/2018/05/28/citing-blogs.html
+ => https://fatcat.wiki/release/sejvdbc4mrh6ja73r5ov64l4vi
+
+http://kcoyle.net/mexico.html
+
+http://www.dlib.org/dlib/june01/reich/06reich.html
+ => https://fatcat.wiki/release/z477qzrwfvg2vbx226qwo2gosy
+ => http://web.archive.org/web/20010712114837/http://www.dlib.org/dlib/june01/reich/06reich.html
+http://www.dlib.org/dlib/november12/beaudoin/11beaudoin1.html
+ => https://fatcat.wiki/release/rm4afnxm2jfotbsky2ca5uqlzm
+http://www.dlib.org/dlib/march08/marshall/03marshall-pt1.html
+ => https://fatcat.wiki/release/mjtqtuyhwfdr7j2c3l36uor7uy
+
+https://web.archive.org/web/20141222133249/http://www.genders.org/g58/g58_doyle.html
+ => https://fatcat.wiki/container/nzyvsqxghrhhppt7ruhfsvcnru (?)
+ => https://fatcat.wiki/container/47b5x547gvbw3pbjdpqicyne7u (?)
+
+https://blog.dshr.org/2014/03/the-half-empty-archive.html
+https://blog.dshr.org/2018/10/brief-talk-at-internet-archive-event.html
+
+https://distill.pub/2017/momentum/
+ => https://fatcat.wiki/release/urz24xenybawtlfaflo3yxhcoa
+
+http://people.csail.mit.edu/junyanz/cat/cat_papers.html
+
+## Goals
+
+"static page" script that takes extid (or fatcat id) and wayback link
+ x=> looks up fatcat release entity
+ x=> checks for existing webcapture object with same params
+ x=> fetch wayback base HTML, in re-write mode
+ x=> extract list of all embeds
+ x=> hit CDX server for each embed, as well as base URL
+ x=> create webcapture entity locally
+ => write out CDX snippet to local disk
+ x=> submit to API (controlled by flag) and print editgroup
+
+"add warc file" script; takes CDX snippet and webcapture id
+ => CDX-to-WARC locally
+ => push to a petabox item
+ => update webcapture entity with link
+ => print editgroup
+
+webrecorder workflow
+ => capture single page on webrecorder
+ => download WARC
+ => upload to petabox item
+ => generate CDX snippet
+ => create webcapture entity locally
+ => submit to API (controlled by flag) and print editgroup
+
+helpers:
+x "submit" and "accept" util functions (for editgroups)
+- web view to show submitted/recent/accepted editgroups by editor
+- create entity from JSON
+
+other ideas:
+- general "add a URL" (for files, filesets, webcaptures) helper command
+
+## Commands
+
+ cat gwb_20050408060956.replay.html | hxwls -l \
+ | rg -v '^a\t' \
+ | rg -v '\t//archive.org/' \
+ | rg '\t/web/' \
+ | cut -f3 \
+ | sort -u
+