<feed xmlns='http://www.w3.org/2005/Atom'>
<title>sandcrawler/scalding/src, branch master</title>
<subtitle>[no description]</subtitle>
<id>https://git.bnewbold.net/sandcrawler/atom?h=master</id>
<link rel='self' href='https://git.bnewbold.net/sandcrawler/atom?h=master'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/'/>
<updated>2021-10-04T20:05:21+00:00</updated>
<entry>
<title>Merge branch 'bnewbold-backfill' into 'master'</title>
<updated>2021-10-04T20:05:21+00:00</updated>
<author>
<name>bnewbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2021-10-04T20:05:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=57f879c00b00c6cd4051f54662fea3f96f80ad35'/>
<id>urn:sha1:57f879c00b00c6cd4051f54662fea3f96f80ad35</id>
<content type='text'>
CDX Backfill (scalding version)

See merge request webgroup/sandcrawler!12</content>
</entry>
<entry>
<title>GroupFatcatWorksSubsetJob</title>
<updated>2019-08-26T21:25:34+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-08-26T21:25:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=d2d545675b3c85c3e1b41fd4bda23230d995bf47'/>
<id>urn:sha1:d2d545675b3c85c3e1b41fd4bda23230d995bf47</id>
<content type='text'>
This is a hack-y variant of GroupFatcatWorksSubsetJob which allows
setting different left and right sides of the join. The initial
application is to re-run work merging with only longtail-oa works on the
"left", with the goal of hard-merging these releases into existing
releases with actual identifiers (instead of just grouping into works).

As a refactor, the normal GroupFatcatWorksJob could just be this with
the same file passed as both left and right, though that requires twice
as much JSON parsing/filtering.
</content>
</entry>
<entry>
<title>please command for groupworksfatcat</title>
<updated>2019-08-11T03:20:34+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-08-11T02:50:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=952457f9fae1cc25cdeeefc00e19ae20cf86c659'/>
<id>urn:sha1:952457f9fae1cc25cdeeefc00e19ae20cf86c659</id>
<content type='text'>
</content>
</entry>
<entry>
<title>FatcatScorable and ScoreSelfFatcat job</title>
<updated>2019-08-11T02:50:21+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-08-03T00:11:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=ea9e8990139973d6f5fdf52a470bf6516c7d8c2f'/>
<id>urn:sha1:ea9e8990139973d6f5fdf52a470bf6516c7d8c2f</id>
<content type='text'>
</content>
</entry>
<entry>
<title>add fatcat ident fields in prep for self-scoring job</title>
<updated>2019-08-11T02:50:21+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-08-03T00:11:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=ca725ffd9efe847905afb918ff324b421a4d8859'/>
<id>urn:sha1:ca725ffd9efe847905afb918ff324b421a4d8859</id>
<content type='text'>
</content>
</entry>
<entry>
<title>scalding dump-grobid-status-code job</title>
<updated>2019-04-12T21:19:29+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-04-12T20:48:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=d93ebaa691f8b200a5761850b4533a153cb457ee'/>
<id>urn:sha1:d93ebaa691f8b200a5761850b4533a153cb457ee</id>
<content type='text'>
</content>
</entry>
<entry>
<title>fix typos in DumpGrobidXmlJob</title>
<updated>2018-10-31T01:03:24+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-10-31T01:03:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=4cb7c1bdc6710a11c869f3d398ed39762644395c'/>
<id>urn:sha1:4cb7c1bdc6710a11c869f3d398ed39762644395c</id>
<content type='text'>
</content>
</entry>
<entry>
<title>quick and dirty GROBID XML dumper</title>
<updated>2018-10-30T21:43:57+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-10-30T21:40:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=3105e9fc5799063027f3273048eea27f906d4c66'/>
<id>urn:sha1:3105e9fc5799063027f3273048eea27f906d4c66</id>
<content type='text'>
</content>
</entry>
<entry>
<title>new DumpGrobidMetaInsertableJob</title>
<updated>2018-09-23T03:31:39+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-09-23T03:31:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=51ca75189b7c36577f8e80b9db1f66259f0f6178'/>
<id>urn:sha1:51ca75189b7c36577f8e80b9db1f66259f0f6178</id>
<content type='text'>
</content>
</entry>
<entry>
<title>new simple file metadata dump script</title>
<updated>2018-09-14T04:11:57+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-09-14T04:11:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=5521b1b520a550373369da8b9cbd36148e071115'/>
<id>urn:sha1:5521b1b520a550373369da8b9cbd36148e071115</id>
<content type='text'>
</content>
</entry>
</feed>
