<feed xmlns='http://www.w3.org/2005/Atom'>
<title>fatcat/python/fatcat_tools/transforms, branch v0.3.3</title>
<subtitle>[no description]</subtitle>
<id>https://git.bnewbold.net/fatcat/atom?h=v0.3.3</id>
<link rel='self' href='https://git.bnewbold.net/fatcat/atom?h=v0.3.3'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/'/>
<updated>2020-12-18T03:22:26+00:00</updated>
<entry>
<title>bug fix: is_preserved should always be bool</title>
<updated>2020-12-18T03:22:26+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-12-18T03:22:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=28aa515a964421abba8cefd2f43ef2bf75fe47c5'/>
<id>urn:sha1:28aa515a964421abba8cefd2f43ef2bf75fe47c5</id>
<content type='text'>
</content>
</entry>
<entry>
<title>fix indentation</title>
<updated>2020-12-16T22:39:57+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-12-16T22:39:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=0174f7976e4cf5f288539e8a82ba07cc8f45f5c8'/>
<id>urn:sha1:0174f7976e4cf5f288539e8a82ba07cc8f45f5c8</id>
<content type='text'>
</content>
</entry>
<entry>
<title>have release elasticsearch transform count webcaptures and filesets towards preservation</title>
<updated>2020-12-16T22:34:28+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-12-16T22:34:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=486bbd7ea65fa50b3a839e5d371f04b8655a00c8'/>
<id>urn:sha1:486bbd7ea65fa50b3a839e5d371f04b8655a00c8</id>
<content type='text'>
These are simple/partial changes to have webcaptures and filesets show
up in 'preservation', 'in_ia', and 'in_web' ES schema flags. A
longer-term TODO is to update the ES schema to have more granular
analytic flags.

Also includes a small generalization refactor for URL object parsing
into preservation status, shared across file+fileset+webcapture entity
types (all have similar URL objects with url+rel fields).
</content>
</entry>
<entry>
<title>small release_to_elasticsearch refactors</title>
<updated>2020-12-16T19:31:22+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-12-16T19:29:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=532a25205f2cd2929c4258dee87bc6c53cd5cdc3'/>
<id>urn:sha1:532a25205f2cd2929c4258dee87bc6c53cd5cdc3</id>
<content type='text'>
These should have almost no change in behavior, but improve code
quality.

The one behavior change is counting ftp URLs as "in_web"
</content>
</entry>
<entry>
<title>refactor release_to_elasticsearch transform</title>
<updated>2020-12-16T19:24:55+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-12-16T19:24:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=d6ad61c28ddf5bd7dc57f9766ce57d5b48022d3e'/>
<id>urn:sha1:d6ad61c28ddf5bd7dc57f9766ce57d5b48022d3e</id>
<content type='text'>
This method was huge an monolithic. This commit splits out the content
and container specific sections into helper functions to make it more
managable. This involved refactoring to make many flags ("is_*" and
"in_*") part of the output dict through the entire code path, allowing
simple update() calls on the dict.

Noting that in the future should refactor to use a type-annotated class
for the elasticsearch output object. Perhaps something auto-generated
from the ES schema itself (JSON files).
</content>
</entry>
<entry>
<title>if a release has DOAJ article id, count as OA</title>
<updated>2020-11-19T22:55:15+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-11-18T03:41:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=6b948b6b6d480940571de244c664efe420e05d50'/>
<id>urn:sha1:6b948b6b6d480940571de244c664efe420e05d50</id>
<content type='text'>
</content>
</entry>
<entry>
<title>ingest tool: support for setting ingest type</title>
<updated>2020-11-07T03:16:31+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-11-07T03:16:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=b1b34d44ce1a416ee70be665b71b99ba9f98d9a3'/>
<id>urn:sha1:b1b34d44ce1a416ee70be665b71b99ba9f98d9a3</id>
<content type='text'>
</content>
</entry>
<entry>
<title>elastic transform: more preservation keepers</title>
<updated>2020-10-09T00:41:52+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-10-09T00:41:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=e3320e27999567ec9b687c25cc8040ff600496cd'/>
<id>urn:sha1:e3320e27999567ec9b687c25cc8040ff600496cd</id>
<content type='text'>
</content>
</entry>
<entry>
<title>release ES transform tweaks</title>
<updated>2020-08-08T03:05:31+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-08-08T03:05:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=de0fb59f0e36d8079649feefb7592189d8f7c6ed'/>
<id>urn:sha1:de0fb59f0e36d8079649feefb7592189d8f7c6ed</id>
<content type='text'>
pass-through publisher_type from container extra metadata (ES field
already existed; this is from newer chocula metadata)

count arxiv and PMCID papers which haven't been crawled (by IA) as
"dark", not "bright"
</content>
</entry>
<entry>
<title>basic toml transform helper</title>
<updated>2020-07-31T06:45:30+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-07-30T02:27:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=69e281deaf601b39e8ef51d603e3e5e16dc71777'/>
<id>urn:sha1:69e281deaf601b39e8ef51d603e3e5e16dc71777</id>
<content type='text'>
</content>
</entry>
</feed>
