index
:
fatcat-scholar
bnewbold-jammy
debug-no-i18n
master
x-attic-gitlab-a11y
x-attic-rescore
Unnamed repository; edit this file 'description' to name the repository.
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
fatcat_scholar
/
transform.py
Commit message (
Expand
)
Author
Age
Files
Lines
*
replace grobid2json with grobid_tei_xml
Bryan Newbold
2021-10-27
1
-3
/
+5
*
lint: small cleanups, mostly E711 and E713
Bryan Newbold
2021-10-27
1
-3
/
+3
*
lint: remove all 'import *' uses
Bryan Newbold
2021-10-27
1
-2
/
+20
*
make fmt (black 21.9b0)
Bryan Newbold
2021-10-27
1
-3
/
+10
*
re-style imports (isort) on all core python files
Bryan Newbold
2021-10-27
1
-5
/
+5
*
better parsing of year as integer in refs pipeline
Bryan Newbold
2021-07-26
1
-2
/
+2
*
make fmt
Bryan Newbold
2021-07-26
1
-4
/
+10
*
ref_key: hotfix for some corner cases
Bryan Newbold
2021-07-26
1
-8
/
+25
*
transform: more clean_doi() calls
Bryan Newbold
2021-07-26
1
-3
/
+3
*
refs transform: consolidate clean_ref_key() hacks
Bryan Newbold
2021-07-25
1
-17
/
+35
*
refs transform: many fixes
Bryan Newbold
2021-07-25
1
-9
/
+34
*
refs transform: 1-index refs.index, not 0-index
Bryan Newbold
2021-07-25
1
-3
/
+11
*
refs: clean up GROBID DOIs and PMCIDs
Bryan Newbold
2021-07-01
1
-2
/
+3
*
HACK: don't parse TEI-XML for a specific paper/file
Bryan Newbold
2021-06-30
1
-2
/
+4
*
refs: include (source) release_stage in output
Bryan Newbold
2021-06-30
1
-0
/
+1
*
bugfix: pass full crossref obj, not just 'record'
Bryan Newbold
2021-06-02
1
-1
/
+1
*
refs: use fatcat prefix for some sources
Bryan Newbold
2021-06-02
1
-5
/
+5
*
integrate crossref references, and iterate on refs output logic
Bryan Newbold
2021-06-02
1
-7
/
+115
*
schema: add 'crossref' to bundle schema, and add from_json() helper
Bryan Newbold
2021-06-02
1
-26
/
+4
*
reduce max body size to 0.5M characters
Bryan Newbold
2021-02-24
1
-1
/
+1
*
fix body size limit
Bryan Newbold
2021-02-24
1
-4
/
+4
*
fmt and lint fixes (including one actual bug)
Bryan Newbold
2021-02-15
1
-2
/
+3
*
truncate indexed fulltext body at 1 MByte
Bryan Newbold
2021-02-15
1
-2
/
+13
*
catch TEI-XML parsing exception
Bryan Newbold
2021-01-30
1
-12
/
+17
*
enable sentry exceptions for workers and pipelines
Bryan Newbold
2021-01-30
1
-1
/
+12
*
bigfix: try resolving lang_code list issue again
Bryan Newbold
2021-01-30
1
-5
/
+4
*
bugfix: lang_code sometimes a list
Bryan Newbold
2021-01-29
1
-2
/
+7
*
make fmt
Bryan Newbold
2021-01-25
1
-1
/
+4
*
basic support for excluding web content from index
Bryan Newbold
2021-01-22
1
-6
/
+45
*
bug fix: more html_fulltext not getting processed
Bryan Newbold
2021-01-22
1
-0
/
+2
*
add container_sherpa_color field, and populate it
Bryan Newbold
2021-01-22
1
-0
/
+1
*
improve 'oa' tag calculation
Bryan Newbold
2021-01-16
1
-4
/
+4
*
small corrections to schema/transform
Bryan Newbold
2021-01-16
1
-2
/
+4
*
add support for new identifiers and size_bytes schema fields
Bryan Newbold
2021-01-14
1
-0
/
+3
*
basic HTML transform/index support
Bryan Newbold
2020-11-18
1
-2
/
+46
*
refs: extract fatcat crossref pages metadata
Bryan Newbold
2020-11-13
1
-1
/
+1
*
commands: show usage on empty command
Bryan Newbold
2020-11-02
1
-1
/
+1
*
more SIM metadata mappings
Bryan Newbold
2020-10-19
1
-3
/
+31
*
SIM pipeline: more language conversions
Bryan Newbold
2020-10-16
1
-2
/
+5
*
transform: refactor tag generation out of transform heavy method
Bryan Newbold
2020-10-16
1
-28
/
+37
*
Upgrade Dynaconf to 3+
Bruno Rocha
2020-10-05
1
-1
/
+1
*
refs and grobid2json bugfixes from testing
Bryan Newbold
2020-09-14
1
-3
/
+10
*
bugfix: release_year
Bryan Newbold
2020-09-13
1
-2
/
+2
*
refs transform: both GROBID and fatcat refs
Bryan Newbold
2020-09-13
1
-5
/
+12
*
ref transform: support more GROBID fields
Bryan Newbold
2020-09-13
1
-10
/
+16
*
fixes to refs transform (for non-str author fields)
Bryan Newbold
2020-09-04
1
-2
/
+6
*
heavy to refs command
Bryan Newbold
2020-09-04
1
-2
/
+142
*
use simple names, not domain names, for some platforms
Bryan Newbold
2020-08-12
1
-3
/
+3
*
biblio metadata hacks at transform time
Bryan Newbold
2020-08-12
1
-2
/
+98
*
don't index sim_page without issue_item and first_page
Bryan Newbold
2020-08-06
1
-0
/
+3
[next]