index
:
fatcat-scholar
bnewbold-jammy
debug-no-i18n
master
x-attic-gitlab-a11y
x-attic-rescore
Unnamed repository; edit this file 'description' to name the repository.
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
fatcat_scholar
/
schema.py
Commit message (
Expand
)
Author
Age
Files
Lines
*
add support for new identifiers and size_bytes schema fields
Bryan Newbold
2021-01-14
1
-4
/
+13
*
add basic html fulltext support to fetch pipeline
Bryan Newbold
2020-11-18
1
-0
/
+1
*
schema: optional 'fetched' field on bundles
Bryan Newbold
2020-10-16
1
-0
/
+2
*
make fmt
Bryan Newbold
2020-09-13
1
-6
/
+12
*
ref transform: support more GROBID fields
Bryan Newbold
2020-09-13
1
-1
/
+4
*
URL cleanup helper
Bryan Newbold
2020-09-13
1
-0
/
+28
*
heavy to refs command
Bryan Newbold
2020-09-04
1
-0
/
+36
*
handle small ints better (signed/unsigned abs size)
Bryan Newbold
2020-08-12
1
-1
/
+2
*
transform: more string cleaning
Bryan Newbold
2020-08-12
1
-12
/
+59
*
volume_int/issue_int as actual ints
Bryan Newbold
2020-08-06
1
-2
/
+2
*
handle integer conversion and bounding for ES schema
Bryan Newbold
2020-08-06
1
-9
/
+22
*
scrub_text: single-token strings skipped
Bryan Newbold
2020-08-06
1
-0
/
+4
*
strip ACKNOWLEDGEMENTS prefix
Bryan Newbold
2020-08-06
1
-0
/
+1
*
transform: catch more cases of null extra
Bryan Newbold
2020-07-30
1
-10
/
+10
*
abstracts: more prefixes to ignore
Bryan Newbold
2020-07-27
1
-0
/
+3
*
strip <em> tags explicitly
Bryan Newbold
2020-07-21
1
-0
/
+1
*
handle large/bad 'first_page' metadata
Bryan Newbold
2020-06-29
1
-0
/
+3
*
more conservative container_original_name
Bryan Newbold
2020-06-29
1
-0
/
+2
*
fix lint errors (and some small bugs)
Bryan Newbold
2020-06-29
1
-2
/
+1
*
fixes to schema parsing from prod
Bryan Newbold
2020-06-29
1
-9
/
+13
*
include GROBID-extracted abstracts in search documents
Bryan Newbold
2020-06-29
1
-0
/
+8
*
fetch pdftotext and pdf_meta from blobs, postgrest
Bryan Newbold
2020-06-29
1
-4
/
+5
*
commit production work-around (temporarily)
Bryan Newbold
2020-06-04
1
-1
/
+2
*
collapse pages by SIM issue
Bryan Newbold
2020-06-04
1
-0
/
+1
*
fmt
Bryan Newbold
2020-06-04
1
-0
/
+2
*
start some annotaition fixes for pytype
Bryan Newbold
2020-06-03
1
-1
/
+3
*
more flake8
Bryan Newbold
2020-06-03
1
-1
/
+1
*
flake8 fixes (partial)
Bryan Newbold
2020-06-03
1
-1
/
+1
*
reformat python code with black
Bryan Newbold
2020-06-03
1
-38
/
+64
*
improve text scrubbing
Bryan Newbold
2020-06-03
1
-13
/
+21
*
add prefix scrubing (esp. for abstracts)
Bryan Newbold
2020-05-21
1
-0
/
+18
*
use beautiful soup for XML scrubing
Bryan Newbold
2020-05-21
1
-7
/
+6
*
be more inclusive of author names
Bryan Newbold
2020-05-21
1
-4
/
+4
*
fixes from manual testing
Bryan Newbold
2020-05-20
1
-7
/
+11
*
first pass transform from pipelines to ES schema
Bryan Newbold
2020-05-20
1
-0
/
+334