index
:
sandcrawler
bnewbold-args
bnewbold-backfill
bnewbold-persist-grobid-errors
bnewbold-refactor-loggging
master
trawler
[no description]
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
notes
Commit message (
Expand
)
Author
Age
Files
Lines
*
ingest notes: various in-progress projects
Bryan Newbold
2022-01-27
4
-3
/
+800
*
enqueue PLATFORM PDFs for crawl
Bryan Newbold
2022-01-07
1
-0
/
+23
*
document progress on re-GROBID-ing
Bryan Newbold
2022-01-05
1
-0
/
+89
*
notes on re-GROBID-ing (and re-extracting) some files
trawler
Bryan Newbold
2021-12-09
1
-0
/
+289
*
commit old patch crawl notes
Bryan Newbold
2021-12-01
1
-0
/
+488
*
wrap up crossref refs backfill notes
Bryan Newbold
2021-11-10
1
-0
/
+47
*
update crossref/grobid refs generation notes
Bryan Newbold
2021-11-04
1
-4
/
+96
*
grobid refs backfill progress
Bryan Newbold
2021-11-04
1
-1
/
+43
*
start notes on crossref refs backfill
Bryan Newbold
2021-11-04
1
-0
/
+54
*
old (2020) notes on pdfextract cleanup
Bryan Newbold
2021-10-04
1
-0
/
+74
*
notes on dumping PDF URL lists for partners
Bryan Newbold
2021-10-04
1
-0
/
+66
*
daily OA crawl improvements/notes
Bryan Newbold
2021-09-08
1
-0
/
+1021
*
OAI-PMH patch and ingest improvement notes
Bryan Newbold
2021-09-03
2
-204
/
+1578
*
commit old patch crawl notes (dec 2020)
Bryan Newbold
2021-09-03
1
-0
/
+1
*
commit old arxiv ingest notes
Bryan Newbold
2021-09-03
1
-0
/
+12
*
commit old patch notes (will rework)
Bryan Newbold
2021-09-03
1
-0
/
+110
*
MAG post-crawl stats (5m+ new PDFs crawled successfully)
Bryan Newbold
2021-09-02
1
-0
/
+124
*
MAG and OAI-PMH crawl/processing notes
Bryan Newbold
2021-08-13
2
-0
/
+480
*
2021-07 unpaywall crawl wrap-up notes
Bryan Newbold
2021-07-30
1
-12
/
+108
*
unpaywall 2021-07 crawl partial notes
Bryan Newbold
2021-07-14
1
-0
/
+224
*
notes on large-domain ingest tweaks
Bryan Newbold
2021-05-27
1
-0
/
+480
*
2021-04 unpaywall crawl notes
Bryan Newbold
2021-05-27
1
-0
/
+368
*
late-2020 OA DOI crawl ingest notes
Bryan Newbold
2021-01-04
1
-3
/
+46
*
DOAJ crawl ingest stats
Bryan Newbold
2020-12-31
1
-0
/
+295
*
progress notes on OA DOI ingest (still running)
Bryan Newbold
2020-12-28
1
-11
/
+102
*
HTML ingest deployment notes
Bryan Newbold
2020-12-16
1
-1
/
+71
*
unpaywall crawl/ingest update (from Oct 2020)
Bryan Newbold
2020-12-08
1
-0
/
+134
*
commit sept 2020 scielo ingest notes
Bryan Newbold
2020-12-08
1
-0
/
+21
*
add implementation notes about HTML ingest
Bryan Newbold
2020-11-10
1
-0
/
+248
*
fuzzy matching notes
Bryan Newbold
2020-11-10
1
-0
/
+148
*
unpaywall oct 2020 crawl notes
Bryan Newbold
2020-11-02
1
-45
/
+82
*
more notes on unpaywall ingest from last week
Bryan Newbold
2020-10-27
1
-0
/
+73
*
notes on 2020-09 re-ingest passes
Bryan Newbold
2020-10-17
1
-0
/
+197
*
OA DOIs: partial notes
Bryan Newbold
2020-10-17
1
-0
/
+218
*
notes/status on daily ingest
Bryan Newbold
2020-10-17
1
-0
/
+193
*
start 2020-10 ingest notes
Bryan Newbold
2020-10-11
1
-0
/
+42
*
update unpaywall 2020-04 notes
Bryan Newbold
2020-10-11
1
-0
/
+32
*
OAI-PMH ingest progress timestamps
Bryan Newbold
2020-10-11
1
-0
/
+13
*
notes on file_meta task (from august)
Bryan Newbold
2020-10-01
1
-0
/
+66
*
OAI-PMH ingest notes
Bryan Newbold
2020-09-03
1
-0
/
+232
*
daily ingest notes
Bryan Newbold
2020-09-02
1
-0
/
+202
*
follow-up notes on processing 'holes'
Bryan Newbold
2020-09-02
1
-0
/
+19
*
unpaywall ingest follow-up
Bryan Newbold
2020-09-02
1
-0
/
+115
*
grobid+pdftext missing catch-up commands
Bryan Newbold
2020-08-05
1
-0
/
+101
*
MAG ingest follow-up notes
Bryan Newbold
2020-08-05
1
-0
/
+194
*
MAG 2020-07 ingest notes
Bryan Newbold
2020-07-08
1
-0
/
+159
*
2020-05_pubmed ingest notes (short)
Bryan Newbold
2020-06-25
1
-0
/
+10
*
commit old notes on a one-off CDX table cleanup
Bryan Newbold
2020-06-25
1
-0
/
+34
*
commit old (2020-02) pdftrio commands
Bryan Newbold
2020-06-25
1
-0
/
+162
*
ingest: OAI-PMH count table
Bryan Newbold
2020-05-28
1
-0
/
+24
[next]