index
:
sandcrawler
bnewbold-args
bnewbold-backfill
bnewbold-persist-grobid-errors
bnewbold-refactor-loggging
master
trawler
[no description]
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
notes
Commit message (
Expand
)
Author
Age
Files
Lines
*
more dataset crawl notes
Bryan Newbold
2022-04-26
1
-0
/
+53
*
.ua crawling follow-up stats
Bryan Newbold
2022-04-26
1
-2
/
+2
*
start notes on unpaywall and targeted/patch crawls
Bryan Newbold
2022-04-20
2
-0
/
+277
*
.ua ingest notes
Bryan Newbold
2022-04-04
1
-0
/
+29
*
various ingest/task notes
Bryan Newbold
2022-03-22
4
-5
/
+97
*
DOAJ ingest/crawl notes
Bryan Newbold
2022-03-11
1
-0
/
+266
*
partial notes on .ua urgent crawling
Bryan Newbold
2022-03-11
1
-0
/
+196
*
2022 patch crawl bulk ingest notes
Bryan Newbold
2022-03-02
1
-0
/
+106
*
update old OAI-PMH patch crawl notes
Bryan Newbold
2022-02-28
1
-1
/
+36
*
more patch crawling
Bryan Newbold
2022-02-08
2
-9
/
+209
*
OAI-PMH patch crawl more updates
Bryan Newbold
2022-02-08
1
-2
/
+71
*
ingest notes: various in-progress projects
Bryan Newbold
2022-01-27
4
-3
/
+800
*
enqueue PLATFORM PDFs for crawl
Bryan Newbold
2022-01-07
1
-0
/
+23
*
document progress on re-GROBID-ing
Bryan Newbold
2022-01-05
1
-0
/
+89
*
notes on re-GROBID-ing (and re-extracting) some files
trawler
Bryan Newbold
2021-12-09
1
-0
/
+289
*
commit old patch crawl notes
Bryan Newbold
2021-12-01
1
-0
/
+488
*
wrap up crossref refs backfill notes
Bryan Newbold
2021-11-10
1
-0
/
+47
*
update crossref/grobid refs generation notes
Bryan Newbold
2021-11-04
1
-4
/
+96
*
grobid refs backfill progress
Bryan Newbold
2021-11-04
1
-1
/
+43
*
start notes on crossref refs backfill
Bryan Newbold
2021-11-04
1
-0
/
+54
*
old (2020) notes on pdfextract cleanup
Bryan Newbold
2021-10-04
1
-0
/
+74
*
notes on dumping PDF URL lists for partners
Bryan Newbold
2021-10-04
1
-0
/
+66
*
daily OA crawl improvements/notes
Bryan Newbold
2021-09-08
1
-0
/
+1021
*
OAI-PMH patch and ingest improvement notes
Bryan Newbold
2021-09-03
2
-204
/
+1578
*
commit old patch crawl notes (dec 2020)
Bryan Newbold
2021-09-03
1
-0
/
+1
*
commit old arxiv ingest notes
Bryan Newbold
2021-09-03
1
-0
/
+12
*
commit old patch notes (will rework)
Bryan Newbold
2021-09-03
1
-0
/
+110
*
MAG post-crawl stats (5m+ new PDFs crawled successfully)
Bryan Newbold
2021-09-02
1
-0
/
+124
*
MAG and OAI-PMH crawl/processing notes
Bryan Newbold
2021-08-13
2
-0
/
+480
*
2021-07 unpaywall crawl wrap-up notes
Bryan Newbold
2021-07-30
1
-12
/
+108
*
unpaywall 2021-07 crawl partial notes
Bryan Newbold
2021-07-14
1
-0
/
+224
*
notes on large-domain ingest tweaks
Bryan Newbold
2021-05-27
1
-0
/
+480
*
2021-04 unpaywall crawl notes
Bryan Newbold
2021-05-27
1
-0
/
+368
*
late-2020 OA DOI crawl ingest notes
Bryan Newbold
2021-01-04
1
-3
/
+46
*
DOAJ crawl ingest stats
Bryan Newbold
2020-12-31
1
-0
/
+295
*
progress notes on OA DOI ingest (still running)
Bryan Newbold
2020-12-28
1
-11
/
+102
*
HTML ingest deployment notes
Bryan Newbold
2020-12-16
1
-1
/
+71
*
unpaywall crawl/ingest update (from Oct 2020)
Bryan Newbold
2020-12-08
1
-0
/
+134
*
commit sept 2020 scielo ingest notes
Bryan Newbold
2020-12-08
1
-0
/
+21
*
add implementation notes about HTML ingest
Bryan Newbold
2020-11-10
1
-0
/
+248
*
fuzzy matching notes
Bryan Newbold
2020-11-10
1
-0
/
+148
*
unpaywall oct 2020 crawl notes
Bryan Newbold
2020-11-02
1
-45
/
+82
*
more notes on unpaywall ingest from last week
Bryan Newbold
2020-10-27
1
-0
/
+73
*
notes on 2020-09 re-ingest passes
Bryan Newbold
2020-10-17
1
-0
/
+197
*
OA DOIs: partial notes
Bryan Newbold
2020-10-17
1
-0
/
+218
*
notes/status on daily ingest
Bryan Newbold
2020-10-17
1
-0
/
+193
*
start 2020-10 ingest notes
Bryan Newbold
2020-10-11
1
-0
/
+42
*
update unpaywall 2020-04 notes
Bryan Newbold
2020-10-11
1
-0
/
+32
*
OAI-PMH ingest progress timestamps
Bryan Newbold
2020-10-11
1
-0
/
+13
*
notes on file_meta task (from august)
Bryan Newbold
2020-10-01
1
-0
/
+66
[next]