summaryrefslogtreecommitdiffstats
path: root/ps01_grep
diff options
context:
space:
mode:
Diffstat (limited to 'ps01_grep')
-rw-r--r--ps01_grep/ps.txt698
-rw-r--r--ps01_grep/regexp.scm230
-rw-r--r--ps01_grep/tests.txt19
3 files changed, 947 insertions, 0 deletions
diff --git a/ps01_grep/ps.txt b/ps01_grep/ps.txt
new file mode 100644
index 0000000..8f0efd4
--- /dev/null
+++ b/ps01_grep/ps.txt
@@ -0,0 +1,698 @@
+
+ MASSACHVSETTS INSTITVTE OF TECHNOLOGY
+ Department of Electrical Engineering and Computer Science
+
+ 6.945 Spring 2009
+ Problem Set 1
+
+ Issued: Wed. 4 Feb. 2009 Due: Wed. 11 Feb. 2009
+
+
+General instructions for problem sets:
+ A problem set for this class generally asks you to do some
+programming. We usually give you a program to read and extend. You
+should turn in your extensions, in the form of clearly annotated code
+and executions that demonstrate its effectiveness. We may also ask
+for short essays explaining an idea. Your answers to these questions
+must be clear and concise: good logic expressed in good English is
+required. Thank you.
+
+Readings:
+ Review SICP chapter 1 (especially section 1.3)
+ MIT/GNU Scheme documentation, section 5.6 (character sets)
+ Debian GNU/Linux info on regular expressions
+ from the grep man page (attached). This is sane.
+
+Code:
+ regexp.scm, tests.txt (both attached)
+ Windows grep: http://gnuwin32.sourceforge.net/packages/grep.htm
+
+Documentation:
+ The MIT/GNU Scheme installation and documentation can
+ be found online at http://www.gnu.org/software/mit-scheme/
+ The reference manual is in:
+ http://www.gnu.org/software/mit-scheme/documentation/mit-scheme-ref/
+
+ The (Insane) POSIX manual page for regular expressions:
+ http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html
+
+
+ Regular Expressions
+
+Regular expressions are ubiquitous. But the syntax of the regular-
+expression language is awful. There are various incompatable forms of
+the language and the quotation conventions are baroquen [sic].
+Nevertheless, there is a great deal of useful software, for example
+grep, that uses regular expressions to specify the desired behavior.
+
+Part of the value of this problem set is to experience how bad things
+can be. Although regular expression systems are derived from a
+perfectly good mathematical formalism, the particular choices made by
+implementers to expand the formalism into useful software systems are
+often disastrous: the quotation conventions adopted are highly
+irregular; the particularly egregious misuse of parenthesis, both for
+grouping and for backward reference, is a miracle to behold. In
+addition, attempts to increase the expressive power and address
+shortcomings of earlier designs have led to a proliferation of
+incompatible derivative languages.
+
+In this problem set we will invent both a Lisp-friendly language for
+specifying regular expressions and a means of translating this
+language to conventional regular-expression syntax. The application
+is to be able to use the capabilities of systems like grep from inside
+the Scheme environment.
+
+As with any language there are primitives, means of combination, and
+means of abstraction. Our language allows the construction of
+patterns that utilities like grep can match against character-string
+data. Because this language is embedded in Scheme we inherit all of
+the power of Scheme: we can use Scheme constructs to combine patterns
+and Scheme procedures to abstract them.
+
+Patterns are built out of primitive patterns.
+The primitive patterns are:
+
+ (r:dot)
+ matches any character except newline
+
+ (r:bol)
+ matches only the beginning of a line
+
+ (r:eol)
+ matches only the end of a line
+
+ (r:quote <string>)
+ matches the given string
+
+ (r:char-from <char-set>)
+ matches one character from the specified
+ MIT/GNU Scheme character set
+
+ (r:char-not-from <char-set>)
+ matches one character not from the specified
+ MIT/GNU Scheme character set
+
+ MIT/GNU Scheme provides a variety of character sets,
+ but you can make your own with the Scheme procedures
+ char-set and string->char-set.
+
+Patterns can be combined to make compound patterns:
+
+ (r:seq <pattern> ...)
+ This pattern matches each of the argument patterns in sequence,
+ from left to right
+
+ (r:alt <pattern> ...)
+ This pattern tries each argument pattern from left to right,
+ until one of these alternatives matches. If none matches then
+ this pattern does not match.
+
+ (r:repeat <min> <max> <pattern>)
+ This pattern tries to match the given argument pattern a minimum
+ of min times but no more than a maximum of max times. If max is
+ given as #f then there is no maximum specified. Note that if
+ max=min the given pattern must be matched exactly that many
+ times.
+
+Because these are all Scheme procedures (in the file regexp.scm) you
+can freely mix these with any Scheme code.
+
+Here are some examples:
+
+ Pattern: (r:seq (r:quote "a") (r:dot) (r:quote "c"))
+ Matches: "abc" and "aac" and "acc"
+
+ Pattern: (r:alt (r:quote "foo") (r:quote "bar") (r:quote "baz"))
+ Matches: "foo" and "bar" and "baz"
+
+ Pattern: (r:repeat 3 5 (r:alt (r:quote "cat") (r:quote "dog")))
+ Matches: "catdogcat" and "catcatdogdog" and "dogdogcatdogdog"
+ but not "catcatcatdogdogdog"
+
+ Pattern: (let ((digit
+ (r:char-from (string->char-set "0123456789"))))
+ (r:seq (r:bol)
+ (r:quote "[")
+ digit
+ digit
+ (r:quote "]")
+ (r:quote ".")
+ (r:quote " ")
+ (r:char-from (char-set #\a #\b))
+ (r:repeat 3 5 (r:alt (r:quote "cat") (r:quote "dog")))
+ (r:char-not-from (char-set #\d #\e #\f))
+ (r:eol)))
+ Matches: "[09]. acatdogdogcats" but not
+ "[10]. ifacatdogdogs" nor
+ "[11]. acatdogdogsme"
+
+In the file regexp.scm we define an interface to the grep utility,
+which allows a Scheme program to call grep with a regular expression
+(constructed from the given pattern) on a given file name. The grep
+utility extracts lines that contain a substring that can match the
+given pattern.
+
+So, for example: given the test file test.txt (supplied) we can find
+the lines in the file that contain a match to the given pattern.
+
+(pp
+ (r:grep (r:seq " "
+ (r:repeat 3 5 (r:alt (r:quote "cat") (r:quote "dog")))
+ (r:eol))
+ "tests.txt"))
+("[09]. catdogcat" "[10]. catcatdogdog" "[11]. dogdogcatdogdog")
+;Unspecified return value
+
+Note that the pretty-printer (pp) returns an unspecified value, after
+printing the list of lines that grep found matching the pattern
+specified.
+
+-------------
+Problem 1.1: Warmup
+
+In the traditional regular expression language the asterisk (*)
+operator following a subpattern means zero-or-more copies of the
+subpattern and the plus-sign (+) operator following a subpattern means
+one-or-more copies of the subpattern. Define Scheme procedures r:*
+and r:+ to take a pattern and iterate it as necessary. This can be
+done in terms of r:repeat.
+
+Demonstrate your procedures on real data in complex patterns.
+-------------
+
+
+
+ A Subtle Bug, One Bad Joke, Two Tweaks, and a Revelation
+
+Ben Bitdiddle has noticed a problem with our implementation of
+(r:repeat <min> <max> <pattern>).
+
+The use of (r:seq expr "\\|") at the end of the r:repeat procedure is
+a bit dodgy. This code fragment compiles to an ERE Alternation
+regular expression of the form (expr|). (See 9.4.7 of the POSIX
+regular expression document.)
+
+This relies on the fact that alternation with something and nothing is
+the equivalent of saying "one or none". That is: (expr|) denotes one
+or no instances of expr. Unfortunately, this depends on an
+undocumented GNU extension to the formal POSIX standard for REs.
+
+Specifically, section 9.4.3 states that a vertical line appearing
+immediately before a close parenthesis (or immediately after an open
+parenthesis) produces undefined behavior. In essence, an RE must not
+be a null sequence.
+
+GNU grep just happens to Do The Right Thing (tm) when presented with
+(x|). Not all grep implementations are as tolerant.
+
+Therefore, Ben asks his team of three code hackers (Louis, Alyssa and
+his niece Bonnie) to propose alternative workarounds. Ultimately, he
+proposes his own patch, which you will implement.
+
+
+ * Louis Reasoner suggests that a simple, elegant fix would be to
+ replace the code fragment (r:seq expr "\\|") with a straight-
+ forward call to (r:repeat 0 1 expr).
+
+
+ * Alyssa P. Hacker proposes that an alternative fix would be to
+ rewrite the else clause of r:repeat to compile (r:repeat 3 5 <x>)
+ into the equivalent of (xxx|xxxx|xxxxx) instead of the naughty
+ xxx(x|)(x|) non-POSIX-compliant undefined regular expression.
+ She refers to section 9.4.7 of the POSIX regular expression doc.
+
+
+ * Bonnie Bitdiddle points to the question mark (?) operator in
+ section 9.4.6.4 and proposes that a better fix would be to
+ implement an r:? operator then replace (r:seq expr "\\|")
+ with (r:? expr).
+
+
+ * Meanwhile, Ben looks closely at the RE spec and has a revelation.
+ He proposes that r:repeat be re-implemented to emit Interval
+ expressions. See section 9.3.6.5 of the POSIX documentation.
+ Please try not to get sick.
+
+-------------
+Problem 1.2: The Proposals
+
+Let's very briefly consider each proposal:
+
+a. Everyone in the room immediately giggles at Louis' silly joke.
+ What's so funny about it? That is, what's wrong with this idea?
+
+ A one-sentence punchline will do.
+
+b. What advantages does Bonnie's proposal have over Alyssa's
+ in terms of both code and data?
+
+ A brief, concise yet convincing few sentences suffices.
+
+c. What advantage does Ben's proposal have over all the others?
+ Specifically, ponder which section of the POSIX document he cites
+ versus which sections the others cite, then take a quick peek at
+ Problem 1.5 below and consider the implications. Also, consider
+ the size of the output strings in this new code as well as the
+ overall clarity of the code.
+
+ Again, a brief sentence or two is sufficient.
+
+d. Following Ben's proposal, re-implement r:repeat to emit Interval
+ expressions. Hint: Scheme's number->string procedure should be
+ handy. Caveat: beware the backslashes.
+
+ Show the output it generates on a few well-chosen sample inputs.
+ Demonstrate your procedure on real data in complex patterns.
+-------------
+
+
+ Too Much Nesting
+
+Our program produces excessively nested regular expressions: it makes
+groups even when they are not necessary. For example, the following
+simple pattern leads to an overly complex regular expression:
+
+ (display (r:seq (r:quote "a") (r:dot) (r:quote "c")))
+ \(\(a\).\(c\)\)
+
+Another problem is that BREs may involve back-references. (See
+9.3.6.3 of the POSIX regular expression documentation.) A
+back-reference refers to a previously parenthesized subexpression. So
+it is important that the parenthesized subexpressions be ones
+explicitly placed by the author of the pattern. (Aargh! This is one
+of the worst ideas I (GJS) have ever heard of -- grouping, which is
+necessary for iteration, was confused with naming for later reference.
+What a crock!)
+
+-------------
+Problem 1.3: Optimization
+
+Edit our program to eliminate as much of the unnecessary nesting as
+you can. Caution: there are subtle cases here that you have to watch
+out for. What is such a case? Demonstrate your better version of our
+program and show how it handles the subtleties.
+
+Hint: Our program uses strings as its intermediate representation as
+well as its result. You might consider using a different intermediate
+representation.
+-------------
+
+-------------
+Problem 1.4: Back-references
+
+Add in a procedure for constructing back-references.
+Have fun getting confused about BREs.
+-------------
+
+ Standards?
+
+ The best thing about standards is that
+ there are so many to choose from.
+ Tom Knight
+
+There are also Extended Regular Expressions (EREs) defined in the
+POSIX regular expression documentation. Some software, such as egrep,
+uses this version of regular expressions. Unfortunately EREs are not
+a conservative extension of BREs: ERE syntax is actually inconsistent
+with BRE syntax! It is an interesting project to extend our Scheme
+pattern language so that the target can be either BREs or EREs.
+
+-------------
+Problem 1.5: Ugh!
+
+ a. What are the significant differences between BREs and EREs that
+ make this a pain? List the differences that must be addressed.
+
+ b. How can the back end be factored so that our language can compile
+ into either kind of regular expression, depending on what is needed?
+ How can we maintain the abstract layer that is independent of the
+ target regular expression language? Explain your strategy.
+
+ c. Extend our implementation to have both back ends.
+
+Demonstrate your work by making sure that you can run egrep as well as
+grep, with equivalent results in cases that test the differences you
+found in part a.
+-------------
+
+ End of Problem Set. Reference Material Follows.
+-----------------------------------------------------------------------
+
+The following is an excerpt from the Debian GNU/Linux man page on grep.
+
+REGULAR EXPRESSIONS
+ A regular expression is a pattern that describes a set of strings.
+ Regular expressions are constructed analogously to arithmetic expres-
+ sions, by using various operators to combine smaller expressions.
+
+ Grep understands three different versions of regular expression syn-
+ tax: "basic," "extended," and "perl." In GNU grep, there is no dif-
+ ference in available functionality using either of the first two syn-
+ taxes. In other implementations, basic regular expressions are less
+ powerful. The following description applies to extended regular
+ expressions; differences for basic regular expressions are summarized
+ afterwards. Perl regular expressions add additional functionality,
+ but the implementation used here is undocumented and is not compati-
+ ble with other grep implementations.
+
+ The fundamental building blocks are the regular expressions that
+ match a single character. Most characters, including all letters and
+ digits, are regular expressions that match themselves. Any metachar-
+ acter with special meaning may be quoted by preceding it with a back-
+ slash.
+
+ A bracket expression is a list of characters enclosed by [ and ]. It
+ matches any single character in that list; if the first character of
+ the list is the caret ^ then it matches any character not in the
+ list. For example, the regular expression [0123456789] matches any
+ single digit.
+
+ Within a bracket expression, a range expression consists of two char-
+ acters separated by a hyphen. It matches any single character that
+ sorts between the two characters, inclusive, using the locale's col-
+ lating sequence and character set. For example, in the default C
+ locale, [a-d] is equivalent to [abcd]. Many locales sort characters
+ in dictionary order, and in these locales [a-d] is typically not
+ equivalent to [abcd]; it might be equivalent to [aBbCcDd], for exam-
+ ple. To obtain the traditional interpretation of bracket expres-
+ sions, you can use the C locale by setting the LC_ALL environment
+ variable to the value C.
+
+ Finally, certain named classes of characters are predefined within
+ bracket expressions, as follows. Their names are self explanatory,
+ and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:],
+ [:lower:], [:print:], [:punct:], [:space:], [:upper:], and
+ [:xdigit:]. For example, [[:alnum:]] means [0-9A-Za-z], except the
+ latter form depends upon the C locale and the ASCII character encod-
+ ing, whereas the former is independent of locale and character set.
+ (Note that the brackets in these class names are part of the symbolic
+ names, and must be included in addition to the brackets delimiting
+ the bracket list.) Most metacharacters lose their special meaning
+ inside lists. To include a literal ] place it first in the list.
+ Similarly, to include a literal ^ place it anywhere but first.
+ Finally, to include a literal - place it last.
+
+ The period . matches any single character. The symbol \w is a syn-
+ onym for [[:alnum:]] and \W is a synonym for [^[:alnum]].
+
+ The caret ^ and the dollar sign $ are metacharacters that respec-
+ tively match the empty string at the beginning and end of a line.
+ The symbols \< and \> respectively match the empty string at the
+ beginning and end of a word. The symbol \b matches the empty string
+ at the edge of a word, and \B matches the empty string provided it's
+ not at the edge of a word.
+
+ A regular expression may be followed by one of several repetition
+ operators:
+ ? The preceding item is optional and matched at most once.
+ * The preceding item will be matched zero or more times.
+ + The preceding item will be matched one or more times.
+ {n} The preceding item is matched exactly n times.
+ {n,} The preceding item is matched n or more times.
+ {n,m} The preceding item is matched at least n times, but not more
+ than m times.
+
+ Two regular expressions may be concatenated; the resulting regular
+ expression matches any string formed by concatenating two substrings
+ that respectively match the concatenated subexpressions.
+
+ Two regular expressions may be joined by the infix operator |; the
+ resulting regular expression matches any string matching either
+ subexpression.
+
+ Repetition takes precedence over concatenation, which in turn takes
+ precedence over alternation. A whole subexpression may be enclosed
+ in parentheses to override these precedence rules.
+
+ The backreference \n, where n is a single digit, matches the sub-
+ string previously matched by the nth parenthesized subexpression of
+ the regular expression.
+
+ In basic regular expressions the metacharacters ?, +, {, |, (, and )
+ lose their special meaning; instead use the backslashed versions \?,
+ \+, \{, \|, \(, and \).
+
+ Traditional egrep did not support the { metacharacter, and some egrep
+ implementations support \{ instead, so portable scripts should avoid
+ { in egrep patterns and should use [{] to match a literal {.
+
+ GNU egrep attempts to support traditional usage by assuming that { is
+ not special if it would be the start of an invalid interval specifi-
+ cation. For example, the shell command egrep '{1' searches for the
+ two-character string {1 instead of reporting a syntax error in the
+ regular expression. POSIX.2 allows this behavior as an extension,
+ but portable scripts should avoid it.
+
+GNU Project 2002/01/22 GREP(1)
+
+;;;; Scheme Regular Expression Language Implementation -- regexp.scm
+
+(define (r:dot) ".")
+
+(define (r:bol) "^")
+
+(define (r:eol) "$")
+
+(define (r:quote string)
+ (r:seq
+ (call-with-output-string ; see RefMan section 14.3
+ (lambda (port)
+ (let ((end (string-length string)))
+ (do ((i 0 (+ i 1))) ((not (< i end))) ; see RefMan 2.9
+ (let ((c (string-ref string i)))
+ (if (or (char=? c #\.)
+ (char=? c #\[)
+ (char=? c #\\)
+ (char=? c #\^)
+ (char=? c #\$)
+ (char=? c #\*))
+ (write-char #\\ port))
+ (write-char c port))))))))
+
+(define (r:char-from char-set) ; see RefMan section 5.6
+ (let ((members (char-set-members char-set)))
+ (cond ((not (pair? members))
+ (r:seq))
+ ((not (pair? (cdr members)))
+ (r:quote (string (car members))))
+ (else
+ (%char-from #f members)))))
+
+(define (r:char-not-from char-set)
+ (%char-from #t (char-set-members char-set)))
+
+(define (%char-from negate? members)
+ (let ((right? (memv #\] members))
+ (caret? (memv #\^ members))
+ (hyphen? (memv #\- members))
+ (others
+ (delete-matching-items members
+ (lambda (c)
+ (or (char=? c #\])
+ (char=? c #\^)
+ (char=? c #\-))))))
+ (if (and caret?
+ hyphen?
+ (not right?)
+ (not negate?)
+ (null? others))
+ "[-^]"
+ (string-append "["
+ (if negate? "^" "")
+ (if right? "]" "")
+ (list->string others)
+ (if caret? "^" "")
+ (if hyphen? "-" "")
+ "]"))))
+
+;;; Means of combination for patterns
+
+(define (r:seq . exprs)
+ (string-append "\\(" (apply string-append exprs) "\\)"))
+
+(define (r:alt . exprs)
+ (if (pair? exprs)
+ (apply r:seq
+ (cons (car exprs)
+ (append-map (lambda (expr)
+ (list "\\|" expr))
+ (cdr exprs))))
+ (r:seq)))
+
+(define (r:repeat min max expr)
+ (if (not (exact-nonnegative-integer? min))
+ (error "Min must be non-negative integer:" min))
+ (if max
+ (begin
+ (if (not (exact-nonnegative-integer? max))
+ (error "Max must be non-negative integer:" max))
+ (if (not (<= min max))
+ (error "Min not less than max:" min max))))
+ (cond ((not max)
+ (apply r:seq
+ (append (make-list min expr)
+ (list expr "*"))))
+ ((= max min)
+ (apply r:seq (make-list min expr)))
+ (else
+ (apply r:seq
+ (append (make-list min expr)
+ (make-list (- max min)
+ (r:seq expr "\\|")))))))
+
+;;; The following magic allows a program in MIT/GNU Scheme to call the
+;;; grep system utility, returning the list of grep output lines to
+;;; the caller. You can make similar mechanisms to call other system
+;;; utilities.
+
+(load-option 'synchronous-subprocess)
+
+(define (r:grep expr filename)
+ (r:grep-like "grep" '() expr filename))
+
+(define (r:egrep expr filename)
+ (if (eq? microcode-id/operating-system 'nt)
+ (r:grep-like "grep" '("-E") expr filename)
+ (r:grep-like "egrep" '() expr filename)))
+
+(define (r:grep-like program options expr filename)
+ (let ((port (open-output-string)))
+ (and (= (run-synchronous-subprocess program
+ (append options
+ (list "-e" expr (->namestring filename)))
+ 'output port)
+ 0)
+ (r:split-lines (get-output-string port)))))
+
+(define (r:split-lines string)
+ (reverse
+ (let ((end (string-length string)))
+ (let loop ((i 0) (lines '()))
+ (if (< i end)
+ (let ((j
+ (substring-find-next-char string i end #\newline)))
+ (if j
+ (loop (+ j 1)
+ (cons (substring string i j) lines))
+ (cons (substring string i end) lines)))
+ lines)))))
+
+#|
+;;; An alternate implementation using MIT/GNU Scheme's internal
+;;; regular-expression interpreter.
+
+(define (r:grep expr filename)
+ (call-with-input-file filename
+ (lambda (port)
+ (let loop ((lines '()))
+ (let ((line (read-line port)))
+ (if (eof-object? line)
+ (reverse lines)
+ (loop (if (re-string-search-forward expr line #f)
+ (cons line lines)
+ lines))))))))
+|#
+
+#|
+;;; For example...
+
+;;; Note, the result of the next two requests were not in this file
+;;; when the requests were made!
+
+(pp (r:grep (r:quote "r:sex") "regexp.scm"))
+("(pp (r:grep (r:quote \"r:sex\") \"regexp.scm\"))")
+;Unspecified return value
+
+(pp (r:grep (r:quote "r:seq") "regexp.scm"))
+(" (r:seq"
+ "\t (r:seq))"
+ "(define (r:seq . exprs)"
+ " (apply r:seq"
+ " (r:seq)))"
+ "\t (apply r:seq"
+ "\t (apply r:seq (make-list min expr)))"
+ "\t (apply r:seq"
+ "\t\t\t\t (r:seq expr \"\\\\|\")))))))"
+ "(pp (r:grep (r:quote \"r:seq\") \"regexp.scm\"))"
+ "(pp (r:grep (r:seq (r:quote \"a\") (r:dot) (r:quote \"c\")) \"tests.txt\"))"
+ " (r:grep (r:seq \" \""
+ " (r:seq (r:bol)")
+;Unspecified return value
+
+(pp (r:grep (r:seq (r:quote "a") (r:dot) (r:quote "c")) "tests.txt"))
+("[00]. abc"
+ "[01]. aac"
+ "[02]. acc"
+ "[03]. zzzaxcqqq"
+ "[10]. catcatdogdog"
+ "[12]. catcatcatdogdogdog")
+;Unspecified return value
+
+;;; And...
+
+(pp (r:grep (r:alt (r:quote "foo") (r:quote "bar") (r:quote "baz"))
+ "tests.txt"))
+("[05]. foo" "[06]. bar" "[07]. foo bar baz quux")
+;Unspecified return value
+
+
+(pp (r:grep (r:repeat 3 5 (r:alt (r:quote "cat") (r:quote "dog")))
+ "tests.txt"))
+("[09]. catdogcat"
+ "[10]. catcatdogdog"
+ "[11]. dogdogcatdogdog"
+ "[12]. catcatcatdogdogdog"
+ "[13]. acatdogdogcats"
+ "[14]. ifacatdogdogs"
+ "[15]. acatdogdogsme")
+;Unspecified return value
+
+(pp
+ (r:grep (r:seq " "
+ (r:repeat 3 5 (r:alt (r:quote "cat") (r:quote "dog")))
+ (r:eol))
+ "tests.txt"))
+("[09]. catdogcat" "[10]. catcatdogdog" "[11]. dogdogcatdogdog")
+;Unspecified return value
+
+(pp
+ (r:grep
+ (let ((digit
+ (r:char-from (string->char-set "0123456789"))))
+ (r:seq (r:bol)
+ (r:quote "[")
+ digit
+ digit
+ (r:quote "]")
+ (r:quote ".")
+ (r:quote " ")
+ (r:char-from (char-set #\a #\b))
+ (r:repeat 3 5 (r:alt "cat" "dog"))
+ (r:char-not-from (char-set #\d #\e #\f))
+ (r:eol)))
+ "tests.txt"))
+("[13]. acatdogdogcats")
+;Unspecified return value
+|#
+
+;;; This is the file tests.txt
+
+[00]. abc
+[01]. aac
+[02]. acc
+[03]. zzzaxcqqq
+[04]. abdabec
+
+[05]. foo
+[06]. bar
+[07]. foo bar baz quux
+[08]. anything containing them
+
+[09]. catdogcat
+[10]. catcatdogdog
+[11]. dogdogcatdogdog
+[12]. catcatcatdogdogdog
+
+[13]. acatdogdogcats
+[14]. ifacatdogdogs
+[15]. acatdogdogsme
diff --git a/ps01_grep/regexp.scm b/ps01_grep/regexp.scm
new file mode 100644
index 0000000..a7f2305
--- /dev/null
+++ b/ps01_grep/regexp.scm
@@ -0,0 +1,230 @@
+;;;; Scheme Regular Expression Language Implementation -- regexp.scm
+
+(define (r:dot) ".")
+
+(define (r:bol) "^")
+
+(define (r:eol) "$")
+
+(define (r:quote string)
+ (r:seq
+ (call-with-output-string ; see RefMan section 14.3
+ (lambda (port)
+ (let ((end (string-length string)))
+ (do ((i 0 (+ i 1))) ((not (< i end))) ; see RefMan 2.9
+ (let ((c (string-ref string i)))
+ (if (or (char=? c #\.)
+ (char=? c #\[)
+ (char=? c #\\)
+ (char=? c #\^)
+ (char=? c #\$)
+ (char=? c #\*))
+ (write-char #\\ port))
+ (write-char c port))))))))
+
+(define (r:char-from char-set) ; see RefMan section 5.6
+ (let ((members (char-set-members char-set)))
+ (cond ((not (pair? members))
+ (r:seq))
+ ((not (pair? (cdr members)))
+ (r:quote (string (car members))))
+ (else
+ (%char-from #f members)))))
+
+(define (r:char-not-from char-set)
+ (%char-from #t (char-set-members char-set)))
+
+(define (%char-from negate? members)
+ (let ((right? (memv #\] members))
+ (caret? (memv #\^ members))
+ (hyphen? (memv #\- members))
+ (others
+ (delete-matching-items members
+ (lambda (c)
+ (or (char=? c #\])
+ (char=? c #\^)
+ (char=? c #\-))))))
+ (if (and caret?
+ hyphen?
+ (not right?)
+ (not negate?)
+ (null? others))
+ "[-^]"
+ (string-append "["
+ (if negate? "^" "")
+ (if right? "]" "")
+ (list->string others)
+ (if caret? "^" "")
+ (if hyphen? "-" "")
+ "]"))))
+
+;;; Means of combination for patterns
+
+(define (r:seq . exprs)
+ (string-append "\\(" (apply string-append exprs) "\\)"))
+
+(define (r:alt . exprs)
+ (if (pair? exprs)
+ (apply r:seq
+ (cons (car exprs)
+ (append-map (lambda (expr)
+ (list "\\|" expr))
+ (cdr exprs))))
+ (r:seq)))
+
+(define (r:repeat min max expr)
+ (if (not (exact-nonnegative-integer? min))
+ (error "Min must be non-negative integer:" min))
+ (if max
+ (begin
+ (if (not (exact-nonnegative-integer? max))
+ (error "Max must be non-negative integer:" max))
+ (if (not (<= min max))
+ (error "Min not less than max:" min max))))
+ (cond ((not max)
+ (apply r:seq
+ (append (make-list min expr)
+ (list expr "*"))))
+ ((= max min)
+ (apply r:seq (make-list min expr)))
+ (else
+ (apply r:seq
+ (append (make-list min expr)
+ (make-list (- max min)
+ (r:seq expr "\\|")))))))
+
+;;; The following magic allows a program in MIT/GNU Scheme to call the
+;;; grep system utility, returning the list of grep output lines to
+;;; the caller. You can make similar mechanisms to call other system
+;;; utilities.
+
+(load-option 'synchronous-subprocess)
+
+
+(define (r:grep expr filename)
+ (r:grep-like "grep" '() expr filename))
+
+(define (r:egrep expr filename)
+ (if (eq? microcode-id/operating-system 'nt)
+ (r:grep-like "grep" '("-E") expr filename)
+ (r:grep-like "egrep" '() expr filename)))
+
+(define (r:grep-like program options expr filename)
+ (let ((port (open-output-string)))
+ (and (= (run-synchronous-subprocess program
+ (append options
+ (list "-e" expr (->namestring filename)))
+ 'output port)
+ 0)
+ (r:split-lines (get-output-string port)))))
+
+(define (r:split-lines string)
+ (reverse
+ (let ((end (string-length string)))
+ (let loop ((i 0) (lines '()))
+ (if (< i end)
+ (let ((j
+ (substring-find-next-char string i end #\newline)))
+ (if j
+ (loop (+ j 1)
+ (cons (substring string i j) lines))
+ (cons (substring string i end) lines)))
+ lines)))))
+
+#|
+;;; An alternate implementation using MIT/GNU Scheme's internal
+;;; regular-expression interpreter.
+
+(define (r:grep expr filename)
+ (call-with-input-file filename
+ (lambda (port)
+ (let loop ((lines '()))
+ (let ((line (read-line port)))
+ (if (eof-object? line)
+ (reverse lines)
+ (loop (if (re-string-search-forward expr line #f)
+ (cons line lines)
+ lines))))))))
+|#
+
+#|
+;;; For example...
+
+;;; Note, the result of the next two requests were not in this file
+;;; when the requests were made!
+
+(pp (r:grep (r:quote "r:sex") "regexp.scm"))
+("(pp (r:grep (r:quote \"r:sex\") \"regexp.scm\"))")
+;Unspecified return value
+
+(pp (r:grep (r:quote "r:seq") "regexp.scm"))
+(" (r:seq"
+ "\t (r:seq))"
+ "(define (r:seq . exprs)"
+ " (apply r:seq"
+ " (r:seq)))"
+ "\t (apply r:seq"
+ "\t (apply r:seq (make-list min expr)))"
+ "\t (apply r:seq"
+ "\t\t\t\t (r:seq expr \"\\\\|\")))))))"
+ "(pp (r:grep (r:quote \"r:seq\") \"regexp.scm\"))"
+ "(pp (r:grep (r:seq (r:quote \"a\") (r:dot) (r:quote \"c\")) \"tests.txt\"))"
+ " (r:grep (r:seq \" \""
+ " (r:seq (r:bol)")
+;Unspecified return value
+
+(pp (r:grep (r:seq (r:quote "a") (r:dot) (r:quote "c")) "tests.txt"))
+("[00]. abc"
+ "[01]. aac"
+ "[02]. acc"
+ "[03]. zzzaxcqqq"
+ "[10]. catcatdogdog"
+ "[12]. catcatcatdogdogdog")
+;Unspecified return value
+
+;;; And...
+
+(pp (r:grep (r:alt (r:quote "foo") (r:quote "bar") (r:quote "baz"))
+ "tests.txt"))
+("[05]. foo" "[06]. bar" "[07]. foo bar baz quux")
+;Unspecified return value
+
+
+(pp (r:grep (r:repeat 3 5 (r:alt (r:quote "cat") (r:quote "dog")))
+ "tests.txt"))
+("[09]. catdogcat"
+ "[10]. catcatdogdog"
+ "[11]. dogdogcatdogdog"
+ "[12]. catcatcatdogdogdog"
+ "[13]. acatdogdogcats"
+ "[14]. ifacatdogdogs"
+ "[15]. acatdogdogsme")
+;Unspecified return value
+
+(pp
+ (r:grep (r:seq " "
+ (r:repeat 3 5 (r:alt (r:quote "cat") (r:quote "dog")))
+ (r:eol))
+ "tests.txt"))
+("[09]. catdogcat" "[10]. catcatdogdog" "[11]. dogdogcatdogdog")
+;Unspecified return value
+
+(pp
+ (r:grep
+ (let ((digit
+ (r:char-from (string->char-set "0123456789"))))
+ (r:seq (r:bol)
+ (r:quote "[")
+ digit
+ digit
+ (r:quote "]")
+ (r:quote ".")
+ (r:quote " ")
+ (r:char-from (char-set #\a #\b))
+ (r:repeat 3 5 (r:alt "cat" "dog"))
+ (r:char-not-from (char-set #\d #\e #\f))
+ (r:eol)))
+ "tests.txt"))
+("[13]. acatdogdogcats")
+;Unspecified return value
+|# \ No newline at end of file
diff --git a/ps01_grep/tests.txt b/ps01_grep/tests.txt
new file mode 100644
index 0000000..62d2ec9
--- /dev/null
+++ b/ps01_grep/tests.txt
@@ -0,0 +1,19 @@
+[00]. abc
+[01]. aac
+[02]. acc
+[03]. zzzaxcqqq
+[04]. abdabec
+
+[05]. foo
+[06]. bar
+[07]. foo bar baz quux
+[08]. anything containing them
+
+[09]. catdogcat
+[10]. catcatdogdog
+[11]. dogdogcatdogdog
+[12]. catcatcatdogdogdog
+
+[13]. acatdogdogcats
+[14]. ifacatdogdogs
+[15]. acatdogdogsme