# README.extract_license
Source code license extraction scripts


# PURPOSE:
The set of scripts is intended to scan a large source repository of
source files.


# DESIGN:
The license scan is split into two steps:
1. The "extract_comments" tool scans the source files for all comments
  and store this information in a database-like file (recorded
  information per file is "file name", a set of "attributes" and the
  array of comments). 
2. The "extract_license" scripts will iterate over all file entries
  in the database and return those comments which match the provided
  filename and license patterns.


# USAGE:
1. Run the "extract_comments" script for all *.c and *.h files in the
  directory unpack/ and build the "extraced_comments.cpv" database
  from it:
  $ find unpack -name \*.[ch] | ksh93 extract_comments.ksh
  Notes: 
  - multiple modules/subdirs can be scanned in one step
    and one of --acceptfilepattern/--rejectfilepattern may be used
    to restrict the output of "extract_license" later to one
    module/subdir or type of file).
  - The "extract_comments" script is slow as molasses - that's why
    comment extraction and license filtering are split over two steps
    (to avoid the pain of doing both steps over and over again.
    The recommendation is to let "extract_comments" just crawl all
    source files and then use the pattern filters in "extract_license"
    to pick the information you need)
2. Process the database file and extact all comments which contain the
  words "license" or "copyright" (this is the default search pattern
  for "extract_license"'s --acceptcommentpattern option) and send
  the output to the file "report.txt":
  $ ksh93 extract_license.ksh >report.txt

The output of "extract_license.ksh" can be modified using various
patterns (using an extended pattern syntax which supports shell,
regex, perl etc. pattern systems in an unified API; see NOTES below
for syntax+usage) which may be used to reduce either the list of files
which should be checked for comments or the pattern which is used to
determiate whether a comment contains license/copyright information
or not.
Currently supported pattern matching options are:
  -l, --acceptfilepattern=pattern
                  Process only files which match pattern.
  -L, --rejectfilepattern=pattern
                  Process only files which do not match pattern.
  -c, --acceptcommentpattern=pattern
                  Match comments which match pattern.
		  Defaults to ~(Ei)(license|copyright)
  -C, --rejectcommentpattern=pattern
                  Discard comments which match pattern.
		  Defaults to ""


# NOTES:
- The scripts REQUIRE the ksh93-integration update1
  (e.g. >= ast-ksh.2008-02-02) binaries (or better). Precompiled
  binaries for Solaris 11 >= B72 can be found at
  http://www.opensolaris.org/os/project/ksh93-integration/downloads/2008-02-29/
  , on demand I can make binaries for older Solaris releases
  and/or Linux.
- All scripts support the "--man" option to get an online manual page
- Extended pattern syntax:
  "extract_license" uses shell pattern syntax by default, however
  this can be overridden using a ~(<mode><modifier>) prefix.
  - Supported values for <mode> are (only one can be used at the same
  time):
    - 'E': Extended regular expressions (like /usr/xpg4/bin/egrep)
    - 'F': Fixed string literal (like /usr/xpg4/bin/fgrep)
    - 'G': Basix regular expressions (like /usr/xpg4/bin/grep)
    - 'S': Shell pattern
    - 'K': KornShell pattern
    - 'P': Perl5 regular expressions
  - Supported values for <modifier> are (multiple modifiers can be
    specified):
    - 'i': Do case-insensitive matching
    - 'l': left anchor (like '^' in extended regular expression
           pattern)
    - 'r': right anchor (like '$' in extended regular expression
           pattern)
  - Examples:
    1. The pattern ~(Ei)foo is equivalent to
      $ /usr/xpg4/bin/egrep -i "foo" #, e.g it will match
      all strings which contain the string "foo" in a
      case-insensitive manner (e.g. "foo", "FOO", "FoO", "fOo" etc.)
    2. The pattern ~(Er)test is equivalent to
      $ /usr/xpg4/bin/egrep -i 'test$' #, e.g. it will
      match all strings which end with "test".
    3. The pattern ~(Si)fish*chicken works like the normal shell
      pattern "fish*chicken" except that it uses case-insensitive
      matching.
      
# EOF: