# README.extract_license Source code license extraction scripts # PURPOSE: The set of scripts is intended to scan a large source repository of source files. # DESIGN: The license scan is split into two steps: 1. The "extract_comments" tool scans the source files for all comments and store this information in a database-like file (recorded information per file is "file name", a set of "attributes" and the array of comments). 2. The "extract_license" scripts will iterate over all file entries in the database and return those comments which match the provided filename and license patterns. # USAGE: 1. Run the "extract_comments" script for all *.c and *.h files in the directory unpack/ and build the "extraced_comments.cpv" database from it: $ find unpack -name \*.[ch] | ksh93 extract_comments.ksh Notes: - multiple modules/subdirs can be scanned in one step and one of --acceptfilepattern/--rejectfilepattern may be used to restrict the output of "extract_license" later to one module/subdir or type of file). - The "extract_comments" script is slow as molasses - that's why comment extraction and license filtering are split over two steps (to avoid the pain of doing both steps over and over again. The recommendation is to let "extract_comments" just crawl all source files and then use the pattern filters in "extract_license" to pick the information you need) 2. Process the database file and extact all comments which contain the words "license" or "copyright" (this is the default search pattern for "extract_license"'s --acceptcommentpattern option) and send the output to the file "report.txt": $ ksh93 extract_license.ksh >report.txt The output of "extract_license.ksh" can be modified using various patterns (using an extended pattern syntax which supports shell, regex, perl etc. pattern systems in an unified API; see NOTES below for syntax+usage) which may be used to reduce either the list of files which should be checked for comments or the pattern which is used to determiate whether a comment contains license/copyright information or not. Currently supported pattern matching options are: -l, --acceptfilepattern=pattern Process only files which match pattern. -L, --rejectfilepattern=pattern Process only files which do not match pattern. -c, --acceptcommentpattern=pattern Match comments which match pattern. Defaults to ~(Ei)(license|copyright) -C, --rejectcommentpattern=pattern Discard comments which match pattern. Defaults to "" # NOTES: - The scripts REQUIRE the ksh93-integration update1 (e.g. >= ast-ksh.2008-02-02) binaries (or better). Precompiled binaries for Solaris 11 >= B72 can be found at http://www.opensolaris.org/os/project/ksh93-integration/downloads/2008-02-29/ , on demand I can make binaries for older Solaris releases and/or Linux. - All scripts support the "--man" option to get an online manual page - Extended pattern syntax: "extract_license" uses shell pattern syntax by default, however this can be overridden using a ~() prefix. - Supported values for are (only one can be used at the same time): - 'E': Extended regular expressions (like /usr/xpg4/bin/egrep) - 'F': Fixed string literal (like /usr/xpg4/bin/fgrep) - 'G': Basix regular expressions (like /usr/xpg4/bin/grep) - 'S': Shell pattern - 'K': KornShell pattern - 'P': Perl5 regular expressions - Supported values for are (multiple modifiers can be specified): - 'i': Do case-insensitive matching - 'l': left anchor (like '^' in extended regular expression pattern) - 'r': right anchor (like '$' in extended regular expression pattern) - Examples: 1. The pattern ~(Ei)foo is equivalent to $ /usr/xpg4/bin/egrep -i "foo" #, e.g it will match all strings which contain the string "foo" in a case-insensitive manner (e.g. "foo", "FOO", "FoO", "fOo" etc.) 2. The pattern ~(Er)test is equivalent to $ /usr/xpg4/bin/egrep -i 'test$' #, e.g. it will match all strings which end with "test". 3. The pattern ~(Si)fish*chicken works like the normal shell pattern "fish*chicken" except that it uses case-insensitive matching. # EOF: