journal of PBAGMATCS Journal
The language of comments in computer software: A sublanguage of English Letha H. Etzkom,”
Carl G. Davis, Lisa L. Bowen
1999; revised version
Abstract A sublanguage is a subset of a natural language such as the English language. Sublanguages tend to emerge gradually through the use of a language in various fields by specialists in those fields. Some such sublanguages are the ‘language of biophysics’ and the ‘language of naval telegraphic transmissions’. This paper explores whether English-language comments in object-oriented software can be considered to be a sublanguage of English, using standard criteria for sublanguage determination. To make this determination. the article looks at the grammatical content of comments, including: sentence-style comments versus non-sentencestyle comments, and the use of tense, mood, and voice in sentence-style comments. The telegraphic nature of comments is also examined. Additionally, the subject-matter of comments is analyzed in terms of the purpose of comments in describing the operation of computer software. 0 2001 Elsevier Science B.V. All rights reserved. Krpmds: Computer; needs; Reusability
1. Introduction A sublanguage
is a subset
It is not known
guages exist in a given language. Sublanguages emerge gradually through the use of a language in various fields by specialists in those fields. Some such sublanguages are “the language of biophysics”, the “language of car repair manuals”, and “the language of naval telegraphic transmissions” (Kittredge and Lehrberger, 1982; Grishman and Kittredge, 1986).
Phone: + I -256-X24-629
037X-2166/01/$ PTI: SO37X-2
I ; Fax + I -256.X24-6239:
- see front matter 0 2001 166(00)00068-0
All rights reserved.
L.H. Etzkorn et al. I .lournul of PI-ugmutic~s 33 (2001) 17.3 I-I 7S6
In many ways, the processing of sublanguages by computer has been more successful than the processing of natural language as a whole. For example, it is only within the domain of sublanguages that automatic translation has been made practical due to the simplified structure of the sublanguage compared with the structure of the language as a whole (Lehrberger, 1982; Lehrberger and Bourbeau, 1988). Comments in computer software are natural language, textual descriptions of the software. Comments are located within the computer software, typically very close to the computer code that they describe. Good software development practice requires that a sufficient number of comments will be included in computer software in order to adequately document the operation of the software. Comments in computer software provide a guide to someone who may not be familiar with the software. They are designed to help and guide someone to determine what a piece of software is doing, how it operates and any other helpful hints that will increase a person’s ability to understand what is going on. Computer software tends to become larger and more complex with time. The common software packages of today are much more complex than those of ten years ago. Also, many people are involved in testing, modifying, and improving the software, with the result that nearly all commercial software is ‘written by a committee’. So software can be very difficult to understand. As modifications and bug fixes are required after the initial software release, the original developers may not be available, and new people must spend considerable effort to analyze the software in order to fix it or update it. This can be a very difficult task, and the time required for this task can sometimes be prohibitive. Because of this understanding problem, it can sometimes be easier and faster to design and write new software for a particular task, than to spend the time required to understand the existing software that performs that task, such that the existing software can be reused. Therefore, the ability to use a computer to analyze existing computer software and automatically determine its characteristics has been a long term computer science research goal. Most research related to automatically understanding computer software has concentrated on analysis of the computer code itself, since that is a more tractable problem than the analysis of comments (Abd-El-Hafiz and Basili, 1996; Ning et al., 1994; Rich and Wills, 1990). Comment sentences are written in natural language, most commonly the English language. Therefore, the problem of automatically analyzing and understanding comment sentences has been considered in the past to be equivalent to the problem of automatically analyzing and understanding the English language, which is indeed a very difficult problem. Since the comments often include information not provided in the computer code itself - from the computer code u~hur task is being done can be determined, but it is only from the comments that u&~ that task is being done can be understood - it can be more important to understand the comments than to understand the computer code. The study described in this article examines English language comments in computer software to determine whether they can be treated as a sublanguage of English. The requirement for this study arose from the desire to allow a computer to use its ability to analyze natural language comments in order to understand what the soft-
et ul. I Journal
of Prupnatir~s 33 (2001 J 1731-1756
ware is supposed to do. This was an early step in the development of an automated tool intended to determine whether certain object-oriented software components were easily reusable. The ability to analyze comments with a natural language processing system led to the ability to determine if a particular software component can be useful when reused in a certain domain of interest (Etzkom and Davis, 1996, 1997). The need to keep the processing as simple as possible raised the question of whether comments can be treated as a sublanguage. In this paper, Section 2 reviews the linguistic and natural language processing literature on sublanguages, and presents the sublanguage classification criteria which are used for this study. It also briefly discusses the utility of certain categories of sublanguages when analyzed by natural language processing systems. Section 3 describes the experiments conducted for this study, and discusses their results. Section 4 provides conclusions and suggests implications for future research.
2. The character of sublanguages Sublanguage analyses most often have dealt with texts limited to specific fields, such as technical or scientific fields. Allen ( 1995) discusses the problems involved with word ambiguities (determining the correct meaning of a word) resulting from unlimited natural language. He discusses the use of selectional restrictions based on domain knowledge that improve the resolution of word ambiguity. Rich and Knight (1991) also discuss problems with word sense disambiguation. They say: “There are two important parts of the process of using knowledge to facilitate understanding: focus on the relevant part(s) of the available knowledge base, and use that knowledge to resolve ambiguities” (1991: 400). However, the term sublanguage has also been used to refer to “the sentences of a language closed under some or all of the operations in the language” (Lehrberger, 1986: 22). This definition restricts the format of the language but does not restrict the semantic domain of the language. Lehrberger refers to an information sublanguage as a sublanguage whose semantic domain is no more restricted than that of natural language itself. He refers to a sublanguage with a restricted semantic domain as a subject-matter sublanguage (Lehrberger, 1986). Lehrberger says that “grammaticality in a subject-matter sublanguage is determined by whatever officially prescribed or implicit norms of usage exist among the specialists in the subject-matter field” (Lehrberger, 1986). Some grammatical restrictions in a sublanguage may involve the use of the imperative, the common sentence length, uniformity of tense, modality, the use of the interrogative, the deletion of articles (telegraphic style), the deletion of the object noun phrase, the use of proper nouns, and many others (Kittredge, 1982; Lehrberger, 1982). In some cases, the norms of usage may conflict with those of standard language. Sublanguage structures that do not conform to standard language are referred to as ‘deviant’. However, normally deviant sublanguage structures can be paraphrased in standard language, and could often be replaced by standard language structures. One fairly common example of deviant sublanguage structures is the telegraphic style
et al. i .Iournal
of Pragrnutics 33 (2001) 1731-l 756
that is used in weather bulletins or Navy telegraphic messages (Fitzpatrick et al., 1986; Lehrberger, 1986). Sublanguage texts usually contain some material that does not belong to the sublanguage proper. This may occur as whole sentences interspersed among those of the sublanguage proper, or as matrix expressions. Lehrberger ( 1986) gives the following example drawn from a geometry text: “Pythagoras proved the theorem that now bears his name in the sixth century B.C. He was able to show that the area of the square on the hypotenuse of a right triangle equals the sum of the areas of the squares on the other two sides”. (Lehrberger, 1986: 21) The first sentence is a sentence that is not properly a sentence about geometry, but one of history. Also, the phrase ‘He was able to show that’, which is called a matrix expression, is not part of the sublanguage of geometry. Lehrberger concludes that this is a real phenomenon in real texts from restricted sublanguages, and that a text typically consists of a mixture of discourse from within a particular domain, and metadiscourse about it. In summary, the factors which can help to categorize a sublanguage are (Lehrberger, 1986: 37): I. 2. 3. 4. 5. 6.
limited subject matter lexical, syntactic, and semantic restrictions ‘deviant’ rules of grammar high frequency of certain constructions text structure the use of special symbols.
Most other work in sublanguages has been primarily related to using sublanguages to simplify the difficulties involved in processing various kinds of natural language (Navarro and Baeza-Yates, 1997; Anick, 1994; Krovetz and Croft, 1992; Marsh and Friedman, 1985). Marsh and Friedman transported a natural language processing system developed for medical text to a navy environment where it was used to analyze messages about shipboard equipment failures. They discussed various differences in semantic categories and grammar between the two different environments, and how that affected their porting of their system. Krovetz and Croft discussed the problems related to word ambiguity, when using an information retrieval system to locate relevant documents in response to a user’s query. Anick discussed the advantages of sublanguage usage when tuning an information retrieval system to the particular domain of computer troubleshooting. Navarro and Baeza-Yates developed a formal query model to query document databases that employed sublanguage-type matching as a subpart of the model. This article will show that English language comments in computer software meet all of the criteria, and thus can be considered a sublanguage of the English language. The fact that a sublanguage consists of limited subject matter means that computer-based natural language processing employed in the analysis of the sublanguage
can be simpler in several respects than the analysis of unrestricted natural language. For example, a knowledge-base used to store knowledge about the domain of interest can be simpler than one employed to analyze general natural language. Also, a word-based natural language parser, which is often employed in computer-based natural language processing to syntactically tag words in a sentence with their parts of speech, would have a smaller dictionary. The use of a smaller dictionary in a wordbased natural language parser would result in a reduction of the parse ambiguity problem - the fact that many sentences have multiple parses. The syntactic restrictions of a sublanguage mean that a natural language parser used in sublanguage analysis need not handle all the possible grammatical constructs of natural language as a whole, but can support a simpler subset of the natural language. However, since sublanguages sometimes employ deviant rules of grammar and certain special symbols, the natural language parser must be modified to handle these additional rules and symbols. Thus, since it is shown in this study that comments can be treated as a sublanguage, the computer based natural language understanding of comments becomes a practical undertaking.
3. Studies of comments as a sublanguage An initial examination of a set of comments drawn from mathematical packages written in the C and C++ software languages, and from computer hardware controller packages written in the Intel 8086 assembly language led to the following observations: 1. Comments are usually written in the present tense, with either indicative mood or imperative mood. For example, ‘This routine reads the data’. is present tense, indicative mood. ‘Read the data’ is present tense, imperative mood (Etzkom and Davis, 1994). This corresponds to the syntactic restrictions category of Lehrberger’s sublanguage criteria. 2. The set of verbs typically used for comments is much restricted over the set of all English verbs. Verbs often used in comments include: is, uses, provides, implements, accesses, prints, inputs, outputs, reads, writes, supplies, defines, retrieves, gets, etc. Verbs seldom used in comments include: smiles, frowns, laughs, rides, flies, jumps, sings, fights, electrocutes, falls, punishes, hires, fires, pats, throws, pitches, calms, etc. (Etzkom and Davis, 1994). The set of comment verbs, while still very large, is still in general much smaller than the set of natural language verbs. Also, since the domain for analysis will also be restricted, the selection of verbs to support becomes much smaller. This corresponds to the lexical restrictions category of Lehrberger’s sublanguage criteria. 3. Personal pronouns such as I, me. u’e, 14s. he, she are seldom used in comments. This also corresponds to the lexical restrictions category of Lehrberger’s sublanguage criteria.
et al. I .lournal
4. Comments tend to be of certain types. Two major types are header block comments and inline comments. Header block comments appear at the top of files or of subroutines (functions) or classes, and provide an overall description of the operation, whereas inline comments are located very near the software that they describe. Header block comments usually come in two formats - operational description and definition. Examples of operational description header block comments are (the “/*” and ‘*/’ symbols are symbols that indicate, within computer software, that the enclosed sentences are comment sentences): /* This routine
reads the data */ and /* READ-DATA
An example of a definition header block comment /* GeneralMatrix - rectangular matrix class */
reads the data */ is:
Header block comments, however, very often consist of several comment tences. Inline comments also usually come in operational description and definition mats. An example of an operational description inline comment is:
/* Get matrix row. */ Examples
* index variable */ /* counter of incoming
format inline comments
Inline comments most commonly have only one or a very small number (two or three) of comment sentences. The common division of comments into the categories of header block style comments and inline comments corresponds to the text structure category of Lehrberger’s sublanguage criteria. From a natural language processing standpoint, the fact that inline comments usually consist of only a very small number of comments reduces the need for extensive discourse analysis, although there is more of a need for traditional discourse analysis in the header block comments. One unusual aspect of inline comments that can make them harder to analyze with a natural language processing system is that pronouns in the comments sometimes refer to the associated code. For example, a comment ‘This opens the file’. might be attached to a subprogram call. In this case, the pronoun ‘This’ refers to the subprogram. This reference also corresponds to Lehrberger’s text structure category. 3. I. Syntactic
A corpus of comments drawn from three independent graphical user interface (GUI) packages (Smart, 1994; Backer et al., 1991; Watson, 1993) written in C++
et al. I Journal
analyzed, and seven common syntactic patterns were identified for sentencestyle comments, while four common syntactic patterns were identified for non-sentence-style comments. The syntactic patterns found were very similar to those identified in the original brief comment examination. These syntactic patterns are listed here, with examples.
3.1 .I. Common syntactic patterns in sentence-form c~omrnents 1. Present tense, indicative mood, active voice: “This depends on the type of the parent.” (Backer et al., 1991) “This class provides both back propagation training and runtime modules.” (Watson, 1993) 2. Present tense, indicative mood, active voice, missing subject: “Creates a toggle button for the item.” (Backer et al., 1991) “Creates an item of class GnToggleSelectionItem.” (Backer et al., 1991) 3. Present tense, imperative mood, active voice: “Append new ordering object.” (Backer et al., 1991) “Maintain a valid HDC only during paint operations.” (Watson, 1993) Present tense, indicative mood, passive voice: 4. “Timeout is removed automatically by Xt.” (Backer et al., 1991) “The X server is queried for the colour the first time after which it is entered into the database.” (Smart, 1994) 5. Present tense, indicative mood, passive voice, missing subject: “Is used for mapping callbacks.” (Backer et al., 1991) mood, either active or passive voice, sometimes with a 6. Past tense, indicative missing subject: “Register failed.” (Watson, 1993) “Human selected a move.” (Watson, 1993) “Failed to create a window.” (Watson, 1993) 7. Future tense, indicative mood, either active or passing voice, sometimes with a missing subject: “This will involve shifting large blocks of memory around, but will make the code much simpler.” (Watson, 1993) “This will allow a simple command language.” (Smart, 1994). As can be seen from the above examples, many comments have a telegraphic quality, with articles such as ‘the’ or ‘a’ missing. For example, ‘Append new ordering object’. in standard English would become either ‘Append a new ordering object’, or ‘Append the new ordering object’. ‘Human selected a move’ would become ‘A human selected a move’. or ‘The human selected a move’. This telegraphic nature of comments can add some difficulty to the automated parsing of comments, since many or most standard English natural language parsers require the articles to be present. The telegraphic nature of comments corresponds to the ‘deviant’ rules of grammar category of Lehrberger’s sublanguage criteria.
et al. I Joumal
3.1.2. Common syntactic patterns in non-sentence-form comments 1. A simple definition comment is included inline close to the software that it describes, or Itemname - definition is included in a header block comment. “Two way linked lists.” (Backer et al., 1991) “Pointers to the X coordinate data.” (Watson, 1993) “Max path length.” (Smart, 1994) “NumberOfLines - number of segment lines per spline section.” (Smart, 1994) 2. An unattached prepositional phrase is included inline close to the software that it describes. “From base class.” (Backer et al., 1991) “To a maximum of MAX-PLOTS.” (Watson, 1993) “For single selection items.” (Smart, 1994) 3. Allowed values for a software variable are specified close to the software variable that they describe. “1 = derived by MI, 0 = only SI.” (Backer et al., 1991) “0 = not selected, 1 = a child node is selected.” (Watson, 1993) “1 = Preview, 2 = print to file, 3 = send to printer.” (Smart, 1994) 4. Mathematical expressions are included near the software that they describe. Note that this type of construction would most likely be more common in a mathematical package than in a graphical user interface package. in my window.” (Watson, 1993) “x-value * x-scale + x_org = x-coordinate 3.1.3. Common content of comments The content of comments in the C++ GUI packages was also examined, and four standard content styles were determined. These content styles are described below, with examples: 1. The comment provided an operational description of the software. “This class provides both back propagation training and runtime modules.” (Watson, 1993) “The X Server is queried for the colour the first time after which it is entered into the database.” (Smart, 1994) “Creates a toggle button for the item.” (Backer et al., 1991) 2. The comment provides a definition of the software. This type of comment occurs both in sentence and non-sentence form. “For Microsoft Windows (TM), this is a pointer to a locked Handle.” (Watson, 1993) “Utilities used by spline drawing routines.” (Smart, 1994) .3 . The comment provides a description of the definition of the software. “Defines move to and move from.” (Watson, 1993) “The function DrawSpline defines the blending functions for a cubic B-spline, which are cubic polynomials in u.” (Smart, 1994) 4. The comment instructs the reader to perform a certain action. “Be careful not to connect with an empty list.” (Backer et al., 1991) “See comment in wxitem.cc.” (Smart, 1994)
Instructions to the reader as to how to read or understand a text could be considered metadiscourse about the text, similar to that described by Lehrberger, and referred earlier in the example about Pythagoras. 3.2. A initial study of comments in four C++ packages A study of comments as a sublanguage was performed on three independent C++ graphical user interface packages (Backer et al., 1991; Smart, 1994; Watson, 1993). First, the comments were stripped from all the files in a particular package, using a utility program, and placed in a separate comment file. The order of the files from which comments were stripped was chosen randomly, although comments were stripped sequentially from each file. The first 108 comments in the comment file for each package were examined, excluding revision and copyright notices. This typically included comments from several different files. Comments that consisted of code were ignored in the study, since it was felt that in this case the comment characters were performing two duties: one duty was to document the code, and the other duty was to remove code from the compilation process (sentences surrounded by the comment characters ‘/*’ and ‘*/’ are ignored within computer software, and are used for comments; but they are sometimes also used to cause the computer to ignore certain computer code instructions). This manually-performed study was restricted to comments in C++ software. The results of the study are shown in Figs. 1 through 5. q Sentence Style
Fig. 1. Sentence-style
Fig. 1 shows a comparison of the number of sentence style to non-sentence-style comments. Note that 54% of all comments are in sentence style. This is important from the standpoint of a natural language processing tool, since it means up to that 54% of all comments could be parsable by the use of natural language parsing software. From Fig. 2, it can be seen that 82% of all sentence style comments are in present tense, and that only 6% of sentence-style comments are not in present tense, 8% in simple future tense, or 4% in simple past tense. The common restriction of
et al. I Journal
of’Pragmatic~.s 33 (2001) 1731-l
Present Simple Future 0 Simple Past 0 Other A%
Fig. 2. Analysis
of tense in sentence-style
comments to present tense, simple future tense, and simple past tense shows a syncategory of tactic restriction that corresponds to the syntactic restrictions Lehrberger’s sublanguage criteria.
q indicative mood, 8% W indicatiw
Fig. 3. Mood and voice of present
From Fig. 3, it can be seen that 59% of present tense, sentence style comments are in indicative mood, active voice, while another 31% are in imperative mood, active
voice. This shows a syntactic restriction that corresponds to the syntactic restrictions category of Lehrberger’s sublanguage criteria. The common usage of present tense, indicative mood, active voice and present tense, imperative mood, active voice also corresponds to the high frequency of certain constructions category of Lehrberger’s sublanguage criteria.
Cl Value de ftition
Fig. 4. Non-sentence-style
From Fig. 4 it can be seen that definition format comments make up 90% of the non-sentence-style comments. The particular packages examined in the study had few comments of the Itemname (definition format, or of mathematical formulas.
Description q Deftition Cl Description of Deftition 0 Other
Fig. 5. Content
L.H. Etzkorn et al. I Journul of Prugmutics 33 (2001) 173-I
Fig. 5 shows that 51% of the comments provide an operational description. Only a few instructions to the reader occurred in this study. Note that only 3% of all comments were not directly related to the immediate description of computer software. The fact that comments have a specific purpose implies that a limited amount of subject matter is employed, related to the specific operation of the code. This tends to infer that comments have a limited subject matter, as required by the limited subject matter category of Lehrberger’s sublanguage criteria. To further examine comments in relation to Lehrberger’s limited subject matter and lexical restrictions categories, an additional study was performed. Words from the comments in each package were placed on separate lines in a file by the use of a utility program. The English words and the number of different English words (not including duplicate words) were counted, also by the use of a utility program. Then an analysis of the comment words was performed using PC-KIMMO version 2, which includes a description of the English morphology and lexicon known as Englex (Antworth, 1995). PC-KIMMO version 2 is a morphological parser based on Kimmo Koskenniemi’s model of two level morphology (Koskenniemi, 1983) - in addition to decomposing a word into morphemes, PC-KIMMO version 2 also provides parse trees and feature structures. The Englex description of the English morphology and lexicon provides a substantial analysis of English. Using Englex/PC-KIMMO version 2, the number of different particles and the number of different roots present in the words was determined. In Englex, a particle is a sublexicon of words that do not accept affixes. This includes auxiliaries, prepositions, pronouns, interjections, and determiners. Table 1 shows the results of this study. For the overall total category shown in Table 1, the number of different words, number of different particles, number of different roots, and number of not recognized words do not simply represent a summation of those categories over the three graphical user interface packages. Rather, the comment words in all three of the graphical user interface packages were combined into a single file, and that file was analyzed separately. Therefore the number of different words, number of different particles, number of different roots, and number of not recognized words represent true numbers over all three packages. This makes a difference in the totals when compared to a simple summation, since with a simple summation of, for example. the number of different words per category, it would have been possible for a particular word to appear in one graphical user interface package, and then to reappear in another graphical user interface package, and thus be counted more than once in the total. Since the total was run as a separate test, this problem does not occur. As can be seen from Table 1, over three graphical user interface packages, the total number of words that occurred was 46586. Of these words, there were only 1786 different words, from 849 roots. In the root count, a root that can serve as multiple parts of speech was counted multiple times. The number of not recognized words, shown in Table 1, represents the comment words that were not recognized by the basic, unmodified Englex lexicon that is provided with PC-KIMMO. Examples of words not included in the Englex lexicon are some technical words such as such as ‘corn’ to mean ‘communications’; ‘print’ and ‘ascii’; some abbreviations
Table I Word analysis
for the graphical
Number of different words
analyzed Number of different particles
Number of different roots
Number of not recognized words
Total number of words
Gina Watson wxwindows Total over GUI Packages PAT
12272 3608 30706
74x 670 965
59 70 88
403 430 536
256 109 244
an occasional misspelling or typographical error such as ‘contorl’ instead of ‘control’ (comments are seldom spellchecked); and some specialized words peculiar to a particular C++ package, such as ‘gnview’ to mean a view subroutine within the Gina package (‘gn’ here is an abbreviation meaning ‘Gina’). In the case of the Gina package, a few German comments were scattered throughout the code. The German words were also not recognizable by Englex. The overall small number of different words used in the graphical user interface packages, along with the small number of different roots, shows that the comments in the graphical user interface packages are significantly lexically restricted when compared to natural language as a whole. (Compare this to the over two million words in the Penn Treebankcorpus (Linguistics Data Consortium, 1998), which provides hand parsed words from the Dow Jones News service, from the Brown corpus, from IBM computer manuals, and from the Wall Street Journal, among other sources.) This fact corresponds to the lexical restrictions category of Lehrberger’s sublanguage criteria. The relatively small number of words employed also implies that comments have a limited subject matter, which corresponds to the limited subject matter category of Lehrberger’s sublanguage criteria. The small number of root words tends to imply that the semantic restrictions category of Lehrberger’s sublanguage criteria is also descriptive of comments. The results were similar for the parallelization tool, in that the number of different words, and the number of different roots were reasonably small. It is clear that the comments in graphical user interface software (along with one parallelization tool) could be classified, using Lehrberger’s sublanguage criteria, as a sublanguage of the English language. However, after this study it was still possible to wonder whether the sublanguage resulted primarily from the fact that the corpus was made up of comments, or from the fact that the corpus was made up of comments drawn primarily from the graphical user interface area of computer software. The examination of comments drawn from only graphical user interface packages provided an additional domain restriction. To answer this question, a further study was undertaken, in which a corpus of comments from eleven C++ software packages, drawn from four different application areas, was examined. The study examined packages from the following application areas:
L.H. Et~korx et al. I Jounwl
of Pmgmatics 33 (2001) I731-1756
Real time packages: DOSThread (English, 1993), ISC (Laor, 1992), Serial (Serial, 1992). - Text analysis packages: JPEG (Lane et al., 1996), String++ (Moreland, 1994), DOC++ ( Wunderling and Zoeckler, 1996). _ Database packages: Combits (Combits, 1997) Quick Database (Curtis, 1996). _ Mathematical packages: NEWMATOX (Davies, 1995), MFLOAT (Kaufmann and Meuller, 1995) , BLITZ++ (Veldhuizen, 1997). Comments that consisted of code were ignored, since it was felt that in this case the comment characters were performing a different duty - removing code from the compilation process rather than documenting the code implementation. Comments from .H files and .CPP files were examined separately, since it was felt there might be a difference in comment structure. In the C++ language, .H files for the most part contain data structure and class definitions, while the .CPP files contain the code. First, the comments were stripped from all the .H files in a particular package, using a utility program, and placed in a separate comment file. Comments were stripped sequentially from each file; however, the files were chosen in a random manner. The first 100 comments in the comment file for each package were examined, excluding revision and copyright notices. Typically, this included comments from several different .H files (up to 8 files, with an average of 4.25 files). Then this process was repeated for the .CPP files. Figs. 6 through 20 show the results of this study.
q Sentence Form
Non Sentence Fom
Fig. 6. Sentence
form vs. non-sentence
in .H files.
From Figs. 6, 7, and 8 it can be seen that large numbers of comments are in sentence form, and thus possibly can be handled by standard natural language parsing techniques - for comments in .CPP files, 65% of comments are in sentence form,
Non Sentence Form Fig. 7. Sentence-form
in .CCP files.
q Sentence Form
Non Sentence Form Fig. 8. verall sentence-form
in both .H ,and .CPP files.
whereas the comments in sentence form in .H files is lower, 42%, which could be expected from the fact that .H files are more concerned with definitions, while .CPP files are more concerned with implementation. Overall, for .H files and .C files combined, the number of sentence-form comments is 53%. Figs 9, 10, and 11 show that most comments are in present tense. All but a very small percentage of comments is in present tense, simple future tense, or simple past tense. In .H files, 75% of comments are in present tense, while in .CPP files, 77% of comments are in present tense. Overall, for both .H and .CPP files, 77% of comments are in present tense. Comments not in present tense, simple future tense, or
Future tense (indicative mood)
q Past Tense (indicative Fig. 9. Tense in comments
from .H files.
El Present tense
q Future tense (indicative mood)
q lPast Tense (indicative mood) 77%
Fig. 10. Tense in comments
q Other from .CPP files.
E Future tense (indicatiw mood)
q Past Tense (indicative IIIOther
Fig. 1 I. Overall
tense in comments
from both .H and .CCP files.
L.H. Etdxwn et al. I Jowr~al @ Pwgmatic~s 33 (2001) 1731-I 756
simple past tense (the ‘other’ category) make up 16% of total comments, which is higher than the 6% found in the study of the graphical user interface packages. However, this is still small compared to the 84% of comments that are in present tense, simple future tense, or simple past tense. This restriction of tense corresponds to the syntactic restrictions category of Lehrberger’s sublanguage criteria.
n indicatk 0 imperatiw
Fig. 12. Mood in comments
mood, actiw wice mood, pssiw
from .H files.
El indicative mood, active voice n indicative mood, active voice, missing subject 8%
Cl imperative mood, active voice Cl indicative mood, passive voice
n indicative mood, passive
voice, missing subject Fig. 13. Mood in comments
from .CCP files
Figs 12, 13, and 14 show that most comments are either in indicative mood, active voice (37% overall), or in imperative mood, active voice (45% overall). These results are reasonably consistent with the 59% found for indicative mood, active voice and the 3 1% found for imperative mood, active voice in the earlier graphical user interface packages study in that together, these two categories form the largest two mood categories in both cases. This restriction of mood corresponds to the
L.H. Etzkorn et al. I .lournal of Prap~utic~s 33 (2001) 1731-I 756
mood, actiw wice
W indicatiw m actiw mice, missing subject 0 imperatiw
mood, actiw wice
mood, passiw wice
W inclicatiw nw& pwsiw missine subiect
1 1% Fig. 14. Overall
mood in comments
from both .H and .CCP files.
syntactic restrictions category of Lehrberger’s sublanguage criteria. The common usage of present tense, indicative mood, active voice and present tense, imperative mood, active voice also corresponds to the high frequency of certain constructions category of Lehrberger’s sublanguage criteria.
0 unattached prepositional phrase El \alue definitions 76%
Fig. 15. Non-sentence-form
from .H files.
Figs 15, 16, and 17 examine the non-sentence form comments. Overall, for both .H and .CPP files, 85% of non-sentence-form comments are either in Itemname definition format, or in definition format. One difference between the results of this study and the earlier study of the graphical user interface packages is that here 9% of non-sentence form comments are mathematical formulas, while in the graphical user interface packages a negligible number of comments were mathematical formulas. This was due to the presence of mathematical formulas in the .H files in the mathematical packages examined. A negligible number of mathematical formulas
phrase 0 value definitions
Fig. 16. Non-sentence-form
from .CCP files.
q Ite~~edefinition definition
2% 4% 9%
El unattached prepositional ph rase El value definitions
n mathematical formulas
Fig. 17. Overall
from both .H and .CCP files
was found in the other packages; similarly, a negligible number of mathematical formulas was found in the .CPP files in the mathematical packages. The occasional use of mathematical symbols in comments could be considered to meet Lehrberger’s sublanguage criterion related to the use of special symbols. It also sometimes occurs
that programming related symbols sublanguage criterion.
are used in comments.
Such usage also meets this
tl!A Operational [email protected]
Definition Cl Description of Definition Cl Instructions Reader
18.Contents of comments
in .H files.
Figs 18, 19, and 20 show that most comments provide an operational description of computer code (60% overall). The number of comments that provided instructions to the reader (3% in this case) were larger than in the earlier graphical user interface packages study. Note that only 3%’ of all comments was not directly related to the immediate description of computer software. The fact that comments tend to have a specific purpose related to computer code, regardless of the application area, tends to imply that a limited amount of subject matter is employed, as required by the limited subject matter category of Lehrberger’s sublanguage criteria. To further examine comments drawn from several different application areas in relation to Lehrberger’s limited subject matter and lexical restrictions categories, an additional study was performed. Words from the comments in each package were placed on separate lines in a file by the use of a utility program. The English words and the number of different English words (not including duplicate words) were counted, also by the use of a utility program. Then an analysis of the comment words was performed using PC-KIMMO version 2 with Englex (Antworth, 1995). Table 2 shows the results of this study. The totals from the graphical user interface packages study were also included in Table 2, and the graphical user interface packages were used in the calculation of overall totals. (However, graphical user
Description Defhi tion Cl Description of Definition 1% 29%
0 Instructions Reader
Fig. 19. Content
in .CCP files
El Operational Description Definition
Fig. 20. Overall
Cl Description Definition
Cl Instructions Reader
in both .H and .CCP files.
L.H. Etrkorn et (11.I Jownul
Table 2 Word analysis
for all packages
Total over 3 graphical user interface packages Total over 3 real time packages Total over 3 text analysis packages Total over 2 database packages Total over 3 mathematical packages Total over all packages
of Prupnutic~s 33 (2001) 1731-l 756
Total number of words
Number of different roots
Number of not recognized words
interface packages were not included in the analysis that was presented earlier, in Figs. 6 through 20.) The parallelization tool was not included in the table, since it represented only a single package in the area of parallelization tools. As with the earlier study involving only the graphical user interface packages and the one parallelization tool, all totals were calculated separately, from a file that included all the words from all the packages examined. For example, to calculate the total over all three real time packages, all words from all of the real time packages were combined into a single file, and that file was separately analyzed. To calculate the overall total, all words from all of the packages examined, including the graphical user interface packages, were combined into a single file, and that file was separately analyzed. As can be seen from Table 2, over fourteen C++ software packages, the total number of words that occurred was 92359. Of these words, there were only 2723 different words, from 1144 roots. The very small number of different words used overall in all of the packages examined, along with the small number of different roots, shows that the comments in the graphical user interface packages are significantly lexically restricted when compared to natural language as a whole. The small number of words employed also implies that comments have a limited subject matter, which corresponds to the limited subject matter category of Lehrberger’s sublanguage criteria. The small number of root words tends to imply that the semantic restrictions category of Lehrberger’s sublanguage criteria is also descriptive of comments.
and future research
Two separate studies of comments as a sublanguage have been performed. The first study was performed over three C++ graphical user interface packages (Backer
et al., 1991; Smart, 1994; Watson, 1993) and one parallelization tool (used as a control) (PAT, 1994). The second study was performed over three C++ packages drawn from the application area of real time packages (English, 1993; Laor, 1992; Serial, 1992), three text analysis packages (Lane et al., 1996; Moreland, 1994; Wunderling and Zoeckler, 1996), two database packages (Combits, 1997; Curtis, 1996), and three mathematical packages (Davies, 1995; Kaufmann and Meuller, 199.5; Veldhuizen, 1997). In each of these studies, it was demonstrated that comments meet the following sublanguage criteria specified by Lehrberger (1986): _ -
Lexical restrictions Syntactic restrictions Semantic restrictions High frequency of certain Limited subject matter Use of special symbols
The common telegraphic nature of comments, and the common writing of comments either as inline or as header block comments was discussed. These aspects of comments entail that comments also meet the following sublanguage criteria specified by Lehrberger: _ Deviant rules of grammar - Text structure Thus it can be confidently stated that English language comments in computer software form a sublanguage of English. Treating comments as a sublanguage made possible the implementation of a tool called the Program Analysis Tool for Reuse (the PATRicia system) that was developed to determine whether certain software components are useful and thus reusable in a particular domain (Etzkom and Davis, 1994, 1996, 1997). In addition to serving as a reusability analysis tool, the PATRicia system also has potential for use in software development, viz., to enforce standards for good comments. It also can be extended for use in software maintenance, where it could serve as an interactive assistant to the maintenance programmer. Another possible use for the PATRicia system is as an educational tool, to train computer science students to use good commenting practices. The PATRicia system has been shown to successfully identify software components that are potentially reusable in a certain area of interest (Etzkorn and Davis, 1947). In a comparison of the PATRicia system to human computer experts, the PATRicia system was between 70% and 90% complete (percentage of recall) and accurate (percentage of precision) in identifying reusable software components. As a reusability analysis tool, the PATRicia system could be used by large companies or government agencies to identify potentially reusable software components in already existing software. Those software components could be automatically categorized by the PATRicia system and stored in a library of reusable components.
Then, software developers in future projects could select an existing component from the library to fill a particular need, rather than having to write a new software component. Many companies and government agencies have such a large base of existing software, that having a human computer expert perform a similar analysis to that provided by the PATRicia system is prohibitively expensive in both time and cost, while being very boring for the computer expert. In enforcing standards for good comments, the PATRicia system could help improve software quality, and speed up code reviews. Software development often takes place under tight schedules. Thus, good comments that explain the operation of the software are often not produced, due to time constraints on the software developer (as well as to occasional laziness). Most companies with a good software development process have a peer review of the computer software produced, before the software is pronounced acceptable. However, this can be time consuming for the software development personnel involved in the review. The PATRicia system can provide a quick examination of computer software, and a quantitative answer as to whether the documentation, in the form of comments, is sufficient. For this purpose, metrics, or measurement methods, for what constitutes wellcommented software, have recently been developed (Etzkom and Delugach, 2000). The following metrics - Overall Class Documentation Quality A (OCDQa), Overall Class Documentation Quality B (OCDQb), the Class Comment Quality (CCQ), and the Class Identifier Quality (CIQ) - have been defined (Etzkorn and Delugach, 2000). For example, ODCQa is defined as class domain complexity divided by class size, where class domain complexity is a measure of the relatedness of the comment description to the domain of interest, based on a knowledge-representation technique known as conceptual graphs. ODCQb is defined as semantic class definition entropy divided by class size, where semantic class definition entropy is derived from using an information theory technique known as entropy applied to comments in a class. These metrics make the assumption that the best comment is a comment that is the most descriptive in terms of the application area. These metrics are further explained in Etzkom and Delugach (2000). Further study in this area could include applying syntactic as well as semantic considerations to developing new metrics for comment quality. It could also involve an investigation of the audience for which comments are intended: other programmers, the software developer himself or herself, a sub-clan of users, or the general public.
References Abd-El-Hafiz. Salwa K. and Victor R. Basili, 1996. A knowledge-based loops. IEEE Transactions on Software Engineering 22(5): 3399360. Allen, James, 1995. Natural language understanding. Redwood City, CA: Anick, Peter G., 1994, Adapting a full-text information retrieval system to domain. In: Proceedings of the Seventeenth International ACM STGIR Development in Information Retrieval, 349-358. ACM Press.
to the analysis
Benjamin/Cummings. the computer troubleshooting Conference on Research and
Antworth, Evan L., 1995. User’s guide to PC-KIMMO version 2, academic computing department. The Summer Institute of Linguistics. Backer. Andreas, Andreas Genau and Markus Sohlenkamp, 199 I. The generic interactive application for C++ and OSF/MOTIF user manual and tutorial, version 2.0. Human-Computer Interaction Research Division, Institute for Applied Information Technology, German National Research Center for Computer Science. Anonymous ftp at ftp.gmd.de, directory gmd/ginaplus. Biggerstaff, Ted J.. B.G. Mitbander and D.E. Webster, 1994. Program understanding and the concept assignment problem. Communications of the ACM 37(5): 72-82. Combits, 1997. The CS libraries: A database kit. P.O. Box 3301. 2280 GH Rijswijk,The Netherlands. http://www.combits.nl. Curtis. David, 1996. Quick database. Joint Academic Department of Psychological Medicine, London Hospital Medical College. [email protected]
Davies. Robert, 1995. NEWMATOX: A matrix library in C++. I6 Gloucester Street, Wilton. Wellington, New Zealand. [email protected]
English, J., 199.1. Class DOSThread: a base class for multithreaded DOS programs. Department of Computing, University of Brighton, [email protected]
Etzkorn, Letha and Carl Davis. 1994. A documentation-related approach to object-oriented program understanding. In: Proceedings of the IEEE Third Workshop on Program Comprehension, 3945. Los Alamitos. CA: IEEE Computer Society Press. Etzkorn, Letha and Carl Davis. 1996. Automated object-oriented reusable component identification. Knowledge-Based Systems 9: 5 17-524. Etzkom. Letha and Carl Davis, 1997. Automatically identifying reusable components in object-oriented legacy code. IEEE Computer 30: 667 I. Etzkom. Letha and Harry Delugach, 2000. Towards a semantic metrics suite for object oriented design. In: Proceedings of the Thirty-Fourth International Conference on Technology of Object-Oriented Languages and Systems. 7 l-80. Los Alamitos, CA: IEEE Computer Society Press. Fitzpatrick, Eileen, Joan Bachenko and Don Hindle. 1986. The status of telegraphic sublanguages. In: Ralph Grishman and Richard Kittredge. eds., Analyzing language in restricted domains. 39-51. Hillsdale, NJ: Erlbaum. Grishman. Ralph and Richard Kittredge, eds., 1986. Analyzing language in restricted domains. Hillsdale, NJ: Erlbaum. Kaufmann, Friedrich and Walter Meuller, 1995. MFLOAT 2.0. Technical University of Graz. [email protected]
Kittredge. Richard, 1982. Variation and homogeneity of sublanguages. In: Richard Kittredge and John Lehrberger, eds., Sublanguage: Studies of language in restricted semantic domains. 107-137. New York: Walter de Gruyter. Kittredge. Richard, and John Lehrberger, eds., 1982. Sublanguage: Studies of language in restricted semantic domains. New York: Walter de Gruytrr. Koakenniemi, Kimmo, 1983. Two-level morphology: A general computational model for word form recognition and production. Publication No. I I. Helsinki: University of Helsinki Department of General Linguistics. Krovetz, Robert and W. Bruce Croft. 1992. Lexical ambiguity and information retrieval. ACM Transactions on Information Systems 10(2): I l&141. Lane. Tom. Phil Gladstone, Luis Ortiz, Jim Boucher. Lee Cracker, Julian Minguillon. George Phillips. Davide Rossi and GC Weijers. 1996. JPEG. Independent JPEG Group. ftp.uu.net. [email protected]
Laor. Ofer. 1992. ISC: Interrupt service class V.3.66. 27 KKL St.. Kiriat Bialik, Israel. [email protected]
I, [email protected]
Lehrberger, John, 1982. Automatic translation and the concept of sublanguage. In: Richard Kittredge and John Lehrberger, eds., Sublanguage: Studies of language in restricted semantic domains. X lllO6. New York: Walter de Gruyter. Lehrberger, John, 19X6. Sublanguage analysis. In: Ralph Grishman and Richard Kittredge. eds.. Analyzing language in restricted domains, 19-38. New York: Walter de Gruyter. Lehrberger. John and L. Bourbeau, IYXX. Machine translation: Linguistic characteristics of MT systems and general methodology of evaluation. Philadelphia, PA: Benjamin/Cummings.
Linguistics Data Consortium. The University of Pennsylvania. 3615 Market Street, Suite 200. Philadelphia, PA I9 104-260X. Marsh, Elaine and Carol Friedman, 1985. Transporting the linguistic string project system from a medical to a navy domain. ACM Transactions on Office Information Systems 3(2): 121-140. Moreland. Carl, 1994. String++ V. 3. I. 43 I4 Filmore Road. Greensboro. NC 27409. [email protected]
, http://oak.oakland.edu. Navarro, Gonzalo and Ricardo Baeza-Yates. 1997. Proximal nodes: A model to query document databases by content and structure. ACM Transactions on Information Systems I5(4): 4Ow3.5. Ning, Jim. Andre Engberts and W. Voytek Kozaczynski, 1994. Automated support for legacy code understanding. Communications of the ACM 37(5): 39-S I. PAT, 1994. PAT-Parallelization Tool. Georgia Institute of Technology, ftp.cc.gatech.edu. Rich, Elaine and Kevin Knight, 1991. Artificial intelligence, 399417. New York: McGraw-Hill. Rich, C. and Linda M. Wills, 1990. Recognizing a program’s design: A graph-parsing approach. IEEE Software S(l): 82-89. Serial. 1992. Serial. http://www,rt66,com/ftp/pc/windows. Smart, Julian. 1994. Reference manual for wxwindows 1.60: A portable C++ GUI toolkit. Artificial Intelligence Applications Institute. University of Edinburgh. http://www.aia..ed.ac.uk/-jacs/wx. Veldhuizen, T., 1997. BLITZ++ users manual. Department of Computer Science, University of Waterloo, Canada, [email protected]
http://monet.uwaterloo.canada/blitz. Watson, Mark, 1993. Portable GUI development with C++. New York: McGraw Hill. Wunderling, Roland and Malte Zoeckler. 1996. DOC++: A C++ documentation system for LaTeX and HTML. Electronic Library for Mathematical Software. Berlin. [email protected]
http: //www.zib.de/ Visual/software/ dot++/ index.html.
Letha H. Etzkorn is an Assistant Professor in the computer science department at the University of Alabama in Huntsville. Prior to this, she spent several years in industry, most recently with Motorola. where she was a senior engineer responsible for real time hardware and software design. Her primary research interest is in artificial intelligence applications to software engineering. This research also includes natural language processing and knowledge-based systems. She is also interested in object-oriented software and object-oriented software metrics. Carl G. Davis recently retired as chair of the computer science department at the University of Alabama in Huntsville, where he had been since 1986. Prior to joining the university. he worked for the United States government where he was a senior executive, responsible for R&D programs in real-time software development techniques. He was involved in the development of several early software development support environments for large, real time systems. His research interest is in software engineering. including object-oriented designs. He also works on a federally funded pre-college teacher training program to cevelop computational science as a new teaching paradigm. Lisa L. Howen is a lecturer and Ph.D. student in the computer science department at the University of Alabama in Huntsville. Prior to pursuing her Ph.D., she worked in industry for several years for various employement was with Coleman Research Corporation, companies. Her most recent industrial Huntsville, AL, where she was responsible for simulation software developent and integration. She also served as leader of the Requirements Management Process Action Team. Her primary research interests are in software engineering and object-oriented software.