diff options
Diffstat (limited to 'gcc/f/ffe.texi')
-rw-r--r-- | gcc/f/ffe.texi | 1141 |
1 files changed, 1139 insertions, 2 deletions
diff --git a/gcc/f/ffe.texi b/gcc/f/ffe.texi index 4108bb850bc..e30333280d1 100644 --- a/gcc/f/ffe.texi +++ b/gcc/f/ffe.texi @@ -11,15 +11,1047 @@ This chapter describes some aspects of the design and implementation of the @code{g77} front end. +Much of the information below applies not to current +releases of @code{g77}, +but to the 0.6 rewrite being designed and implemented +as of late May, 1999. + +To find about things that are ``To Be Determined'' or ``To Be Done'', +search for the string TBD. +If you want to help by working on one or more of these items, +email me at @email{@value{email-burley}}. +If you're planning to do more than just research issues and offer comments, +see @uref{http://www.gnu.org/software/contribute.html} for steps you might +need to take first. @menu +* Overview of Sources:: +* Overview of Translation Process:: * Philosophy of Code Generation:: * Two-pass Design:: * Challenges Posed:: * Transforming Statements:: * Transforming Expressions:: +* Internal Naming Conventions:: @end menu +@node Overview of Sources +@section Overview of Sources + +The current directory layout includes the following: + +@table @file +@item @value{srcdir}/gcc/ +Non-g77 files in gcc + +@item @value{srcdir}/gcc/f/ +GNU Fortran front end sources + +@item @value{srcdir}/libf2c/ +@code{libg2c} configuration and @code{g2c.h} file generation + +@item @value{srcdir}/libf2c/libF77/ +General support and math portion of @code{libg2c} + +@item @value{srcdir}/libf2c/libI77/ +I/O portion of @code{libg2c} + +@item @value{srcdir}/libf2c/libU77/ +Additional interfaces to Unix @code{libc} for @code{libg2c} +@end table + +Components of note in @code{g77} are described below. + +@file{f/} as a whole contains the source for @code{g77}, +while @file{libf2c/} contains a portion of the separate program +@code{f2c}. +Note that the @code{libf2c} code is not part of the program @code{g77}, +just distributed with it. + +@file{f/} contains text files that document the Fortran compiler, source +files for the GNU Fortran Front End (FFE), and some other stuff. +The @code{g77} compiler code is placed in @file{f/} because it, +along with its contents, +is designed to be a subdirectory of a @code{gcc} source directory, +@file{gcc/}, +which is structured so that language-specific front ends can be ``dropped +in'' as subdirectories. +The C++ front end (@code{g++}), is an example of this---it resides in +the @file{cp/} subdirectory. +Note that the C front end (also referred to as @code{gcc}) +is an exception to this, as its source files reside +in the @file{gcc/} directory itself. + +@file{libf2c/} contains the run-time libraries for the @code{f2c} program, +also used by @code{g77}. +These libraries normally referred to collectively as @code{libf2c}. +When built as part of @code{g77}, +@code{libf2c} is installed under the name @code{libg2c} to avoid +conflict with any existing version of @code{libf2c}, +and thus is often referred to as @code{libg2c} when the +@code{g77} version is specifically being referred to. + +The @code{netlib} version of @code{libf2c/} +contains two distinct libraries, +@code{libF77} and @code{libI77}, +each in their own subdirectories. +In @code{g77}, this distinction is not made, +beyond maintaining the subdirectory structure in the source-code tree. + +@file{libf2c/} is not part of the program @code{g77}, +just distributed with it. +It contains files not present +in the official (@code{netlib}) version of @code{libf2c}, +and also contains some minor changes made from @code{libf2c}, +to fix some bugs, +and to facilitate automatic configuration, building, and installation of +@code{libf2c} (as @code{libg2c}) for use by @code{g77} users. +See @file{libf2c/README} for more information, +including licensing conditions +governing distribution of programs containing code from @code{libg2c}. + +@code{libg2c}, @code{g77}'s version of @code{libf2c}, +adds Dave Love's implementation of @code{libU77}, +in the @file{libf2c/libU77/} directory. +This library is distributed under the +GNU Library General Public License (LGPL)---see the +file @file{libf2c/libU77/COPYING.LIB} +for more information, +as this license +governs distribution conditions for programs containing code +from this portion of the library. + +Files of note in @file{f/} and @file{libf2c/} are described below: + +@table @file +@item f/BUGS +Lists some important bugs known to be in g77. +Or use Info (or GNU Emacs Info mode) to read +the ``Actual Bugs'' node of the @code{g77} documentation: + +@smallexample +info -f f/g77.info -n "Actual Bugs" +@end smallexample + +@item f/ChangeLog +Lists recent changes to @code{g77} internals. + +@item libf2c/ChangeLog +Lists recent changes to @code{libg2c} internals. + +@item f/NEWS +Contains the per-release changes. +These include the user-visible +changes described in the node ``Changes'' +in the @code{g77} documentation, plus internal +changes of import. +Or use: + +@smallexample +info -f f/g77.info -n News +@end smallexample + +@item f/g77.info* +The @code{g77} documentation, in Info format, +produced by building @code{g77}. + +All users of @code{g77} (not just installers) should read this, +using the @code{more} command if neither the @code{info} command, +nor GNU Emacs (with its Info mode), are available, or if users +aren't yet accustomed to using these tools. +All of these files are readable as ``plain text'' files, +though they're easier to navigate using Info readers +such as @code{info} and GNU Emacs Info mode. +@end table + +If you want to explore the FFE code, which lives entirely in @file{f/}, +here are a few clues. +The file @file{g77spec.c} contains the @code{g77}-specific source code +for the @code{g77} command only---this just forms a variant of the +@code{gcc} command, so, +just as the @code{gcc} command itself does not contain the C front end, +the @code{g77} command does not contain the Fortran front end (FFE). +The FFE code ends up in an executable named @file{f771}, +which does the actual compiling, +so it contains the FFE plus the @code{gcc} back end (GBE), +the latter to do most of the optimization, and the code generation. + +The file @file{parse.c} is the source file for @code{yyparse()}, +which is invoked by the GBE to start the compilation process, +for @file{f771}. + +The file @file{top.c} contains the top-level FFE function @code{ffe_file} +and it (along with top.h) define all @samp{ffe_[a-z].*}, @samp{ffe[A-Z].*}, +and @samp{FFE_[A-Za-z].*} symbols. + +The file @file{fini.c} is a @code{main()} program that is used when building +the FFE to generate C header and source files for recognizing keywords. +The files @file{malloc.c} and @file{malloc.h} comprise a memory manager +that defines all @samp{malloc_[a-z].*}, @samp{malloc[A-Z].*}, and +@samp{MALLOC_[A-Za-z].*} symbols. + +All other modules named @var{xyz} +are comprised of all files named @samp{@var{xyz}*.@var{ext}} +and define all @samp{ffe@var{xyz}_[a-z].*}, @samp{ffe@var{xyz}[A-Z].*}, +and @samp{FFE@var{XYZ}_[A-Za-z].*} symbols. +If you understand all this, congratulations---it's easier for me to remember +how it works than to type in these regular expressions. +But it does make it easy to find where a symbol is defined. +For example, the symbol @samp{ffexyz_set_something} would be defined +in @file{xyz.h} and implemented there (if it's a macro) or in @file{xyz.c}. + +The ``porting'' files of note currently are: + +@table @file +@item proj.c +@itemx proj.h +This defines the ``language'' used by all the other source files, +the language being Standard C plus some useful things +like @code{ARRAY_SIZE} and such. + +@item target.c +@itemx target.h +These describe the target machine +in terms of what data types are supported, +how they are denoted +(to what C type does an @code{INTEGER*8} map, for example), +how to convert between them, +and so on. +Over time, versions of @code{g77} rely less on this file +and more on run-time configuration based on GBE info +in @file{com.c}. + +@item com.c +@itemx com.h +These are the primary interface to the GBE. + +@item ste.c +@itemx ste.h +This contains code for implementing recognized executable statements +in the GBE. + +@item src.c +@itemx src.h +These contain information on the format(s) of source files +(such as whether they are never to be processed as case-insensitive +with regard to Fortran keywords). +@end table + +If you want to debug the @file{f771} executable, +for example if it crashes, +note that the global variables @code{lineno} and @code{input_filename} +are usually set to reflect the current line being read by the lexer +during the first-pass analysis of a program unit and to reflect +the current line being processed during the second-pass compilation +of a program unit. + +If an invocation of the function @code{ffestd_exec_end} is on the stack, +the compiler is in the second pass, otherwise it is in the first. + +(This information might help you reduce a test case and/or work around +a bug in @code{g77} until a fix is available.) + +@node Overview of Translation Process +@section Overview of Translation Process + +The order of phases translating source code to the form accepted +by the GBE is: + +@enumerate +@item +Stripping punched-card sources (@file{g77stripcard.c}) + +@item +Lexing (@file{lex.c}) + +@item +Stand-alone statement identification (@file{sta.c}) + +@item +Parsing (@file{stb.c} and @file{expr.c}) + +@item +Constructing (@file{stc.c}) + +@item +Collecting (@file{std.c}) + +@item +Expanding (@file{ste.c}) +@end enumerate + +To get a rough idea of how a particularly twisted Fortran statement +gets treated by the passes, consider: + +@smallexample + FORMAT(I2 4H)=(J/ + & I3) +@end smallexample + +The job of @file{lex.c} is to know enough about Fortran syntax rules +to break the statement up into distinct lexemes without requiring +any feedback from subsequent phases: + +@smallexample +`FORMAT' +`(' +`I24H' +`)' +`=' +`(' +`J' +`/' +`I3' +`)' +@end smallexample + +The job of @file{sta.c} is to figure out the kind of statement, +or, at least, statement form, that sequence of lexemes represent. + +The sooner it can do this (in terms of using the smallest number of +lexemes, starting with the first for each statement), the better, +because that leaves diagnostics for problems beyond the recognition +of the statement form to subsequent phases, +which can usually better describe the nature of the problem. + +In this case, the @samp{=} at ``level zero'' +(not nested within parentheses) +tells @file{sta.c} that this is an @emph{assignment-form}, +not @code{FORMAT}, statement. + +An assignment-form statement might be a statement-function +definition or an executable assignment statement. + +To make that determination, +@file{sta.c} looks at the first two lexemes. + +Since the second lexeme is @samp{(}, +the first must represent an array for this to be an assignment statement, +else it's a statement function. + +Either way, @file{sta.c} hands off the statement to @file{stb.c} +(either its statement-function parser or its assignment-statement parser). + +@file{stb.c} forms a +statement-specific record containing the pertinent information. +That information includes a source expression and, +for an assignment statement, a destination expression. +Expressions are parsed by @file{expr.c}. + +This record is passed to @file{stc.c}, +which copes with the implications of the statement +within the context established by previous statements. + +For example, if it's the first statement in the file +or after an @code{END} statement, +@file{stc.c} recognizes that, first of all, +a main program unit is now being lexed +(and tells that to @file{std.c} +before telling it about the current statement). + +@file{stc.c} attaches whatever information it can, +usually derived from the context established by the preceding statements, +and passes the information to @file{std.c}. + +@file{std.c} saves this information away, +since the GBE cannot cope with information +that might be incomplete at this stage. + +For example, @samp{I3} might later be determined +to be an argument to an alternate @code{ENTRY} point. + +When @file{std.c} is told about the end of an external (top-level) +program unit, +it passes all the information it has saved away +on statements in that program unit +to @file{ste.c}. + +@file{ste.c} ``expands'' each statement, in sequence, by +constructing the appropriate GBE information and calling +the appropriate GBE routines. + +Details on the transformational phases follow. +Keep in mind that Fortran numbering is used, +so the first character on a line is column 1, +decimal numbering is used, and so on. + +@menu +* g77stripcard:: +* lex.c:: +* sta.c:: +* stb.c:: +* expr.c:: +* stc.c:: +* std.c:: +* ste.c:: + +* Gotchas (Transforming):: +* TBD (Transforming):: +@end menu + +@node g77stripcard +@subsection g77stripcard + +The @code{g77stripcard} program handles removing content beyond +column 72 (adjustable via a command-line option), +optionally warning about that content being something other +than trailing whitespace or Fortran commentary. + +This program is needed because @code{lex.c} doesn't pay attention +to maximum line lengths at all, to make it easier to maintain, +as well as faster (for sources that don't depend on the maximum +column length vis-a-vis trailing non-blank non-commentary content). + +Just how this program will be run---whether automatically for +old source (perhaps as the default for @file{.f} files?)---is not +yet determined. + +In the meantime, it might as well be implemented as a typical UNIX pipe. + +It should accept a @samp{-fline-length-@var{n}} option, +with the default line length set to 72. + +When the text it strips off the end of a line is not blank +(not spaces and tabs), +it should insert an additional comment line +(beginning with @samp{!}, +so it works for both fixed-form and free-form files) +containing the text, +following the stripped line. +The inserted comment should have a prefix of some kind, +TBD, that distinguishes the comment as representing stripped text. +Users could use that to @code{sed} out such lines, if they wished---it +seems silly to provide a command-line option to delete information +when it can be so easily filtered out by another program. + +(This inserted comment should be designed to ``fit in'' well +with whatever the Fortran community is using these days for +preprocessor, translator, and other such products, like OpenMP. +What that's all about, and how @code{g77} can elegantly fit its +special comment conventions into it all, is TBD as well. +We don't want to reinvent the wheel here, but if there turn out +to be too many conflicting conventions, we might have to invent +one that looks nothing like the others, but which offers their +host products a better infrastructure in which to fit and coexist +peacefully.) + +@code{g77stripcard} probably shouldn't do any tab expansion or other +fancy stuff. +People can use @code{expand} or other pre-filtering if they like. +The idea here is to keep each stage quite simple, while providing +excellent performance for ``normal'' code. + +(Code with junk beyond column 73 is not really ``normal'', +as it comes from a card-punch heritage, +and will be increasingly hard for tomorrow's Fortran programmers to read.) + +@node lex.c +@subsection lex.c + +To help make the lexer simple, fast, and easy to maintain, +while also having @code{g77} generally encourage Fortran programmers +to write simple, maintainable, portable code by maximizing the +performance of compiling that kind of code: + +@itemize @bullet +@item +There'll be just one lexer, for both fixed-form and free-form source. + +@item +It'll care about the form only when handling the first 7 columns of +text, stuff like spaces between strings of alphanumerics, and +how lines are continued. + +Some other distinctions will be handled by subsequent phases, +so at least one of them will have to know which form is involved. + +For example, @samp{I = 2 . 4} is acceptable in fixed form, +and works in free form as well given the implementation @code{g77} +presently uses. +But the standard requires a diagnostic for it in free form, +so the parser has to be able to recognize that +the lexemes aren't contiguous +(information the lexer @emph{does} have to provide) +and that free-form source is being parsed, +so it can provide the diagnostic. + +The @code{g77} lexer doesn't try to gather @samp{2 . 4} into a single lexeme. +Otherwise, it'd have to know a whole lot more about how to parse Fortran, +or subsequent phases (mainly parsing) would have two paths through +lots of critical code---one to handle the lexeme @samp{2}, @samp{.}, +and @samp{4} in sequence, another to handle the lexeme @samp{2.4}. + +@item +It won't worry about line lengths +(beyond the first 7 columns for fixed-form source). + +That is, once it starts parsing the ``statement'' part of a line +(column 7 for fixed-form, column 1 for free-form), +it'll keep going until it finds a newline, +rather than ignoring everything past a particular column +(72 or 132). + +The implication here is that there shouldn't @emph{be} +anything past that last column, other than whitespace or +commentary, because users using typical editors +(or viewing output as typically printed) +won't necessarily know just where the last column is. + +Code that has ``garbage'' beyond the last column +(almost certainly only fixed-form code with a punched-card legacy, +such as code using columns 73-80 for ``sequence numbers'') +will have to be run through @code{g77stripcard} first. + +Also, keeping track of the maximum column position while also watching out +for the end of a line @emph{and} while reading from a file +just makes things slower. +Since a file must be read, and watching for the end of the line +is necessary (unless the typical input file was preprocessed to +include the necessary number of trailing spaces), +dropping the tracking of the maximum column position +is the only way to reduce the complexity of the pertinent code +while maintaining high performance. + +@item +ASCII encoding is assumed for the input file. + +Code written in other character sets will have to be converted first. + +@item +Tabs (ASCII code 9) +will be converted to spaces via the straightforward +approach. + +Specifically, a tab is converted to between one and eight spaces +as necessary to reach column @var{n}, +where dividing @samp{(@var{n} - 1)} by eight +results in a remainder of zero. + +@item +Linefeeds (ASCII code 10) +mark the ends of lines. + +@item +A carriage return (ASCII code 13) +is accept if it immediately precedes a linefeed, +in which case it is ignored. + +Otherwise, it is rejected (with a diagnostic). + +@item +Any other characters other than the above +that are not part of the GNU Fortran Character Set +(@pxref{Character Set}) +are rejected with a diagnostic. + +This includes backspaces, form feeds, and the like. + +(It might make sense to allow a form feed in column 1 +as long as that's the only character on a line. +It certainly wouldn't seem to cost much in terms of performance.) + +@item +The end of the input stream (EOF) +ends the current line. + +@item +The distinction between uppercase and lowercase letters +will be preserved. + +It will be up to subsequent phases to decide to fold case. + +Current plans are to permit any casing for Fortran (reserved) keywords +while preserving casing for user-defined names. +(This might not be made the default for @file{.f} files, though.) + +Preserving case seems necessary to provide more direct access +to facilities outside of @code{g77}, such as to C or Pascal code. + +Names of intrinsics will probably be matchable in any case, +However, there probably won't be any option to require +a particular mixed-case appearance of intrinsics +(as there was for @code{g77} prior to version 0.6), +because that's painful to maintain, +and probably nobody uses it. + +(How @samp{external SiN; r = sin(x)} would be handled is TBD. +I think old @code{g77} might already handle that pretty elegantly, +but whether we can cope with allowing the same fragment to reference +a @emph{different} procedure, even with the same interface, +via @samp{s = SiN(r)}, needs to be determined. +If it can't, we need to make sure that when code introduces +a user-defined name, any intrinsic matching that name +using a case-insensitive comparison +is ``turned off''.) + +@item +Backslashes in @code{CHARACTER} and Hollerith constants +are not allowed. + +This avoids the confusion introduced by some Fortran compiler vendors +providing C-like interpretation of backslashes, +while others provide straight-through interpretation. + +Some kind of lexical construct (TBD) will be provided to allow +flagging of a @code{CHARACTER} +(but probably not a Hollerith) +constant that permits backslashes. +It'll necessarily be a prefix, such as: + +@smallexample +PRINT *, C'This line has a backspace \b here.' +PRINT *, F'This line has a straight backslash \ here.' +@end smallexample + +Further, command-line options might be provided to specify that +one prefix or the other is to be assumed as the default +for @code{CHARACTER} constants. + +However, it seems more helpful for @code{g77} to provide a program +that converts prefix all constants +(or just those containing backslashes) +with the desired designation, +so printouts of code can be read +without knowing the compile-time options used when compiling it. + +If such a program is provided +(let's name it @code{g77slash} for now), +then a command-line option to @code{g77} should not be provided. +(Though, given that it'll be easy to implement, it might be hard +to resist user requests for it ``to compile faster than if we +have to invoke another filter''.) + +This program would take a command-line option to specify the +default interpretation of slashes, +affecting which prefix it uses for constants. + +@code{g77slash} probably should automatically convert Hollerith +constants that contain slashes +to the appropriate @code{CHARACTER} constants. +Then @code{g77} wouldn't have to define a prefix syntax for Hollerith +constants specifying whether they want C-style or straight-through +backslashes. +@end itemize + +The above implements nearly exactly what is specified by +@ref{Character Set}, +and +@ref{Lines}, +except it also provides automatic conversion of tabs +and ignoring of newline-related carriage returns. + +It also effects the ``pure visual'' model, +by which is meant that a user viewing his code +in a typical text editor +(assuming it's not preprocessed via @code{g77stripcard} or similar) +doesn't need any special knowledge +of whether spaces on the screen are really tabs, +whether lines end immediately after the last visible non-space character +or after a number of spaces and tabs that follow it, +or whether the last line in the file is ended by a newline. + +Most editors don't make these distinctions, +the ANSI FORTRAN 77 standard doesn't require them to, +and it permits a standard-conforming compiler +to define a method for transforming source code to +``standard form'' however it wants. + +So, GNU Fortran defines it such that users have the best chance +of having the code be interpreted the way it looks on the screen +of the typical editor. + +(Fancy editors should @emph{never} be required to correctly read code +written in classic two-dimensional-plaintext form. +By correct reading I mean ability to read it, book-like, without +mistaking text ignored by the compiler for program code and vice versa, +and without having to count beyond the first several columns. +The vague meaning of ASCII TAB, among other things, complicates +this somewhat, but as long as ``everyone'', including the editor, +other tools, and printer, agrees about the every-eighth-column convention, +the GNU Fortran ``pure visual'' model meets these requirements. +Any language or user-visible source form +requiring special tagging of tabs, +the ends of lines after spaces/tabs, +and so on, is broken by this definition. +Fortunately, Fortran @emph{itself} is not broken, +even if most vendor-supplied defaults for their Fortran compilers @emph{are} +in this regard.) + +Further, this model provides a clean interface +to whatever preprocessors or code-generators are used +to produce input to this phase of @code{g77}. +Mainly, they need not worry about long lines. + +@node sta.c +@subsection sta.c + +@node stb.c +@subsection stb.c + +@node expr.c +@subsection expr.c + +@node stc.c +@subsection stc.c + +@node std.c +@subsection std.c + +@node ste.c +@subsection ste.c + +@node Gotchas (Transforming) +@subsection Gotchas (Transforming) + +This section is not about transforming ``gotchas'' into something else. +It is about the weirder aspects of transforming Fortran, +however that's defined, +into a more modern, canonical form. + +@subsubsection Multi-character Lexemes + +Each lexeme carries with it a pointer to where it appears in the source. + +To provide the ability for diagnostics to point to column numbers, +in addition to line numbers and names, +lexemes that represent more than one (significant) character +in the source code need, generally, +to provide pointers to where each @emph{character} appears in the source. + +This provides the ability to properly identify the precise location +of the problem in code like + +@smallexample +SUBROUTINE X +END +BLOCK DATA X +END +@end smallexample + +which, in fixed-form source, would result in single lexemes +consisting of the strings @samp{SUBROUTINEX} and @samp{BLOCKDATAX}. +(The problem is that @samp{X} is defined twice, +so a pointer to the @samp{X} in the second definition, +as well as a follow-up pointer to the corresponding pointer in the first, +would be preferable to pointing to the beginnings of the statements.) + +This need also arises when parsing (and diagnosing) @code{FORMAT} +statements. + +Further, it arises when diagnosing +@code{FMT=} specifiers that contain constants +(or partial constants, or even propagated constants!) +in I/O statements, as in: + +@smallexample +PRINT '(I2, 3HAB)', J +@end smallexample + +(A pointer to the beginning of the prematurely-terminated Hollerith +constant, and/or to the close parenthese, is preferable to a pointer +to the open-parenthese or the apostrophe that precedes it.) + +Multi-character lexemes, which would seem to naturally include +at least digit strings, alphanumeric strings, @code{CHARACTER} +constants, and Hollerith constants, therefore need to provide +location information on each character. +(Maybe Hollerith constants don't, but it's unnecessary to except them.) + +The question then arises, what about @emph{other} multi-character lexemes, +such as @samp{**} and @samp{//}, +and Fortran 90's @samp{(/}, @samp{/)}, @samp{::}, and so on? + +Turns out there's a need to identify the location of the second character +of these two-character lexemes. +For example, in @samp{I(/J) = K}, the slash needs to be diagnosed +as the problem, not the open parenthese. +Similarly, it is preferable to diagnose the second slash in +@samp{I = J // K} rather than the first, given the implicit typing +rules, which would result in the compiler disallowing the attempted +concatenation of two integers. +(Though, since that's more of a semantic issue, +it's not @emph{that} much preferable.) + +Even sequences that could be parsed as digit strings could use location info, +for example, to diagnose the @samp{9} in the octal constant @samp{O'129'}. +(This probably will be parsed as a character string, +to be consistent with the parsing of @samp{Z'129A'}.) + +To avoid the hassle of recording the location of the second character, +while also preserving the general rule that each significant character +is distinctly pointed to by the lexeme that contains it, +it's best to simply not have any fixed-size lexemes +larger than one character. + +This new design is expected to make checking for two +@samp{*} lexemes in a row much easier than the old design, +so this is not much of a sacrifice. +It probably makes the lexer much easier to implement +than it makes the parser harder. + +@subsubsection Space-padding Lexemes + +Certain lexemes need to be padded with virtual spaces when the +end of the line (or file) is encountered. + +This is necessary in fixed form, to handle lines that don't +extend to column 72, assuming that's the line length in effect. + +@subsubsection Bizarre Free-form Hollerith Constants + +Last I checked, the Fortran 90 standard actually required the compiler +to silently accept something like + +@smallexample +FORMAT ( 1 2 Htwelve chars ) +@end smallexample + +as a valid @code{FORMAT} statement specifying a twelve-character +Hollerith constant. + +The implication here is that, since the new lexer is a zero-feedback one, +it won't know that the special case of a @code{FORMAT} statement being parsed +requires apparently distinct lexemes @samp{1} and @samp{2} to be treated as +a single lexeme. + +(This is a horrible misfeature of the Fortran 90 language. +It's one of many such misfeatures that almost make me want +to not support them, and forge ahead with designing a new +``GNU Fortran'' language that has the features, +but not the misfeatures, of Fortran 90, +and provide utility programs to do the conversion automatically.) + +So, the lexer must gather distinct chunks of decimal strings into +a single lexeme in contexts where a single decimal lexeme might +start a Hollerith constant. + +(Which probably means it might as well do that all the time +for all multi-character lexemes, even in free-form mode, +leaving it to subsequent phases to pull them apart as they see fit.) + +Compare the treatment of this to how + +@smallexample +CHARACTER * 4 5 HEY +@end smallexample + +and + +@smallexample +CHARACTER * 12 HEY +@end smallexample + +must be treated---the former must be diagnosed, due to the separation +between lexemes, the latter must be accepted as a proper declaration. + +@subsubsection Hollerith Constants + +Recognizing a Hollerith constant---specifically, +that an @samp{H} or @samp{h} after a digit string begins +such a constant---requires some knowledge of context. + +Hollerith constants (such as @samp{2HAB}) can appear after: + +@itemize @bullet +@item +@samp{(} + +@item +@samp{,} + +@item +@samp{=} + +@item +@samp{+}, @samp{-}, @samp{/} + +@item +@samp{*}, except as noted below +@end itemize + +Hollerith constants don't appear after: + +@itemize @bullet +@item +@samp{CHARACTER*}, +which can be treated generally as +any @samp{*} that is the second lexeme of a statement +@end itemize + +@subsubsection Confusing Function Keyword + +While + +@smallexample +REAL FUNCTION FOO () +@end smallexample + +must be a @code{FUNCTION} statement and + +@smallexample +REAL FUNCTION FOO (5) +@end smallexample + +must be a type-definition statement, + +@smallexample +REAL FUNCTION FOO (@var{names}) +@end smallexample + +where @var{names} is a comma-separated list of names, +can be one or the other. + +The only way to disambiguate that statement +(short of mandating free-form source or a short maximum +length for name for external procedures) +is based on the context of the statement. + +In particular, the statement is known to be within an +already-started program unit +(but not at the outer level of the @code{CONTAINS} block), +it is a type-declaration statement. + +Otherwise, the statement is a @code{FUNCTION} statement, +in that it begins a function program unit +(external, or, within @code{CONTAINS}, nested). + +@subsubsection Weird READ + +The statement + +@smallexample +READ (N) +@end smallexample + +is equivalent to either + +@smallexample +READ (UNIT=(N)) +@end smallexample + +or + +@smallexample +READ (FMT=(N)) +@end smallexample + +depending on which would be valid in context. + +Specifically, if @samp{N} is type @code{INTEGER}, +@samp{READ (FMT=(N))} would not be valid, +because parentheses may not be used around @samp{N}, +whereas they may around it in @samp{READ (UNIT=(N))}. + +Further, if @samp{N} is type @code{CHARACTER}, +the opposite is true---@samp{READ (UNIT=(N))} is not valid, +but @samp{READ (FMT=(N))} is. + +Strictly speaking, if anything follows + +@smallexample +READ (N) +@end smallexample + +in the statement, whether the first lexeme after the close +parenthese is a comma could be used to disambiguate the two cases, +without looking at the type of @samp{N}, +because the comma is required for the @samp{READ (FMT=(N))} +interpretation and disallowed for the @samp{READ (UNIT=(N))} +interpretation. + +However, in practice, many Fortran compilers allow +the comma for the @samp{READ (UNIT=(N))} +interpretation anyway +(in that they generally allow a leading comma before +an I/O list in an I/O statement), +and much code takes advantage of this allowance. + +(This is quite a reasonable allowance, since the +juxtaposition of a comma-separated list immediately +after an I/O control-specification list, which is also comma-separated, +without an intervening comma, +looks sufficiently ``wrong'' to programmers +that they can't resist the itch to insert the comma. +@samp{READ (I, J), K, L} simply looks cleaner than +@samp{READ (I, J) K, L}.) + +So, type-based disambiguation is needed unless strict adherence +to the standard is always assumed, and we're not going to assume that. + +@node TBD (Transforming) +@subsection TBD (Transforming) + +Continue researching gotchas, designing the transformational process, +and implementing it. + +Specific issues to resolve: + +@itemize @bullet +@item +Just where should @code{INCLUDE} processing take place? + +Clearly before (or part of) statement identification (@file{sta.c}), +since determining whether @samp{I(J)=K} is a statement-function +definition or an assignment statement requires knowing the context, +which in turn requires having processed @code{INCLUDE} files. + +@item +Just where should (if it was implemented) @code{USE} processing take place? + +This gets into the whole issue of how @code{g77} should handle the concept +of modules. +I think GNAT already takes on this issue, but don't know more than that. +Jim Giles has written extensively on @code{comp.lang.fortran} +about his opinions on module handling, as have others. +Jim's views should be taken into account. + +Actually, Richard M. Stallman (RMS) also has written up +some guidelines for implementing such things, +but I'm not sure where I read them. +Perhaps the old @email{gcc2@@cygnus.com} list. + +If someone could dig references to these up and get them to me, +that would be much appreciated! +Even though modules are not on the short-term list for implementation, +it'd be helpful to know @emph{now} how to avoid making them harder to +implement them @emph{later}. + +@item +Should the @code{g77} command become just a script that invokes +all the various preprocessing that might be needed, +thus making it seem slower than necessary for legacy code +that people are unwilling to convert, +or should we provide a separate script for that, +thus encouraging people to convert their code once and for all? + +At least, a separate script to behave as old @code{g77} did, +perhaps named @code{g77old}, might ease the transition, +as might a corresponding one that converts source codes +named @code{g77oldnew}. + +These scripts would take all the pertinent options @code{g77} used +to take and run the appropriate filters, +passing the results to @code{g77} or just making new sources out of them +(in a subdirectory, leaving the user to do the dirty deed of +moving or copying them over the old sources). + +@item +Do other Fortran compilers provide a prefix syntax +to govern the treatment of backslashes in @code{CHARACTER} +(or Hollerith) constants? + +Knowing what other compilers provide would help. + +@item +Is it okay to drop support for the @samp{-fintrin-case-initcap}, +@samp{-fmatch-case-initcap}, @samp{-fsymbol-case-initcap}, +and @samp{-fcase-initcap} options? + +I've asked @email{info-gnu-fortran@@gnu.org} for input on this. +Not having to support these makes it easier to write the new front end, +and might also avoid complicated its design. +@end itemize + @node Philosophy of Code Generation @section Philosophy of Code Generation @@ -476,7 +1508,7 @@ Further, after the @code{SYSTEM_CLOCK} library routine returns, the compiler must ensure that the temporary variable it wrote is copied into the appropriate element of the @samp{CLOCKS} array. (This assumes the compiler doesn't just reject the code, -which it should if it is compiling under some kind of a "strict" option.) +which it should if it is compiling under some kind of a ``strict'' option.) @item To determine the correct index into the @samp{CLOCKS} array, @@ -882,6 +1914,111 @@ to hold the value of the expression. @item Other stuff??? +@end itemize +@node Internal Naming Conventions +@section Internal Naming Conventions -@end itemize +Names exported by FFE modules have the following (regular-expression) forms. +Note that all names beginning @code{ffe@var{mod}} or @code{FFE@var{mod}}, +where @var{mod} is lowercase or uppercase alphanumerics, respectively, +are exported by the module @code{ffe@var{mod}}, +with the source code doing the exporting in @file{@var{mod}.h}. +(Usually, the source code for the implementation is in @file{@var{mod}.c}.) + +Identifiers that don't fit the following forms +are not considered exported, +even if they are according to the C language. +(For example, they might be made available to other modules +solely for use within expansions of exported macros, +not for use within any source code in those other modules.) + +@table @code +@item ffe@var{mod} +The single typedef exported by the module. + +@item FFE@var{umod}_[A-Z][A-Z0-9_]* +(Where @var{umod} is the uppercase for of @var{mod}.) + +A @code{#define} or @code{enum} constant of the type @code{ffe@var{mod}}. + +@item ffe@var{mod}[A-Z][A-Z][a-z0-9]* +A typedef exported by the module. + +The portion of the identifier after @code{ffe@var{mod}} is +referred to as @code{ctype}, a capitalized (mixed-case) form +of @code{type}. + +@item FFE@var{umod}_@var{type}[A-Z][A-Z0-9_]*[A-Z0-9]? +(Where @var{umod} is the uppercase for of @var{mod}.) + +A @code{#define} or @code{enum} constant of the type +@code{ffe@var{mod}@var{type}}, +where @var{type} is the lowercase form of @var{ctype} +in an exported typedef. + +@item ffe@var{mod}_@var{value} +A function that does or returns something, +as described by @var{value} (see below). + +@item ffe@var{mod}_@var{value}_@var{input} +A function that does or returns something based +primarily on the thing described by @var{input} (see below). +@end table + +Below are names used for @var{value} and @var{input}, +along with their definitions. + +@table @code +@item col +A column number within a line (first column is number 1). + +@item file +An encapsulation of a file's name. + +@item find +Looks up an instance of some type that matches specified criteria, +and returns that, even if it has to create a new instance or +crash trying to find it (as appropriate). + +@item initialize +Initializes, usually a module. No type. + +@item int +A generic integer of type @code{int}. + +@item is +A generic integer that contains a true (non-zero) or false (zero) value. + +@item len +A generic integer that contains the length of something. + +@item line +A line number within a source file, +or a global line number. + +@item lookup +Looks up an instance of some type that matches specified criteria, +and returns that, or returns nil. + +@item name +A @code{text} that points to a name of something. + +@item new +Makes a new instance of the indicated type. +Might return an existing one if appropriate---if so, +similar to @code{find} without crashing. + +@item pt +Pointer to a particular character (line, column pairs) +in the input file (source code being compiled). + +@item run +Performs some herculean task. No type. + +@item terminate +Terminates, usually a module. No type. + +@item text +A @code{char *} that points to generic text. +@end table |