aboutsummaryrefslogtreecommitdiff
path: root/gcc/doc/cppinternals.texi
diff options
context:
space:
mode:
Diffstat (limited to 'gcc/doc/cppinternals.texi')
-rw-r--r--gcc/doc/cppinternals.texi430
1 files changed, 430 insertions, 0 deletions
diff --git a/gcc/doc/cppinternals.texi b/gcc/doc/cppinternals.texi
new file mode 100644
index 00000000000..95d09817b1d
--- /dev/null
+++ b/gcc/doc/cppinternals.texi
@@ -0,0 +1,430 @@
+\input texinfo
+@setfilename cppinternals.info
+@settitle The GNU C Preprocessor Internals
+
+@ifinfo
+@dircategory Programming
+@direntry
+* Cpplib: (cppinternals). Cpplib internals.
+@end direntry
+@end ifinfo
+
+@c @smallbook
+@c @cropmarks
+@c @finalout
+@setchapternewpage odd
+@ifinfo
+This file documents the internals of the GNU C Preprocessor.
+
+Copyright 2000, 2001 Free Software Foundation, Inc.
+
+Permission is granted to make and distribute verbatim copies of
+this manual provided the copyright notice and this permission notice
+are preserved on all copies.
+
+@ignore
+Permission is granted to process this file through Tex and print the
+results, provided the printed document carries copying permission
+notice identical to this one except for the removal of this paragraph
+(this paragraph not being relevant to the printed manual).
+
+@end ignore
+Permission is granted to copy and distribute modified versions of this
+manual under the conditions for verbatim copying, provided also that
+the entire resulting derived work is distributed under the terms of a
+permission notice identical to this one.
+
+Permission is granted to copy and distribute translations of this manual
+into another language, under the above conditions for modified versions.
+@end ifinfo
+
+@titlepage
+@c @finalout
+@title Cpplib Internals
+@subtitle Last revised Jan 2001
+@subtitle for GCC version 3.0
+@author Neil Booth
+@page
+@vskip 0pt plus 1filll
+@c man begin COPYRIGHT
+Copyright @copyright{} 2000, 2001
+Free Software Foundation, Inc.
+
+Permission is granted to make and distribute verbatim copies of
+this manual provided the copyright notice and this permission notice
+are preserved on all copies.
+
+Permission is granted to copy and distribute modified versions of this
+manual under the conditions for verbatim copying, provided also that
+the entire resulting derived work is distributed under the terms of a
+permission notice identical to this one.
+
+Permission is granted to copy and distribute translations of this manual
+into another language, under the above conditions for modified versions.
+@c man end
+@end titlepage
+@contents
+@page
+
+@node Top, Conventions,, (DIR)
+@chapter Cpplib - the core of the GNU C Preprocessor
+
+The GNU C preprocessor in GCC 3.0 has been completely rewritten. It is
+now implemented as a library, cpplib, so it can be easily shared between
+a stand-alone preprocessor, and a preprocessor integrated with the C,
+C++ and Objective C front ends. It is also available for use by other
+programs, though this is not recommended as its exposed interface has
+not yet reached a point of reasonable stability.
+
+This library has been written to be re-entrant, so that it can be used
+to preprocess many files simultaneously if necessary. It has also been
+written with the preprocessing token as the fundamental unit; the
+preprocessor in previous versions of GCC would operate on text strings
+as the fundamental unit.
+
+This brief manual documents some of the internals of cpplib, and a few
+tricky issues encountered. It also describes certain behaviour we would
+like to preserve, such as the format and spacing of its output.
+
+Identifiers, macro expansion, hash nodes, lexing.
+
+@menu
+* Conventions:: Conventions used in the code.
+* Lexer:: The combined C, C++ and Objective C Lexer.
+* Whitespace:: Input and output newlines and whitespace.
+* Hash Nodes:: All identifiers are hashed.
+* Macro Expansion:: Macro expansion algorithm.
+* Files:: File handling.
+* Index:: Index.
+@end menu
+
+@node Conventions, Lexer, Top, Top
+@unnumbered Conventions
+@cindex interface
+@cindex header files
+
+cpplib has two interfaces - one is exposed internally only, and the
+other is for both internal and external use.
+
+The convention is that functions and types that are exposed to multiple
+files internally are prefixed with @samp{_cpp_}, and are to be found in
+the file @samp{cpphash.h}. Functions and types exposed to external
+clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}. For
+historical reasons this is no longer quite true, but we should strive to
+stick to it.
+
+We are striving to reduce the information exposed in cpplib.h to the
+bare minimum necessary, and then to keep it there. This makes clear
+exactly what external clients are entitled to assume, and allows us to
+change internals in the future without worrying whether library clients
+are perhaps relying on some kind of undocumented implementation-specific
+behaviour.
+
+@node Lexer, Whitespace, Conventions, Top
+@unnumbered The Lexer
+@cindex lexer
+@cindex tokens
+
+The lexer is contained in the file @samp{cpplex.c}. We want to have a
+lexer that is single-pass, for efficiency reasons. We would also like
+the lexer to only step forwards through the input files, and not step
+back. This will make future changes to support different character
+sets, in particular state or shift-dependent ones, much easier.
+
+This file also contains all information needed to spell a token, i.e. to
+output it either in a diagnostic or to a preprocessed output file. This
+information is not exported, but made available to clients through such
+functions as @samp{cpp_spell_token} and @samp{cpp_token_len}.
+
+The most painful aspect of lexing ISO-standard C and C++ is handling
+trigraphs and backlash-escaped newlines. Trigraphs are processed before
+any interpretation of the meaning of a character is made, and unfortunately
+there is a trigraph representation for a backslash, so it is possible for
+the trigraph @samp{??/} to introduce an escaped newline.
+
+Escaped newlines are tedious because theoretically they can occur
+anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token,
+within the characters of an identifier, and even between the @samp{*}
+and @samp{/} that terminates a comment. Moreover, you cannot be sure
+there is just one - there might be an arbitrarily long sequence of them.
+
+So the routine @samp{parse_identifier}, that lexes an identifier, cannot
+assume that it can scan forwards until the first non-identifier
+character and be done with it, because this could be the @samp{\}
+introducing an escaped newline, or the @samp{?} introducing the trigraph
+sequence that represents the @samp{\} of an escaped newline. Similarly
+for the routine that handles numbers, @samp{parse_number}. If these
+routines stumble upon a @samp{?} or @samp{\}, they call
+@samp{skip_escaped_newlines} to skip over any potential escaped newlines
+before checking whether they can finish.
+
+Similarly code in the main body of @samp{_cpp_lex_token} cannot simply
+check for a @samp{=} after a @samp{+} character to determine whether it
+has a @samp{+=} token; it needs to be prepared for an escaped newline of
+some sort. These cases use the function @samp{get_effective_char},
+which returns the first character after any intervening newlines.
+
+The lexer needs to keep track of the correct column position,
+including counting tabs as specified by the @samp{-ftabstop=} option.
+This should be done even within comments; C-style comments can appear in
+the middle of a line, and we want to report diagnostics in the correct
+position for text appearing after the end of the comment.
+
+Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers,
+may be invalid and require a diagnostic. However, if they appear in a
+macro expansion we don't want to complain with each use of the macro.
+It is therefore best to catch them during the lexing stage, in
+@samp{parse_identifier}. In both cases, whether a diagnostic is needed
+or not is dependent upon lexer state. For example, we don't want to
+issue a diagnostic for re-poisoning a poisoned identifier, or for using
+@samp{__VA_ARGS__} in the expansion of a variable-argument macro.
+Therefore @samp{parse_identifier} makes use of flags to determine
+whether a diagnostic is appropriate. Since we change state on a
+per-token basis, and don't lex whole lines at a time, this is not a
+problem.
+
+Another place where state flags are used to change behaviour is whilst
+parsing header names. Normally, a @samp{<} would be lexed as a single
+token. After a @code{#include} directive, though, it should be lexed
+as a single token as far as the nearest @samp{>} character. Note that
+we don't allow the terminators of header names to be escaped; the first
+@samp{"} or @samp{>} terminates the header name.
+
+Interpretation of some character sequences depends upon whether we are
+lexing C, C++ or Objective C, and on the revision of the standard in
+force. For example, @samp{::} is a single token in C++, but two
+separate @samp{:} tokens, and almost certainly a syntax error, in C.
+Such cases are handled in the main function @samp{_cpp_lex_token}, based
+upon the flags set in the @samp{cpp_options} structure.
+
+Note we have almost, but not quite, achieved the goal of not stepping
+backwards in the input stream. Currently @samp{skip_escaped_newlines}
+does step back, though with care it should be possible to adjust it so
+that this does not happen. For example, one tricky issue is if we meet
+a trigraph, but the command line option @samp{-trigraphs} is not in
+force but @samp{-Wtrigraphs} is, we need to warn about it but then
+buffer it and continue to treat it as 3 separate characters.
+
+@node Whitespace, Hash Nodes, Lexer, Top
+@unnumbered Whitespace
+@cindex whitespace
+@cindex newlines
+@cindex escaped newlines
+@cindex paste avoidance
+@cindex line numbers
+
+The lexer has been written to treat each of @samp{\r}, @samp{\n},
+@samp{\r\n} and @samp{\n\r} as a single new line indicator. This allows
+it to transparently preprocess MS-DOS, Macintosh and Unix files without
+their needing to pass through a special filter beforehand.
+
+We also decided to treat a backslash, either @samp{\} or the trigraph
+@samp{??/}, separated from one of the above newline indicators by
+non-comment whitespace only, as intending to escape the newline. It
+tends to be a typing mistake, and cannot reasonably be mistaken for
+anything else in any of the C-family grammars. Since handling it this
+way is not strictly conforming to the ISO standard, the library issues a
+warning wherever it encounters it.
+
+Handling newlines like this is made simpler by doing it in one place
+only. The function @samp{handle_newline} takes care of all newline
+characters, and @samp{skip_escaped_newlines} takes care of arbitrarily
+long sequences of escaped newlines, deferring to @samp{handle_newline}
+to handle the newlines themselves.
+
+Another whitespace issue only concerns the stand-alone preprocessor: we
+want to guarantee that re-reading the preprocessed output results in an
+identical token stream. Without taking special measures, this might not
+be the case because of macro substitution. We could simply insert a
+space between adjacent tokens, but ideally we would like to keep this to
+a minimum, both for aesthetic reasons and because it causes problems for
+people who still try to abuse the preprocessor for things like Fortran
+source and Makefiles.
+
+The token structure contains a flags byte, and two flags are of interest
+here: @samp{PREV_WHITE} and @samp{AVOID_LPASTE}. @samp{PREV_WHITE}
+indicates that the token was preceded by whitespace; if this is the case
+we need not worry about it incorrectly pasting with its predecessor.
+The @samp{AVOID_LPASTE} flag is set by the macro expansion routines, and
+indicates that paste avoidance by insertion of a space to the left of
+the token may be necessary. Recursively, the first token of a macro
+substitution, the first token after a macro substitution, the first
+token of a substituted argument, and the first token after a substituted
+argument are all flagged @samp{AVOID_LPASTE} by the macro expander.
+
+If a token flagged in this way does not have a @samp{PREV_WHITE} flag,
+and the routine @var{cpp_avoid_paste} determines that it might be
+misinterpreted by the lexer if a space is not inserted between it and
+the immediately preceding token, then stand-alone CPP's output routines
+will insert a space between them. To avoid excessive spacing,
+@var{cpp_avoid_paste} tries hard to only request a space if one is
+likely to be necessary, but for reasons of efficiency it is slightly
+conservative and might recommend a space where one is not strictly
+needed.
+
+Finally, the preprocessor takes great care to ensure it keeps track of
+both the position of a token in the source file, for diagnostic
+purposes, and where it should appear in the output file, because using
+CPP for other languages like assembler requires this. The two positions
+may differ for the following reasons:
+
+@itemize @bullet
+@item
+Escaped newlines are deleted, so lines spliced in this way are joined to
+form a single logical line.
+
+@item
+A macro expansion replaces the tokens that form its invocation, but any
+newlines appearing in the macro's arguments are interpreted as a single
+space, with the result that the macro's replacement appears in full on
+the same line that the macro name appeared in the source file. This is
+particularly important for stringification of arguments - newlines
+embedded in the arguments must appear in the string as spaces.
+@end itemize
+
+The source file location is maintained in the @var{lineno} member of the
+@var{cpp_buffer} structure, and the column number inferred from the
+current position in the buffer relative to the @var{line_base} buffer
+variable, which is updated with every newline whether escaped or not.
+
+TODO: Finish this.
+
+@node Hash Nodes, Macro Expansion, Whitespace, Top
+@unnumbered Hash Nodes
+@cindex hash table
+@cindex identifiers
+@cindex macros
+@cindex assertions
+@cindex named operators
+
+When cpplib encounters an "identifier", it generates a hash code for it
+and stores it in the hash table. By "identifier" we mean tokens with
+type @samp{CPP_NAME}; this includes identifiers in the usual C sense, as
+well as keywords, directive names, macro names and so on. For example,
+all of "pragma", "int", "foo" and "__GNUC__" are identifiers and hashed
+when lexed.
+
+Each node in the hash table contain various information about the
+identifier it represents. For example, its length and type. At any one
+time, each identifier falls into exactly one of three categories:
+
+@itemize @bullet
+@item Macros
+
+These have been declared to be macros, either on the command line or
+with @code{#define}. A few, such as @samp{__TIME__} are builtins
+entered in the hash table during initialisation. The hash node for a
+normal macro points to a structure with more information about the
+macro, such as whether it is function-like, how many arguments it takes,
+and its expansion. Builtin macros are flagged as special, and instead
+contain an enum indicating which of the various builtin macros it is.
+
+@item Assertions
+
+Assertions are in a separate namespace to macros. To enforce this, cpp
+actually prepends a @code{#} character before hashing and entering it in
+the hash table. An assertion's node points to a chain of answers to
+that assertion.
+
+@item Void
+
+Everything else falls into this category - an identifier that is not
+currently a macro, or a macro that has since been undefined with
+@code{#undef}.
+
+When preprocessing C++, this category also includes the named operators,
+such as @samp{xor}. In expressions these behave like the operators they
+represent, but in contexts where the spelling of a token matters they
+are spelt differently. This spelling distinction is relevant when they
+are operands of the stringizing and pasting macro operators @code{#} and
+@code{##}. Named operator hash nodes are flagged, both to catch the
+spelling distinction and to prevent them from being defined as macros.
+@end itemize
+
+The same identifiers share the same hash node. Since each identifier
+token, after lexing, contains a pointer to its hash node, this is used
+to provide rapid lookup of various information. For example, when
+parsing a @code{#define} statement, CPP flags each argument's identifier
+hash node with the index of that argument. This makes duplicated
+argument checking an O(1) operation for each argument. Similarly, for
+each identifier in the macro's expansion, lookup to see if it is an
+argument, and which argument it is, is also an O(1) operation. Further,
+each directive name, such as @samp{endif}, has an associated directive
+enum stored in its hash node, so that directive lookup is also O(1).
+
+@node Macro Expansion, Files, Hash Nodes, Top
+@unnumbered Macro Expansion Algorithm
+
+@node Files, Index, Macro Expansion, Top
+@unnumbered File Handling
+@cindex files
+
+Fairly obviously, the file handling code of cpplib resides in the file
+@samp{cppfiles.c}. It takes care of the details of file searching,
+opening, reading and caching, for both the main source file and all the
+headers it recursively includes.
+
+The basic strategy is to minimize the number of system calls. On many
+systems, the basic @code{open ()} and @code{fstat ()} system calls can
+be quite expensive. For every @code{#include}-d file, we need to try
+all the directories in the search path until we find a match. Some
+projects, such as glibc, pass twenty or thirty include paths on the
+command line, so this can rapidly become time consuming.
+
+For a header file we have not encountered before we have little choice
+but to do this. However, it is often the case that the same headers are
+repeatedly included, and in these cases we try to avoid repeating the
+filesystem queries whilst searching for the correct file.
+
+For each file we try to open, we store the constructed path in a splay
+tree. This path first undergoes simplification by the function
+@code{_cpp_simplify_pathname}. For example,
+@samp{/usr/include/bits/../foo.h} is simplified to
+@samp{/usr/include/foo.h} before we enter it in the splay tree and try
+to @code{open ()} the file. CPP will then find subsequent uses of
+@samp{foo.h}, even as @samp{/usr/include/foo.h}, in the splay tree and
+save system calls.
+
+Further, it is likely the file contents have also been cached, saving a
+@code{read ()} system call. We don't bother caching the contents of
+header files that are re-inclusion protected, and whose re-inclusion
+macro is defined when we leave the header file for the first time. If
+the host supports it, we try to map suitably large files into memory,
+rather than reading them in directly.
+
+The include paths are intenally stored on a null-terminated
+singly-linked list, starting with the @code{"header.h"} directory search
+chain, which then links into the @code{<header.h>} directory chain.
+
+Files included with the @code{<foo.h>} syntax start the lookup directly
+in the second half of this chain. However, files included with the
+@code{"foo.h"} syntax start at the beginning of the chain, but with one
+extra directory prepended. This is the directory of the current file;
+the one containing the @code{#include} directive. Prepending this
+directory on a per-file basis is handled by the function
+@code{search_from}.
+
+Note that a header included with a directory component, such as
+@code{#include "mydir/foo.h"} and opened as
+@samp{/usr/local/include/mydir/foo.h}, will have the complete path minus
+the basename @samp{foo.h} as the current directory.
+
+Enough information is stored in the splay tree that CPP can immediately
+tell whether it can skip the header file because of the multiple include
+optimisation, whether the file didn't exist or couldn't be opened for
+some reason, or whether the header was flagged not to be re-used, as it
+is with the obsolete @code{#import} directive.
+
+For the benefit of MS-DOS filesystems with an 8.3 filename limitation,
+CPP offers the ability to treat various include file names as aliases
+for the real header files with shorter names. The map from one to the
+other is found in a special file called @samp{header.gcc}, stored in the
+command line (or system) include directories to which the mapping
+applies. This may be higher up the directory tree than the full path to
+the file minus the base name.
+
+@node Index,, Files, Top
+@unnumbered Index
+@printindex cp
+
+@bye