Skip to content

Latest commit

 

History

History
453 lines (360 loc) · 14.4 KB

Readme.md

File metadata and controls

453 lines (360 loc) · 14.4 KB

General Docs

Top

Contains:

  • AST_Design: Overview of ASTs
  • AST Types: Listing of all AST types along with their Code type interface.
  • Parsing: Overview of the parsing interface.
  • Parser Algo: In-depth breakdown of the parser's implementation.

CURRENTLY UNSUPPORTED

There is no support for validating expressions.
Its a todo

Only trivial template support is provided. The intention is for only simple, non-recursive substitution. The parameters of the template are treated like regular parameter AST entries. This means that the typename entry for the parameter AST would be either:

  • class
  • typename
  • A fundamental type, function, or pointer type.

Concepts and Constraints are not supported
Its a todo

Feature Macros

  • GEN_DEFINE_ATTRIBUTE_TOKENS : Allows user to define their own attribute macros for use in parsing.
    • This can be generated using base.cpp.
  • GEN_DEFINE_LIBRARY_CORE_CONSTANTS : Optional typename codes as they are non-standard to C/C++ and not necessary to library usage
  • GEN_DONT_ENFORCE_GEN_TIME_GUARD : By default, the library ( gen.hpp/ gen.cpp ) expects the macro GEN_TIME to be defined, this disables that.
  • GEN_ENFORCE_STRONG_CODE_TYPES : Enforces casts to filtered code types.
  • GEN_EXPOSE_BACKEND : Will expose symbols meant for internal use only.
  • GEN_ROLL_OWN_DEPENDENCIES : Optional override so that user may define the dependencies themselves.
  • GEN_DONT_ALLOW_INVALID_CODE (Not implemented yet) : Will fail when an invalid code is constructed, parsed, or serialized.
  • GEN_C_LIKE_CPP : Setting to <true or 1> Will prevent usage of function defnitions using references and structs with member functions. Structs will still have user-defined operator conversions, for-range support, and other operator overloads

The Data & Interface

The library's persistent state is managed tracked by a context struct: global Context* _ctx; defined within static_data.cpp

struct Context
{
// User Configuration
// Persistent Data Allocation
AllocatorInfo Allocator_DyanmicContainers; // By default will use a genral slab allocator (TODO(Ed): Currently does not)
AllocatorInfo Allocator_Pool; // By default will use the growing vmem reserve (TODO(Ed): Currently does not)
AllocatorInfo Allocator_StrCache; // By default will use a dedicated slab allocator (TODO(Ed): Currently does not)
// Temporary Allocation
AllocatorInfo Allocator_Temp;
// LoggerCallaback* log_callback; // TODO(Ed): Impl user logger callback as an option.
u32 Max_CommentLineLength; // Used by def_comment
u32 Max_StrCacheLength; // Any cached string longer than this is always allocated again.
u32 InitSize_BuilderBuffer;
u32 InitSize_CodePoolsArray;
u32 InitSize_StringArenasArray;
u32 CodePool_NumBlocks;
// TODO(Ed): Review these... (No longer needed if using the proper allocation strategy)
u32 InitSize_LexArena;
u32 SizePer_StringArena;
// TODO(Ed): Symbol Table
// Keep track of all resolved symbols (naemspaced identifiers)
// Parser
// Used by the lexer to persistently treat all these identifiers as preprocessor defines.
// Populate with strings via gen::cache_str.
// Functional defines must have format: id( ;at minimum to indicate that the define is only valid with arguments.
Array(StrCached) PreprocessorDefines;
// Backend
// The fallback allocator is utilized if any fo the three above allocators is not specified by the user.
u32 InitSize_Fallback_Allocator_Bucket_Size;
Array(Arena) Fallback_AllocatorBuckets;
// Array(Token) LexerTokens;
Array(Pool) CodePools;
Array(Arena) StringArenas;
StringTable StrCache;
// TODO(Ed): This needs to be just handled by a parser context
Arena LexArena;
StringTable Lexer_defines;
Array(Token) Lexer_Tokens;
// TODO(Ed): Active parse context vs a parse result need to be separated conceptually
ParseContext parser;
};

The interface for the context:

  • init: Initializtion
  • deinit: De-initialization.
  • reset: Clears the allocations, but doesn't free the memoery, then calls init() on _ctx again.
  • get_context: Retreive the currently tracked context.
  • set_context: Swap out the current tracked context.

Allocato usage

  • Allocator_DyanmicContainers: Growing arrays, hash tables. (Unbounded sized containers)
  • Allocator_Pool: Fixed-sized object allocations (ASTs, etc)
  • Allocator_StrCache: StrCached allocations
  • Allocator_Temp: Temporary alloations mostly intended for StrBuilder usage. Manually cleared by the user by their own discretion.

The allocator definitions used are exposed to the user incase they want to dictate memory usage

  • Allocators are defined with the AllocatorInfo structure found in memory.hpp
  • Most of the work is just defining the allocation procedure:
    void* ( void* allocator_data, AllocType type, ssize size, ssize alignment, void* old_memory, ssize old_size, u64 flags );

For any allocator above that the user does not define before init, a fallback allocator will be assigned that utiizes the fallback_allocator_proc wtihin interface.cpp.

As mentioned in root readme, the user is provided Code objects by calling the constructor's functions to generate them or find existing matches.

The AST is managed by the library and provided to the user via its interface.
However, the user may specifiy memory configuration.

Data layout of AST struct (Subject to heavily change with upcoming todos)

struct AST
{
union {
struct
{
Code InlineCmt; // Class, Constructor, Destructor, Enum, Friend, Functon, Operator, OpCast, Struct, Typedef, Using, Variable
Code Attributes; // Class, Enum, Function, Struct, Typedef, Union, Using, Variable
Code Specs; // Destructor, Function, Operator, Typename, Variable
union {
Code InitializerList; // Constructor
Code ParentType; // Class, Struct, ParentType->Next has a possible list of interfaces.
Code ReturnType; // Function, Operator, Typename
Code UnderlyingType; // Enum, Typedef
Code ValueType; // Parameter, Variable
};
union {
Code Macro; // Parameter
Code BitfieldSize; // Variable (Class/Struct Data Member)
Code Params; // Constructor, Function, Operator, Template, Typename
Code UnderlyingTypeMacro; // Enum
};
union {
Code ArrExpr; // Typename
Code Body; // Class, Constructor, Destructor, Enum, Friend, Function, Namespace, Struct, Union
Code Declaration; // Friend, Template
Code Value; // Parameter, Variable
};
union {
Code NextVar; // Variable; Possible way to handle comma separated variables declarations. ( , NextVar->Specs NextVar->Name NextVar->ArrExpr = NextVar->Value )
Code SuffixSpecs; // Only used with typenames, to store the function suffix if typename is function signature. ( May not be needed )
Code PostNameMacro; // Only used with parameters for specifically UE_REQUIRES (Thanks Unreal)
};
};
StrCached Content; // Attributes, Comment, Execution, Include
struct {
Specifier ArrSpecs[AST_ArrSpecs_Cap]; // Specifiers
Code NextSpecs; // Specifiers; If ArrSpecs is full, then NextSpecs is used.
};
};
StrCached Name;
union {
Code Prev;
Code Front;
Code Last;
};
union {
Code Next;
Code Back;
};
Token* Token; // Reference to starting token, only avaialble if it was derived from parsing.
Code Parent;
CodeType Type;
// CodeFlag CodeFlags;
ModuleFlag ModuleFlags;
union {
b32 IsFunction; // Used by typedef to not serialize the name field.
struct {
b16 IsParamPack; // Used by typename to know if type should be considered a parameter pack.
ETypenameTag TypeTag; // Used by typename to keep track of explicitly declared tags for the identifier (enum, struct, union)
};
Operator Op;
AccessSpec ParentAccess;
s32 NumEntries;
s32 VarParenthesizedInit; // Used by variables to know that initialization is using a constructor expression instead of an assignment expression.
};
};
static_assert( sizeof(AST) == AST_POD_Size, "ERROR: AST is not size of AST_POD_Size" );

StringCahced is a typedef for Str (a string slice), to denote it is an interned string
CodeType is enum taggin the type of code. Has an underlying type of u32
OperatorT is a typedef for EOperator::Type which has an underlying type of u32
StrBuilder is the dynamically allocating string builder type for the library

AST widths are setup to be AST_POD_Size (128 bytes by default). The width dictates how much the static array can hold before it must give way to using an allocated array:

constexpr static
int AST_ArrSpecs_Cap =
(
    AST_POD_Size
    - sizeof(Code)
    - sizeof(StrCached)
    - sizeof(Code) * 2
    - sizeof(Token*)
    - sizeof(Code)
    - sizeof(CodeType)
    - sizeof(ModuleFlag)
    - sizeof(u32)
)
/ sizeof(Specifier) - 1;

Data Notes:

  • ASTs are wrapped for the user in a Code struct which is a wrapper for a AST* type.
  • Code types have member symbols but their data layout is enforced to be POD types.
  • This library treats memory failures as fatal.
  • Cached Strings are stored in their own set of arenas. AST constructors use cached strings for names, and content.
  • Strings used for serialization and file buffers are not contained by those used for cached strings.
    • _ctx->Allocator_Temp is used.
  • Its intended to generate the AST in one go and serialize after. The constructors and serializer are designed to be a "one pass, front to back" setup.
    • Any modifcations to an existing AST should be to just construct another with the modifications done on-demand while traversing the AST (non-destructive).

The following CodeTypes are used which the user may optionally use strong typing with if they enable: GEN_ENFORCE_STRONG_CODE_TYPES

  • CodeBody : Has support for for : range iterating across Code objects.
  • CodeAttributes
  • CodeComment
  • CodeClass
  • CodeConstructor
  • CodeDefine
  • CodeDefineParams
  • CodeDestructor
  • CodeEnum
  • CodeExec
  • CodeExtern
  • CodeInclude
  • CodeFriend
  • CodeFn
  • CodeModule
  • CodeNS
  • CodeOperator
  • CodeOpCast : User defined member operator conversion
  • CodeParams : Has support for for : range iterating across parameters.
  • CodePreprocessCond
  • CodePragma
  • CodeSpecifiers : Has support for for : range iterating across specifiers.
  • CodeStruct
  • CodeTemplate
  • CodeTypename
  • CodeTypedef
  • CodeUnion
  • CodeUsing
  • CodeVar

Each struct Code<Name> has an associated "filtered AST" with the naming convention: AST_<CodeName> Unrelated fields of the AST for that node type are omitted and only necessary padding members are defined otherwise.

For the interface related to these code types see:

  • ast.hpp: Under the region pragma Code C-Interface
  • code_types.hpp: Under the region pragma Code C-Interface. Additional functionlity for c++ will be within the struct definitions or at the end of the file.

There are three categories of interfaces for Code AST generation & reflection

  • Upfront
  • Parsing
  • Untyped

Upfront Construction

All component ASTs must be previously constructed, and provided on creation of the code AST. The construction will fail and return CodeInvalid otherwise.

Interface :``

  • def_attributes
    • This is pre-appended right before the function symbol, or placed after the class or struct keyword for any flavor of attributes used.
    • Its up to the user to use the desired attribute formatting: [[]] (standard), __declspec (Microsoft), or __attribute__ (GNU).
  • def_comment
  • def_class
  • def_constructor
  • def_define
  • def_define_params
  • def_destructor
  • def_enum
  • def_execution
    • This is equivalent to untyped_str, except that its intended for use only in execution scopes.
  • def_extern_link
  • def_friend
  • def_function
  • def_include
  • def_module
  • def_namespace
  • def_operator
  • def_operator_cast
  • def_param
  • def_params
  • def_pragma
  • def_preprocess_cond
  • def_specifier
  • def_specifiers
  • def_struct
  • def_template
  • def_type
  • def_typedef
  • def_union
  • def_using
  • def_using_namespace
  • def_variable

Bodies:

  • def_body
  • def_class_body
  • def_enum_body
  • def_export_body
  • def_extern_link_body
  • def_function_body
    • Use this for operator bodies as well
  • def_global_body
  • def_namespace_body
  • def_struct_body
  • def_union_body

Usage:

<name> = def_<function type>( ... );

Code <name>
{
    ...
    <name> = def_<function name>( ... );
}

All optional parmeters are defined within struct Opts_def_<functon name>. This was done to setup a macro trick for default optional parameers in the C library:

struct gen_Opts_def_struct
{
	gen_CodeBody       body;
	gen_CodeTypename   parent;
	gen_AccessSpec     parent_access;
	gen_CodeAttributes attributes;
	gen_CodeTypename*  interfaces;
	gen_s32            num_interfaces;
	gen_ModuleFlag     mflags;
};
typedef struct gen_Opts_def_struct gen_Opts_def_struct;

GEN_API gen_CodeClass gen_def__struct( gen_Str name, gen_Opts_def_struct opts GEN_PARAM_DEFAULT );
#define gen_def_struct( name, ... ) gen_def__struct( name, ( gen_Opts_def_struct ) { __VA_ARGS__ } )

In the C++ library, the def_<funtion name> is not wrapped in a macro.

When using the body functions, its recommended to use the args macro to auto determine the number of arguments for the varadic:

def_global_body( args( ht_entry, array_ht_entry, hashtable ));

// instead of:
def_global_body( 3, ht_entry, array_ht_entry, hashtable );

If a more incremental approach is desired for the body ASTs, Code def_body( CodeT type ) can be used to create an empty bodyss When the members have been populated use: code_validate_body to verify that the members are valid entires for that type.

Parse construction

A string provided to the API is parsed for the intended language construct.

Interface :

  • parse_class
  • parse_constructor
  • parse_define
  • parse_destructor
  • parse_enum
  • parse_export_body
  • parse_extern_link
  • parse_friend
  • parse_function
  • parse_global_body
  • parse_namespace
  • parse_operator
  • parse_operator_cast
  • parse_struct
  • parse_template
  • parse_type
  • parse_typedef
  • parse_union
  • parse_using
  • parse_variable

Usage:

Code <name> = parse_<function name>( string with code );

Code <name> = def_<function name>( ..., parse_<function name>(
    <string with code>
));

Untyped constructions

Code ASTs are constructed using unvalidated strings.

Interface :

  • token_fmt_va
  • token_fmt
  • untyped_str
  • untyped_fmt
  • untyped_token_fmt

During serialization any untyped Code AST has its string value directly injected inline of whatever context the content existed as an entry within. Even though these are not validated from somewhat correct c/c++ syntax or components, it doesn't mean that Untyped code can be added as any component of a Code AST:

  • Untyped code cannot have children, thus there cannot be recursive injection this way.
  • Untyped code can only be a child of a parent of body AST, or for values of an assignment (ex: variable assignment).

These restrictions help prevent abuse of untyped code to some extent.

Usage Conventions:

Code <name> = def_variable( <type>, <name>, untyped_<function name>(
    <string with code>
));

Code <name> = untyped_str( code(
    <some code without "" quotes>
));

Optionally, code_str, and code_fmt macros can be used so that the code macro doesn't have to be used:

Code <name> = code_str( <some code without "" quotes > )

Template metaprogramming in the traditional sense becomes possible with the use of token_fmt and parse constructors:

Str value = txt("Something");

char const* template_str = txt(
    Code with <key> to replace with token_values
    ...
);
char const* gen_code_str = token_fmt( "key", value, template_str );
Code        <name>       = parse_<function name>( gen_code_str );

Predefined Codes

The following are provided predefined by the library as they are commonly used:

  • enum_underlying_macro
  • access_public
  • access_protected
  • access_private
  • attrib_api_export
  • attrib_api_import
  • module_global_fragment
  • module_private_fragment
  • fmt_newline
  • pragma_once
  • param_varaidc (Used for varadic definitions)
  • preprocess_else
  • preprocess_endif
  • spec_const
  • spec_consteval
  • spec_constexpr
  • spec_constinit
  • spec_extern_linkage (extern)
  • spec_final
  • spec_forceinline
  • spec_global (global macro)
  • spec_inline
  • spec_internal_linkage (internal macro)
  • spec_local_persist (local_persist macro)
  • spec_mutable
  • spec_neverinline
  • spec_noexcept
  • spec_override
  • spec_ptr
  • spec_pure
  • spec_ref
  • spec_register
  • spec_rvalue
  • spec_static_member (static)
  • spec_thread_local
  • spec_virtual
  • spec_volatile
  • t_empty (Used for varaidc macros)
  • t_auto
  • t_void
  • t_int
  • t_bool
  • t_char
  • t_wchar_t
  • t_class
  • t_typename

Optionally the following may be defined if GEN_DEFINE_LIBRARY_CODE_CONSTANTS is defined

  • t_b32
  • t_s8
  • t_s16
  • t_s32
  • t_s64
  • t_u8
  • t_u16
  • t_u32
  • t_u64
  • t_ssize (ssize_t)
  • t_usize (size_t)
  • t_f32
  • t_f64

Extent of operator overload validation

The AST and constructors will be able to validate that the arguments provided for the operator type match the expected form:

  • If return type must match a parameter
  • If number of parameters is correct
  • If added as a member symbol to a class or struct, that operator matches the requirements for the class (types match up)
  • There is no support for validating new & delete operations (yet)

The user is responsible for making sure the code types provided are correct and have the desired specifiers assigned to them beforehand.

Code generation and modification

There are two provided auxiliary interfaces:

  • Builder
  • Scanner

Builder is a similar object to the jai language's strbuilder_builder

  • The purpose of it is to generate a file.
  • A file is specified and opened for writing using the open( file_path) function.
  • The code is provided via print( code ) function will be serialized to its buffer.
  • When all serialization is finished, use the write() command to write the buffer to the file.

Scanner

  • The purpose is to scan or parse files
  • Some with two basic functions to convert a fil to code: scan_file and parse_file
    • scan_file: Merely grabs the file and stores it in an untyped Code.
    • parse_file: Will parse the file using parse_global_body and return a CodeBody.
  • Two basic functions for grabbing columns from a CSV: parse_csv_one_column and parse_csv_two_columns