Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Make isIncludeFile() return TRUE for files with an unknown extension #3696

Closed
wants to merge 1 commit into from

Conversation

techee
Copy link
Contributor

@techee techee commented Apr 12, 2023

This PR is inspired by the problem reported here in Geany:

geany/geany#3454

Basically the problem is that if ctags parses a file with an unknown file extension, the isIncludeFile() function returns false which makes the code at various places in C/C++ parser like

if (isFunctionDeclaration && !isHeader())
    tag->isFileScope = true;

set tag->isFileScope = true. I think that when we don't know for sure whether a file is a header or source, it would be safer to treat it as a header and not to generate the fileScope extras flag for it (see more in geany/geany#3457 (comment)).

I'm posting this as a draft here only for further discussion. The patch makes unit tests fail (I didn't want to spend time on it unless I know this PR is something that is wanted in ctags), misses a unit test and also the documentation of the -h command-line option should be changed. Since all files would be treated as headers by default, the only use for -h would be to override source files like *.c to be treated as headers.

@b4n
Copy link
Member

b4n commented Apr 12, 2023

The idea sounds reasonable to me, defaulting to non-local in doubt is probably safer.

if (isFunctionDeclaration && !isHeader())
    tag->isFileScope = true;

Shouldn't this include whether the declaration is static or similar? I mean, even in a C file, a non-static declaration is going to be exported (not "file scope"), even if there was no declaration anywhere. C even accepts use of implicit declarations, although it's a very bad idea to actually use that.
OK, it's a fair bit trickier in C++, but then again being on the conservative side is probably better. E.g. false negatives are better than false positives.

@techee
Copy link
Contributor Author

techee commented Apr 12, 2023

Shouldn't this include whether the declaration is static or similar?

I simplified this one a bit just for the sake of the problem of this PR where this isn't relevant. For function definitions (i.e. functions with bodies) the parser really checks whether they are static or not. For function declarations (i.e. function prototypes without bodies) it doesn't check for the static because such declarations have no chance leaving the current file. So I think the parser does a reasonable thing here.

Back to this patch, I was also thinking it would maybe be more appropriate to introduce a new function isSource() because what we want is:

if (isFunctionDeclaration && we_are_sure_this_is_a_source())
    tag->isFileScope = true;

Note that if

  • isHeader() returns TRUE when we are sure it's a header file, FALSE otherwise (this is what happens without this patch and in isolation, this is a reasonable thing the function should do)
  • isSource() returns TRUE when we are sure it's a source file, FALSE otherwise

then isSource() isn't simply a negation of isHeader() because both of them return FALSE for cases where we aren't sure. This is the assumption the parser makes and the source of this problem.

@b4n
Copy link
Member

b4n commented Apr 16, 2023

@techee indeed that makes even more sense, and can thus be used as best fits the caller, depending on the actual use for the information. And indeed here the question is rather whether it's definitely a source file rather than not definitely a header.

@b4n
Copy link
Member

b4n commented Apr 16, 2023

@masatake do you have an opinion here?

@b4n
Copy link
Member

b4n commented Apr 16, 2023

BTW, a couple of things you can figure out by grepping:

  • isInputHeaderFile() is always negated (e.g. !isInputHeaderFile())
  • isInputHeaderFile() is almost always used to determine tag->isFileScope (but for one initialization in the C++ parser to better support guessing the language of .h files)
  • isIncludeFile() is only used to set inputFileInfo::isHeader
  • inputFileInfo::isHeader is only used by the above, plus a debug print.

So I think whether or not there should be both isHeader and isSource depends on whether the two use cases actually need different results. And I'd think they don't: they both check whether it's a header to adjust heuristics outside source files, so in both case an answer to "is this really a source file?" is a better default behavior, especially as from my experience uncommon extensions are almost always only used for headers, not sources.

So I would think that the best solution is to forget entirely about "is this a header", and replace the whole concept with "is this definitely not a header?" (and so, the probably more reasonable user option "is this a source file? (rather than a header)"), possibly including replacing the related option altogether.

@masatake
Copy link
Member

(Catching up slowly)

@masatake
Copy link
Member

I'm sorry for not getting back to you sooner.
"How the 'tag->isFileScope' (file: field in a tag file) should be" is a hard issue.

As a code reader, I think the file: is useful.

However, as a ctags developer, I want to delete it from ctags. The file: is too high-level information that the current implementation of ctags provides. ctags should provide basic raw information to client tools. The client tools should decide whether a language object is file: or not based on the tags output.

e.g.

included.c:

static int f(void) { return X; }

main.c

#define X 3
#include "included.c"

It is hard to know whether f has "file:" or not when parsing included.c.
The knowledge about the source tree, especially includes/included knowledge, is needed.


Let's think about short-term solutions.

I wonder if we extend -h to support 'negate marker'.
Quoted from the description for -h:

      -h (<list>|default)
              Specifies  a  <list>  of  file extensions, separated by periods,
              which are to be interpreted as include (or header) files. 
...
              The  default list is .h.H.hh.hpp.hxx.h++.inc.def. To restore the
              default list, specify "-h default".

Instead of listing the file extension for headers, make ctags accept the file extensions for none headers like:

  ctags ... -h '!.c++.cc.cp.cpp.cxx.C.CPP.CXX.c' ...

Geany can pass the extensions for non-headers to ctags before running ctags.

p.s.
The Cxx parser provides the raw information via its "properties" field:

$ ctags --list-fields=C
#LETTER NAME       ENABLED LANGUAGE JSTYPE FIXED OP DESCRIPTION
-       macrodef   no      C        s--    no    -- macro definition
-       properties no      C        s--    no    -- properties (static, inline, mutable,...)

@masatake
Copy link
Member

The code I illustrated for my idea:

diff --git a/main/options.c b/main/options.c
index 3f6f3d1d6..78e3d2b07 100644
--- a/main/options.c
+++ b/main/options.c
@@ -152,6 +152,7 @@ optionValues Option = {
        .fileList = NULL,
        .tagFileName = NULL,
        .headerExt = NULL,
+       .headerExtNegated = false,
        .etagsInclude = NULL,
        .tagFileFormat = DEFAULT_FILE_FORMAT,
 #ifdef HAVE_ICONV
@@ -1141,7 +1142,11 @@ extern bool isIncludeFile (const char *const fileName)
        bool result = false;
        const char *const extension = fileExtension (fileName);
        if (Option.headerExt != NULL)
+       {
                result = stringListExtensionMatched (Option.headerExt, extension);
+               if (Option.headerExtNegated)
+                       result = !result;
+       }
        return result;
 }
 
@@ -2547,16 +2552,27 @@ static void processHeaderListOption (const int option, const char *parameter)
                installHeaderListDefaults ();
        else
        {
+               bool negated = false;
                bool clear = true;
 
-               if (parameter [0] == '+')
+               if (parameter [0] == '!')
+               {
+                       negated = true;
+                       clear = true;
+                       parameter++;
+                       if (parameter [0] == '+')
+                               error (FATAL, "don't combine ! and + in -%c", option);
+               }
+               else if (parameter [0] == '+')
                {
                        ++parameter;
                        clear = false;
                }
                if (Option.headerExt == NULL)
                        Option.headerExt = stringListNew ();
-               verbose ("    Header Extensions:\n");
+               verbose ("    %sHeader Extensions:\n", negated? "(negated) ": "");
+
+               Option.headerExtNegated = negated;
                addExtensionList (Option.headerExt, parameter, clear);
        }
 }
diff --git a/main/options_p.h b/main/options_p.h
index 4895b644f..0ee0e4b48 100644
--- a/main/options_p.h
+++ b/main/options_p.h
@@ -84,6 +84,7 @@ typedef struct sOptionValues {
        char *fileList;         /* -L  name of file containing names of files */
        char *tagFileName;      /* -o  name of tags file */
        stringList* headerExt;  /* -h  header extensions */
+       bool headerExtNegated;  /* -h  is specified with !, the prefix for negating. */
        stringList* etagsInclude;/* --etags-include  list of TAGS files to include*/
        unsigned int tagFileFormat;/* --format  tag file format (level) */
 #ifdef HAVE_ICONV

@techee
Copy link
Contributor Author

techee commented Apr 24, 2023

However, as a ctags developer, I want to delete it from ctags. The file: is too high-level information that the current implementation of ctags provides. ctags should provide basic raw information to client tools. The client tools should decide whether a language object is file: or not based on the tags output.

But is there any such field based on which clients could decide whether it's a local or a non-local declaration (e.g. whether a function is static or not)? C/C++ with all of their includes and preprocessors are very specific and it's kind of understandable that the parsing won't be 100% perfect (which is kind of true for basically any ctags parser, there are always some heuristics and assumptions that have to be made).

The file: flag is useful though to get rid of tags that are "invalid" in certain file. In Geany, we use it for autocompletion where the user types some prefix, say pref in file1.c and we show all tags in a popup that start with pref. if there's for example static void preference2() in file2.c, this one can be excluded from the autocompletion results because there's no way for such tag to "leave" file2.c. On the other hand, void preference3() will be included from file3.c as it may be visible in file1.c which we are editing. Of course, ideally we'd analyze all the includes to see whether it's really visible in file1.c but that would require real parser. But even such simple elimination of static variables helps to get rid of a lot of garbage from such simple autocompletion implementation.

I wonder if we extend -h to support 'negate marker'.

Sounds good and I think that makes sense for ctags if you want to preserve some compatibility with existing tools. I'm just wondering if

-h '!.c++.cc.cp.cpp.cxx.C.CPP.CXX.c'

shouldn't be the default when -h is not specified on the command-line because I think it's safer not to generate the file: flag if the parser runs into an unknown extension.

Geany can pass the extensions for non-headers to ctags before running ctags.

I'm wondering if we can do it somehow (I didn't investigate it though). Remember we are just calling some internal ctags functions like in the mini-geany.c client and I'm not sure if we can inject some command-line options this way.

@masatake
Copy link
Member

But is there any such field based on which clients could decide whether it's a local or a non-local declaration (e.g. whether a function is static or not)?

See the "p.s." in my last comment. properties: is the field for string "static".

[jet@living]~/var/ctags-github% cat /tmp/foo.c
static void foo(void) {}
static void bar(void);
void baz(void);
void baz(void) {}
[jet@living]~/var/ctags-github% ./ctags --options=NONE --fields-C=+'{properties}' --kinds-C=+p -o - /tmp/foo.c
./ctags --options=NONE --fields-C=+'{properties}' --kinds-C=+p -o - /tmp/foo.c
ctags: Notice: No options will be read from files or environment
bar	/tmp/foo.c	/^static void bar(void);$/;"	p	typeref:typename:void	file:	properties:static
baz	/tmp/foo.c	/^void baz(void) {}$/;"	f	typeref:typename:void
baz	/tmp/foo.c	/^void baz(void);$/;"	p	typeref:typename:void	file:
foo	/tmp/foo.c	/^static void foo(void) {}$/;"	f	typeref:typename:void	file:	properties:static

ctags also records #include like:

[jet@living]~/var/ctags-github% cat /tmp/bar.c
#include <a.h>
#include "b.h"
[jet@living]~/var/ctags-github% ./ctags --options=NONE --fields=+r --extras=+r -o - /tmp/bar.c
ctags: Notice: No options will be read from files or environment
a.h	/tmp/bar.c	/^#include <a.h>/;"	h	roles:system
b.h	/tmp/bar.c	/^#include "b.h"/;"	h	roles:local

Though #if/#idef conditions are not tracked at all, a client tool, having its own database, can do exciting things with the tags output. See also #2428 . I guess @pragmaware's code3 may utilize these tags. (@pragmaware designed the properties: field).

About -h !, when thinking about compatibility, I don't want to change its default value.
Anyway, we can add ! marker whether we will change -h's default value or not.
I'll make a pull request.

I'm wondering if we can do it somehow (I didn't investigate it though). Remember we are just calling some internal ctags functions like in the mini-geany.c client and I'm not sure if we can inject some command-line options this way.

Extending the interface for Geany is the next step. I believe it is not hard.
I

@techee
Copy link
Contributor Author

techee commented Apr 24, 2023

See the "p.s." in my last comment. properties: is the field for string "static".

Cool, I wasn't aware of that. In this case I'm actually wondering whether it's worth introducing any changes in ctags - I think we'll be able to flag those file scope tags by ourselves. Especially if the defaults for -h are going to stay and there would have to be some special interface against Geany for it, I think it's not worth it.

ctags also records #include like:

I'm aware of that but you can never be sure whether some include file is used or not because you can have code like

#ifdef FOO
#include "bar.h"
#endif

and FOO can be defined on the command line. What we do though is that we sort the tags from filed directly included from the edited file towards the top of the autocompletion list which seems to help offering better results for autocompletion.

@masatake
Copy link
Member

So the refined issue here is the design of the new API for using the "properties:" field of C and C++ parsers.

Alright?

@techee
Copy link
Contributor Author

techee commented Apr 25, 2023

So the refined issue here is the design of the new API for using the "properties:" field of C and C++ parsers.

Maybe there's already everything present but if you could give me a hint, it would be great. If I understand correctly, the "properties" field is stored into parserFieldsDynamic of tagEntryInfo from which we could get it by

ctags/main/entry.c

Lines 1009 to 1021 in 5ba7dfc

extern const tagField* getParserFieldForIndex (const tagEntryInfo * tag, int index)
{
if (index < 0
|| tag->usedParserFields <= ((unsigned int)index) )
return NULL;
else if (index < PRE_ALLOCATED_PARSER_FIELDS)
return tag->parserFields + index;
else
{
unsigned int n = index - PRE_ALLOCATED_PARSER_FIELDS;
return ptrArrayItem(tag->parserFieldsDynamic, n);
}
}

The only problem is how to get the index from the "properties" filed name here:

.name = "properties", \

Even though the index will probably be 0 in this case, I'd prefer a more robust solution in case someone starts reordering the definitions in the parser. Is there a function somewhere that we could use to convert field name into this index?

@masatake
Copy link
Member

mini-geany helps us!

$ git diff | cat
diff --git a/main/entry.c b/main/entry.c
index 9e2601213..cfb3f5a2a 100644
--- a/main/entry.c
+++ b/main/entry.c
@@ -1020,7 +1020,7 @@ extern const tagField* getParserFieldForIndex (const tagEntryInfo * tag, int ind
 	}
 }
 
-extern const char* getParserFieldValueForType (tagEntryInfo *const tag, fieldType ftype)
+extern const char* getParserFieldValueForType (const tagEntryInfo *const tag, fieldType ftype)
 {
 	for (int i = 0; i < tag->usedParserFields; i++)
 	{
diff --git a/main/entry.h b/main/entry.h
index 21d9f60ae..7b27aeae6 100644
--- a/main/entry.h
+++ b/main/entry.h
@@ -298,7 +298,7 @@ extern bool isTagExtra (const tagEntryInfo *const tag);
  */
 extern void attachParserField (tagEntryInfo *const tag, fieldType ftype, const char* value);
 extern void attachParserFieldToCorkEntry (int index, fieldType ftype, const char* value);
-extern const char* getParserFieldValueForType (tagEntryInfo *const tag, fieldType ftype);
+extern const char* getParserFieldValueForType (const tagEntryInfo *const tag, fieldType ftype);
 
 extern int makePlaceholder (const char *const name);
 extern void markTagAsPlaceholder (tagEntryInfo *e, bool placeholder);
diff --git a/main/mini-geany.c b/main/mini-geany.c
index 0f0571995..24390d1b9 100644
--- a/main/mini-geany.c
+++ b/main/mini-geany.c
@@ -27,6 +27,11 @@
 #include <string.h>
 #include <errno.h>
 
+langType C_lang = LANG_IGNORE;
+langType CXX_lang = LANG_IGNORE;
+fieldType c_props_ft = FIELD_UNKNOWN; /* Ideally, this should be passed via clientData of writeEntry(, ...) */
+fieldType cxx_props_ft = FIELD_UNKNOWN; /* Ideally, this should be passed via clientData of writeEntry(, ...) */
+
 static int writeEntry (tagWriter *writer, MIO * mio, const tagEntryInfo *const tag, void *clientData);
 static void rescanFailed (tagWriter *writer, unsigned long validTagNum, void *clientData);
 
@@ -209,6 +214,7 @@ typedef struct {
 	unsigned long lineNumber;
 	int lang;
 	bool isAnon;
+	bool isStatic;
 } Tag;
 
 
@@ -234,6 +240,11 @@ static Tag *createTag(const tagEntryInfo *info)
 	tag->lineNumber = info->lineNumber;
 	tag->lang = info->langType;
 	tag->isAnon = isTagExtraBitMarked(info, XTAG_ANONYMOUS);
+
+	const char *props = (info->langType == C_lang)? getParserFieldValueForType (info, c_props_ft):
+		(info->langType == CXX_lang)? getParserFieldValueForType (info, cxx_props_ft): NULL;
+	tag->isStatic = (props && !!strstr(props, "static"));
+
 	return tag;
 }
 
@@ -306,11 +317,12 @@ static void processCollectedTags(ptrArray *tagArray)
 	for (i = 0; i < num; i++)
 	{
 		Tag *tag = ptrArrayItem(tagArray, i);
-		printf("%s\tline: %lu\tkind: %s\t lang: %s\n",
+		printf("%s\tline: %lu\tkind: %s\t lang: %s%s\n",
 			tag->name,
 			tag->lineNumber,
 			ctagsGetKindName(tag->kindLetter, tag->lang),
-			ctagsGetLangName(tag->lang));
+			ctagsGetLangName(tag->lang),
+			tag->isStatic? " [static]": "");
 	}
 
 	/* prepare for the next parsing by clearing the tag array */
@@ -323,6 +335,15 @@ extern int main (int argc, char **argv)
 	/* called once when Geany starts */
 	ctagsInit();
 
+	{
+		C_lang = getNamedLanguage("C", 0);
+		c_props_ft = getFieldTypeForNameAndLanguage ("properties", C_lang);
+		enableField (c_props_ft, true);
+		CXX_lang = getNamedLanguage("C++", 0);
+		cxx_props_ft = getFieldTypeForNameAndLanguage ("properties", CXX_lang);
+		enableField (cxx_props_ft, true);
+	}
+
 	/* create empty tag array *
 	 * this is where we store the collected tags
 	 * NOTE: Geany doesn't use the ptrArray type - it is used just for the purpose of this demo */
$ cat /tmp/foo.c 
static void foo(void) {}
static void bar(void);
void baz(void);
void baz(void) {}
$ ./mini-geany /tmp/foo.c
This parser only parses C files - provide them as arguments on the command line or get a hard-coded buffer parsed when no arguments are provided


Parsing /tmp/foo.c:
foo	line: 1	kind: function	 lang: C [static]
bar	line: 2	kind: prototype	 lang: C [static]
baz	line: 3	kind: prototype	 lang: C
baz	line: 4	kind: function	 lang: C

@masatake
Copy link
Member

The changes for entry.[ch] are already merged in #3699.

@techee
Copy link
Contributor Author

techee commented May 8, 2023

@masatake Thanks!

After some thinking, though, I think for us it will be easiest to still rely on isFileScope and only ignore it in cases when it's set for files with an unknown extension. See:

geany/geany#3490

The code you provided could still be interesting for us in the future if we decide we'd like to get some info from some parser-specific fields.

@techee techee closed this May 8, 2023
@masatake
Copy link
Member

masatake commented May 9, 2023

@techee I see I already wrote code for the ! negate marker. I wonder I should merge it or not...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants