Query-Based Code Analysis Engine
中文 | English
In the domain of large-scale software development, the demands for dynamic and multifaceted static code analysis exceed the capabilities of traditional tools. To bridge this gap, we present CodeFuse-Query, a system that redefines static code analysis through the fusion of Domain Optimized System Design and Logic Oriented Computation Design.
CodeFuse-Query reimagines code analysis as a data computation task, support scanning over 10 billion lines of code daily and more than 300 different tasks. It optimizes resource utilization, prioritizes data reusability, applies incremental code extraction, and introduces tasks types specially for Code Change, underscoring its domain-optimized design. The system’s logic-oriented facet employs Datalog, utilizing a unique two-tiered schema, COREF, to convert source code into data facts. Through Godel, a distinctive language, CodeFuse-Query enables formulation of complex tasks as logical expressions, harnessing Datalog’s declarative prowess.
Overall, the CodeFuse-Query platform is divided into three main parts: code data model, code query DSL, and platform productization services.
We have defined a code data and standardization model: COREF, which requires all code to be converted to this model through various language extractors.
COREF mainly contains the following information:
COREF = AST (Abstract Syntax Tree) + ASG (Abstract Semantic Graph) + CFG (Control Flow Graph) + PDG (Program Dependency Graph) + Call Graph + Class Hierarchy + Documentation (Documentation/Comments)
Note: Since the computation difficulty of each type of information varies, not all languages’ COREF information includes all the above. The basic information mainly consists of AST, ASG, Call Graph, Class Hierarchy, and Documentation, while other information (CFG and PDG) is still under construction and will be gradually supported.
Based on the generated COREF code data, CodeFuse-Query uses a custom DSL language called Gödel for queries to meet code analysis needs.
Gödel is a logical reasoning language based on the logical reasoning language Datalog, which derives new facts through “facts” and “rules”. Gödel is also a declarative language, which, compared to imperative programming, focuses more on describing “what is needed” and leaves the implementation to the computation engine.
Since the code has been transformed into relational data (COREF data is stored in the form of relational data tables), one might wonder why not use SQL directly or use an SDK, but instead learn a new DSL language. The reason is that Datalog has monotonicity and termination properties, meaning that Datalog sacrifices some expressive power, and Gödel inherits this characteristic.
CodeFuse-Query includes the Sparrow CLI and the online service Query Center. Sparrow CLI contains all components and dependencies, such as extractors, data model, compiler, etc., allowing users to generate code data and conduct queries locally (for Sparrow CLI usage, please see Section 3: Installation, Configuration, and Running). If users require online queries, they can experiment using the Query Center.
As of now, CodeFuse-Query supports data analysis for 11 programming languages. Among them, support for 5 languages (Java, JavaScript, TypeScript, XML, Go) is very mature, while the remaining 6 languages (Object-C, C++, Python3, Swift, SQL, Properties) are in beta stage and have room for further improvement and perfection. The specific support status is shown in the table below:
Language | Status | COREF Model Node Count |
---|---|---|
Java | Mature | 162 |
XML | Mature | 12 |
TS/JS | Mature | 392 |
Go | Mature | 40 |
OC/C++ | Beta | 53/397 |
Python3 | Beta | 93 |
Swift | Beta | 248 |
SQL | Beta | 750 |
Properties | Beta | 9 |
Note: The maturity level of the language status is determined based on the types of information contained in COREF and the actual implementation. Except for OC/C++, all languages support complete AST information and Documentation, and in the case of Java, COREF for Java also supports ASG, Call Graph, Class Hierarchy, and some CFG information.
Installation, Configuration, and Running
cli
: The entry point for the command-line tool, providing a unified command-line interface, calling other modules to complete specific functionslanguage
: Core data and data modeling (lib) for various languages. Regarding the degree of openness, please refer to the section “Some Notes on the Scope of Open Source”doc
: Reference documentsexamples
: Gödel query language examplestutorial
:CodeFuse-Query Development Container Usage TutorialAs of now, it is not possible to build an executable program from the source code because not all modules have been made open-source in this release, and missing modules will be released over the next year. Nevertheless, to ensure a complete experience, we have released complete installation packages for download, please see the Release page.
Regarding the openness of languages, you can refer to the table below:
Language | Data Modeling Open Source | Data Core Open Source | Maturity |
---|---|---|---|
Python | Y | Y | RELEASE |
Java | Y | Y | RELEASE |
JavaScript | Y | Y | RELEASE |
Go | Y | Y | RELEASE |
XML | Y | Y | RELEASE |
Cfamily | Y | Y | BETA |
SQL | Y | Y | BETA |
Swift | N | N | BETA |
Properties | Y | Y | BETA |