The three levels of source code analysis and their adversaries

In our software composition analysis (SCA) projects, we have run into three ways of identifying open source code that has been added to a code base.

  • Code scanning. In the first case, a developer innocently copies open source components into your project or product repository. They don’t change anything so the code can be identified comparatively easily using regular expressions to match license texts and copyright notices. Open source license text and copyright notice scanners provide this service.
  • Snippet matching. In the second case, a developer also often innocently copies open source code into your project or product and superficially modifies the code to match the programming problem at hand. Such text snippets can be found by comparing them with the source code of known open source projects; commercial tools can provide this service.
  • Semantic analysis. In the third and hardest case, a developer copies code into your project or product and heavily modifies the code. To find such copies, you need to apply semantic analysis which goes beyond the source code to the underlying algorithm. Tools to do so and the experts who can handle them are available but often highly specialized.

We provide both code scanning and snippet matching services and partner with specialized experts on semantic analysis, if requested by a client.


Free weekly industry insights from the world of open source, in three paragraphs or less. Most Tuesdays, always 4pm CET, by Prof. Riehle.

Join 8 other subscribers