The 10K Project


This demo is part of a larger research project (with Andrew Wu and Jun Li) that aims to extract structured information from 10-K filings to answer interesting questions in strategy and organizational behavior:

  • Given disclosures of named competitors are voluntary, what motivates a company to disclose? Is it strategic?
  • Can we utilize network-analysis to get a measure of industry diversification for a given company?

The research is still in early stages.

Research involving large-scale textual analysis of 10-K filings has shown great promise in industry classification, asset pricing, etc.

This project is the first in this line of research to look at competitor mentions in 10-K filings.


Data Extraction
10-K filings were crawled from SEC EDGAR database.
HTML documents were converted to text and competition sections were parsed using a probabilistic parsing method.
Extraction of Proper Nouns
Proper nouns were extracted using part-of-speech tagging and named entity recognition.
Filtering of Companies
A subset of proper nouns were identified as companies through collaborative filtering using Google Knowledge Graph API and Wikipedia.
Company Mapping
The above method was also used to canonacalize company names. For example, [, Amazom Inc., Amazon] -> Amazon.