This demo is part of a larger research project (with Andrew Wu and Jun Li) that aims to extract structured information from 10-K filings to answer interesting questions in strategy and organizational behavior:

The research is still in early stages.

Research involving large-scale textual analysis of 10-K filings has shown great promise in industry classification, asset pricing, etc.

This project is the first in this line of research to look at competitor mentions in 10-K filings.


Data Extraction
10-K filings were crawled from SEC EDGAR database.
HTML documents were converted to text and competition sections were parsed using a probabilistic parsing method.
Extraction of Proper Nouns
Proper nouns were extracted using part-of-speech tagging and named entity recognition.
Filtering of Companies
A subset of proper nouns were identified as companies through collaborative filtering using Google Knowledge Graph API and Wikipedia.
Company Mapping
The above method was also used to canonacalize company names. For example, [Amazon.com, Amazom Inc., Amazon] -> Amazon.