This demo is part of a larger research project (with Andrew Wu and Jun Li) that aims to extract structured information from 10-K filings to answer interesting questions in strategy and organizational behavior:
- Given disclosures of named competitors are voluntary, what motivates a company to disclose? Is it strategic?
- Can we utilize network-analysis to get a measure of industry diversification for a given company?
The research is still in early stages.
Research involving large-scale textual analysis of 10-K filings has shown great promise in industry classification, asset pricing, etc.
This project is the first in this line of research to look at competitor mentions in 10-K filings.
- Data Extraction
- 10-K filings were crawled from SEC EDGAR database.
- HTML documents were converted to text and competition sections were parsed using a
- Extraction of Proper Nouns
- Proper nouns were extracted using part-of-speech tagging and named entity recognition.
- Filtering of Companies
- A subset of proper nouns were identified as companies through collaborative filtering using
Google Knowledge Graph API and Wikipedia.
- Company Mapping
- The above method was also used to canonacalize company names. For example, [Amazon.com,
Amazom Inc., Amazon] -> Amazon.