The 10K Project

About

This demo is part of a larger research project (with Andrew Wu and Jun Li) that aims to extract structured information from 10-K filings to answer interesting questions in strategy and organizational behavior:

Given disclosures of named competitors are voluntary, what motivates a company to disclose? Is it strategic?
Can we utilize network-analysis to get a measure of industry diversification for a given company?

The research is still in early stages.

Research involving large-scale textual analysis of 10-K filings has shown great promise in industry classification, asset pricing, etc.

This project is the first in this line of research to look at competitor mentions in 10-K filings.

Methodology

Data Extraction: 10-K filings were crawled from SEC EDGAR database.
Parsing: HTML documents were converted to text and competition sections were parsed using a probabilistic parsing method.
Extraction of Proper Nouns: Proper nouns were extracted using part-of-speech tagging and named entity recognition.
Filtering of Companies: A subset of proper nouns were identified as companies through collaborative filtering using Google Knowledge Graph API and Wikipedia.
Company Mapping: The above method was also used to canonacalize company names. For example, [Amazon.com, Amazom Inc., Amazon] -> Amazon.

About

Related Works

Methodology