The phenomena called ‘inverted index’. An evaluation and explanation

Nowadays, indexes are essential in every database to support fast search to retrieve specific database rows. Not only databases but also applications could use data structures,  for example the System.Collections.Generic.Dictionary in C# and the java.util.HashMap in Java could be used for for fast searches using specific keys. In terms of the Big O notation, the retrieval time for items in a hashmap and dictionary based on a key is O(c) (constant). These data structures are very handy to support fast searches in specific items in applications. Database indexes are a bit slower compared to the map and dictionary, because most databases use a B-Tree on the background. The retrieval time of these indexes are O(log n) in a worst case scenario but still it is very fast. Last month, I got in touch with the phenomena called ‘inverted index’. I never heart of the inverted index before, but people told me positive stories about it and also the usual stories that this type of index is ‘tha bomb’ and should be used for everything related to search, so I decided to dive into this topic.

The inverted index is, like the name said, an index but inverted. That makes sense, but how does it work? Like the normal index, the inverted index is nothing more than a data structure. I am going to explain the inverted index based on the following use case: We have a list of files and we want to make an application that supports full text search. In table 1, we have modelled the table files that we want to use.


Table 1; a relational table contain file information

The id is the primary key of the table, the filename column contains the name of the file on the file system, the directory column contains the absolute folder structure the file is positioned and the column content contains the text content of the file. This is a very common way to model your table that references to files with additional information. In this structure, you can search files by filename and directory by placing indexes on the column filename and directory. You can also place an index on content to search for text and search for specific text. For example, queries you can build:

SELECT f.* FROM Files f WHERE f.content LIKE ‘%example%’ returns row 1 and 2.

SELECT * FROM Files f WHERE f.content LIKE ‘%with%’  returns row 3, 4 and 5.

This strategy works fine for the queries and data above, but this strategy won’t work for every search, for example:

SELECT * FROM Files f WHERE f.content LIKE ‘%content%’

I expect it would return row 1 till 5 (because every row contains the text ‘content’), but in reality, it will only return row 2 till 5. It is possible to achieve the right result by expanding the query, but it is not the most optimal case in terms of the execution path the database is using. In our use case, we want to search for a word and retrieve all the files that contain the word. Basically, what we need to do is to invert the table; based on a row word, you want a list of files instead of having a row file that contain a list of words. We take the column id and content from the table ‘Files’ and invert the table into a new table called ‘Words’. In the image below, you see the result of the modifications:

Extraction of the original Table ‘Files’ into 2 separate tables.

In the new situation we extracted the table files into words and files. We collect all unique words from the column content in our old table ‘Files’, fill the table ‘Words’ with all unique words we can find, add references into the column ‘Files’ that contain the word and finally we place an unique index on the column ‘Word’. The column ‘Files’ is an array structure, which contain a reference to another table, ‘Files’ in our case. Using the table ‘Words’, we can search files that contain specific words. Using this data structure, it is much more efficient to search for files that contain a specific word(s).

Implementation

Based on this information, I have made an implementation by myself how an inverted index structure could be implemented using Java. See the example below.

//Folder with pdf files to search for
private static final String FOLDER_WITH_PDF_FILES = "C:\\JDeveloper\\mywork\\Playground\\TestIText\\pdf";

//Hashmap which simulates the inverted index using a string key and a Set as a value
private static HashMap<String, Set<String>> invertedIndex;

public static void main(String[] args) {

    //Scanner for reading commandline input
    Scanner scanner = new Scanner(System.in);
    
    //Build the invertedIndex/Table
    buildInvertedIndex(FOLDER_WITH_PDF_FILES);
    System.out.println("Found " + invertedIndex.size() + " unique words.");
    
    // Want to exit?
    while(true) {

        // Ask for a word and retreive the input
        System.out.print("Search words in file: ");
        String words = scanner.nextLine();
            
        if(words.equalsIgnoreCase("q"))
            break;
        else {
            ArrayList<String> wordsToSearch = extractWords(words);
            searchInFiles(wordsToSearch);
        }
    }
}

The main is the start point of the application. For my inverted index, I use a HashMap<String, Set<String>>. The key in this hashmap is in our case the word and the Set<String> is the list of files that contain the word. First, it builds the inverted index using the buildInvertedIndex function. This function is divided in the following following functions

/**
 * Builds the inverted index using the pdf files in the folder
 * @param pathToSearch the path which is used to search PDF files
 * @return an inverted index containing the words and file names
 * @throws IOException
 */
private static void buildInvertedIndex(String pathToSearch) {
    
    invertedIndex = new HashMap<String, Set<String>>();
    File fPathToSearch = new File(pathToSearch);

    //Read all the files in the folder passed by the parameter (in this demo, I expect every file is PDF)
    for(File file : fPathToSearch.listFiles()) {
        processPDF(file);
    }
}

/**
 * Processes a PDF file and fills the inverted index. This contains
 * - Text extraction
 * - Word extraction
 * - Fills the inverted index
 * @param file
 */
private static void processPDF(File file) {
    
    //Open file as pdf (in this situation, we expect that every file is a pdf file)
    PdfReader reader = null;
    PdfReaderContentParser parser = null;
    
    try {
        reader = new PdfReader(file.getAbsolutePath());
        parser = new PdfReaderContentParser(reader);    
    } catch(IOException e){
        e.printStackTrace();
        return;
    }
    
    //For every page...
    TextExtractionStrategy strategy = null;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        
        //Extract the text 
        try {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
        } catch (IOException e) {
            e.printStackTrace();
            return;
        }
        
        //Extract the words from the text
        ArrayList<String> words = extractWords(strategy.getResultantText());
        
        //For every word, create/edit the index
        for(String word : words) {
            if(!invertedIndex.containsKey(word)) {
                invertedIndex.put(word, new HashSet<String>());
            }
            invertedIndex.get(word).add(file.getName());
        }
    }
}

/**
 * Extracts words using regex from the given input. No diacritics where included
 * @param input the value to extract words from
 * @return a list of words
 */
private static ArrayList<String> extractWords(String input) {
    
    //TODO: no words with diacritics included
    Pattern p = Pattern.compile("[\\w']+");
    Matcher m = p.matcher(input);
    
    ArrayList<String> returnList = new ArrayList<String>();
    while (m.find()) {
        returnList.add(input.substring(m.start(), m.end()).toLowerCase());
    }
    return returnList;
}

For this code example I decided to use the opensource library iText, which is a library that could be used for displaying and creation of PDF files (http://www.itextpdf.com/). I use itext in the processPDF function, for extracting textual content from PDF files. After extracting the textual content from files, I need to extract all words. This is done by the extractWords function using a regex pattern. After all words are processed for all files and the index is created, we can use the index to search for content. This is done by the following part of the code:

/**
 * Search files that contain a given words
 * @param wordsToSearchFor words that the file should contain
 */
private static void searchInFiles(ArrayList<String> wordsToSearchFor) {
    
    //Add every found row to to searcher
    TextSearcher searcher = new TextSearcher();
    
    int noWordsCurrentCount = 0;
    int noWordsToSearch = wordsToSearchFor.size();
    for(String wordToSearch : wordsToSearchFor) {
        
        ++noWordsCurrentCount;    
        Set<String> items = invertedIndex.get(wordToSearch);
        if(items == null)
            break;
        
        boolean finalIncrement = noWordsCurrentCount == noWordsToSearch ? true : false;
        searcher.incrementalSearch(items, finalIncrement);
    }
    
    //If the amount of increments of the searcher is not equal to 
    if(noWordsToSearch != searcher.getIncrements()) {
        System.out.println("No files found with the specified words");
        return;
    }
    
    //Print the results
    Set<String> itemsFound = searcher.getResults();
    System.out.println("Found: " + itemsFound.size() + " file(s)");
    
    int foundCount = 1;
    for(String item : itemsFound) {
        System.out.println(foundCount + ": " + item);
        ++foundCount;
    }
}

//Helper class for searching using the inverted index
public class TextSearcher {
    
    private HashMap<String, Integer> hits;
    private Set<String> results;
    private int increments;
    
    public TextSearcher() {
        hits = new HashMap<String, Integer>();
        results = new HashSet<String>();
        increments = 0;
    }
    
    public void incrementalSearch(Set<String> hitsToAdd, boolean finalIncrement) {
        
        ++increments;
        for(String hitToAdd : hitsToAdd){
            
            if(hits.containsKey(hitToAdd)) {
                
                Integer hitsOnFile = hits.get(hitToAdd);
                if(hitsOnFile + 1 < increments)
                    hits.remove(hitToAdd);
                else {
                    registerHit(hitToAdd, hitsOnFile + 1, finalIncrement);
                }
            }
            else if(increments == 1) {
                registerHit(hitToAdd, 1, finalIncrement);
            }
        }
    }
    
    private void registerHit(String key, Integer value, boolean finalIncrement) {
        hits.put(key, value);
        if(finalIncrement) {
            results.add(key);
        }
    }
    
    public Set<String> getResults(){
        return results;
    }

    public int getIncrements() {
        return increments;
    }
}

For the inverted index, I build a helper class called TextSearcher. The TextSearcher class is an internal class and introduced to cache the results per word when you want to search with multiple words. After a word is found in the inverted index, the TextSearcher is used to process the files found by the word. After all words are processed, we can get the results from the TextSearcher and print the results in the console.

Using this console application, you can type one or more words. Then the inverted index is used to determine if one or set of words is used in one of the processed documents. In the image below, I processed some pdf files I found on the internet and my thesis. Then I did some searches based on a set of words and it prints the files which contain the words. The image below shows some results.

Conclusion

The inverted index is a different way to model your data based on a key that contains multiple rows instead of a key that results in a single row. In my opinion, it is not a complete new technique, but it is a more clever way to store data in such a way that it  results in a better search performance for a 1 to many data cardinality. For full text searches, this technique is an excellent option. A normal index is a perfect strategy for data based on one key, one value (1 to 1 data cardinality). My final conclusion; There isn’t really a best index, it really depends on your data. So my advice; first analyse your data and then choose the proper index. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *