16. March 2018

1. Intro

Structure

  1. Introduction
  2. Character Convolutional Network
    1. Idea CNN
    2. Adaption to text mining
    3. Architecture and conclusion
  3. Implementation
    1. Data
    2. Interactive App
  4. Explainable AI with LIME
    1. Idea
    2. Integration in App
  5. Conclusion and Outlook

Applications

  • Genre classification in movie reviews
  • Fraud detection in emails
  • Readability assessment of texts
  • Topic analysis in news articles
  • Pre ordering of mails in customer service
  • Sentiment analysis in yelp reviews

2. Character CNN

Idea 2D Convolution

  • Essential part of image processing neural networks as described in Goodfellow, Bengio, and Courville (2016)
  • ML: experts extract features (transformations) of the data manually
  • DL: automated feature extraction optimized on predictive power
  • Therefore uses adaptive filters that convolve through the images
  • E.g.: detect wheels for car-classification:
Illustration filters of different depths by @Yann2013

Illustration filters of different depths by Yann LeCun (2013)

Idea 2D Convolution

Credits to Niklas Klein for this example.

Credits to Niklas Klein for this example.

Idea 2D Convolution

Idea 2D Convolution

Idea 2D Convolution

Idea 2D Convolution

Idea 2D Convolution

Idea 2D Convolution

Idea 2D Pooling

  • Reduce dimension of feature map preserving major information
  • Introduce robustness in the network

Idea 2D Pooling

  • Reduce dimension of feature map preserving major information
  • Introduce robustness in the network

Text Encoding

  • Match each (lowercase) character with the 72-symbol alphabet: abcdefghijklmnopqrstuvwxyz0123456789,;.!?:’’“´`^/|_@#$%&*˜‘+-=<>()
  • Each review encoded as matrix with dimension #chars x #alphabet
  • This can be treated as a 1-channel image in the CNN

1D convolution

  • Extract information from neigbhouring words via convolutional filter
  • kernel size \(\approx\) text window from GloVe etc.
  • Filter has dimension: kernel size x #alphabet
  • Filter moves in word-direction (y-axis) over the image-representation
  • Use same padding
  • The very first filter extracts a feature vector with length 1014
  • Filter with kernel size 3 on alphabet with length 72 contains 216 parameters
  • Rectified linear unit as activation to introduce non-linearity

1D convolution - animation

1D convolution - animation

  • Same padding:
    • spaces on top and bottom added
    • Output vector has same length as input
  • Kernel size: 3
  • Filter extracts one scalar value in each step
  • Many filters per convolutional layer in real application: 256 / layer in Yann LeCun (2013)
  • Results in feature map with dimensions 1014 x 256

Temporal max-pooling

  • We use pooling to decrease the size of the feature vectors while obtaining major information
  • Enhances robustness
  • Key for training of deep networks
  • 1D convolution \(\rightarrow\) 1D max pooling
  • From kernel_size neighbouring values select the largest one
  • Here we use stride = kernel_size: non-overlapping values
  • Example 1:
    • Feature vector with dimension 1014 x 1
    • kernel_size = stride = 3
    • Reduced to dimension 338 x 1
  • Example 2: blackboard

Architecture

Dataset augmentation

  • Prominent technique to increase the model’s generalization power
  • Artificially augment your train data set
  • Images: rotation, blurring, cropping …
  • Text: replace words with their synonyms
  • Authors boost performance marginally using the libreOffice Thesaurus

Results from Yann LeCun (2013)

Classification errors on different tasks and models by @Yann2013

Classification errors on different tasks and models by Yann LeCun (2013)

Conclusion on CNN’s for text classification:

  • No a priori information needed (syntax, semantic structure, splitting in pre-defined words, …)
  • Frequent abnormal character combinations could also be learned (misspellings, slang, emoticons, …)
  • Performance highly dependent on data set size:
    • classic models such as n-grams, bag of words are strong competitors on data sets < 1Mio
    • Could be solved by combinations: use pre-trained embeddings for encoding and train CNN on those
  • Strong performance with raw, user-generated texts from real world applications
  • Works in various classification settings (product categories, fraud detection, …) and is not restricted to sentiment analysis
  • Interpret text and language as a 1-dimensional signal

3. Implementation

Data

  • Yelp Polarity data set provided by Zhang, Zhao, and LeCun (2015)
  • Built upon data from the Yelp Dataset Challenge 2015
  • English language
  • Train data: 560.000 balanced positive (4, 5 stars) and negative (1, 2 stars) reviews of restaurants, doctors, bars …
  • Test data: 38.000 balanced reviews
Example positive yelp review

Example positive yelp review

Data

Word count distribution train reviews

Word count distribution train reviews

Data

positive review

My go-to ice cream place in the summer! The lines are usually long and you have to wait a while, but they have some delicious snowstorms! They also have a drive-thru, which I have never used, but that is because I like to sit out front at the tables in the parking lot. Perfect place to go with friends on a beautiful summer night for a yummy treat!

negative review

I have been to this Mesa AZ location on Alma School Road a few times with good results but this last time Tuesday 8/19/2014 will be my last. The waitress gave me a crazy look when I asked if there was a house italian salad dressing. The deep dish pizza was terrible. They hardly put any cheese on the thing and all is left is one inch of dough for 16 dollars. Just Terrible.

Framework

  • Data preprocessing in R
  • Python library keras with tensorflow backend
  • Trained on GTX1070 GPU
  • Training parameters:
    • loss: categorical cross entropy
    • optimizer: adam with:
      • learning rate = 0.001
      • beta 1 = 0.09
      • beta 2 = 0.999
      • decay = 0.95
    • Regularization: early stopping with patience = 3
    • batch size: 1000
  • Code is public on my Github account

Memory Problems

  • Training with 560.000 reviews fast leads to memory problems
  • Not enough RAM to store 560.000 one-hot encoded training samples
  • Solution 1: train with only 10.000 reviews

Memory Problems

  • Training with 560.000 reviews fast leads to memory problems
  • Not enough RAM to store 560.000 one-hot encoded training samples
  • Solution 2: create a data Generator function:
    • In each batch iteration, reads batch_size = 1000 reviews from the .csv file
    • Encodes them into character-vectors on CPU
    • Feeds them to the network for parallel training on GPU
    • Deletes the batch from memory and reads the next batch
    • Callable in keras via the method model.fit_generator(generator=trainGenerator ...

Results

Confusion matrices for different train data sizes
positive negative
560K
17723 406
554 19317
280K
16921 1208
615 19256
140K
16698 1431
1049 18822

Results

Comparison different train data sizes
AUC Accuracy
560K 0.9959275 0.9747368
280K 0.9892149 0.9520263
140K 0.9817623 0.9347368

App

4. Explainable AI with LIME

Why does the model decide that way?

Problem

CNN as black box model

CNN as black box model

Idea of LIME

  • Ribeiro, Singh, and Guestrin (2016) developed method for Local Interpretable Model-agnostic Explanations
  • Our CNN consists of 12.127.746 weight parameters which shape the decision rule for the net \(\rightarrow\) impossible to interpret
  • Idea: use the reactions of the model on perturbations of the input to interpret the local prediction of the model
  • Model agnostic technique that works for any black box model
  • Interpretable representations:
    • text: words
    • image: parts of the image, so called superpixels

Illustration

Illustration of LIME by @RibeiroSinghGuestrin2016. Blue and pink areas depict the complex decision function of $f$. Fat cross marks the observation that is to be explained and crosses and bubbles mark perturbations $z$ of $x$, resized proportional to proximity. The dashed line is the local learned explanation.

Illustration of LIME by Ribeiro, Singh, and Guestrin (2016). Blue and pink areas depict the complex decision function of \(f\). Fat cross marks the observation that is to be explained and crosses and bubbles mark perturbations \(z\) of \(x\), resized proportional to proximity. The dashed line is the local learned explanation.

Artificial Design Matrix

Perturbed examples
I like the good expensive beer P(positive)
I like beer 1 1 0 0 0 1 0.50
I like good 1 1 0 1 0 0 0.80
I like good expensive beer 1 1 0 1 1 1 0.74
I the good beer 1 0 1 1 0 1 0.54
expensive beer 0 0 0 0 1 1 0.27

LIME in Action

Screenshot from LIME in action

Screenshot from LIME in action

LIME Pseudo-Algorithm

  1. Create artificial design matrix
  2. Calculate similarities of perturbed observations with true observation (text domain: cosine distance)
  3. Set amount of explanatory features n and select them (e.g. via lasso, forward-selection, …)
  4. Fit interpretable surrogate model on the label-output of the black box model
    • Use only the n selected features
    • Weigh perturbations according to their similarity score
  5. Use coefficients of the surrogate model as explanations

Ingredients and settings

  • Penalized ridge regression is used as interpretable surrogate model
  • Important hyperparameters:
    • num_features = 5: amount of features to be selected
    • num_samples = 1000: the size of the design matrix
    • bow = False: control if all occurencies of a word should be perturbed or one by one
    • feature_selection = "lasso_path": selection method for step 3
  • Refer to the official documentation of LIME or this blog post with examples in R for more details

App

5. Conclusion and Outlook

Conclusion

  • Char-CNN’s show impressive results although almost no domain knowledge is included
  • LIME offers a first good method to try to understand those black-box predictions
  • The performance on specific tasks is highly dependent on the train data
  • The results are still partly counter-intuitive

Outlook

  • Use GloVe embeddings instead of the character encoding to include outside information
  • Try the model with different languages
  • Try different settings of LIME
  • Toy around with the length of the encoding (currently set to 1014)
  • LIME will implement local predictions as measure of trust in the future

Thank you!

References

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Ribeiro, Marco Túlio, Sameer Singh, and Carlos Guestrin. 2016. “"Why Should I Trust You?": Explaining the Predictions of Any Classifier.” CoRR abs/1602.04938. http://arxiv.org/abs/1602.04938.

Yann LeCun, Marc’Aurelio Ranzato. 2013. “Deep Learning Tutorial.” http://www.cs.nyu.edu/~yann/talks/lecun-ranzato-icml2013.pdf.

Zhang, Xiang, Junbo Zhao, and Yann LeCun. 2015. “Character-Level Convolutional Networks for Text Classification,” 649–57.