Explainable character-level CNN for text classification

16. March 2018

1. Intro

Structure

Introduction
Character Convolutional Network
1. Idea CNN
2. Adaption to text mining
3. Architecture and conclusion
Implementation
1. Data
2. Interactive App
Explainable AI with LIME
1. Idea
2. Integration in App
Conclusion and Outlook

Applications

Genre classification in movie reviews
Fraud detection in emails
Readability assessment of texts
Topic analysis in news articles
Pre ordering of mails in customer service
Sentiment analysis in yelp reviews
…

2. Character CNN

Idea 2D Convolution

Essential part of image processing neural networks as described in Goodfellow, Bengio, and Courville (2016)
ML: experts extract features (transformations) of the data manually
DL: automated feature extraction optimized on predictive power
Therefore uses adaptive filters that convolve through the images
E.g.: detect wheels for car-classification:

Illustration filters of different depths by @Yann2013

Illustration filters of different depths by Yann LeCun (2013)

Idea 2D Convolution

Credits to Niklas Klein for this example.

Idea 2D Convolution

Idea 2D Pooling

Reduce dimension of feature map preserving major information
Introduce robustness in the network

Idea 2D Pooling

Reduce dimension of feature map preserving major information
Introduce robustness in the network

Text Encoding

Match each (lowercase) character with the 72-symbol alphabet: abcdefghijklmnopqrstuvwxyz0123456789,;.!?:’’“´`^/|_@#$%&*˜‘+-=<>()
Each review encoded as matrix with dimension #chars x #alphabet
This can be treated as a 1-channel image in the CNN

1D convolution

Extract information from neigbhouring words via convolutional filter
kernel size $\approx$ text window from GloVe etc.
Filter has dimension: kernel size x #alphabet
Filter moves in word-direction (y-axis) over the image-representation
Use same padding
The very first filter extracts a feature vector with length 1014
Filter with kernel size 3 on alphabet with length 72 contains 216 parameters
Rectified linear unit as activation to introduce non-linearity

1D convolution - animation

Same padding:
- spaces on top and bottom added
- Output vector has same length as input
Kernel size: 3
Filter extracts one scalar value in each step
Many filters per convolutional layer in real application: 256 / layer in Yann LeCun (2013)
Results in feature map with dimensions 1014 x 256

Temporal max-pooling

We use pooling to decrease the size of the feature vectors while obtaining major information
Enhances robustness
Key for training of deep networks
1D convolution $\rightarrow$ 1D max pooling
From kernel_size neighbouring values select the largest one
Here we use stride = kernel_size: non-overlapping values
Example 1:
- Feature vector with dimension 1014 x 1
- kernel_size = stride = 3
- Reduced to dimension 338 x 1
Example 2: blackboard

Architecture

Dataset augmentation

Prominent technique to increase the model’s generalization power
Artificially augment your train data set
Images: rotation, blurring, cropping …
Text: replace words with their synonyms
Authors boost performance marginally using the libreOffice Thesaurus

Results from Yann LeCun (2013)

Classification errors on different tasks and models by @Yann2013

Classification errors on different tasks and models by Yann LeCun (2013)

Conclusion on CNN’s for text classification:

No a priori information needed (syntax, semantic structure, splitting in pre-defined words, …)
Frequent abnormal character combinations could also be learned (misspellings, slang, emoticons, …)
Performance highly dependent on data set size:
- classic models such as n-grams, bag of words are strong competitors on data sets < 1Mio
- Could be solved by combinations: use pre-trained embeddings for encoding and train CNN on those
Strong performance with raw, user-generated texts from real world applications
Works in various classification settings (product categories, fraud detection, …) and is not restricted to sentiment analysis
Interpret text and language as a 1-dimensional signal

3. Implementation

Data

Yelp Polarity data set provided by Zhang, Zhao, and LeCun (2015)
Built upon data from the Yelp Dataset Challenge 2015
English language
Train data: 560.000 balanced positive (4, 5 stars) and negative (1, 2 stars) reviews of restaurants, doctors, bars …
Test data: 38.000 balanced reviews

Example positive yelp review

Data

Word count distribution train reviews

Data

positive review

My go-to ice cream place in the summer! The lines are usually long and you have to wait a while, but they have some delicious snowstorms! They also have a drive-thru, which I have never used, but that is because I like to sit out front at the tables in the parking lot. Perfect place to go with friends on a beautiful summer night for a yummy treat!

negative review

I have been to this Mesa AZ location on Alma School Road a few times with good results but this last time Tuesday 8/19/2014 will be my last. The waitress gave me a crazy look when I asked if there was a house italian salad dressing. The deep dish pizza was terrible. They hardly put any cheese on the thing and all is left is one inch of dough for 16 dollars. Just Terrible.

Framework

Data preprocessing in R
Python library keras with tensorflow backend
Trained on GTX1070 GPU
Training parameters:
- loss: categorical cross entropy
- optimizer: adam with:
  - learning rate = 0.001
  - beta 1 = 0.09
  - beta 2 = 0.999
  - decay = 0.95
- Regularization: early stopping with patience = 3
- batch size: 1000
Code is public on my Github account

Memory Problems

Training with 560.000 reviews fast leads to memory problems
Not enough RAM to store 560.000 one-hot encoded training samples
Solution 1: train with only 10.000 reviews

Memory Problems

Training with 560.000 reviews fast leads to memory problems
Not enough RAM to store 560.000 one-hot encoded training samples
Solution 2: create a data Generator function:
- In each batch iteration, reads batch_size = 1000 reviews from the .csv file
- Encodes them into character-vectors on CPU
- Feeds them to the network for parallel training on GPU
- Deletes the batch from memory and reads the next batch
- Callable in keras via the method model.fit_generator(generator=trainGenerator ...

Results

Confusion matrices for different train data sizes
positive	negative
560K
17723	406
554	19317
280K
16921	1208
615	19256
140K
16698	1431
1049	18822

Results

Comparison different train data sizes
	AUC	Accuracy
560K	0.9959275	0.9747368
280K	0.9892149	0.9520263
140K	0.9817623	0.9347368

App

4. Explainable AI with LIME

Why does the model decide that way?

Problem

CNN as black box model

Idea of LIME

Ribeiro, Singh, and Guestrin (2016) developed method for Local Interpretable Model-agnostic Explanations
Our CNN consists of 12.127.746 weight parameters which shape the decision rule for the net $\rightarrow$ impossible to interpret
Idea: use the reactions of the model on perturbations of the input to interpret the local prediction of the model
Model agnostic technique that works for any black box model
Interpretable representations:
- text: words
- image: parts of the image, so called superpixels

Illustration

Illustration of LIME by Ribeiro, Singh, and Guestrin (2016). Blue and pink areas depict the complex decision function of $f$. Fat cross marks the observation that is to be explained and crosses and bubbles mark perturbations $z$ of $x$, resized proportional to proximity. The dashed line is the local learned explanation.

Artificial Design Matrix

Perturbed examples
	I	like	the	good	expensive	beer	P(positive)
I like beer	1	1	0	0	0	1	0.50
I like good	1	1	0	1	0	0	0.80
I like good expensive beer	1	1	0	1	1	1	0.74
I the good beer	1	0	1	1	0	1	0.54
expensive beer	0	0	0	0	1	1	0.27

LIME in Action

Screenshot from LIME in action

LIME Pseudo-Algorithm

Create artificial design matrix
Calculate similarities of perturbed observations with true observation (text domain: cosine distance)
Set amount of explanatory features n and select them (e.g. via lasso, forward-selection, …)
Fit interpretable surrogate model on the label-output of the black box model
- Use only the n selected features
- Weigh perturbations according to their similarity score
Use coefficients of the surrogate model as explanations

Ingredients and settings

Penalized ridge regression is used as interpretable surrogate model
Important hyperparameters:
- num_features = 5: amount of features to be selected
- num_samples = 1000: the size of the design matrix
- bow = False: control if all occurencies of a word should be perturbed or one by one
- feature_selection = "lasso_path": selection method for step 3
Refer to the official documentation of LIME or this blog post with examples in R for more details

App

5. Conclusion and Outlook

Conclusion

Char-CNN’s show impressive results although almost no domain knowledge is included
LIME offers a first good method to try to understand those black-box predictions
The performance on specific tasks is highly dependent on the train data
The results are still partly counter-intuitive

Outlook

Use GloVe embeddings instead of the character encoding to include outside information
Try the model with different languages
Try different settings of LIME
Toy around with the length of the encoding (currently set to 1014)
LIME will implement local predictions as measure of trust in the future

Thank you!

References

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Ribeiro, Marco Túlio, Sameer Singh, and Carlos Guestrin. 2016. “"Why Should I Trust You?": Explaining the Predictions of Any Classifier.” CoRR abs/1602.04938. http://arxiv.org/abs/1602.04938.

Yann LeCun, Marc’Aurelio Ranzato. 2013. “Deep Learning Tutorial.” http://www.cs.nyu.edu/~yann/talks/lecun-ranzato-icml2013.pdf.

Zhang, Xiang, Junbo Zhao, and Yann LeCun. 2015. “Character-Level Convolutional Networks for Text Classification,” 649–57.