Skip to content

Commit 7880528

Browse files
Support for more complex pipelines (Closes stefan-grafberger#20) (PR stefan-grafberger#24)
* WIP: Creating complex example * WIP: Creating complex example * Minor linter fixes * Added synthetic data * Trying to get test pipeline to work: works with decision tree classifier, but not keras nn * Example works without w2v * Complex example runs. TODO: mlinspect support for it * Removed unnecessary statement * Commented out stuff in complex example to make pipeline run * Added join Dag Node * Fixed a bug in code_reference_to_module extraction * Fixed a bug involving chained method calls * Groupby Dag nodes work in progress * Added end_lineno and end_col_offset to CodeReference to support chained function call edge cases * WIP: fixing tests after code_reference change * WIP: fixing tests after code_reference change * WIP: fixing tests after code_reference change * WIP: fixing tests after code_reference change * All tests work again after modifying code_reference * Groupby Aggregate Dag node with description * Discovered a huge bug with subscript instrumentation * Resulting changes from instrumentation bug fix * Dag Extraction works for projections with lists of columns as argument * Found way how subscript assigns might be possible * Index-Assingn wir extraction works * Index-Assign runs, but no dag extraction yet. Todo: add module info * Index-Assign is doable but requires other operators to work * Added some comments * Select does not work yet but does not throw errors anymore * Train-Test-Split no longer causes mlinspect to fail, still need to implement wir_extraction for tuple unpacking * Added WIR support for tuple unpacking in assignments * Imputer now in Dag * Nested pipeline DAG creation starting to work. TODO: Delete left-over original nodes after copying for column transformer * Nested pipeline DAG extraction works * Added support for W2V transformer * Dag Extraction works for whole complex pipeline, only some runtime stuff is missing that requires analyzer instrumentation * Fixed a test * Subscript Assign Dag nodes work completely now * Projection/selection differentiation for df.__getitem__ almost works * Select Dag node works now * Finished selection changes and updated tests * Added a TODO * Adding analyzer support statement by statement: Data Source * WIP: join * WIP: join * Pandas Backend now supports Joins * Pandas Backend now supports Group by Aggs by treating it as a data source * Pandas Backend now supports the 2nd join (found bug in inital join implementation) * Preparing set label * Pandas Backend now supports set label syntax * Fixed bug introduced in last commit * Projection with double list syntax already works * Select by series works * Train-test-splits work, although we'll need to revisit them once we model the test set in the DAG * Started with sklearn pipeline * Analyzer support for simple version of the sklearn pipeline * Analyzer support for W2V transformer * Analyzers work for complex example. Some code will need to be cleaned up though. TODO: Write demo analyzers * Moved demo into nb and new directory * Fixed bug with print score * Some cleanup. For some reason, due to moving the demo_utils, the healthcare example is slower * More cleanup * More cleanup * Started with Demo Analyzer. Works: Propagating age_group. TODO: Histograms, Embeddings * Propagating age_group and race to calculate histograms * First histogram plots work * Histograms for Data Source and Groupby Agg if col available * Added matplotlib to dependencies * Included race histograms * Add missing embedding inspection * Update demo notebook, added simple pipeline time measurement. Note: iterators are not shared yet between inspections, so each inspection will slow everything down a bit * Added a simple lineage inspection for demo purposes * Renamed analyzers to inspections * Improved package structure a bit * Some cleanup * More cleanup * Changes from cidr version of repo * Disable demo notebook image output when running corresponding pytest test * Update readme after renaming analyzers to inspections
1 parent f9adae0 commit 7880528

File tree

61 files changed

+5181
-1449
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+5181
-1449
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,17 +36,17 @@ Prerequisite: python >= 3.8
3636
Make it easy to analyze your pipeline and automatically check for common issues.
3737
```python
3838
from mlinspect.pipeline_inspector import PipelineInspector
39-
from mlinspect.instrumentation.analyzers.materialize_first_rows_analyzer import MaterializeFirstRowsAnalyzer
39+
from mlinspect.inspections.materialize_first_rows_inspection import MaterializeFirstRowsInspection
4040

4141
IPYNB_PATH = ...
4242

4343
inspection_result = PipelineInspector \
4444
.on_pipeline_from_ipynb_file(IPYNB_PATH) \
45-
.add_analyzer(MaterializeFirstRowsAnalyzer(2)) \
45+
.add_inspection(MaterializeFirstRowsInspection(2)) \
4646
.execute()
4747

4848
extracted_dag = inspection_result.dag
49-
analyzer_results = inspection_result.analyzer_to_annotations
49+
inspection_to_annotations = inspection_result.inspection_to_annotations
5050
```
5151

5252
## Notes
File renamed without changes.
File renamed without changes.

demo/healthcare/demo_utils.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
"""
2+
Some useful utils for the project
3+
"""
4+
import numpy
5+
from sklearn.exceptions import NotFittedError
6+
from gensim.sklearn_api import W2VTransformer
7+
from tensorflow.keras.layers import Dense
8+
from tensorflow.keras.models import Sequential
9+
from tensorflow.python.keras.optimizer_v2.gradient_descent import SGD
10+
11+
12+
class MyW2VTransformer(W2VTransformer):
13+
"""Some custom w2v transformer."""
14+
15+
def partial_fit(self, X):
16+
# pylint: disable=useless-super-delegation
17+
super().partial_fit([X])
18+
19+
def fit(self, X, y=None):
20+
X = X.iloc[:, 0].tolist()
21+
return super().fit([X], y)
22+
23+
def transform(self, words):
24+
words = words.iloc[:, 0].tolist()
25+
if self.gensim_model is None:
26+
raise NotFittedError(
27+
"This model has not been fitted yet. Call 'fit' with appropriate arguments before using this method."
28+
)
29+
30+
# The input as array of array
31+
vectors = []
32+
for word in words:
33+
if word in self.gensim_model.wv:
34+
vectors.append(self.gensim_model.wv[word])
35+
else:
36+
vectors.append(numpy.zeros(self.size))
37+
return numpy.reshape(numpy.array(vectors), (len(words), self.size))
38+
39+
40+
def create_model(input_dim):
41+
"""Create a simple neural network"""
42+
clf = Sequential()
43+
clf.add(Dense(9, activation='relu', input_dim=input_dim))
44+
clf.add(Dense(9, activation='relu'))
45+
clf.add(Dense(2, activation='softmax'))
46+
clf.compile(loss='categorical_crossentropy', optimizer=SGD(), metrics=["accuracy"])
47+
return clf

demo/healthcare/healthcare.png

353 KB
Loading

demo/healthcare/healthcare.py

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
"""
2+
An example pipeline
3+
"""
4+
import os
5+
6+
import pandas as pd
7+
from sklearn.compose import ColumnTransformer
8+
from sklearn.impute import SimpleImputer
9+
from sklearn.model_selection import train_test_split
10+
from sklearn.pipeline import Pipeline
11+
from sklearn.preprocessing import OneHotEncoder, StandardScaler
12+
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
13+
from demo.healthcare.demo_utils import MyW2VTransformer, create_model
14+
from mlinspect.utils import get_project_root
15+
16+
COUNTIES_OF_INTEREST = ['county2', 'county3']
17+
18+
# load input data sources (data generated with https://www.mockaroo.com as a single file and then split into two)
19+
patients = pd.read_csv(os.path.join(str(get_project_root()), "demo", "healthcare", "healthcare_patients.csv"), na_values='?')
20+
histories = pd.read_csv(os.path.join(str(get_project_root()), "demo", "healthcare", "healthcare_histories.csv"),
21+
na_values='?')
22+
23+
# combine input data into a single table
24+
data = patients.merge(histories, on=['ssn'])
25+
26+
# compute mean complications per age group, append as column
27+
complications = data.groupby('age_group').agg(mean_complications=('complications', 'mean'))
28+
29+
data = data.merge(complications, on=['age_group'])
30+
31+
# target variable: people with a high number of complications
32+
data['label'] = data['complications'] > 1.2 * data['mean_complications']
33+
34+
# project data to a subset of attributes
35+
data = data[['smoker', 'last_name', 'county', 'num_children', 'race', 'income', 'label']]
36+
37+
# filter data
38+
data = data[data['county'].isin(COUNTIES_OF_INTEREST)]
39+
40+
# define the feature encoding of the data
41+
impute_and_one_hot_encode = Pipeline([
42+
('impute', SimpleImputer(strategy='most_frequent')),
43+
('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))
44+
])
45+
46+
featurisation = ColumnTransformer(transformers=[
47+
("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),
48+
('word2vec', MyW2VTransformer(min_count=2), ['last_name']),
49+
('numeric', StandardScaler(), ['num_children', 'income'])
50+
])
51+
52+
# define the training pipeline for the model
53+
neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)
54+
pipeline = Pipeline([
55+
('features', featurisation),
56+
('learner', neural_net)])
57+
58+
# train-test split
59+
train_data, test_data = train_test_split(data, random_state=0)
60+
# model training
61+
model = pipeline.fit(train_data, train_data['label'])
62+
# model evaluation
63+
print(model.score(test_data, test_data['label']))

demo/healthcare/healthcare_demo.ipynb

Lines changed: 571 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)