Skip to content

Commit 84f9185

Browse files
Analyzers and initial instrumentation for pandas (Closes stefan-grafberger#6) (PR stefan-grafberger#18)
* Started backend introduction * Started backend introduction * Printing first 5 read_csv values working * Alternative * Analyzer with iterator and returning iterator * more experimenting: now with pd concant and merge to join annoations, input and output data into one row * Can create iterator, but is slow currently * Start implementing iterator creation * Fast iterator creation works, some tuple naming not correct yet * Efficiently pass pandas rows to a test analyzer. There still are some bugs with identifying rows/annotations * Extremely fast annotation propagation works for read_csv and dropna. Still have to introduce map for storing and deleting annotations for wir nodes as necessary. overall overhead of our complete library for adult_easy seems to be in the order of 30-60ms only * Minor cleanup * Added some todo comments * Added some todo comments * Added some todo comments * Added some todo comments * Added some todo comments * Removed _x and _y leftovers in column names analyzers see after merging dfs in pandas backend * Passing Pandas annotations without maps or similar by subclassing dataframe and series * Update todos * Code cleanup * Started introducing analyzer interface * Passing analyzers from outside to inspector and executor * Some cleanup in pandas backend * Introduce InspectionResult class * renamed graph to dag in inspection result * Test for print_first_rows_analyzer and cleanup * Fixed bug with passing analyzers * Improve PipelineExecutor annotation output * Each analyzer now only sees its own annotations * Renamed print_first_rows_analyzer to materialize_first_rows_analyzer * Renamed print_first_rows_analyzer to materialize_first_rows_analyzer * Update todo and minor cleanup * Update todo * Added test for backend annotation propagation * Fixed bug resulting from storing state in static variables that only occured in tests * Fix pylint duplicate codee warning * Update readme * Added test to see if everything works for multiple analyzers and fixed a bug discoverd by this test * Added operatorType enum, introduced operator_context to analyzer input and added type hints * Minor cleanup * Use data classes for wir and dag node classes * Introduce code_reference * Renamed wir node attribute * Cleaned up function_info extraction in pipeline_executor * Move all backend-specific code into backends * Pandas Backend now supports projections * Very minor cleanup * Removed leftover comment * Update two docstrings * Delete some leftover WIP code
1 parent db0ba0c commit 84f9185

35 files changed

+1267
-559
lines changed

README.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,17 @@ Prerequisite: python >= 3.8
3636
Make it easy to analyze your pipeline and automatically check for common issues.
3737
```python
3838
from mlinspect.pipeline_inspector import PipelineInspector
39+
from mlinspect.instrumentation.analyzers.materialize_first_rows_analyzer import MaterializeFirstRowsAnalyzer
3940

4041
IPYNB_PATH = ...
4142

42-
extracted_annotated_dag = PipelineInspector\
43-
.on_jupyter_pipeline(IPYNB_PATH)\
44-
.add_analyzer("test")\
43+
inspection_result = PipelineInspector \
44+
.on_pipeline_from_ipynb_file(IPYNB_PATH) \
45+
.add_analyzer(MaterializeFirstRowsAnalyzer(2)) \
4546
.execute()
47+
48+
extracted_dag = inspection_result.dag
49+
analyzer_results = inspection_result.analyzer_to_annotations
4650
```
4751

4852
## Notes

mlinspect/instrumentation/analyzers/__init__.py

Whitespace-only changes.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
"""
2+
The Interface for the different instrumentation backends
3+
"""
4+
import abc
5+
from typing import Union, Iterable
6+
7+
from mlinspect.instrumentation.analyzers.analyzer_input import OperatorContext, AnalyzerInputDataSource, \
8+
AnalyzerInputUnaryOperator
9+
10+
11+
class Analyzer(metaclass=abc.ABCMeta):
12+
"""
13+
The Interface for the different instrumentation backends
14+
"""
15+
16+
@property
17+
@abc.abstractmethod
18+
def analyzer_id(self):
19+
"""The id of the analyzer"""
20+
raise NotImplementedError
21+
22+
@abc.abstractmethod
23+
def visit_operator(self, operator_context: OperatorContext,
24+
row_iterator: Union[Iterable[AnalyzerInputDataSource], Iterable[AnalyzerInputUnaryOperator]])\
25+
-> Iterable[any]:
26+
"""Visit an operator in the DAG"""
27+
raise NotImplementedError
28+
29+
@abc.abstractmethod
30+
def get_operator_annotation_after_visit(self) -> any:
31+
"""Get the output to be included in the DAG"""
32+
raise NotImplementedError
33+
34+
def __eq__(self, other):
35+
"""Analyzers must implement equals"""
36+
return (isinstance(other, self.__class__) and
37+
self.analyzer_id == other.analyzer_id)
38+
39+
def __hash__(self):
40+
"""Analyzers must be hashable"""
41+
return hash((self.__class__.__name__, self.analyzer_id))
42+
43+
def __repr__(self):
44+
"""Analyzers must be hashable"""
45+
return "{}({})".format(self.__class__.__name__, self.analyzer_id)
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
"""
2+
Data classes used as input for the analyzers
3+
"""
4+
import dataclasses
5+
from typing import Tuple
6+
7+
from mlinspect.instrumentation.dag_node import OperatorType
8+
9+
10+
@dataclasses.dataclass(frozen=True)
11+
class AnalyzerInputRow:
12+
"""
13+
A class we use to efficiently pass pandas/sklearn rows
14+
"""
15+
values: list
16+
fields: list
17+
18+
def get_index_of_column(self, column_name):
19+
"""
20+
Get the values index for some column
21+
"""
22+
if column_name in self.fields:
23+
return self.fields.index(column_name)
24+
return None
25+
26+
def get_value_by_column_index(self, index):
27+
"""
28+
Get the value at some index
29+
"""
30+
return self.values[index]
31+
32+
33+
@dataclasses.dataclass(frozen=True)
34+
class AnalyzerInputDataSource:
35+
"""
36+
Wrapper class for the only operator without a parent: a Data Source
37+
"""
38+
output: AnalyzerInputRow
39+
40+
41+
@dataclasses.dataclass(frozen=True)
42+
class AnalyzerInputUnaryOperator:
43+
"""
44+
Wrapper class for the operators with one parent like Selections and Projections
45+
"""
46+
input: AnalyzerInputRow
47+
annotation: AnalyzerInputRow
48+
output: AnalyzerInputRow
49+
50+
51+
@dataclasses.dataclass(frozen=True)
52+
class OperatorContext:
53+
"""
54+
Additional context for the analyzer. Contains, most importantly, the operator type.
55+
"""
56+
operator: OperatorType
57+
function_info: Tuple[str, str]
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
"""
2+
A simple example analyzer
3+
"""
4+
from typing import Union, Iterable
5+
6+
from mlinspect.instrumentation.analyzers.analyzer_input import OperatorContext, AnalyzerInputDataSource, \
7+
AnalyzerInputUnaryOperator
8+
from mlinspect.instrumentation.analyzers.analyzer import Analyzer
9+
10+
11+
class MaterializeFirstRowsAnalyzer(Analyzer):
12+
"""
13+
A simple example analyzer
14+
"""
15+
16+
def __init__(self, row_count: int):
17+
self.row_count = row_count
18+
self._analyzer_id = self.row_count
19+
self._operator_output = None
20+
21+
@property
22+
def analyzer_id(self):
23+
return self._analyzer_id
24+
25+
def visit_operator(self, operator_context: OperatorContext,
26+
row_iterator: Union[Iterable[AnalyzerInputDataSource], Iterable[AnalyzerInputUnaryOperator]])\
27+
-> Iterable[any]:
28+
"""
29+
Visit an operator
30+
"""
31+
current_count = - 1
32+
operator_output = []
33+
34+
for row in row_iterator:
35+
current_count += 1
36+
if current_count < self.row_count:
37+
operator_output.append(row.output)
38+
yield None
39+
40+
self._operator_output = operator_output
41+
42+
def get_operator_annotation_after_visit(self) -> any:
43+
assert self._operator_output # May only be called after the operator visit is finished
44+
result = self._operator_output
45+
self._operator_output = None
46+
return result

mlinspect/instrumentation/backends/__init__.py

Whitespace-only changes.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
"""
2+
Get all available backends
3+
"""
4+
from typing import List
5+
6+
from mlinspect.instrumentation.backends.backend import Backend
7+
from mlinspect.instrumentation.backends.pandas_backend import PandasBackend
8+
from mlinspect.instrumentation.backends.sklearn_backend import SklearnBackend
9+
10+
11+
def get_all_backends() -> List[Backend]:
12+
"""Get the list of all currently available backends"""
13+
return [PandasBackend(), SklearnBackend()]
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
"""
2+
The Interface for the different instrumentation backends
3+
"""
4+
import abc
5+
6+
import networkx
7+
8+
9+
class Backend(metaclass=abc.ABCMeta):
10+
"""
11+
The Interface for the different instrumentation backends
12+
"""
13+
14+
def __init__(self):
15+
self.code_reference_to_description = {}
16+
self.code_reference_analyzer_output_map = {}
17+
self.analyzers = []
18+
19+
@property
20+
@abc.abstractmethod
21+
def prefix(self):
22+
"""The prefix of the module of the library the backend is for"""
23+
raise NotImplementedError
24+
25+
@property
26+
@abc.abstractmethod
27+
def operator_map(self):
28+
"""The list of known operator mappings"""
29+
raise NotImplementedError
30+
31+
@property
32+
@abc.abstractmethod
33+
def replacement_type_map(self):
34+
"""The list of used data type replacements"""
35+
raise NotImplementedError
36+
37+
@staticmethod
38+
@abc.abstractmethod
39+
def preprocess_wir(wir: networkx.DiGraph) -> networkx.DiGraph:
40+
"""Preprocess the wir if necessary"""
41+
raise NotImplementedError
42+
43+
@abc.abstractmethod
44+
def before_call_used_value(self, function_info, subscript, call_code, value_code, value_value,
45+
code_reference):
46+
"""The value or module a function may be called on"""
47+
# pylint: disable=too-many-arguments, unused-argument
48+
raise NotImplementedError
49+
50+
@abc.abstractmethod
51+
def before_call_used_args(self, function_info, subscript, call_code, args_code, code_reference, args_values):
52+
"""The arguments a function may be called with"""
53+
# pylint: disable=too-many-arguments, unused-argument
54+
raise NotImplementedError
55+
56+
@abc.abstractmethod
57+
def before_call_used_kwargs(self, function_info, subscript, call_code, kwargs_code, code_reference, kwargs_values):
58+
"""The keyword arguments a function may be called with"""
59+
# pylint: disable=too-many-arguments, unused-argument
60+
raise NotImplementedError
61+
62+
@abc.abstractmethod
63+
def after_call_used(self, function_info, subscript, call_code, return_value, code_reference):
64+
"""The return value of some function"""
65+
# pylint: disable=too-many-arguments, unused-argument
66+
raise NotImplementedError

0 commit comments

Comments
 (0)