Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Citi Hackathon code submission #810

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open
Changes from 5 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
2e77117
feat(csv): add sensitive data check for .csv and .xlsx files
Psingle20 Oct 26, 2024
079dfc1
feat(logging): add checks for .log and .json files
Psingle20 Oct 26, 2024
5f66f08
fix: revert previous changes
Psingle20 Oct 26, 2024
c449a9c
feat: add support for .log and .json files
Psingle20 Oct 26, 2024
8b257a2
test: add test for edge cases
ChaitanyaD48 Oct 27, 2024
027a459
refactor: corrected the way filepaths are geting extracted from diff
Psingle20 Oct 28, 2024
df82548
refactor: modified the test to provide exact action and diff content
Psingle20 Oct 28, 2024
fd26523
refactor: modified proxy.config to support the feature
Psingle20 Nov 10, 2024
77809a3
feat: added logic for EXIF metadata retrieval
shabbirflow Nov 10, 2024
2cb09c0
feat: added logic for EXIF metadata retrieval
shabbirflow Nov 10, 2024
7f60205
feat: create checkCryptoImplementation file for detecting non-standar…
ChaitanyaD48 Nov 10, 2024
d6510e8
test: add test cases for checkCryptoImplementation
ChaitanyaD48 Nov 10, 2024
57add83
feat: integrate checkCryptoImplementation into the main processing chain
ChaitanyaD48 Nov 10, 2024
f258864
feat: added test cases for exif data retrieval & push - blocking
shabbirflow Nov 10, 2024
36e07d9
feat(CheckExif): modified the CheckExif file and integrated it with t…
Psingle20 Nov 11, 2024
d2314ac
Merge branch 'finos:main' into CheckFilesData
Psingle20 Nov 11, 2024
c03d3fe
Merge branch 'finos:main' into GetEXIFData
Psingle20 Nov 11, 2024
2c2441a
feat: integrate checkEXIFJpeg validation in push action chain
Psingle20 Nov 11, 2024
bba8ff1
feat: added logic for ai/ml usage detection
shabbirflow Nov 11, 2024
f8d6ad0
Merge branch 'main' of https://github.com/Psingle20/git-proxy
Psingle20 Nov 14, 2024
f07aacb
feat: add Gitleaks vulnerability detection feature
Psingle20 Nov 14, 2024
193dbf6
refactor: integrated with the workflow
Psingle20 Nov 14, 2024
ddc4e98
Merge branch 'finos:main' into CheckFilesData
Psingle20 Nov 14, 2024
7ef0dcf
Merge branch 'finos:main' into DetectAiMlUsage
Psingle20 Nov 14, 2024
7e8cb6b
Merge branch 'finos:main' into GetEXIFData
Psingle20 Nov 14, 2024
45b5024
Merge branch 'finos:main' into detect-cryptography
Psingle20 Nov 14, 2024
c72bbaa
Merge branch 'finos:main' into DetectAiMlUsage
shabbirflow Nov 15, 2024
4e01fa1
Merge branch 'finos:main' into main
Psingle20 Nov 23, 2024
fee5329
Merge branch 'finos:main' into detect-cryptography
ChaitanyaD48 Nov 25, 2024
994d080
Merge branch 'CheckFilesData' into CitiCodeSubmission
Psingle20 Nov 28, 2024
ddd9901
feat: exif data check merged
Psingle20 Nov 28, 2024
c8ea591
feat: merge Aiml usage detection feat
Psingle20 Nov 28, 2024
0be5039
feat: Gitleaks check merged
Psingle20 Nov 28, 2024
518234f
feat: crypto check feature merged
Psingle20 Nov 28, 2024
cca6713
reafactor: gitleaks rules update and general code clean up
Psingle20 Nov 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion proxy.config.json
Original file line number Diff line number Diff line change
@@ -78,9 +78,15 @@
"literals": [],
"patterns": [],
"providers": {},
"ProxyFileTypes" : [".jpg"]
"ProxyFileTypes" : [".jpg"],
"aiMlUsage": {
"enabled": true,
"blockPatterns": ["modelWeights", "largeDatasets", "aiLibraries", "configKeys", "aiFunctions"]
}


}

}

},
1 change: 1 addition & 0 deletions src/proxy/chain.js
Original file line number Diff line number Diff line change
@@ -12,6 +12,7 @@ const pushActionChain = [
proc.push.getDiff,
proc.push.checkSensitiveData, // checkSensitiveData added
proc.push.checkExifJpeg,
proc.push.checkForAiMlUsage,
proc.push.clearBareClone,
proc.push.scanDiff,
proc.push.blockForAuth,
143 changes: 143 additions & 0 deletions src/proxy/processors/push-action/checkForAiMlUsage.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
const { Step } = require('../../actions');
const config = require('../../../config');
const commitConfig = config.getCommitConfig();

const fs = require('fs');

// Patterns for detecting different types of AI/ML assets
const FILE_PATTERNS = {
modelWeights: /\.(h5|pb|pt|ckpt|pkl)$/,
// Regex for model weight files like .h5, .pt, .ckpt, or .pkl
largeDatasets: /\.(csv|json|xlsx)$/,
// Regex for large dataset files
aiLibraries: /(?:import\s+(tensorflow|torch|keras|sklearn|tokenizer)|require\(['"]tensorflow|torch|keras|sklearn|tokenizer['"]\))/,
// Regex for AI/ML libraries and tokenizers
configKeys: /\b(epochs|learning_rate|batch_size|token)\b/,
// Regex for config keys in JSON/YAML including token-related keys
aiFunctionNames: /\b(train_model|predict|evaluate|fit|transform|tokenize|tokenizer)\b/
// Regex for AI/ML function/class names with token/tokenizer
};


// Function to check if a file name suggests it is AI/ML related (model weights or dataset)
const isAiMlFileByExtension = (fileName) => {
const checkAiMlConfig = commitConfig.diff.block.aiMlUsage;
// check file extensions for common model weight files
if(checkAiMlConfig.blockPatterns.includes('modelWeights')
&& FILE_PATTERNS.modelWeights.test(fileName)){
// console.log("FOUND MODEL WEIGHTS");
return true; }
// check file extensions for large datasets
if(checkAiMlConfig.blockPatterns.includes('largeDatasets')
&& FILE_PATTERNS.largeDatasets.test(fileName)){
// console.log("FOUND LARGE DATASETS");
return true; }
return false;
};

// Function to check if file content suggests it is AI/ML related
const isAiMlFileByContent = (fileContent) => {
const checkAiMlConfig = commitConfig.diff.block.aiMlUsage;
// check file content for AI/ML libraries
if(checkAiMlConfig.blockPatterns.includes('aiLibraries')
&& FILE_PATTERNS.aiLibraries.test(fileContent)){
// console.log("FOUND AI LIBRARIES");
return true; }
// check file content for config keys
if(checkAiMlConfig.blockPatterns.includes('configKeys')
&& FILE_PATTERNS.configKeys.test(fileContent)){
// console.log("FOUND CONFIG KEYS");
return true; }
// check file content for AI/ML function/class names
if(checkAiMlConfig.blockPatterns.includes('aiFunctionNames')
&& FILE_PATTERNS.aiFunctionNames.test(fileContent)){
// console.log("FOUND AI FUNCTION NAMES");
return true; }
return false;
};


// Main function to detect AI/ML usage in an array of file paths
const detectAiMlUsageFiles = async (filePaths) => {
const results = [];
// console.log("filePaths!", filePaths);
for (const filePath of filePaths) {
try {
const fileName = filePath.split('/').pop();
// console.log(fileName, "!!!");
// Check if the file name itself indicates AI/ML usage
if (isAiMlFileByExtension(fileName)) {
// console.log("FOUND EXTENSION for ", fileName);
results.push(false); continue;
// Skip content check if the file name is a match
}
// Check for AI/ML indicators within the file content
// console.log("testing content for ", fileName);
const content = await fs.promises.readFile(filePath, 'utf8');
if (isAiMlFileByContent(content)) {
results.push(false); continue;
}
results.push(true); // No indicators found in content
} catch (err) {
console.error(`Error reading file ${filePath}:`, err);
results.push(false); // Treat errors as no AI/ML usage found
}
}

return results;
};

// Helper function to parse file paths from git diff content
const extractFilePathsFromDiff = (diffContent) => {
const filePaths = [];
const lines = diffContent.split('\n');

lines.forEach(line => {
const match = line.match(/^diff --git a\/(.+?) b\/(.+?)$/);
if (match) {
filePaths.push(match[1]); // Extract the file path from "a/" in the diff line
}
});

return filePaths;
};

// Main exec function
const exec = async (req, action, log = console.log) => {
// console.log("HEYYY");
const diffStep = action.steps.find((s) => s.stepName === 'diff');
const step = new Step('checkForAiMlUsage');
action.addStep(step);
if(!commitConfig.diff.block.aiMlUsage.enabled) {
// console.log("INSIDW!!")
return action;
}

if (diffStep && diffStep.content) {
const filePaths = extractFilePathsFromDiff(diffStep.content);
// console.log(filePaths);

if (filePaths.length) {
const aiMlDetected = await detectAiMlUsageFiles(filePaths);
// console.log(aiMlDetected);
const isBlocked = aiMlDetected.some(found => !found);
// const isBlocked = false;

if (isBlocked) {
step.blocked = true;
step.error = true;
step.errorMessage = 'Your push has been blocked due to AI/ML usage detection';
log(step.errorMessage);
}
} else {
log('No valid image files found in the diff content.');
}
} else {
log('No diff content available.');
}

return action;
};

exec.displayName = 'logFileChanges.exec';
module.exports = { exec };
1 change: 1 addition & 0 deletions src/proxy/processors/push-action/index.js
Original file line number Diff line number Diff line change
@@ -13,3 +13,4 @@ exports.checkUserPushPermission = require('./checkUserPushPermission').exec;
exports.clearBareClone = require('./clearBareClone').exec;
exports.checkSensitiveData = require('./checkSensitiveData').exec;
exports.checkExifJpeg = require('./checkExifJpeg').exec;
exports.checkForAiMlusage = require('./checkForAiMlUsage').exec;
92 changes: 92 additions & 0 deletions test/checkAiMlUsage.test.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
const { exec } = require('../src/proxy/processors/push-action/checkForAiMlUsage.js');
const sinon = require('sinon');
const { Action } = require('../src/proxy/actions/Action.js');
const { Step } = require('../src/proxy/actions/Step.js');


describe('Detect AI/ML usage from git diff', () => {
let logStub;

beforeEach(() => {
// Stub console.log and config.getCommitConfig for isolation in each test case
logStub = sinon.stub(console, 'log');
});

afterEach(() => {
// Restore stubs to avoid cross-test interference
logStub.restore();
// configStub.restore();
});

const createDiffContent = (filePaths) => {
// Creates diff-like content for each file path to simulate actual git diff output
return filePaths.map((filePath) => `diff --git a/${filePath} b/${filePath}`).join('\n');
};

it('Block push if AI/ML file extensions detected', async () => {
// Create action and step instances with test data that should trigger blocking
const action = new Action('action_id', 'push', 'create', Date.now(), 'owner/repo');
const step = new Step('diff');

const filePaths = [
'test/test_data/ai_test_data/model.h5',
'test/test_data/ai_test_data/dataset.csv',
];
step.setContent(createDiffContent(filePaths));
action.addStep(step);

await exec(null, action);

// Check that console.log was called with the blocking message
sinon.assert.calledWith(
logStub,
sinon.match(
/Your push has been blocked due to AI\/ML usage detection/,
),
);
});

it('Block push if AI/ML file content detected', async () => {
// Create action and step instances with test data that should trigger blocking
const action = new Action('action_id', 'push', 'create', Date.now(), 'owner/repo');
const step = new Step('diff');

const filePaths = [
'test/test_data/ai_test_data/ai_script.py',
'test/test_data/ai_test_data/ai_config.json',
];
step.setContent(createDiffContent(filePaths));
action.addStep(step);

await exec(null, action);

// Check that console.log was called with the blocking message
sinon.assert.calledWith(
logStub,
sinon.match(
/Your push has been blocked due to AI\/ML usage detection/,
),
);
});

it('Allow push if no AI/ML usage is detected', async () => {
// Configure with no sensitive EXIF parameters

const action = new Action('action_id', 'push', 'create', Date.now(), 'owner/repo');
const step = new Step('diff');

const filePaths = ['test/test_data/ai_test_data/non_ai_script.py'];
step.setContent(createDiffContent(filePaths));
action.addStep(step);

await exec(null, action);

// Ensure no blocking message was logged
sinon.assert.neverCalledWith(
logStub,
sinon.match(
/Your push has been blocked due to AI\/ML usage detection/,
),
);
});
});
4 changes: 4 additions & 0 deletions test/test_data/ai_test_data/ai_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"epochs": 100,
"learning_rate": 0.01
}
4 changes: 4 additions & 0 deletions test/test_data/ai_test_data/ai_script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
import tensorflow as tf
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(10, activation='relu'))
model.compile(optimizer='adam', loss='mse')
3 changes: 3 additions & 0 deletions test/test_data/ai_test_data/dataset.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
id,feature1,feature2,label
1,0.5,0.3,1
2,0.6,0.2,0
1 change: 1 addition & 0 deletions test/test_data/ai_test_data/model.h5
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This is a dummy model file
1 change: 1 addition & 0 deletions test/test_data/ai_test_data/non_ai_script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
print("Hello World") # No AI/ML content