-
Notifications
You must be signed in to change notification settings - Fork 9
Addition of the function cleaning_cluster_population #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Addition of the function cleaning_cluster_population #149
Conversation
| NumPy array containing the label values. | ||
| The array should have dimensions corresponding | ||
| to either (n_atoms, n_frames) for 2D inputs, | ||
| or (n_atoms, n_frames, n_dims) for 3D inputs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the labels, not the data... aren't they always a 2D array? Each atom at each frame has one integer label, even if the data were multivariate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same reason, I think all the logic in the function for the 3D case is never necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends how we do want to use this function. I've thought different applications:
- 2D labels: cleaning the labels to color the trajectory
- 2D labels: can come also from other clustering methods, not necessarily from onion, and can be used as above
- 3D labels: come from get_onion_analysis and can be use to redo the plot with the number of clusters and the unclassified fraction cleaned. By keeping this function out from get_onion_analysis you are free to change the threshold how many times you want without running the Onion calculation multiple times. In this case, I should also think a way to do that plot again. On the other hand, it's true that's enough to keep only the 2D option and then run it in a for cicle when we need it.
There's no problem for me to delete it in case; since it wasn't difficult to do I've added both the options.
| array is 2D (n_atoms, n_frames), the output will be a 2D array of | ||
| the same shape. Otherwise, if the input is 3D | ||
| (n_atoms, n_frames, n_dims), the output will also be a 3D array | ||
| of the same shape. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before, labels are always 2D.
SimoneMartino98
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two easy comments. I’ll leave the discussion on the label dimensions to @matteobecchi.
Anyway, nice work!
| missing = np.setdiff1d(excluded_arr, np.unique(labels)) | ||
|
|
||
| if missing.size > 0: | ||
| logger.log(f"Excluded value(s) not found in labels: {missing}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not actually a log, rather it should be a warning. It is also true that at the moment we don't have a warning log function. My suggestion here is to change the line with:
logger.warning(f"Excluded value(s) not found in labels: {missing}")and then add something like the following function in the _internal.logs.py file (after the log function):
def warning(self, msg: str) -> None:
"""Records an informational warning message to the log.
Parameters:
msg:
The message to record.
"""
timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S")
history_entry = f"[{timestamp}] {msg}"
console.warning(msg)
self._log.append(history_entry)It should work and give a yellow warning msg. If not let's see what happen.
|
|
||
| from .case_data import CleanPopCaseData | ||
|
|
||
| # ---------------- Tests ---------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this; we are already in the test folder. Up to you though.
| "Clean_pop test files were not present. They have been created." | ||
| ) | ||
| exp_clean_pop = np.load(expected_clean_pop) | ||
| assert np.allclose(exp_clean_pop, test_clean_pop, atol=1e-6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure; but i think that for this case the arrays should be exactly equal to each other (we are not using float here, they are all int, i'm not seeing any machine dependencies).
Maybe is the same thing, but i think that here we can be more restrictive with:
https://numpy.org/devdocs/reference/generated/numpy.array_equal.html
instead of np.allclose
I have created the function the works for 2D and 3D labels' arrays. It cleans the clusters with low population.
Solved #136