Python Data Processing: How To Select Data in NumPy

Learn how to select and modify data in Numpy.
By Boris Delovski • Updated on Nov 30, 2022

Ndarrays expose a very flexible and powerful API for indexing and slicing data, not unlike the one typically used for slicing lists. This makes it very easy to select and manipulate data stored inside ndarrays in various ways. You can use standard indexing and slicing techniques, but you can also take advantage of boolean indexing. Using boolean indexing makes filtering your arrays a trivial task and allows you to fully utilize the power of NumPy not only for selecting data but also for modifying it.

 

How to index and slice ndarrays

By accessing the index of some element stored in an ndarray, you can basically access that particular element. If you are working with one-dimensional arrays then the process itself is not very different from indexing data stored in lists, for example. However, once you start working with arrays that have two or more dimensions, the situation changes a bit. 

 

 

For a one-dimensional array, indexing functions is identical to indexing lists. You start by indexing the first element of the array with a 0, and as you move towards the end of the array, you increment that starting value by one. To access an element, you just call on its index. You can use positive indexing, negative indexing, and you can even slice ndarrays.

# Create an ndarray
arr = np.array([6, 7.5, 8, 0, 1])

# Select the first element
arr[0]

# Select the last element
arr[-1]

# Slice the array
arr[1:3]

To get elements from some two-dimensional arrays, you need to enter two values: one for the so-called rows selector and one for the so-called columns selector. The same principle also applies to slicing. A visual representation of the technique can be seen below:

ndarray slicingImage Source: Edlitera
 

If you enter an integer, you are defining a row number or a column number. 

If you enter a list, you are defining multiple elements that you want to select. The first element of the rows selector and the first element of the columns selector define one element that you want to select, the second pair the second element, etc.

If you enter a slice, you are defining a set of rows or a set of columns that contain elements that you want to select.

 

 

 Let's demonstrate using integers on an example:

# Create an ndarray
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

# Access the item at row 1, position 3
arr[1, 3]


Let's demonstrate using lists on an example:

# Create an ndarray
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

# Access the second element of the first row and the third element of the last row
arr[[0,-1], [1, 2]]


Let's demonstrate slicing on an example:

# Create an ndarray
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

# Access the second element of the first and last rows
arr[-2:, -3:]


Aside from this standard way of selecting data, you can also use boolean arrays and boolean indexing.

 

 

How to use boolean arrays, boolean indexing, and filtering

Often you want to select data that is stored inside some ndarrays based on some condition. For example, I want only those numbers from the ndarray that are lower than some other number, or only those strings that match a certain string. Basically, I want to filter the ndarrays. The way you achieve this is pretty easy: you use boolean arrays and indexing. To be more specific, you can filter your data using logical operators. Just as a refresher, logical operators are:

  • ==
  • !=
  • &
  • |
  • >
  • <
  • >=
  • <=

 

Boolean arrays

The way that the logical operators interact with ndarrays is a bit different than how they interact with data types such as lists. Let's demonstrate this with an example. There's the following Python list of strings, where each string represents a name: 

['Mike', 'Joe', 'Will', 'Mike', 'Bob']

If you try to use a comparison operator to check whether this list of names is equal to the string 'Mike', you will of course get False because your list is not equal to your string.

['Mike', 'Joe', 'Will', 'Mike', 'Bob']  ==  'Mike'

However, if you convert your list into an array and repeat the same code, the result you will get is a bit different.

names = np.array(['Mike', 'Joe', 'Will', 'Mike', 'Bob'])  
names  ==  'Mike'

Instead of getting a boolean value True or False, you get an array of boolean values. The reason for that is simple: instead of checking whether the whole array is equal to your string, you actually check each member of your ndarray separately. This means that the result of comparing an array of values with a single value will always result in an array of boolean values that has as many members as the original array. These arrays are called boolean arrays. This is extremely important because it allows you to select elements of your array without using their positions inside the array by using boolean indexing.

 

Boolean indexing

Typically, if I wanted to select the string 'Mike', my code would look like this:

names[[0, 3]] 

However, I can achieve the same result using boolean indexing. To be more specific, I will get the same result if I use an array that is the same size as the names array and that contains boolean values to select elements of your array.

names[[True, False, False, True, False]]

This way of indexing, called boolean indexing, might seem very impractical to use since it is more prone to human error than just inputting numerical indexes, but that is not true because you don't actually need to create the boolean array yourself. You can use comparison operators to perform that task. This is the basis of filtering arrays.

 

Filtering

I can replace the previously used boolean array with a comparison between the original array and the string 'Mike'. So to get the same result as the previous one, you can use the following code:

names[names == 'Mike']

And this is essentially how to filter your ndarrays by combining boolean indexing with comparison operators. You can also perform much more complex operations, for example, by combining multiple conditions:

names[(names == 'Mike') | (names == 'Bob')]

names[(names == 'Joe') & (names == 'Will')]

 

 

How to modify data

To finish everything off, let's talk about using boolean arrays for slice value assignment, i.e. using filtering to modify values in your arrays. Remember, by filtering your arrays you are actually selecting all elements in your array that satisfy a particular condition. Once the elements are selected, you can assign new values to them the same way you would change the value of a variable in Python, using the = operator. 


Let's break down the process of replacing each occurrence of 'Mike' in the names array with 'John'. First off, you need to define the boolean array that will serve as the filter:

names == 'Mike'

That boolean array can then be used to filter the array of names, which makes it easy to assign a new value to the filtered elements:

names[names == 'Mike'] = 'John'

There is one thing you need to be careful about here if you are working with arrays of strings and not numbers, and that is the maximum length of strings that appear in your array. Python always tries to be as efficient as possible, so if the longest string in your array consists of four characters, it is going to encode the values in the array accordingly. In layman's terms, if the longest string in your original array is four characters long then even if you assign a new value to some element of your array, it is going to shorten it to four characters. For example, if I try changing 'Mike' into 'Alexander' using the following code, Python will not throw an error but it will shorten 'Alexander' to 'Alex'.

names[names == 'Mike'] = 'Alexander'

There is a pretty simple way to solve this problem. Using typecasting, you can change the encoding used for strings in your array, which will allow you to introduce longer strings. To do that you can use the following code:

names = names.astype('U20')

Using this code, you set the maximum length of your strings to 20, which is more than enough. So if you now try to replace ‘Alex’ (which previously replaced ‘Mike’) with 'Alexander’, your code will lead to the expected result:

names[names == 'Alex'] = 'Alexander'

 

Conclusion

Indexing ndarrays, slicing them, and performing more complex operations such as filtering data and assigning new values to elements inside an array are all made easy by the very intuitive API exposed by the NumPy package. Using very similar code to what you use when working with standard data types in Python, such as lists, you can perform powerful operations that allow you to manipulate multidimensional data. In the following and final article on NumPy, I will demonstrate how the ability to use the indexing, slicing, and filtering techniques covered in this article makes performing complex mathematical computations not only easy but also very fast and computationally efficient.

 

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.