You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I have a (2D) time series that I want to use for e.g. an LSTM model. Then I first convert it to a 3D array and then pass it to the model. This is normally done in memory with numpy. But what happens when I manage my BIG file with Spark? The solutions I've seen so far all do it by working with Spark and then converting the 3D data in numpy at the end. And that puts everything in memory.... or am I thinking wrong?
A common Spark LSTM solution is looks like this:
# create fake dataset
import random
from keras import models
from keras import layers
data = []
for node in range(0,100):
for day in range(0,100):
data.append([str(node),
day,
random.randrange(15, 25, 1),
random.randrange(50, 100, 1),
random.randrange(1000, 1045, 1)])
df = spark.createDataFrame(data,['Node', 'day','Temp','hum','press'])
# transform the data
df_trans = df.groupBy('day').pivot('Node').sum()
df_trans = df_trans.orderBy(['day'], ascending=True)
#make tran/test data
trainDF = df_trans[df_trans.day < 70]
testDF = df_trans[df_trans.day > 70]
################## we lost the SPARK #############################
# create train/test array
trainArray = np.array(trainDF.select(trainDF.columns).collect())
testArray = np.array(testDF.select(trainDF.columns).collect())
# drop the target columns
xtrain = trainArray[:, 0:-1]
xtest = testArray[:, 0:-1]
# take the target column
ytrain = trainArray[:, -1:]
ytest = testArray[:, -1:]
# reshape 2D to 3D
xtrain = xtrain.reshape((xtrain.shape[0], 1, xtrain.shape[1]))
xtest = xtest.reshape((xtest.shape[0], 1, xtest.shape[1]))
# build the model
model = models.Sequential()
model.add(layers.LSTM(1, input_shape=(1,400)))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
# train the model
loss = model.fit(xtrain, ytrain, batch_size=10, epochs=100)
My problem with this is: if my Spark data uses millions of rows and thousands of columns, then when the # create train/test array program line tries to transform the data, it causes a memory overflow.
How do I use Tempo with LSTM to solve the 2D - > 3D problem?
The text was updated successfully, but these errors were encountered:
I have a usage question.
If I have a (2D) time series that I want to use for e.g. an LSTM model. Then I first convert it to a 3D array and then pass it to the model. This is normally done in memory with numpy. But what happens when I manage my BIG file with Spark? The solutions I've seen so far all do it by working with Spark and then converting the 3D data in numpy at the end. And that puts everything in memory.... or am I thinking wrong?
My problem with this is: if my Spark data uses millions of rows and thousands of columns, then when the # create train/test array program line tries to transform the data, it causes a memory overflow.
How do I use Tempo with LSTM to solve the 2D - > 3D problem?
The text was updated successfully, but these errors were encountered: